# Transcript Classification (Deception Detection)

Based on the transcript, wheter a person lying or saying truth is calassified using the convolutional neural networks.

However, before building the model, transcripts should be processed such that model can find usefull information.

Following are the steps:

   - Read Transcripts and Store it in variable
   - Load Pretrained word2vector data (GloVe -840B)
   - Convert Transcripts to matrix using word2vect
   - Split datasets in 70% - 30% ratio
   - Build model and train



# Import Libraries

For building CNN machine learning model __Tensorflow__ library is used. For performing specific tasks such as tokenizing, vectorizing, word2vector etc. modules of __Tensorflow__ library is used.

In [None]:
# Importing necessary libraries

import numpy as np  # for numerical operations and matrix like datatype
import os           # for accessing directory related functions

# Importing functions from tensorflow library


from tensorflow.keras.layers.experimental.preprocessing import TextVectorization # for making text vector

from tensorflow.keras.layers import Embedding  # For embedding the sentences in to numerical matrix
from tensorflow.keras import layers,models     # For various layers of Machine Learning model (i.e. Conv)
from tensorflow import keras                   # For building machine learning model

# Read Transcripts

For each hearing, there is individual text file that contrains the transcript. And are in seperate folder.
So first all the transcripts are loaded in one single variable.

Folder structure is assumed as follows (based on visual inspection)

- Truthfull/
    - trial_truth_001.txt
    - ...
- Deceptive/
    - trial_lie_001.txt
    - ...

Here using __os__ library for getting the list of the filenames.



In [None]:
# Getting the list of File Names

Truthful = "Truthful"   # These are the folder names where transcripts were placed
Deceptive = "Deceptive" 

truth_ids = [fname[:-4] for fname in os.listdir(Truthful)]  # Tacking the empty list and will only store the file name without file extention.
lie_ids = [fname[:-4] for fname in os.listdir(Deceptive)]   # As Id will be same for coresponding video file and other informations

## Read all files and store in a list

Once all the file name are fetched, transcripts are stored in list and then that list is converted to numpy array for futher convinence. Another List is created that will store the class of the transcript (i.e. true = 1 or lie = 0).

In [None]:
# Now Read those files and store in list

Truth_sentences = []
Lie_sentences = []

# First reading the truthfull transcripts
for fname in truth_ids:
    path = os.path.join(Truthful,fname+".txt")
    with open(path) as f:
        Truth_sentences.append(f.read())
        
# Reading lie transcripts
for fname in lie_ids:
    path = os.path.join(Deceptive,fname+".txt")
    with open(path) as f:
        Lie_sentences.append(f.read())
        
# Labels for classification

Truth_labels = [1 for _ in range(len(Truth_sentences))]
Lie_labels = [0 for _ in range(len(Lie_sentences))]

# Converting in numpy array
Truth_sentences = np.array(Truth_sentences).reshape(len(Truth_sentences),1)  # Column vector
Lie_sentences = np.array(Lie_sentences).reshape(len(Lie_sentences),1)

Truth_labels = np.array(Truth_labels).reshape(len(Truth_labels),1)  # Column vector
Lie_labels = np.array(Lie_labels).reshape(len(Lie_labels),1)

## Merge both type of list in one

Once transcripts and labels were sotred in aproriate list (or numpy array), it is joined in one numpy array consisting two column: First one cosists of transcripts and second one is it's label(i.e. true=1 or lie = 0).

In [None]:
# Now Store in numpy array as : First column is transcript and Second one its class (i.e. truth or lie)

T_data = np.concatenate([Truth_sentences,Truth_labels],axis=1)  # Connect 2 vector vertically
L_data = np.concatenate([Lie_sentences,Lie_labels],axis = 1)  # Connect 2 vector vertically

AllData = np.concatenate([T_data,L_data],axis=0)  # Connecte 2 matrix Horizontally (i.e. appending another matrix)


In [None]:
AllData[-2:]

array([['That word has a definition and you are not using it. You are not using the definition that I apply in this ... in this continuum. So if you want me to talk about what stalking means, I will talk about it.',
        '0'],
       ['The lunging and the gun going off were sort of contemporanios, I don’t remember how close they were or if it happened at exactly the same moment or one right after the other, it all happened very fast, and it all seemed to happen all at once, and I would say as far as distance, maybe as far as Mr. Babbikey (sp?) is, (court reporter?) but I couldn’t say for sure with absolute certainty.',
        '0']], dtype='<U1071')

# Word Vectorization

For machine learning, text data needs to be converted into numerical values with appropriate methods. One of which is Word2Vector method where each word is represented by a __n__-dimentional vector. So a sentence(transcript) with __m__ words is represented by either __n__ x __m__ or __m__ x __n__ sized matrix. 

To represent a word in vector an unsurevised learning methods is used or a pretrained data is used. So, here pretrained data from __GloVe__ :Global Vector for word representation is used for getting vector for word. from the __GloVe__ dataset __300__-dimetional(i.e. __n=300__) representation is used.

Since each transcript will be having of different legth, So here window of __100__ words is used (i.e. __m=100__). transcript with number of words less than __100__ is padded with zero and larger is trimmed.

So, a transcript is represented by __n__ x __m__ = __300__ x __100__ matrix.

## Loading GloVe dataset

First, load all the data from pre-trained dataset in to a dictionary.

In [None]:
# Loading pretrained GloVe dataset
glove_folder = "glove.840B"
glove_file = "glove.840B.300d.txt"
glove_path = os.path.join(glove_folder,glove_file)

embodided_word_map = {}

with open(glove_path) as emFile:
    for line in emFile:
        word,vect = line.split(" ",maxsplit=1)
        vect = np.fromstring(vect, "f", sep=" ")
        embodided_word_map[word] = vect

print("Total words in GloVe dataset:",len(embodided_word_map))
# embodided_word_map['hi']

Total words in GloVe dataset: 2196016


In [None]:
# Example:- 
embodided_word_map['look']

array([-1.9463e-02, -1.8862e-01, -3.3833e-01, -1.7087e-02,  2.4807e-01,
       -2.0557e-01,  1.9839e-01,  9.0633e-03, -1.8412e-01,  2.1553e+00,
       -3.5654e-01, -1.8052e-01,  4.8173e-02, -2.7695e-01,  5.6454e-02,
        1.6258e-01, -2.7082e-01,  1.0765e+00, -4.1729e-01, -3.3334e-01,
        5.8293e-03, -2.1324e-01,  3.2689e-01, -2.0474e-01, -1.8690e-01,
        2.3764e-01, -3.5091e-02, -1.0563e-01,  2.1216e-01, -1.8023e-01,
       -3.4032e-01, -4.8700e-02, -1.1078e-01,  6.8588e-02,  2.5711e-01,
       -1.4287e-01, -4.4981e-02,  1.0357e-01, -3.3532e-01, -1.9495e-01,
       -3.3474e-01, -1.5415e-01,  1.8489e-01, -1.4937e-01,  2.8578e-01,
       -2.1299e-01, -4.9552e-01, -1.8745e-01, -7.4939e-02,  5.0816e-02,
       -3.5211e-02,  1.4748e-01,  1.0345e-01, -4.2498e-01,  2.1406e-01,
        1.4149e-01, -2.4607e-01, -4.4894e-02,  2.4726e-01,  1.2828e-01,
        1.6653e-01, -4.6914e-01, -1.5911e-01,  3.3017e-01,  4.4409e-02,
        1.5897e-01, -2.3019e-01,  2.3289e-01,  5.7056e-01,  9.27

## Get list of unique words from all transcripts (Vocabulary)

To get the vocabulary of our transcripts, Vectorization module from tensorflow is used. That will find all the unique words from all the transcripts as well as can perform simpel vectorization with the help of __one-hot encoding__ which is representing a word by it's index in Vocabulary. Here we can all specify the length of output vector (__m__).

In [None]:
# creating a function that can create vector of one-hot encoding and learns the vocabulary.
# Here, from each sentences only 100 words are considered, so extra is excluded and padded with 0 when less

make_vector = TextVectorization(output_sequence_length = 100) # m = 100
make_vector.adapt(AllData[:,0])  # Adapting to out dataset

vocab = make_vector.get_vocabulary()
vocab_idx = dict(zip(vocab,range(len(vocab))))
print("Length of Vocabulary",len(vocab_idx))

Length of Vocabulary 1521


In [None]:
vocab_idx

{'': 0,
 '[UNK]': 1,
 'i': 2,
 'and': 3,
 'the': 4,
 'to': 5,
 'was': 6,
 'that': 7,
 'a': 8,
 'he': 9,
 'of': 10,
 'it': 11,
 'in': 12,
 'my': 13,
 'me': 14,
 'um': 15,
 'you': 16,
 'so': 17,
 'just': 18,
 'had': 19,
 'on': 20,
 'we': 21,
 'they': 22,
 'is': 23,
 'but': 24,
 'know': 25,
 'have': 26,
 'at': 27,
 'what': 28,
 'like': 29,
 'him': 30,
 'not': 31,
 'for': 32,
 'as': 33,
 'there': 34,
 'uh': 35,
 'were': 36,
 'this': 37,
 'time': 38,
 'with': 39,
 'then': 40,
 'she': 41,
 'when': 42,
 'up': 43,
 'did': 44,
 'all': 45,
 'about': 46,
 'out': 47,
 'or': 48,
 'do': 49,
 'her': 50,
 'said': 51,
 'didnt': 52,
 'would': 53,
 'get': 54,
 'dont': 55,
 'back': 56,
 'if': 57,
 'going': 58,
 'remember': 59,
 'no': 60,
 'really': 61,
 'are': 62,
 'his': 63,
 'because': 64,
 'be': 65,
 'two': 66,
 'been': 67,
 '…': 68,
 'well': 69,
 'told': 70,
 'started': 71,
 'other': 72,
 'its': 73,
 'got': 74,
 'go': 75,
 'very': 76,
 'through': 77,
 'them': 78,
 'right': 79,
 'im': 80,
 'house': 81,

Above **make_vector** function is used to convert given transcript( sentence) to a vector containing an index of each word and trimmed or padded to lenth __m__ = __100__. 

**Example:**

Consider this transcript:

    "No sir I was not, not at all."


In [None]:
for i,sent in enumerate(AllData[:,0]):
    if len(sent.split()) == 8:
        print("Sentence:",sent )
        prinot("word2vector:\n",make_vector([sent]),AllData[i,1])


Sentence: No sir I was not, not at all.
word2vector:
 tf.Tensor(
[[ 60 261   2   6  31  31  27  45   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0]], shape=(1, 100), dtype=int64) 0


Above result shows the conversion of that given sentence using one-hot encoding.

In that result 60, 261,   2,   6,  31,  31,  27,  and 45 shows index of word "No", "sir", "I", "was", "not", "at" and "all" in the vocabulary. rest is 0 as there are only 8 words in a sentence and we need vector with lenght of 100.


## Using GloVe

For converting a **word** in to a __n__ dimentional vector a converison matrix is need to be created for our vocabulary from the **GloVe** dataset.

In [None]:
# Now creatnig a matrix from the vectors of words
total_tokens = len(vocab_idx) + 2  # Tokens is the indvidual words from vocabulary
dimention_vect = 300  # Dimention of a vector of GloVe dataset
l = 0
conversion_matrix = np.zeros((total_tokens,dimention_vect)) # Created matrix
nf_words = []
hit,miss = (0,0)
for word,i in vocab_idx.items():
    
    if word in embodided_word_map.keys():
        ''' Check if that word present in GloVe dataset'''
        conversion_matrix[i] = embodided_word_map[word]
        hit = hit +1
    else:
        nf_words.append(word)
        miss = miss + 1

print(" Total word found in GloVe:",hit," out of:",hit+miss)
# embodided_word_map = {}

 Total word found in GloVe: 1468  out of: 1521


In [None]:
conversion_matrix.shape

(1523, 300)

Here Some of the words were not available in the GloVe so for that we are just considering the all 0 vectors. As this number is small so it should not impact the result.

To convert given word in to a 300-dimentional vector that "conversion_matrix" is used.

For example:
    
Word "No" is represented by $60^{th}$ row of "conversion_matrix" since index of word "No" in vocabulary is 60.
Similary word "sir" is represented by $261^{th}$ row of "conversion_matrix"..

So using that a simple one-hote encoded vector of transcript(sentence) is converted to __n__ x __m__ sized matrix.

In [None]:
# For converting text to 300x100 vector will be using Embedding layer from keras
embedding_layer = Embedding(
    total_tokens,
    dimention_vect,
    embeddings_initializer=keras.initializers.Constant(conversion_matrix),
    trainable=False,
)

# Preparing Training and Testing Data

Since, the data set whcihc we have created is a matrix with 2 columns and first half is of one category and 
another half is of another. Then each input is converted to 300x100 matrix using the embedding_layer.

And then First 70% of dataset is used as traning and remaining is used as testing.

Input and labels are seperated also.

## Split data in training and testing data



In [None]:
np.random.shuffle(AllData)  # Shuffeling data row wise
x = make_vector(np.array([[s] for s in AllData[:,0]])).numpy() # Creating word vector for all the transcripts

data_in = []
for s in x:
    data_in.append(embedding_layer(s).numpy().T)
data_in = np.array(data_in)   

# Now spliting data in training and testing
n = int(0.7 * x.shape[0])
x_train = data_in[:n]
y_train = np.array(AllData[:n,1],dtype='float32')

x_test = data_in[n:]
y_test = np.array(AllData[n:,1],dtype='float32')

# Build model and train

In [None]:
cb = keras.callbacks.EarlyStopping(monitor="val_acc",
    min_delta=0.01,
    patience=110,
    verbose=0,
    mode="max",
    baseline=None,
    restore_best_weights=True)

op_m = keras.optimizers.RMSprop(
    learning_rate=0.001, rho=0.9, momentum=0.001, epsilon=1e-07, centered=False,
    name='RMSprop'
)

int_sequences_input = keras.Input(shape=(300,100))

x = layers.Conv1D(3,8,activation="relu",kernel_regularizer=keras.regularizers.l2(0.001)
                 )(int_sequences_input)
x = layers.MaxPooling1D(2)(x)
x = layers.Dropout(0.5)(x)

# x = layers.Conv1D(2,15,activation="relu",kernel_regularizer=keras.regularizers.l2(0.001))(x)
# x = layers.MaxPooling1D(2)(x)
# # x = layers.Dropout(0.2)(x)

# x = layers.Conv1D(2,15,activation="relu",kernel_regularizer=keras.regularizers.l2(0.001))(x)
# x = layers.MaxPooling1D(2)(x)
# # x = layers.Dropout(0.2)(x)

# x = layers.Conv1D(2,15,activation="relu",kernel_regularizer=keras.regularizers.l2(0.001))(x)
# x = layers.MaxPooling1D(2)(x)
# x = layers.Dropout(0.2)(x)

x = layers.Flatten()(x)

x = layers.Dense(300,activation="relu",kernel_regularizer=keras.regularizers.l2(0.001))(x)
x = layers.Dropout(0.3)(x)

x = layers.Dense(1024,activation="relu",kernel_regularizer=keras.regularizers.l2(0.01))(x)
x = layers.Dropout(0.6)(x)

# x = layers.Dense(50,activation="relu")(x)
# x = layers.Dropout(0.2)(x)

preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()


Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 300, 100)]        0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 293, 3)            2403      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 146, 3)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 146, 3)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 438)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 300)               131700    
_________________________________________________________________
dropout_4 (Dropout)          (None, 300)              

In [None]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer='rmsprop', metrics=["acc"]
)
history = model.fit(x_train, y_train,epochs=1000,validation_data=(x_test, y_test),callbacks=[cb])

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

Epoch 125/1000
Epoch 126/1000
Epoch 127/1000
Epoch 128/1000
Epoch 129/1000
Epoch 130/1000
Epoch 131/1000
Epoch 132/1000
Epoch 133/1000
Epoch 134/1000
Epoch 135/1000
Epoch 136/1000


In [None]:
pred = np.argmax(model.predict(x_test),axis=1)
print("Accuracy on Test Data:", np.sum(pred == y_test)/y_test.shape[0])

Accuracy on Test Data: 0.5945945945945946


In [None]:
test_model = models.load_model("BestSofar")

In [None]:
test_model.summary()

Model: "functional_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 300, 100)]        0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 293, 3)            2403      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 146, 3)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 146, 3)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 438)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 300)               131700    
_________________________________________________________________
dropout_4 (Dropout)          (None, 300)              

In [None]:
pred = np.argmax(test_model.predict(x_test),axis=1)
print("Accuracy on Test Data:", np.sum(pred == y_test)/y_test.shape[0])

Accuracy on Test Data: 0.8648648648648649
