# Instructions

To run the code below correctly follow these instructions:

1) Download all of the SNLI dataset from the following link and save just the snli_1.0_train.jsonl and the snli_1.0_test.jsonl version: https://nlp.stanford.edu/projects/snli/snli_1.0.zip
2) Download the GloVe model version from: https://nlp.stanford.edu/data/glove.6B.zip and use the glove.6B.50d.txt version
3) Unzip/Extract the files if necessary.
4) Save those files to the same file area/location to this Notebook.
5) Ensure all of the relevent libraries/packages are installed and with the same version number shown below to replicate the results.
6) Ensure any log files are deleted before running the code as this will confuse Tensorboard of which files to run if not completed.
8) Restart the Keneral and select run all cells.

This model takes around 112 minutes to complete training.

# Step 1 - Import libraries/packages

In [4]:
# Python version using - 3.12.5
import random
random.seed(1)
import numpy as np # Version - 1.23.4
import pandas as pd # Version 2.2.2
import os # Version 10.0.22631
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # Stops long messages coming through
from sklearn import preprocessing # Version 1.5.1
import keras # Version 2.6.0
import tensorflow as tf # Version 2.6.0
from tensorflow.keras.models import Sequential # Version 2.6.0
from tensorflow.keras.layers import Dense, LSTM, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D  # Version 2.6.0
from tensorflow.keras.preprocessing.text import Tokenizer # Version 2.6.0
from tensorflow.keras.preprocessing.sequence import pad_sequences # Version 2.6.0
from tensorflow.keras.optimizers import Adam # Version 2.6.0
from tensorflow.keras.callbacks import TensorBoard # Version 2.6.0


In [5]:
%load_ext tensorboard 

# Step 2 - Load in the data

In [7]:
train_snli = pd.read_json("snli_1.0_train.jsonl", lines=True) # Reads the train_snli file, this is the dataset used for training
test_snli = pd.read_json("snli_1.0_test.jsonl", lines = True) # Reads the test_snli file, this is the dataset used for testing

full_dataset = [train_snli,test_snli] # Concatenates the data, to make one data frame

full_dataset = pd.concat(full_dataset, ignore_index=True) # concatenates the train and test dataset together

full_dataset.tail() # Checks if the dataset has been concatenated correctly

Unnamed: 0,annotator_labels,captionID,gold_label,pairID,sentence1,sentence1_binary_parse,sentence1_parse,sentence2,sentence2_binary_parse,sentence2_parse
560147,"[contradiction, contradiction, contradiction, ...",4378810163.jpg#4,contradiction,4378810163.jpg#4r1c,Two women are observing something together.,( ( Two women ) ( ( are ( ( observing somethin...,(ROOT (S (NP (CD Two) (NNS women)) (VP (VBP ar...,Two women are standing with their eyes closed.,( ( ( Two women ) ( are ( standing ( with ( th...,(ROOT (S (NP (NP (CD Two) (NNS women)) (SBAR (...
560148,"[entailment, entailment, entailment, contradic...",4378810163.jpg#4,entailment,4378810163.jpg#4r1e,Two women are observing something together.,( ( Two women ) ( ( are ( ( observing somethin...,(ROOT (S (NP (CD Two) (NNS women)) (VP (VBP ar...,Two girls are looking at something.,( ( Two girls ) ( ( are ( looking ( at somethi...,(ROOT (S (NP (CD Two) (NNS girls)) (VP (VBP ar...
560149,"[contradiction, contradiction, contradiction, ...",152881593.jpg#1,contradiction,152881593.jpg#1r1c,A man in a black leather jacket and a book in ...,( ( ( ( ( A man ) ( in ( a ( black ( leather j...,(ROOT (S (NP (NP (NP (DT A) (NN man)) (PP (IN ...,A man is flying a kite.,( ( A man ) ( ( is ( flying ( a kite ) ) ) . ) ),(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...
560150,"[entailment, entailment, entailment, neutral, ...",152881593.jpg#1,entailment,152881593.jpg#1r1e,A man in a black leather jacket and a book in ...,( ( ( ( ( A man ) ( in ( a ( black ( leather j...,(ROOT (S (NP (NP (NP (DT A) (NN man)) (PP (IN ...,A man is speaking in a classroom.,( ( A man ) ( ( is ( speaking ( in ( a classro...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...
560151,"[neutral, neutral, neutral, neutral, neutral]",152881593.jpg#1,neutral,152881593.jpg#1r1n,A man in a black leather jacket and a book in ...,( ( ( ( ( A man ) ( in ( a ( black ( leather j...,(ROOT (S (NP (NP (NP (DT A) (NN man)) (PP (IN ...,A man is teaching science in a classroom.,( ( A man ) ( ( is ( ( teaching science ) ( in...,(ROOT (S (NP (DT A) (NN man)) (VP (VBZ is) (VP...


# Step - 3 Exploaratory data analysis and Pre-processing of the data

In [9]:
FiftyPercentFullDataset = round((50 * len(full_dataset) / 100)) # Finds half (50%) of the total number of rows, to split the dataset in half, round is there to get to the nearest integer

FiftyPercentFullDataset = full_dataset.iloc[0:FiftyPercentFullDataset] # Uses Python slicing to with the variable shown above to cut the dataset in half

print(len(FiftyPercentFullDataset)) # Checks the length of the dataset to see if it has been cut in half

280076


In [10]:
FiftyPercentFullDataset = FiftyPercentFullDataset[['sentence1', 'sentence2','gold_label']] # Selects only the Premise (sentence1), the Hypothesis (sentence2), and the Label (gold_label)


In [11]:
FiftyPercentFullDataset.dtypes # Looks at the data types of each covariate

sentence1     object
sentence2     object
gold_label    object
dtype: object

In [12]:
NA_train = FiftyPercentFullDataset.isnull() 

NA_train.value_counts() # No null values

sentence1  sentence2  gold_label
False      False      False         280076
Name: count, dtype: int64

In [13]:
FiftyPercentFullDataset['gold_label'].value_counts() # Looks at the counts of all possible values in the gold_label column

gold_label
entailment       93428
contradiction    93265
neutral          93029
-                  354
Name: count, dtype: int64

In [14]:
# Removes the "-" label in the dataset, since if it is used it will lead to inaccurate results

FiftyPercentFullDataset = FiftyPercentFullDataset[FiftyPercentFullDataset['gold_label'] != "-"]
FiftyPercentFullDataset.tail()
FiftyPercentFullDataset['gold_label'].value_counts() # Checks if the '-' label has been removed

gold_label
entailment       93428
contradiction    93265
neutral          93029
Name: count, dtype: int64

In [15]:
# Renames columns to enable easier readings of the covariates

FiftyPercentFullDataset = FiftyPercentFullDataset.rename(columns = {"sentence1": "Premise", "sentence2": "Hypothesis", "gold_label": "Label"})

In [16]:
FiftyPercentFullDataset.tail() # Checks if the covariates have been renamed

Unnamed: 0,Premise,Hypothesis,Label
280071,"An ATV rider, speeds around a corner, while sl...",A man rides an ATV down a dirt path.,neutral
280072,"An ATV rider, speeds around a corner, while sl...",A ATV rider just fell of their ATV into a river.,contradiction
280073,"An ATV rider, speeds around a corner, while sl...",An ATV rider is in the woods.,neutral
280074,"An ATV rider, speeds around a corner, while sl...",A person rides an ATV while focusing on the tr...,entailment
280075,"An ATV rider, speeds around a corner, while sl...","A man rides a motorcycle down a dirt trail, ar...",contradiction


# Step 4 - Create Word Embeddings

In [18]:

covariatePremise = FiftyPercentFullDataset['Premise']
covariateHypothesis = FiftyPercentFullDataset['Hypothesis']


labels = FiftyPercentFullDataset['Label'] # Makes the label covaraite into its own data frame


LabelEncoder = preprocessing.LabelEncoder()
labels = LabelEncoder.fit_transform(labels)

labels = tf.keras.utils.to_categorical(labels, 3).astype("int32") # Assigns the labels as ints.



In [19]:
premiseTokeniser = Tokenizer(num_words = 300) # Applies a maximum of 300 words for the Premise
hypothesisTokeniser = Tokenizer(num_words = 300) # Applies a maximum of 300 words for the Hypothesis

premiseTokeniser.fit_on_texts(covariatePremise)
hypothesisTokeniser.fit_on_texts(covariateHypothesis)

#Below converts all the sentences to their own token-ID sequences. This is necessary for LSTMs since they work on sequences 
premiseSequences = premiseTokeniser.texts_to_sequences(covariatePremise)
hypothesisSequences = hypothesisTokeniser.texts_to_sequences(covariateHypothesis)

premisePadded = pad_sequences(premiseSequences, maxlen=50) # Makes all sequences to a fixed length of 50, this is the first input of the model
hypothesisPadded = pad_sequences(hypothesisSequences, maxlen=50) # This is the second input of the model

print(premisePadded.shape) # Checks if the padding has worked
print(hypothesisPadded.shape)

(279722, 50)
(279722, 50)


In [20]:
'''
The code below splits the dataset.
'''


TestSplit = 0.2 # Used for the testing set

TestSplit = int(TestSplit * len(premisePadded)) # Finds 20% of the dataset for the testing split.


# Below does Python slicing to only get 80% of the data by using the - operation
premisePaddedTrain = premisePadded[:-TestSplit]
hypothesisPaddedTrain = hypothesisPadded[:-TestSplit]
labelTrain = labels[:-TestSplit]

# Below does Python slicing to only get 20% of the data by using the - operation
premisePaddedTest = premisePadded[-TestSplit:]
hypothesisPaddedTest = hypothesisPadded[-TestSplit:]
labelTest = labels[-TestSplit:]

print(len(FiftyPercentFullDataset))
print(len(premisePaddedTrain)) # Checks the size of them to ensure that the splitting has worked
print(len(premisePaddedTest))

print(len(premisePaddedTrain) + len(premisePaddedTest)) # Checks if the lengths of the splitting are the same size.



279722
223778
55944
279722


In [21]:
tokenDictionary = {} # A dictionary to store each word

with open(file = "glove.6B.50d.txt", encoding="utf8") as glove: # This glove file is the smallest possiblee glove file, it has 6 billion words with 50 dimensions

    for entry in glove:
        
        entryLine = entry.split()
        entryVector = np.array(entryLine[1:]) # First character is always the word, hence we start with the second character to get the numerical values
        entryVector = entryVector.astype(np.float32) # Since the numbers are strings at first, change them to a number

        if entryVector.shape[0] == 50:
            tokenDictionary[entryLine[0]] = entryVector

In [22]:
'''
Below creates an embedding matrix for the Premise covariate
'''

premiseEmbeddingMatrix = np.zeros((len(premiseTokeniser.word_index) + 1, 50)) # A numpy array full of zeros is made, to handle Out-of-Vocabulary

for character, i in premiseTokeniser.word_index.items(): # Character can either be a punctuation or a word 
    premiseEmbVector = tokenDictionary.get(character)
    if premiseEmbVector is not None:
        premiseEmbeddingMatrix[i] = premiseEmbVector


print(premiseEmbeddingMatrix.shape)

(14336, 50)


In [23]:
'''
Below creates an embedding matrix for the Premise covariate
'''

hypothesisEmbeddingMatrix = np.zeros((len(hypothesisTokeniser.word_index) + 1, 50))


for character, i in hypothesisTokeniser.word_index.items():
    hypothesisEmbVector = tokenDictionary.get(character)
    if hypothesisEmbVector is not None:
        hypothesisEmbeddingMatrix[i] = hypothesisEmbVector


print(hypothesisEmbeddingMatrix.shape)

(23202, 50)


# Step 5 - Model Building

In [25]:
premiseEmbeddingLayer = keras.layers.Embedding( # Creates the embedding layer for the Premise Covariate
    14336,
    50,
    trainable = False, # These should not update during training
)

premiseEmbeddingLayer.build((1,))
premiseEmbeddingLayer.set_weights([premiseEmbeddingMatrix])

In [26]:
hypothesisembeddinglayer = keras.layers.Embedding( # Creates the embedding layer for the Hypothesis Covariate
    23202,
    50,
    trainable = False, # These should not update during training - since they are the pre-trained embeddings.
)

hypothesisembeddinglayer.build((1,))
hypothesisembeddinglayer.set_weights([hypothesisEmbeddingMatrix])


In [27]:
model_Bi_LSTM = Sequential() # Creates a Sequential Keras model

input1 = keras.Input(shape=(50,), dtype="float", name ="Premise")
input2 = keras.Input(shape=(50,), dtype="float", name ="Hypothesis")

embeddedSequencesPremise = premiseEmbeddingLayer(input1)
embeddedSequencesHypothesis = hypothesisembeddinglayer(input2)

x = keras.layers.concatenate([embeddedSequencesPremise, embeddedSequencesHypothesis]) # Concatenates the embedding layers into one embedding layer.

x = keras.layers.Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout = 0.2, name= "BiLSTMLayer1"))(x)
x = keras.layers.Dropout(0.25)(x) # Adds drop out layers to prevent overfitting
x = keras.layers.Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout = 0.2, name= "BiLSTMLayer2"))(x)
x = keras.layers.Dropout(0.25)(x)
x = keras.layers.Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout = 0.2, name= "BiLSTMLayer3"))(x)
x = keras.layers.Dropout(0.25)(x)

averagePooling = keras.layers.GlobalAveragePooling1D()(x)
maxPooling =  keras.layers.GlobalMaxPooling1D()(x)

x = keras.layers.concatenate([averagePooling, maxPooling]) # Concatenates the pooling layers into one layer

output = keras.layers.Dense(3, activation = "softmax", name = "final_output")(x) # Output layer

model_Bi_LSTM = keras.Model(inputs=[input1,input2], outputs = output)

model_Bi_LSTM.compile(loss = "categorical_crossentropy",
              optimizer = Adam(learning_rate=0.001),
              metrics = ['accuracy'],
             )

model_Bi_LSTM.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Premise (InputLayer)            [(None, 50)]         0                                            
__________________________________________________________________________________________________
Hypothesis (InputLayer)         [(None, 50)]         0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 50, 50)       716800      Premise[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 50, 50)       1160100     Hypothesis[0][0]                 
______________________________________________________________________________________________

# Step 6 - Train and Test the model

In [29]:
logger = TensorBoard( # Creates a logger for Tensorboard so it can show the results of the model during training.
    log_dir = "logs_BiLSTM",
    histogram_freq=1, # This will create statistical data of the accuracy, loss etc every epoch.
    write_graph = True,
    write_images = True
    # The two lines above is telling Tensorflow to not only log textual data, but to also create visual representations of the models architecture as well the training process 
)

In [30]:
history = model_Bi_LSTM.fit([premisePaddedTrain, hypothesisPaddedTrain], labelTrain, # Fits the model, and trains it.
                batch_size = 256,
                epochs = 10, # The neural network will go around itself 10 times.
                verbose = 1,
                validation_split = 0.1, # During training it will take 10% of the dataset for validation.
                callbacks = [logger])

print("Testing Model: \n........................")

model_Bi_LSTM.evaluate([premisePaddedTest, hypothesisPaddedTest], labelTest) # Tests the model on the testing set

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Testing Model: 
........................


[0.804968535900116, 0.6405333876609802]

In [31]:
tensorboard --logdir=logs_BiLSTM # Activates Tensorboard.

# End of Notebook