# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In this lab we'll work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [84]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1" 
# first we import some packages that we need
import torch
import torch.optim as optim
import torch.nn as nn
import torchtext

# our hyperparameters (add more when/if you need them)
device = torch.device('cuda:2')

batch_size = 8
learning_rate = 0.001
epochs = 3

#other packages that we are going to use
import random
import math

import numpy as np
import pandas as pd

#### Download and unzip: https://github.com/sdobnik/computational-semantics/blob/master/assignments/05-natural-language-inference/simple_snli_1.0.zip

In [85]:
train_data = pd.read_csv('simple_snli_1.0_train.csv', header=None, sep='\t')
train_data.columns = ['premise', 'hypothesis', 'relation']
test_data = pd.read_csv('simple_snli_1.0_test.csv', header=None, sep='\t')
test_data.columns = ['premise', 'hypothesis', 'relation']
dev_data = pd.read_csv('simple_snli_1.0_dev.csv', header=None, sep='\t')
dev_data.columns = ['premise', 'hypothesis', 'relation']

In [6]:
from torch.utils.data import DataLoader, Dataset

In [12]:
def tokenize(string):
        # The tokenizer was given as a whitespace tokenizer
        return string.lower().split()

In [90]:
# I implement a Dataset to keep track of vocab, word2idx, idx2word
# Dataset can also be used in DataLoader which gives batch loading, etc, for free.

class InferenceDataset(Dataset):

    def __init__(self, data, unk_label='<unk>', pad_label='<pad>'):
        
        self.unk_idx, self.unk_label = 0, unk_label
        self.pad_idx, self.pad_label = 1, pad_label

        self.data = data.copy()
        self.data['premise'] = self.data['premise'].apply(self.tokenize)
        self.data['hypothesis'] = self.data['hypothesis'].apply(self.tokenize)


        self.vocab = self.__unique_words()
        
        self.word2idx = dict()
        self.idx2word = dict()
        self.word2idx[self.unk_label] = self.unk_idx
        self.word2idx[self.pad_label] = self.pad_idx
        self.word2idx.update({word:idx+max(self.word2idx.values())+1 for idx, word in enumerate(self.vocab)})

        self.idx2word = {v:k for k,v in self.word2idx.items()}

        self.labels = list(np.unique(self.data['relation']))

    def __unique_words(self):
        all_words = []
        for s in self.data['premise']:
            all_words += s
        for s in self.data['hypothesis']:
            all_words += s
        return np.unique(all_words)
        
    def tokenize(self, string):
        if isinstance(string, str): 
            # The tokenizer was given as a whitespace tokenizer
            return string.lower().split()
        else:  # for NaN
            return "<unk>"

    def __getitem__(self, idx):
        #x = self.data.iloc[0] #for test
        x = self.data.iloc[idx]
        out = (x['premise'], x['hypothesis'], x['relation'])
        return out
        
    def __len__(self):
        return len(self.data)

In [26]:
train_dataset = InferenceDataset(train_data)

In [27]:
print(train_dataset[0])
print(len(train_dataset))
print("labels", (train_dataset.labels))
print(len(train_dataset.word2idx))

(['a', 'person', 'on', 'a', 'horse', 'jumps', 'over', 'a', 'broken', 'down', 'airplane.'], ['a', 'person', 'is', 'training', 'his', 'horse', 'for', 'a', 'competition.'], 'neutral')
550152
labels ['-', 'contradiction', 'entailment', 'neutral']
56258


In [32]:
from collections import namedtuple
from torch.nn.utils.rnn import pad_sequence 

relation_to_idx = {k:v for v,k in enumerate(sorted(np.unique(train_data['relation'])))}
idx_relation = {v:k for k,v in relation_to_idx.items()}

class Collate():
    def __init__(self, word_to_idx, pad_idx=1, unk_idx=0, relation_to_idx=relation_to_idx):
        self.pad_idx = pad_idx
        self.unk_idx = unk_idx
        self.word_to_idx = word_to_idx
        self.relation_to_idx = relation_to_idx
    def __call__(self, batch):
        batch = np.transpose(batch)
        
        premises = np.transpose(batch[0])
        premises = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in premises]
        premises = pad_sequence(premises, batch_first=True, padding_value=self.pad_idx)

        hypothesis = np.transpose(batch[1]) #batch first
        hypothesis = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in hypothesis]
        hypothesis = pad_sequence(hypothesis, batch_first=True, padding_value=self.pad_idx)
        
        relations = [self.relation_to_idx[rel] for rel in batch[2]]

        return premises, hypothesis, relations


def dataloader(dataset, word2idx, pad_idx, unk_idx, batch_size=32, shuffle=True): # Need word2idx etc to match between train and test. Id probably do this is another wya in hindsight.
    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=Collate(word2idx, pad_idx, unk_idx) )
    return loader



In [45]:
loader = dataloader(train_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=5,shuffle=False)

# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [38]:
num_words = 3
dimensions = 5
# A tensor for testing
test_tensor = torch.rand([batch_size, num_words, dimensions], dtype=torch.float64, device=device)
print(test_tensor)

tensor([[[0.8061, 0.2281, 0.5062, 0.6049, 0.7887],
         [0.2986, 0.7589, 0.3305, 0.5077, 0.6510],
         [0.6243, 0.8474, 0.4927, 0.0142, 0.0017]],

        [[0.2019, 0.0107, 0.8083, 0.0365, 0.5661],
         [0.1913, 0.8099, 0.0156, 0.5976, 0.6845],
         [0.0220, 0.8118, 0.6238, 0.6299, 0.1341]],

        [[0.1902, 0.2143, 0.8930, 0.1624, 0.5947],
         [0.8394, 0.1865, 0.9838, 0.6364, 0.4700],
         [0.7046, 0.0562, 0.7640, 0.4350, 0.6576]],

        [[0.9125, 0.8473, 0.7251, 0.2109, 0.4149],
         [0.6755, 0.2177, 0.6164, 0.5084, 0.3705],
         [0.5049, 0.1899, 0.7938, 0.5299, 0.4353]],

        [[0.4814, 0.3094, 0.8386, 0.5047, 0.4816],
         [0.0770, 0.5450, 0.1650, 0.1687, 0.6292],
         [0.4714, 0.7477, 0.4895, 0.9195, 0.0179]],

        [[0.7469, 0.5814, 0.7323, 0.9269, 0.4896],
         [0.7308, 0.0728, 0.1488, 0.7182, 0.4672],
         [0.7275, 0.8991, 0.3934, 0.1995, 0.9469]],

        [[0.9286, 0.7681, 0.3522, 0.0741, 0.5673],
         [0.7792, 0

In [39]:
def pooling(input_tensor):
    batches = len(input_tensor)
    words = len(input_tensor[0])
    dims = len(input_tensor[0][0])

    new_tensors = []

    for i in range(0,batches):
        new_tensor = []

        for j in range(0,dims):
            temp_tensor = []
        
            for k in range(0,words):
                temp_tensor.append(input_tensor[i][k][j])
            
            max_val = max(temp_tensor)
            new_tensor.append(max_val)
    
        actual_new_tensor = new_tensor[0].unsqueeze(0)
    
        for l in range(1,len(new_tensor)):
            actual_new_tensor = torch.cat((actual_new_tensor, new_tensor[l].unsqueeze(0)))

        new_tensors.append(actual_new_tensor)
    
    output_tensor = torch.stack(new_tensors)

    return output_tensor

In [40]:
pooling(test_tensor)

tensor([[0.8061, 0.8474, 0.5062, 0.6049, 0.7887],
        [0.2019, 0.8118, 0.8083, 0.6299, 0.6845],
        [0.8394, 0.2143, 0.9838, 0.6364, 0.6576],
        [0.9125, 0.8473, 0.7938, 0.5299, 0.4353],
        [0.4814, 0.7477, 0.8386, 0.9195, 0.6292],
        [0.7469, 0.8991, 0.7323, 0.9269, 0.9469],
        [0.9286, 0.7681, 0.5235, 0.9810, 0.5673],
        [0.6589, 0.8666, 0.5201, 0.7905, 0.7835]], dtype=torch.float64)

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[2 marks]**

In [41]:
# Test tensors (size of batch, num, dim)
t = torch.rand([2*batch_size, num_words, dimensions], dtype=torch.float64, device=device)
t1, t2 = torch.split(t, batch_size)
# Pooled test tensors (size of batch, dim)
pt1 = pooling(t1)
pt2 = pooling(t2)
print(pt1)
print(pt2)

tensor([[0.7854, 0.5951, 0.9977, 0.9915, 0.3051],
        [0.6959, 0.7494, 0.8796, 0.8771, 0.6148],
        [0.8451, 0.4234, 0.8094, 0.1745, 0.8213],
        [0.7048, 0.8210, 0.9782, 0.8134, 0.5554],
        [0.8686, 0.7258, 0.9975, 0.6675, 0.7278],
        [0.9578, 0.8795, 0.6341, 0.7726, 0.9834],
        [0.8506, 0.9707, 0.9655, 0.8623, 0.7997],
        [0.6671, 0.7804, 0.6940, 0.4311, 0.6858]], dtype=torch.float64)
tensor([[0.9325, 0.3408, 0.8460, 0.9052, 0.8642],
        [0.6949, 0.7839, 0.8212, 0.7429, 0.9528],
        [0.7171, 0.9547, 0.7250, 0.3005, 0.9764],
        [0.6507, 0.8414, 0.9914, 0.6650, 0.9458],
        [0.8940, 0.8808, 0.7559, 0.6979, 0.5078],
        [0.9085, 0.6848, 0.8746, 0.9807, 0.7672],
        [0.8020, 0.8979, 0.9812, 0.8814, 0.6962],
        [0.7778, 0.7521, 0.9830, 0.3091, 0.9889]], dtype=torch.float64)


In [42]:
def combine_premise_and_hypothesis(hypothesis, premise):
    
    batches = len(hypothesis)
    dims = len(hypothesis[0])
    final_dims = 4*dims

    new_tensors = []

    for i in range(0,batches):
        hyp = hypothesis[i]
        pre = premise[i]
    
        summed = torch.cat((pre,hyp))
        subtracted = pre - hyp
        multiplied = torch.mul(pre, hyp)
    
        new_tensors.append(torch.cat((summed, subtracted, multiplied)))
    
    output = torch.stack(new_tensors)
    
    return output

In [43]:
combine_premise_and_hypothesis(pt1, pt2)

tensor([[ 0.9325,  0.3408,  0.8460,  0.9052,  0.8642,  0.7854,  0.5951,  0.9977,
          0.9915,  0.3051,  0.1471, -0.2544, -0.1516, -0.0863,  0.5591,  0.7324,
          0.2028,  0.8441,  0.8975,  0.2637],
        [ 0.6949,  0.7839,  0.8212,  0.7429,  0.9528,  0.6959,  0.7494,  0.8796,
          0.8771,  0.6148, -0.0010,  0.0345, -0.0584, -0.1342,  0.3380,  0.4835,
          0.5875,  0.7224,  0.6516,  0.5858],
        [ 0.7171,  0.9547,  0.7250,  0.3005,  0.9764,  0.8451,  0.4234,  0.8094,
          0.1745,  0.8213, -0.1280,  0.5313, -0.0844,  0.1260,  0.1552,  0.6060,
          0.4042,  0.5868,  0.0524,  0.8019],
        [ 0.6507,  0.8414,  0.9914,  0.6650,  0.9458,  0.7048,  0.8210,  0.9782,
          0.8134,  0.5554, -0.0541,  0.0204,  0.0131, -0.1484,  0.3904,  0.4586,
          0.6908,  0.9698,  0.5409,  0.5253],
        [ 0.8940,  0.8808,  0.7559,  0.6979,  0.5078,  0.8686,  0.7258,  0.9975,
          0.6675,  0.7278,  0.0253,  0.1550, -0.2416,  0.0305, -0.2200,  0.7765,
      

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [78]:
class SNLIModel(nn.Module):
    def __init__(self, word2idx, relation2idx, embedding_dim=32, hidden_size=128, padding_idx=1):
        super().__init__()
        self.vocab_size = len(word2idx)
        self.output_dim = len(relation2idx)
        self.hidden_size = hidden_size
        # your code goes here
        self.embeddings = nn.Embedding(self.vocab_size, embedding_dim, padding_idx=padding_idx) #
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=self.hidden_size, num_layers=1, bidirectional=True)
        self.classifier = nn.Linear(self.hidden_size*8, self.output_dim)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, premise, hypothesis):
        p = self.embeddings(premise)
        h = self.embeddings(hypothesis)
        
        lstm_p, (hidden, c) = self.LSTM(p)
        lstm_h, (hidden, c) = self.LSTM(h)
        
        p_pooled = pooling(lstm_p)
        h_pooled = pooling(lstm_h)
        
        ph_representation = combine_premise_and_hypothesis(h_pooled,p_pooled)
        ph_representation = self.dropout(ph_representation)  # is this at the right stage??
        
        predictions = self.classifier(ph_representation)
        
        return predictions

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [91]:
train_dataset = InferenceDataset(train_data)
test_dataset = InferenceDataset(test_data)
dev_dataset = InferenceDataset(dev_data)

In [92]:
#remember change batch size to 8
#do we need a new loader for every epoch? it worked here
loader = dataloader(train_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=5,shuffle=False)
test_loader = dataloader(test_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=5,shuffle=False)
dev_loader = dataloader(dev_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=5,shuffle=False)

In [93]:
# train_iter, test_iter = dataloader(path_to_folder)
# train, dev, test, premises, hypotheses, relations = dataloader('./')

model = SNLIModel(train_dataset.vocab, train_dataset.labels).to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(epochs):
    # train model
    total_loss = 0
    for i, batch in enumerate(dev_loader):
        prems = batch[0]
        hyps = batch[1]
        rels = torch.Tensor(batch[2]).long().to(device)

        output = model(prems, hyps)
        
        loss = loss_function(output, rels)
        total_loss += loss.item()
        
        if i%10==0:
            print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()
    
# test model after all epochs are completed

 Batch 0 : Average Loss = 1.40449
 Batch 10 : Average Loss = 1.29644
 Batch 20 : Average Loss = 1.25519
 Batch 30 : Average Loss = 1.21706
 Batch 40 : Average Loss = 1.21594
 Batch 50 : Average Loss = 1.20818
 Batch 60 : Average Loss = 1.20633
 Batch 70 : Average Loss = 1.20285
 Batch 80 : Average Loss = 1.20869
 Batch 90 : Average Loss = 1.21662
 Batch 100 : Average Loss = 1.2063
 Batch 110 : Average Loss = 1.20213
 Batch 120 : Average Loss = 1.19442
 Batch 130 : Average Loss = 1.19616
 Batch 140 : Average Loss = 1.18865
 Batch 150 : Average Loss = 1.20131
 Batch 160 : Average Loss = 1.19774
 Batch 170 : Average Loss = 1.19363
 Batch 180 : Average Loss = 1.19312
 Batch 190 : Average Loss = 1.18933
 Batch 200 : Average Loss = 1.19558
 Batch 210 : Average Loss = 1.19424
 Batch 220 : Average Loss = 1.19504
 Batch 230 : Average Loss = 1.19602
 Batch 240 : Average Loss = 1.19247
 Batch 250 : Average Loss = 1.18865
 Batch 260 : Average Loss = 1.18718
 Batch 270 : Average Loss = 1.18634
 Bat

In [94]:
torch.save(model, 'inference_dev.model')

In [96]:
import pickle
#so that when we load it back in, we can have access to the same word2idx etc.
with open("train_dataset.pickle","wb") as f:
    pickle.dump(train_dataset, f)

In [None]:
with open("train_dataset.pickle", 'rb') as f:
    train_dataset = pickle.load(f)
    
model = torch.load('inference_dev.model')
model.eval()

In [104]:
# I interrupted this because it is slow and with the 400+ batches we get a good idea of the accuracy either way.
with torch.no_grad():

    correct = 0
    counter = 0
    for i, batch in enumerate(test_loader):
        test_output = model(batch[0], batch[1])
        # test_output = model(batch[1])
        test_output = torch.argmax(test_output, dim=1)
        targets = torch.tensor(batch[2], device=device)
        correct += torch.sum(test_output == targets)
        counter += len(test_output)

        test_accu = correct/counter
        
        if i%10==0:
            print(f' Batch {i} : Average Test Accuracy = {round(float(test_accu), 5)}')

    print(f'Total Test Accuracy = {round(float(test_accu), 5)}')

 Batch 0 : Average Test Accuracy = 0.4
 Batch 10 : Average Test Accuracy = 0.50909
 Batch 20 : Average Test Accuracy = 0.51429
 Batch 30 : Average Test Accuracy = 0.52903
 Batch 40 : Average Test Accuracy = 0.54634
 Batch 50 : Average Test Accuracy = 0.53333
 Batch 60 : Average Test Accuracy = 0.54426
 Batch 70 : Average Test Accuracy = 0.52113
 Batch 80 : Average Test Accuracy = 0.5358
 Batch 90 : Average Test Accuracy = 0.53846
 Batch 100 : Average Test Accuracy = 0.52475
 Batch 110 : Average Test Accuracy = 0.52793
 Batch 120 : Average Test Accuracy = 0.52397
 Batch 130 : Average Test Accuracy = 0.51603
 Batch 140 : Average Test Accuracy = 0.52766
 Batch 150 : Average Test Accuracy = 0.52583
 Batch 160 : Average Test Accuracy = 0.52671
 Batch 170 : Average Test Accuracy = 0.52398
 Batch 180 : Average Test Accuracy = 0.52265
 Batch 190 : Average Test Accuracy = 0.51728
 Batch 200 : Average Test Accuracy = 0.51642
 Batch 210 : Average Test Accuracy = 0.51943
 Batch 220 : Average Test 

KeyboardInterrupt: 

Suggest a _baseline_ that we can compare our model against **[2 marks]**

Theoretically, given that we have 4 classes ('entailment', 'contradiction', 'neutral', '-'), if we guess blindly, we should have 25% accuracy. However, this conclusion assumes that the classes are evenly spread out, which is not the case. While inspecting the files one can notice that the '-' class is rather rare (and should probably have been excluded). Thus, given the three basic classes (mentioned as such in the task description above, by the way), guessing the same class every time should give us a **33% accuracy** (as these three are rather evenly spread out). We would suggest this to be our baseline.

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[4 marks]**.

+ We could look at the model's performance per class / category, as we know that they a) differ in frequency, with '-' being a unique case and b) the 'neutral' case is notoriously difficult for NLI models to predict. Thus, we could see if this is indeed the case here.
+ Having done that, we could compare that performance to other BiLSTM models' performance in general and in these particular classes.
+ We could look at class-wise precision and recall to see if our model is underpredicting or overpredicting certain classes.

Suggest some ways to improve the model **[3 marks]**.

+ We are pretty sure that we should remove the problematic lines from the data - the ones where the class is marked as '-' as well as the ones where one of the sentences is marked as N/A, which translated to NaN when the file was loaded in and caused all sorts of issues (and was replaced by one unknown token). 
+ We are unsure if the dropout rate we set was a good one, or if we set it in the right place.
+ We could test different hyperparameters and see which ones are best for the model.
+ We could make use of pre-trained word embeddings instead of initializing them from scratch.
+ We could experiment with the number of different layers aside from LSTM and classification.

### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.