# A5: Natural Language Inference using Neural Networks

Adam Ek

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Write all your answers and the code in the appropriate boxes below.


In this lab we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

## 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns in `simple_snli_1.0.zip`.

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [1]:
!wget https://nlp.stanford.edu/projects/snli/snli_1.0.zip

--2023-05-29 12:42:54--  https://nlp.stanford.edu/projects/snli/snli_1.0.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94550081 (90M) [application/zip]
Saving to: ‘snli_1.0.zip’


2023-05-29 12:43:06 (8.84 MB/s) - ‘snli_1.0.zip’ saved [94550081/94550081]



In [2]:
!unzip snli_1.0.zip

Archive:  snli_1.0.zip
   creating: snli_1.0/
  inflating: snli_1.0/.DS_Store      
   creating: __MACOSX/
   creating: __MACOSX/snli_1.0/
  inflating: __MACOSX/snli_1.0/._.DS_Store  
 extracting: snli_1.0/Icon           
  inflating: __MACOSX/snli_1.0/._Icon  
  inflating: snli_1.0/README.txt     
  inflating: __MACOSX/snli_1.0/._README.txt  
  inflating: snli_1.0/snli_1.0_dev.jsonl  
  inflating: snli_1.0/snli_1.0_dev.txt  
  inflating: snli_1.0/snli_1.0_test.jsonl  
  inflating: snli_1.0/snli_1.0_test.txt  
  inflating: snli_1.0/snli_1.0_train.jsonl  
  inflating: snli_1.0/snli_1.0_train.txt  
  inflating: __MACOSX/._snli_1.0     


In [11]:
!pip install gensim

Defaulting to user installation because normal site-packages is not writeable
Collecting gensim
  Downloading gensim-4.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.4 MB)
     |████████████████████████████████| 26.4 MB 272 kB/s            
Installing collected packages: gensim
Successfully installed gensim-4.3.1


In [48]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from gensim.utils import tokenize
from torch.utils.data import DataLoader, Dataset
import pickle
import torch.optim as optim
from tqdm.auto import tqdm

device = torch.device('cuda:3')

In [2]:
train_path = './snli_1.0/snli_1.0_train.txt'
dev_path = './snli_1.0/snli_1.0_dev.txt'
test_path = './snli_1.0/snli_1.0_test.txt'

In [49]:
class Vocab:
    def __init__(self, tokens, pad_token='PAD', unk_token='UNK', unk_and_pad=True):
        """If we are creating a vocab of a finite set of labels, we don't need unk and pad tokens. 
        We then set unk_and_pad to False.
        """
        if unk_and_pad:
            self.tokens = tokens+[unk_token]
            self.i2t = {i: t for i, t in enumerate(self.tokens, start=1)}
            self.i2t[0] = pad_token
            self.t2i = {v: k for k, v in self.i2t.items()}
            self.tokens += [pad_token] 
        else:
            self.tokens = tokens
            self.i2t = {i: t for i, t in enumerate(self.tokens)}
            self.t2i = {v: k for k, v in self.i2t.items()}
        
    def __len__(self):
        return len(self.tokens)
    
    def __getitem__(self, x):
        if type(x) == str:
            return self.t2i[x]
        if type(x) == int:
            return self.i2t[x]

In [62]:
class NLI_Dataset(Dataset):
    def __init__(self, tsv_file,
                 train=True,
                 unk_token='UNK',
                 pad_token='PAD',
                 vocab=None,
                 label_vocab=None):
        
        self.unk_token = unk_token
        self.pad_token = pad_token
        self.alldata = pd.read_csv(tsv_file, sep='\t')
        self.data = self.remove_missing(self.alldata)
        
        self.gold_labels = [label for label in self.data['gold_label']]
        self.premises = [list(tokenize(premise)) for premise in self.data['sentence1']]
        self.hypotheses = [list(tokenize(hypothesis)) for hypothesis in self.data['sentence2']]
        
        if train:
            self.tokens = list(set([token for line in self.premises for token in line]+[token for line in self.hypotheses for token in line]))
            self.vocab = Vocab(self.tokens, pad_token=self.pad_token, unk_token=self.unk_token)    
            self.label_vocab = Vocab(['neutral', 'contradiction', '-', 'entailment'], unk_and_pad=False)
            self.int_premises = [[self.vocab[word] for word in seq] for seq in self.premises]
            self.int_hypotheses = [[self.vocab[word] for word in seq] for seq in self.hypotheses]
        
        else:
            self.vocab = vocab
            self.label_vocab = label_vocab
            self.int_premises = [[self.vocab[word] if word in self.vocab.t2i else self.vocab[self.unk_token] for word in seq] for seq in self.premises]
            self.int_hypotheses = [[self.vocab[word] if word in self.vocab.t2i else self.vocab[self.unk_token] for word in seq] for seq in self.hypotheses]
            
        self.int_gold_labels = [self.label_vocab[label] for label in self.gold_labels]
    
    def remove_missing(self, df): # Quickfix for removing nan values in the training data (there were only six such rows).
        
        droprows = [i for i, x in enumerate(df['sentence1']) if type(x) != str]
        droprows += [i for i, x in enumerate(df['sentence2']) if type(x) != str]
        droprows += [i for i, x in enumerate(df['gold_label']) if type(x) != str]
        droprows = list(set(droprows))
        
        return df.drop(index=droprows)
    
    def __getitem__(self, idx):
        
        return (self.int_gold_labels[idx], self.int_premises[idx], self.int_hypotheses[idx])

    def __len__(self):
        return len(self.int_gold_labels)

In [63]:
def nli_pad_fn(data):
    p_len = max([len(x[1]) for x in data])
    h_len = max([len(x[2]) for x in data])
    padded_data = [(x[0], [w for w in x[1]+[0]*(p_len-len(x[1]))], [w for w in x[2]+[0]*(h_len-len(x[2]))]) for x in data]
    return padded_data


def dataloader(path_to_snli, batch_size, train=True, vocab=None, label_vocab=None):
    dataset = NLI_Dataset(path_to_snli, train=train, vocab=vocab, label_vocab=label_vocab)
    dataloader = DataLoader(dataset,
                            batch_size=batch_size,
                            shuffle=True,
                            collate_fn=nli_pad_fn)
    if train==True:
        return dataloader, dataset.vocab, dataset.label_vocab
    else:
        return dataloader

## 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using one shared bidirectional LSTM (or two different LSTMS)
    2) Perform max over the tokens in the premise and the hypothesis
    3) Combine the encoded premise and encoded hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max pooling method. When performing max-pooling, $max$ will be the function which selects a _maximum_ value from a vector and $x$ is the output, thus for each dimension $d$ in our output $x$ we get:

\begin{equation}
    x_d = max(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


This operation will reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence. 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max pooling and returns the result (the output should be of size: ```(batch_size, dimensions)```). [**4 Marks**]

In [66]:
def pooling(input_tensor):
    output_tensor = torch.max(input_tensor, 1)[0] 
    return output_tensor

### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[2 marks]**

In [67]:
def combine_premise_and_hypothesis(premise, hypothesis):
    p_minus_h = torch.sub(premise, hypothesis)
    p_times_h = torch.stack([torch.mul(premise[i],hypothesis[i])for i in range(len(premise))])
    output = torch.cat((premise, hypothesis, p_minus_h, p_times_h), 1)
    return output

### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [68]:
class SNLIModel(nn.Module):
    def __init__(self, vocab_size, embedding_dims, hidden_dims, n_labels, pad_index=0, dropout=0.1):
        super(SNLIModel, self).__init__()
        
        self.pad_index = pad_index
        self.embeddings = nn.Embedding(vocab_size, 
                                       embedding_dims, 
                                       padding_idx=pad_index)
        self.rnn = nn.LSTM(embedding_dims, 
                           hidden_dims, 
                           bidirectional=True,
                           batch_first=True)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dims*8, n_labels) 
        
    def forward(self, premise, hypothesis):
        p_embedded = self.embeddings(premise)
        h_embedded = self.embeddings(hypothesis)
        
        p = self.dropout(p_embedded)  
        h = self.dropout(h_embedded)  
        
        p, _ = self.rnn(p)
        h, _ = self.rnn(h)
         
        p_pooled = pooling(p)
        h_pooled = pooling(h)
        
        ph_representation = combine_premise_and_hypothesis(p_pooled,h_pooled)
        predictions = self.classifier(ph_representation)
        
        return predictions

## 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [76]:
# Hyperparameters:
epochs = 3
batch_size = 4
embedding_dims = 128
hidden_dims = 128
dropout = 0.1
learning_rate = 0.001

In [75]:
# Load data:
train_iter, vocab, label_vocab = dataloader(train_path, batch_size)
test_iter = dataloader(test_path, batch_size, train=False, vocab=vocab, label_vocab=label_vocab)

In [77]:
!pip install nvidia-cudnn-cu11==8.5.0.96  # To keep cudnn from throwing errors all the time.

Defaulting to user installation because normal site-packages is not writeable


In [78]:
model = SNLIModel(len(vocab), embedding_dims, hidden_dims, len(label_vocab), dropout=dropout).to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean').to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate) 

In [79]:
for epoch in tqdm(range(epochs)):
    model.train()
    total_loss = 0
    for i, batch in enumerate(train_iter):
         
        gold_labels = torch.Tensor([example[0] for example in batch]).long().to(device)
        premises = torch.Tensor([example[1] for example in batch]).long().to(device)
        hypotheses = torch.Tensor([example[2] for example in batch]).long().to(device)
                  
        output = model(premises, hypotheses)
                  
        loss = loss_function(output, gold_labels)
        total_loss += loss.item()
        
        loss.backward()
                  
        optimizer.step()
                  
        optimizer.zero_grad()             
                         
        
        if (i%100) == 0:
            print('total_loss:', round(total_loss / (i + 1), 4), end='\r')
            

# save trained model
pickle.dump(model, open('nli_model.pickle', 'wb'))
    
# test model after all epochs are completed
accuracies = []
for batch in test_iter:
    gold_labels = [example[0] for example in batch]
    premises = torch.Tensor([example[1] for example in batch]).long().to(device)
    hypotheses = torch.Tensor([example[2] for example in batch]).long().to(device)
    output = model(premises, hypotheses).cpu().numpy(force=True)
    model_predictions = [np.argmax(output[i]) for i in range(len(batch))]
    batch_accuracies = [int(gold_labels[i] == model_predictions[i]) for i in range(len(batch))]
    accuracies.extend(batch_accuracies)

print('accuracy:', sum(accuracies)/len(accuracies))


  0%|          | 0/3 [00:00<?, ?it/s]

accuracy: 0.721564


Suggest a _baseline_ that we can compare our model against **[2 marks]**

**Your answer should go here**

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[4 marks]**.

**Your answer should go here**

Suggest some ways to improve the model **[3 marks]**.

**Your answer should go here**

## Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

## Statement of contribution

Briefly state how many times you have met for discussions, who was present, to what degree each member contributed to the discussion and the final answers you are submitting.

## Marks

This assignment has a total of 23 marks.