# Natural Language Inference using Neural Networks
Adam Ek

----------------------------------

The lab is an exploration and learning exercise to be done in a group and also in discussion with the teachers and other students.

Before starting, please read the instructions on [how to work on group assignments](https://github.com/sdobnik/computational-semantics/blob/master/README.md).

Write all your answers and the code in the appropriate boxes below.

----------------------------------

In the VG part of problem set 3, we will work with neural networks for natural language inference. Our task is: given a premise sentence P and hypothesis H, what entailment relationship holds between them? Is H entailed by P, contradicted by P or neutral towards P?

Given a sentence P, if H definitely describe something true given P then it is an **entailment**. If H describe something that's *maybe* true given P, it's **neutral**, and if H describe something that's definitely *false* given P it's a **contradiction**. 

This definition of inference, and the method we use to solve it, is diffrent from what you've previously worked with. Briefly discuss strengths and weaknesses of using formal semantics versus using statistical methods for natural language inference. **[4 marks]**

**Your answer should go here**

Formal semantics: It requires a lot of hand work for categorizing and creating logical operations. The accuracy of the results might by higher than statistical models in tasks as question answering as formal semantics could make language generalization that include syntax. However, given the complixity of its modeling, it would be a computationally expensive alternative. On the morphological and sentential level, words and sentences need to be parsed, written and structured correctly in order to be identified and processed, so this makes it more error prone.  

Statistical methods: It is more automated, but has less generalisation or knowledge of language, such as syntax. However, it could have richer word representation, as meaning carying where we see it in VSM. This representation might be corpus based though. This means a better model would require a larger data to creat a more inclusive language scheme that covers suffiecient representations. Larger corpus with improved word representations such as deep contexualized representation would make it a good at handling polysimys and synonyms where the context plays a crucial role. However, larger corpus would also require annotation and different structures depending on the task that model is aimed to train on. So being corpus dependent has its advantages and disadvantages. 


# 1. Data

We will explore natural language inference using neural networks on the SNLI dataset, described in [1]. The dataset can be downloaded [here](https://nlp.stanford.edu/projects/snli/). We prepared a "simplified" version, with only the relevant columns [here](https://gubox.box.com/s/idd9b9cfbks4dnhznps0gjgbnrzsvfs4).

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

Like in the previous lab, we'll use torchtext to build a dataloader. You can essentially do the same thing as you did in the last lab, but with our new dataset. **[1 mark]**

In [1]:
import pandas as pd

In [2]:
import torch
from torchtext.data import Field, TabularDataset, BucketIterator
device = torch.device('cuda:0')
    




def dataloader(path):
        # tokenizer
    whitespacer = lambda x: x.split(' ')

   
    
    premises = Field(tokenize = whitespacer ,lower = True, batch_first = True,pad_token = "<pad>", unk_token = "<unk>",fix_length=16)
    hypothesis = Field(tokenize = whitespacer , lower = True, batch_first = True, pad_token = "<pad>", unk_token = "<unk>", fix_length=16 )
    relation = Field(batch_first = True,sequential=False, is_target=True)
    
    # create tabular datasets  
    train, test = TabularDataset.splits(path = path,
                                        train = "simple_snli_1.0_train.csv",
                                        test = "simple_snli_1.0_test.csv",
                                        format = "csv",
                                        fields = [("premises", premises),
                                                  ("hypothesis", hypothesis),
                                                  ("relation", relation)],
                                        skip_header = True,
                                        csv_reader_params = {"delimiter": "\t"})
    

    
    premises.build_vocab(train.premises,test.premises,train.hypothesis,test.hypothesis)
    hypothesis.vocab = premises.vocab
    relation.build_vocab(train, test,min_freq=3)

    
    train_iter, test_iter = BucketIterator.splits((train, test), 
                                                  batch_size=12,
                                                  sort_within_batch=False,
                                                  shuffle=True,
                                                  sort_key=lambda x: (len(x.premises),(len(x.hypothesis))),
                                                  device=device)
    return train_iter, test_iter, premises, relation 
    

# 2. Model

In this part, we'll build the model for predicting the relationship between H and P.

We will process each sentence using an LSTM. Then, we will construct some representation of the sentence. When we have a representation for H and P, we will combine them into one vector which we can use to predict the relationship.

We will train a model described in [2], the BiLSTM with mean/max-pooling model. The procedure for the model is roughly:

    1) Encode the Hypothesis and the Premise using a bidirectional LSTM
    2) Perform max or mean pooling over the premise and hypothesis
    3) Combine the premise and hypothesis into one representation
    4) Predict the relationship 

### Creating a representation of a sentence

Let's first consider step 2 where we perform max/mean pooling. There is a function in pytorch for this, but we'll implement it from scratch. 

Let's consider the general case, what we want to do for these methods is apply some function $f$ along dimension $i$, and we want to do this for all $i$'s. As an example we consider the matrix S with size ``(N, D)`` where N is the number of words and D the number of dimensions:

$S = \begin{bmatrix}
    s_{11} & s_{12} & s_{13} & \dots  & s_{1d} \\
    s_{21} & s_{22} & s_{23} & \dots  & s_{2d} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    s_{n1} & s_{n2} & s_{n3} & \dots  & s_{nd}
\end{bmatrix}$

What we want to do is apply our function $f$ on each dimension, taking the input $s_{1d}, s_{2d}, ..., s_{nd}$ and generating the output $x_d$. 

You will implement both the max and mean pooling methods. When performing mean-pooling, $f$ will be the mean function and $x$ is the output, thus for each dimension $d$ we calculate:

\begin{equation}
x_d = \frac{1}{N}\sum_{j=1}^N x_{jd}
\end{equation}

When performing max-pooling we do the same thing, but let $f$ be the ``argmax`` function:

\begin{equation}
    x_d = f(s_{1d}, s_{2d}, ..., s_{nd}) = argmax(s_{1d}, s_{2d}, ..., s_{nd})
\end{equation}


Both of these operations reduce a batch of size ``(batch_size, num_words, dimensions)`` to ``(batch_size, 1, dimensions)`` meaning that we now have created a sentence representation based on the content of the words representations in the sentence (by applying some function $f$ along a dimension). 

Create a function that takes as input a tensor of size ``(batch_size, num_words, dimensions)`` then performs max or mean pooling and return it. [**6 Marks**]

In [3]:

import torch 
import torch.nn as nn

def pooling(input_tensor):  
  
    values,indecies = torch.max(input_tensor, 1)
    l = torch.unsqueeze(values, 1)
    return l
    
    


### Combining sentence representations

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ will be ``(batch_size, 1, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; P \cdot H; P-H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

Implement the function. **[4 marks]**

In [4]:
def combine_premise_and_hypothesis(hypothesis, premise):
    
    output=[]
    for i,l in enumerate(hypothesis):
        a = hypothesis[i] - premise[i] 
        b = hypothesis[i] * premise[i] 
        l = output.append(torch.cat((hypothesis[i], premise[i],b,a), 1))
        
    return torch.squeeze(torch.stack(output),1)

                      


### Creating the model

Finally, we can build the model according to the procedure given previously by using the functions we defined above. Additionaly, in the model you should use *dropout*. For efficiency purposes, it's acceptable to only train the model with either max or mean pooling. 

Implement the model [**6 marks**]

In [5]:
class SNLIModel(nn.Module):
    def __init__(self, vocab_length, embedding_dim, hidden_dim, num_labels):
        # your code goes here
        super().__init__()
        self.embeddings = nn.Embedding(vocab_length, embedding_dim) # vocab_size=len(CONTEXTS.vocab); embedding_dim = 50 (say)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True, batch_first=True, num_layers = 2, dropout=0.1 )
        self.nf = nn.Linear(hidden_dim * 8, hidden_dim)
        
       
        
    def forward(self, premise, hypothesis):
        
        prem = self.embeddings(premise)
        hypo = self.embeddings(hypothesis)
        
    
        prem_encode, (h_p, lc_p) = self.rnn(prem)
        hypo_encode, (h_h, lc_h) = self.rnn(hypo)
     
        
        max_prem = pooling(prem_encode)
        max_hypo = pooling(hypo_encode)
        
                
        x = combine_premise_and_hypothesis(max_prem,max_hypo)
        
        lin= self.nf(x)  
        
      
        return F.relu(lin)

# 3. Training and testing

As before, implement the training and testing of the model. SNLI can take a very long time to train, so I suggest you only run it for one or two epochs. **[2 marks]** 

**Tip for efficiency:** *when developing your model, try training and testing the model on one batch (for each epoch) of data to make sure everything works! It's very annoying if you train for N epochs to find out that something went wrong when testing the model, or to find that something goes wrong when moving from epoch 0 to epoch 1.*

In [6]:
import torch.optim as optim
from sklearn import metrics
import numpy as np
import torch.nn.functional as F
from sklearn.metrics import accuracy_score

train_iter, test_iter , Vocab, relation  = dataloader("SNLI-data")

num_of_batches = [(batch_idx , batch)for batch_idx , batch in enumerate(train_iter)][-1][0]

embedding_size= 50
hidden_size= 25
epochs =  3      
output_size= 3  


model = SNLIModel(len(Vocab.vocab), embedding_size, hidden_size, output_size)
model = model.to(device)
loss_function = nn.CrossEntropyLoss()
loss_function = loss_function.to(device)
optimizer = optim.Adam(model.parameters(),lr= 0.001)


def train(model, iterator, optimizer, criterion,epochs):
    
        for e in range(epochs):    

            model.train()  
            for batch in iterator:
                optimizer.zero_grad()
                prem  = batch.premises
                hypo = batch.hypothesis
                predictions = model(prem,hypo)
                tag = batch.relation
                loss = criterion(predictions, tag) 
                loss.backward()       
                optimizer.step() 
                
                
train(model, train_iter,optimizer, loss_function,epochs)
                
print("train done")


model.eval()
with torch.no_grad():
    pred = []
    label = []
    for batch_idx, batch in enumerate(test_iter):
        hyp  = batch.hypothesis
        prem = batch.premises
        answer = model(prem,hyp)
        pred1 = torch.max(answer, 1)[1].view(batch.relation.size()).tolist()
        relation = batch.relation.tolist()
        pred += pred1
        label += relation
        
        
print("test done")

acc = accuracy_score(pred,label)
print("Accuracy:", acc)    





train done
test done
Accuracy: 0.4774


Suggest a baseline that we can compare our model against **[2 marks]**

I cannot think of what a good baseline would be. Perhaps the average accuracy for the entailment - not entailment labels could be a baseline 


**Your answer should go here**

Suggest some ways (other than using a baseline) in which we can analyse the models performance **[6 marks]**.

**Your answer should go here**

The model's preformance can be measured by testing on more hypothesis based on the same premise. This means  the model is actually practical and could preform on real-world tasks. In a practical task the model should be able to handle multiple hypothesis/inferences by whether they are entailed by the premise or not. Other way to measure the preformance of a model is the validity of a hypothesis depndeing on multiple premises. Meaning that a one hypothesis can be true/entailed given different premises, such as which correct context/premise entails a hypothesis. Of course this might require other datasets, but in genreal sense, a resemblense of human logical thinking should be ideally able to handle such reasoning problems. For example, a random question that human would ask is "What do we understand from this text?"" what can we infere from it", and how many inferences can be taken from it. A summary of a text consist of a multiple inferences/hypothesis where the total summary can be valid only if all the inferences are correct given the original text/premises.    


Suggest some ways to improve the model **[4 marks]**.

1. Including contextually richer embeddings/representations such as ELMO should hypothetically add stronger meaning to the sentence.  

2. Hyperparameters to test the best possible paramaters that fit the task should add a reasonable improvement. 

3. Testing/improving different methods for computing the premise-hypothesise combination.


**Your answer should go here**

### Readings

[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). 

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.