# Natural Language Inference using Neural Networks - VG
Based on the Python Notebook by Adam Ek, expanded upon by Maria Irena Szawerna for the VT2022 Computational Semantics course.

----------------------------------

# 1. Data

##### Problem description, references
In this notebook I will, on the basis of Lab 5, explore how SNLI-trained models perform on FRACAS and whether or not fine-tuning them will improve the performance. A more detailed summary can be found at the end of the notebook. Throughout the notebook I will remark on what was sourced from the original Lab 5 notebook or somewhere else; I did contribute to all the parts of the basic Lab 5 though, so it is not strictly only my classmates' work. The references for the lab can be found at the end of the notebook, and I did not use anything besides that except for the FRACAS dataset itself, which is linked later in this cell, as well as a guide on how to access XML files, linked in the appropriate code block.  

##### Research question and thesis statement  
+ Does an SNLI-trained BiLSTM model perform well when evaluated on the FRACAS dataset? Does fine-tuning it on FRACAS influence the performance?
+ My prediction is that fine-tuning a model trained on SNLI on FRACAS will improve its performance on FRACAS-related tasks. However, the FRACAS dataset is much smaller, so the change may not be drastically big.

##### Description of the datasets' structure and my changes to the FRACAS dataset
*The dataset could not be downloaded in the simplified version, the link did not work; instead, I used [this](https://github.com/sdobnik/computational-semantics/blob/master/assignments/05-natural-language-inference/simple_snli_1.0.zip)*

The (simplified) data is organized as follows (tab-separated values):
* Column 1: Premise
* Column 2: Hypothesis
* Column 3: Relation

I will also work with the FRACAS dataset from [here](https://nlp.stanford.edu/~wcmac/downloads/fracas.xml). This one accepts the following relations:
* yes - equivalent to entailment
* no - equivalent to contradiction
* unknown - equivalent to neutral (neither entailment nor contradiction can be concluded)
* undef - the example is too tricky for an expert to resolve  

I will have to transform this dataset into a format that will be easy to work with, replacing their names of judgements with mine. In addition, some of their examples have two or more premises for one hypothesis. According to the dataset decription, one-premise problems constitute around 55% of the dataset, that is 192 problems. While this does slim the dataset down, I cannot do it any other way without combining the multiple premises into one sentence, as the model I am working with only compares two sentences (the hypothesis and the premise), and so the multi-premise problems will have to be excluded. In the end fewer than 192 problems are used, as some of those do not have an answer (undef), and so I excluded them.  

----

In [1]:
# This is the same as in Lab 5
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1" 
import torch
import torch.optim as optim
import torch.nn as nn
import torchtext

device = torch.device('cuda:2')

batch_size = 8
learning_rate = 0.001
epochs = 3

import random
import math

import numpy as np
import pandas as pd

### SNLI data and datasets/loaders  
----
**THIS WHOLE SECTION IS THE SAME AS IN LAB 5**

In [2]:
train_data = pd.read_csv('simple_snli_1.0_train.csv', header=None, sep='\t')
train_data.columns = ['premise', 'hypothesis', 'relation']
test_data = pd.read_csv('simple_snli_1.0_test.csv', header=None, sep='\t')
test_data.columns = ['premise', 'hypothesis', 'relation']
dev_data = pd.read_csv('simple_snli_1.0_dev.csv', header=None, sep='\t')
dev_data.columns = ['premise', 'hypothesis', 'relation']

In [3]:
from torch.utils.data import DataLoader, Dataset

In [4]:
# I implement a Dataset to keep track of vocab, word2idx, idx2word
# Dataset can also be used in DataLoader which gives batch loading, etc, for free.

class InferenceDataset(Dataset):

    def __init__(self, data, unk_label='<unk>', pad_label='<pad>'):
        
        self.unk_idx, self.unk_label = 0, unk_label
        self.pad_idx, self.pad_label = 1, pad_label

        self.data = data.copy()
        self.data['premise'] = self.data['premise'].apply(self.tokenize)
        self.data['hypothesis'] = self.data['hypothesis'].apply(self.tokenize)


        self.vocab = self.__unique_words()
        
        self.word2idx = dict()
        self.idx2word = dict()
        self.word2idx[self.unk_label] = self.unk_idx
        self.word2idx[self.pad_label] = self.pad_idx
        self.word2idx.update({word:idx+max(self.word2idx.values())+1 for idx, word in enumerate(self.vocab)})

        self.idx2word = {v:k for k,v in self.word2idx.items()}

        self.labels = list(np.unique(self.data['relation']))

    def __unique_words(self):
        all_words = []
        for s in self.data['premise']:
            all_words += s
        for s in self.data['hypothesis']:
            all_words += s
        return np.unique(all_words)
        
    def tokenize(self, string):
        if isinstance(string, str): 
            # The tokenizer was given as a whitespace tokenizer
            return string.lower().split()
        else:  # for NaN
            return "<unk>"

    def __getitem__(self, idx):
        #x = self.data.iloc[0] #for test
        x = self.data.iloc[idx]
        out = (x['premise'], x['hypothesis'], x['relation'])
        return out
        
    def __len__(self):
        return len(self.data)

In [5]:
from collections import namedtuple
from torch.nn.utils.rnn import pad_sequence 

relation_to_idx = {k:v for v,k in enumerate(sorted(np.unique(train_data['relation'])))}
idx_relation = {v:k for k,v in relation_to_idx.items()}

class Collate():
    def __init__(self, word_to_idx, pad_idx=1, unk_idx=0, relation_to_idx=relation_to_idx):
        self.pad_idx = pad_idx
        self.unk_idx = unk_idx
        self.word_to_idx = word_to_idx
        self.relation_to_idx = relation_to_idx
    def __call__(self, batch):
        batch = np.transpose(batch)
        
        premises = np.transpose(batch[0])
        premises = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in premises]
        premises = pad_sequence(premises, batch_first=True, padding_value=self.pad_idx)

        hypothesis = np.transpose(batch[1]) #batch first
        hypothesis = [torch.tensor([self.word_to_idx.get(w, self.unk_idx) for w in s], device=device) for s in hypothesis]
        hypothesis = pad_sequence(hypothesis, batch_first=True, padding_value=self.pad_idx)
        
        relations = [self.relation_to_idx[rel] for rel in batch[2]]

        return premises, hypothesis, relations


def dataloader(dataset, word2idx, pad_idx, unk_idx, batch_size=32, shuffle=True): # Need word2idx etc to match between train and test. Id probably do this is another wya in hindsight.
    loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        collate_fn=Collate(word2idx, pad_idx, unk_idx) )
    return loader



### FRACAS data
----
**THIS WHOLE SECTION IS ENTIRELY NEW, WITH THE parseXML() FUNCTION ADAPTED FROM UNDER THE LINK IN THE COMMENT**

In [6]:
import csv
import xml.etree.ElementTree as ET

In [7]:
def parseXML(xmlfile):
    # adapted from https://www.geeksforgeeks.org/xml-parsing-python/
    tree = ET.parse(xmlfile)
    root = tree.getroot()
    problems = []
    counter = 0
  
    for item in root.findall('./problem'):
        counter += 1
        
        items = {}

        for child in item:
            if child.tag not in items:
                items[child.tag] = child.text.encode('utf8')
            else:  # if it is in items, meaning it's a problem with multiple premises
                items['multi'] = 'yes'
            
            # I estimated that from the category 5. Adjectives and onwards, those are semantic examples
            if counter <= 196:
                items['category'] = 'syntactic'
            else:
                items['category'] = 'semantic'
        
        if 'multi' not in items:  # if it is a 1-premise problem
            problems.append(items)
        else:
            continue
      
    return problems

In [8]:
def decode_problem(problem):
    problem['p'] = problem['p'].decode("utf-8").strip()
    problem['h'] = problem['h'].decode("utf-8").strip()
    del problem['q']

    return problem

In [9]:
def select_1_premise_problems(xmlfile):
    
    problems = parseXML(xmlfile)
    
    # removing unnecessary "features"
    for problem in problems:
        if 'note' in problem:
            del problem['note']
        if 'why' in problem:
            del problem['why']

    # picking the ones that are not undef, changing the relation name
    selected_problems = []
    for problem in problems:
        if problem['a'] == b' Yes ':
            problem['a'] = 'entailment'
            problem = decode_problem(problem)
            selected_problems.append(problem)
        elif problem['a'] == b' No ':
            problem['a'] = 'contradiction'
            problem = decode_problem(problem)
            selected_problems.append(problem)
        elif problem['a'] == b" Don't know ":
            problem['a'] = 'neutral'
            problem = decode_problem(problem)
            selected_problems.append(problem)
        else:
            continue
            
    return selected_problems

In [10]:
problems = select_1_premise_problems('Natlog Problems.xml')

In [11]:
full_fracas_frame = pd.DataFrame(problems)
# I need to rename the columns so that it works with our datasets and loaders from the section above.
full_fracas_frame.columns = ['premise', 'category', 'hypothesis', 'relation']

In [12]:
full_fracas_frame

Unnamed: 0,premise,category,hypothesis,relation
0,An Italian became the world's greatest tenor.,syntactic,There was an Italian who became the world's gr...,entailment
1,The really ambitious tenors are Italian.,syntactic,There are really ambitious tenors who are Ital...,entailment
2,No really great tenors are modest.,syntactic,There are really great tenors who are modest.,contradiction
3,Some great tenors are Swedish.,syntactic,There are great tenors who are Swedish.,entailment
4,Many great tenors are German.,syntactic,There are great tenors who are German.,entailment
...,...,...,...,...
160,It is true that ITEL won the contract in 1992.,semantic,ITEL won the contract in 1992.,entailment
161,It is false that ITEL won the contract in 1992.,semantic,ITEL won the contract in 1992.,contradiction
162,Smith saw Jones sign the contract.,semantic,Jones signed the contract.,entailment
163,Smith saw Jones sign the contract and his secr...,semantic,Smith saw Jones sign the contract.,entailment


----
Usually data is divided into train, test, dev in a 8:1:1 ratio. Since I will not have a train set here (I train on the SNLI
data and only finetune on FRACAS dev), that leaves us with test and dev at 1:1.

since I want it shuffled, but I want to keep the results reproducible for re-running the notebook, I will keep a fixed seed. 

----

In [14]:
random.Random(25).shuffle(problems)
print(problems)

[{'p': 'A few committee members are from Scandinavia.', 'category': 'syntactic', 'h': 'At least a few female committee members are from Scandinavia.', 'a': 'neutral'}, {'p': 'Smith wrote a report for two hours.', 'category': 'semantic', 'h': 'Smith wrote a report.', 'a': 'neutral'}, {'p': 'Smith wrote a novel in 1991.', 'category': 'semantic', 'h': 'Smith wrote it in 1992.', 'a': 'contradiction'}, {'p': 'Smith and Jones left the meeting.', 'category': 'semantic', 'h': 'Smith left the meeting.', 'a': 'entailment'}, {'p': 'No delegate finished the report on time.', 'category': 'syntactic', 'h': 'Some Scandinavian delegate finished the report on time.', 'a': 'contradiction'}, {'p': 'Some Irish delegates finished the survey on time.', 'category': 'syntactic', 'h': 'Some delegates finished the survey on time.', 'a': 'entailment'}, {'p': 'Several delegates got the results published in major national newspapers.', 'category': 'syntactic', 'h': 'Several delegates got the results published.', '

In [15]:
split_point = math.ceil(len(problems) / 2)

test_fracas = problems[:split_point]
dev_fracas = problems[split_point:]

In [16]:
test_fracas_frame = pd.DataFrame(test_fracas)
test_fracas_frame.columns = ['premise', 'category', 'hypothesis', 'relation']
dev_fracas_frame = pd.DataFrame(dev_fracas)
dev_fracas_frame.columns = ['premise', 'category', 'hypothesis', 'relation']

In [17]:
test_fracas_frame

Unnamed: 0,premise,category,hypothesis,relation
0,A few committee members are from Scandinavia.,syntactic,At least a few female committee members are fr...,neutral
1,Smith wrote a report for two hours.,semantic,Smith wrote a report.,neutral
2,Smith wrote a novel in 1991.,semantic,Smith wrote it in 1992.,contradiction
3,Smith and Jones left the meeting.,semantic,Smith left the meeting.,entailment
4,No delegate finished the report on time.,syntactic,Some Scandinavian delegate finished the report...,contradiction
...,...,...,...,...
78,"When Jones got his job at the CIA, he knew tha...",semantic,It is the case that Jones is not and will neve...,entailment
79,"Either Smith, Jones or Anderson signed the con...",syntactic,If Smith and Anderson did not sign the contrac...,entailment
80,In 1994 ITEL sent a progress report every month.,semantic,ITEL sent a progress report in July 1994.,entailment
81,Fido is not a small animal.,semantic,Fido is a large animal.,neutral


In [18]:
dev_fracas_frame

Unnamed: 0,premise,category,hypothesis,relation
0,Smith claimed he had costed his proposal and s...,syntactic,Jones claimed Smith had costed Jones' proposal.,neutral
1,Neither commissioner spends time at home.,syntactic,One of the commissioners spends a lot of time ...,contradiction
2,Several Portuguese delegates got the results p...,syntactic,Several delegates got the results published in...,entailment
3,Many delegates obtained interesting results fr...,syntactic,Many delegates obtained results from the survey.,entailment
4,At least three commissioners spend a lot of ti...,syntactic,At least three commissioners spend time at home.,entailment
...,...,...,...,...
77,Smith discovered a new species in 1991.,semantic,Smith discovered it in 1992.,contradiction
78,"John wanted to buy a car, and he did.",syntactic,John bought a car.,entailment
79,The people who were at the meeting all voted f...,syntactic,Everyone at the meeting voted for a new chairman.,entailment
80,Some great tenors are Swedish.,syntactic,There are great tenors who are Swedish.,entailment


In [19]:
# I will want to see the performance on syntactic vs. semantic examples, so I will divide the test set into those two
syn_test_fracas = []
sem_test_fracas = []

for item in test_fracas:
    if item["category"] == 'syntactic':
        syn_test_fracas.append(item)
    else:  # if category == semantic
        sem_test_fracas.append(item)

In [20]:
syn_test_fracas_frame = pd.DataFrame(syn_test_fracas)
syn_test_fracas_frame.columns = ['premise', 'category', 'hypothesis', 'relation']
sem_test_fracas_frame = pd.DataFrame(sem_test_fracas)
sem_test_fracas_frame.columns = ['premise', 'category', 'hypothesis', 'relation']

In [21]:
syn_test_fracas_frame

Unnamed: 0,premise,category,hypothesis,relation
0,A few committee members are from Scandinavia.,syntactic,At least a few female committee members are fr...,neutral
1,No delegate finished the report on time.,syntactic,Some Scandinavian delegate finished the report...,contradiction
2,Some Irish delegates finished the survey on time.,syntactic,Some delegates finished the survey on time.,entailment
3,Several delegates got the results published in...,syntactic,Several delegates got the results published.,entailment
4,Bill suggested to Frank's boss that they shoul...,syntactic,If it was suggested that Bill and Frank's boss...,neutral
5,Mary used her workstation.,syntactic,Mary is female.,entailment
6,An Italian became the world's greatest tenor.,syntactic,There was an Italian who became the world's gr...,entailment
7,John said Bill had hurt himself.,syntactic,Someone said John had been hurt.,neutral
8,Several great tenors are British.,syntactic,There are great tenors who are British.,entailment
9,Most Europeans who are resident in Europe can ...,syntactic,Most Europeans can travel freely within Europe.,neutral


In [22]:
sem_test_fracas_frame

Unnamed: 0,premise,category,hypothesis,relation
0,Smith wrote a report for two hours.,semantic,Smith wrote a report.,neutral
1,Smith wrote a novel in 1991.,semantic,Smith wrote it in 1992.,contradiction
2,Smith and Jones left the meeting.,semantic,Smith left the meeting.,entailment
3,Smith ran his own business for two years.,semantic,Smith ran his own business.,entailment
4,John has a genuine diamond.,semantic,John has a diamond.,entailment
5,Smith wrote a report in two hours.,semantic,Smith spent more than two hours writing the re...,contradiction
6,ITEL won more orders than APCOM lost.,semantic,APCOM lost some orders.,neutral
7,The PC-6082 is as fast as the ITEL-XZ.,semantic,The PC-6082 is fast.,neutral
8,Smith discovered new species for two years.,semantic,Smith discovered new species.,entailment
9,John is a former successful university student.,semantic,John is a university student.,neutral


# 2. Model

**This section is roughly the same as in Lab 5, with the exception of using PyTorch max pooling rather than a custom (and slower) function. I have also changed the embedding size relative to what was submitted in the draft submission of Lab 5.**

### Creating a representation of a sentence
----
I will not use my custom max pooling function, as it was rather slow. Instead, I will use the PyTorch one. 

In [23]:
num_words = 3
dimensions = 5
# A tensor for testing
test_tensor = torch.rand([batch_size, num_words, dimensions], dtype=torch.float64, device=device)
print(test_tensor)

tensor([[[0.1323, 0.7823, 0.5127, 0.9199, 0.3370],
         [0.3986, 0.7415, 0.1556, 0.1236, 0.9320],
         [0.7010, 0.7487, 0.6831, 0.8826, 0.2171]],

        [[0.0781, 0.1650, 0.0951, 0.0395, 0.1147],
         [0.7370, 0.1587, 0.0340, 0.9292, 0.9115],
         [0.0550, 0.2563, 0.2875, 0.2127, 0.1141]],

        [[0.0838, 0.2077, 0.7513, 0.6818, 0.7008],
         [0.5850, 0.8879, 0.3350, 0.7218, 0.6587],
         [0.1865, 0.7171, 0.5937, 0.8862, 0.8820]],

        [[0.3829, 0.5779, 0.0744, 0.1269, 0.7876],
         [0.6733, 0.6044, 0.9346, 0.4400, 0.7812],
         [0.1369, 0.6862, 0.2518, 0.8547, 0.5197]],

        [[0.3529, 0.6909, 0.0228, 0.5946, 0.6634],
         [0.8360, 0.4068, 0.0725, 0.4669, 0.8832],
         [0.3467, 0.7944, 0.9746, 0.6484, 0.5324]],

        [[0.1317, 0.1929, 0.7879, 0.9044, 0.9152],
         [0.5748, 0.4898, 0.7750, 0.2756, 0.1717],
         [0.2656, 0.2103, 0.6207, 0.3750, 0.3937]],

        [[0.1467, 0.4918, 0.5485, 0.6889, 0.4375],
         [0.2035, 0

In [24]:
torch.max(test_tensor, dim=1)[0]

tensor([[0.7010, 0.7823, 0.6831, 0.9199, 0.9320],
        [0.7370, 0.2563, 0.2875, 0.9292, 0.9115],
        [0.5850, 0.8879, 0.7513, 0.8862, 0.8820],
        [0.6733, 0.6862, 0.9346, 0.8547, 0.7876],
        [0.8360, 0.7944, 0.9746, 0.6484, 0.8832],
        [0.5748, 0.4898, 0.7879, 0.9044, 0.9152],
        [0.7589, 0.8200, 0.9739, 0.6889, 0.7853],
        [0.4414, 0.9071, 0.7231, 0.8754, 0.8779]], device='cuda:2',
       dtype=torch.float64)

### Combining sentence representations
----

Next, we need to combine the premise and hypothesis into one representation. We will do this by concatenating four tensors (the final size of our tensor $X$ should be ``(batch_size, 4d)`` where ``d`` is the number of dimensions that you use): 

$$X = [P; H; |P-H|; P \cdot H]$$

Here, what we do is concatenating P, H, P times H, and the absolute value of P minus H, then return the result.

In [25]:
# Test tensors (size of batch, num, dim)
t = torch.rand([2*batch_size, num_words, dimensions], dtype=torch.float64, device=device)
t1, t2 = torch.split(t, batch_size)
# Pooled test tensors (size of batch, dim)
pt1 = torch.max(t1, dim=1)[0]
pt2 = torch.max(t2, dim=1)[0]
print(pt1)
print(pt2)

tensor([[0.8366, 0.3686, 0.4485, 0.9285, 0.9173],
        [0.9342, 0.9869, 0.9526, 0.9503, 0.8760],
        [0.9636, 0.6425, 0.6769, 0.8461, 0.7809],
        [0.9657, 0.7991, 0.9479, 0.9716, 0.9304],
        [0.6827, 0.8800, 0.5000, 0.6977, 0.4619],
        [0.9638, 0.7895, 0.7666, 0.5614, 0.9902],
        [0.5916, 0.8364, 0.4759, 0.7160, 0.7296],
        [0.4630, 0.7979, 0.8645, 0.9520, 0.4016]], device='cuda:2',
       dtype=torch.float64)
tensor([[0.3498, 0.7195, 0.8864, 0.4609, 0.7742],
        [0.2673, 0.8991, 0.4715, 0.8393, 0.3224],
        [0.8582, 0.8807, 0.7806, 0.7027, 0.6609],
        [0.5979, 0.3729, 0.7625, 0.8292, 0.8282],
        [0.9695, 0.5026, 0.5869, 0.8204, 0.9345],
        [0.9071, 0.6866, 0.8673, 0.5703, 0.5868],
        [0.7155, 0.8885, 0.4790, 0.9022, 0.6091],
        [0.6841, 0.9987, 0.9844, 0.9081, 0.9202]], device='cuda:2',
       dtype=torch.float64)


In [26]:
def combine_premise_and_hypothesis(hypothesis, premise):
    
    batches = len(hypothesis)
    dims = len(hypothesis[0])
    final_dims = 4*dims

    new_tensors = []

    for i in range(0,batches):
        hyp = hypothesis[i]
        pre = premise[i]
    
        summed = torch.cat((pre,hyp))
        subtracted = pre - hyp
        multiplied = torch.mul(pre, hyp)
    
        new_tensors.append(torch.cat((summed, subtracted, multiplied)))
    
    output = torch.stack(new_tensors)
    
    return output

In [27]:
combine_premise_and_hypothesis(pt1, pt2)

tensor([[ 0.3498,  0.7195,  0.8864,  0.4609,  0.7742,  0.8366,  0.3686,  0.4485,
          0.9285,  0.9173, -0.4868,  0.3509,  0.4379, -0.4676, -0.1431,  0.2926,
          0.2652,  0.3975,  0.4280,  0.7102],
        [ 0.2673,  0.8991,  0.4715,  0.8393,  0.3224,  0.9342,  0.9869,  0.9526,
          0.9503,  0.8760, -0.6669, -0.0877, -0.4811, -0.1110, -0.5537,  0.2497,
          0.8873,  0.4492,  0.7975,  0.2824],
        [ 0.8582,  0.8807,  0.7806,  0.7027,  0.6609,  0.9636,  0.6425,  0.6769,
          0.8461,  0.7809, -0.1054,  0.2381,  0.1037, -0.1433, -0.1200,  0.8270,
          0.5658,  0.5284,  0.5946,  0.5161],
        [ 0.5979,  0.3729,  0.7625,  0.8292,  0.8282,  0.9657,  0.7991,  0.9479,
          0.9716,  0.9304, -0.3677, -0.4262, -0.1854, -0.1423, -0.1022,  0.5774,
          0.2980,  0.7228,  0.8057,  0.7706],
        [ 0.9695,  0.5026,  0.5869,  0.8204,  0.9345,  0.6827,  0.8800,  0.5000,
          0.6977,  0.4619,  0.2868, -0.3774,  0.0869,  0.1228,  0.4725,  0.6619,
      

### Creating the model
----

In [28]:
class SNLIModel(nn.Module):
    def __init__(self, word2idx, relation2idx, embedding_dim=64, hidden_size=128, padding_idx=1):
        super().__init__()
        self.vocab_size = len(word2idx)
        self.output_dim = len(relation2idx)
        self.hidden_size = hidden_size
        # your code goes here
        self.embeddings = nn.Embedding(self.vocab_size, embedding_dim, padding_idx=padding_idx) #
        self.LSTM = nn.LSTM(input_size=embedding_dim, hidden_size=self.hidden_size, num_layers=1, bidirectional=True)
        self.classifier = nn.Linear(self.hidden_size*8, self.output_dim)
        self.dropout = nn.Dropout(0.1)
        
    def forward(self, premise, hypothesis):
        p = self.embeddings(premise)
        h = self.embeddings(hypothesis)
        
        lstm_p, (hidden, c) = self.LSTM(p)
        lstm_h, (hidden, c) = self.LSTM(h)
        
        p_pooled = torch.max(lstm_p, dim=1)[0]
        h_pooled = torch.max(lstm_h, dim=1)[0]
        
        ph_representation = combine_premise_and_hypothesis(h_pooled,p_pooled)
        ph_representation = self.dropout(ph_representation)  # is this at the right stage??
        
        predictions = self.classifier(ph_representation)
        
        return predictions

# 3. Training and testing

----
**This section re-uses the training loop from Lab 5 and adapts its evaluation loop into a function.**

In [29]:
train_dataset = InferenceDataset(train_data)
test_dataset = InferenceDataset(test_data)
# the dev set was used when testing if this works
dev_dataset = InferenceDataset(dev_data)

In [30]:
dev_fracas_dataset = InferenceDataset(dev_fracas_frame)
test_fracas_dataset = InferenceDataset(test_fracas_frame)
syn_test_fracas_dataset = InferenceDataset(syn_test_fracas_frame)
sem_test_fracas_dataset = InferenceDataset(sem_test_fracas_frame)

In [31]:
train_loader = dataloader(train_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
test_loader = dataloader(test_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
dev_loader = dataloader(dev_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
dev_fracas_loader = dataloader(dev_fracas_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
test_fracas_loader = dataloader(test_fracas_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
syn_test_fracas_loader = dataloader(syn_test_fracas_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)
sem_test_fracas_loader = dataloader(sem_test_fracas_dataset, train_dataset.word2idx, train_dataset.pad_idx, train_dataset.unk_idx, batch_size=batch_size,shuffle=False)

### Training the model just on SNLI

In [32]:
model = SNLIModel(train_dataset.word2idx, train_dataset.labels).to(device)
loss_function = nn.CrossEntropyLoss(reduction='mean')
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

for epoch in range(epochs):
    # train model
    total_loss = 0
    for i, batch in enumerate(train_loader):
        prems = batch[0]
        hyps = batch[1]
        rels = torch.Tensor(batch[2]).long().to(device)

        output = model(prems, hyps)
        
        loss = loss_function(output, rels)
        total_loss += loss.item()
        
        if i%200==0:
            print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

  return array(a, dtype, copy=False, order=order)


 Batch 0 : Average Loss = 1.40134
 Batch 200 : Average Loss = 1.10077
 Batch 400 : Average Loss = 1.07978
 Batch 600 : Average Loss = 1.06881
 Batch 800 : Average Loss = 1.05873
 Batch 1000 : Average Loss = 1.04806
 Batch 1200 : Average Loss = 1.03642
 Batch 1400 : Average Loss = 1.03038
 Batch 1600 : Average Loss = 1.02504
 Batch 1800 : Average Loss = 1.01949
 Batch 2000 : Average Loss = 1.01454
 Batch 2200 : Average Loss = 1.00815
 Batch 2400 : Average Loss = 1.00342
 Batch 2600 : Average Loss = 0.99813
 Batch 2800 : Average Loss = 0.99525
 Batch 3000 : Average Loss = 0.99171
 Batch 3200 : Average Loss = 0.98877
 Batch 3400 : Average Loss = 0.98656
 Batch 3600 : Average Loss = 0.98324
 Batch 3800 : Average Loss = 0.98025
 Batch 4000 : Average Loss = 0.97578
 Batch 4200 : Average Loss = 0.97138
 Batch 4400 : Average Loss = 0.96819
 Batch 4600 : Average Loss = 0.96609
 Batch 4800 : Average Loss = 0.96293
 Batch 5000 : Average Loss = 0.961
 Batch 5200 : Average Loss = 0.95769
 Batch 540

In [33]:
torch.save(model, 'inference_VG.model')

In [34]:
import pickle
#so that when we load it back in, we can have access to the same word2idx etc.
with open("train_dataset.pickle","wb") as f:
    pickle.dump(train_dataset, f)

In [35]:
with open("train_dataset.pickle", 'rb') as f:
    train_dataset = pickle.load(f)
    
model = torch.load('inference_VG.model')
model.eval()

SNLIModel(
  (embeddings): Embedding(56258, 64, padding_idx=1)
  (LSTM): LSTM(64, 128, bidirectional=True)
  (classifier): Linear(in_features=1024, out_features=4, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [36]:
def test_model(model, loader, name, divider):
    
    with torch.no_grad():

        correct = 0
        counter = 0
        for i, batch in enumerate(loader):
            test_output = model(batch[0], batch[1])
            # test_output = model(batch[1])
            test_output = torch.argmax(test_output, dim=1)
            targets = torch.tensor(batch[2], device=device)
            correct += torch.sum(test_output == targets)
            counter += len(test_output)

            test_accu = correct/counter
        
            if i%divider==0:
                print(f' Batch {i} : Average Test Accuracy on the {name} = {round(float(test_accu), 5)}')

    print(f'Total Test Accuracy on the {name} = {round(float(test_accu), 5)}')

In [37]:
test_model(model, test_loader, 'SNLI test set', 200)

 Batch 0 : Average Test Accuracy on the SNLI test set = 0.375
 Batch 200 : Average Test Accuracy on the SNLI test set = 0.7245
 Batch 400 : Average Test Accuracy on the SNLI test set = 0.73036
 Batch 600 : Average Test Accuracy on the SNLI test set = 0.72983
 Batch 800 : Average Test Accuracy on the SNLI test set = 0.73002
 Batch 1000 : Average Test Accuracy on the SNLI test set = 0.7294
 Batch 1200 : Average Test Accuracy on the SNLI test set = 0.73012
Total Test Accuracy on the SNLI test set = 0.7313


In [38]:
test_model(model, test_fracas_loader, 'FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the FRACAS test set = 0.75
 Batch 2 : Average Test Accuracy on the FRACAS test set = 0.54167
 Batch 4 : Average Test Accuracy on the FRACAS test set = 0.55
 Batch 6 : Average Test Accuracy on the FRACAS test set = 0.57143
 Batch 8 : Average Test Accuracy on the FRACAS test set = 0.51389
 Batch 10 : Average Test Accuracy on the FRACAS test set = 0.49398
Total Test Accuracy on the FRACAS test set = 0.49398


In [48]:
test_model(model, syn_test_fracas_loader, 'syntactic FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the syntactic FRACAS test set = 0.625
 Batch 2 : Average Test Accuracy on the syntactic FRACAS test set = 0.58333
 Batch 4 : Average Test Accuracy on the syntactic FRACAS test set = 0.525
Total Test Accuracy on the syntactic FRACAS test set = 0.525


In [49]:
test_model(model, sem_test_fracas_loader, 'semantic FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the semantic FRACAS test set = 0.5
 Batch 2 : Average Test Accuracy on the semantic FRACAS test set = 0.54167
 Batch 4 : Average Test Accuracy on the semantic FRACAS test set = 0.425
Total Test Accuracy on the semantic FRACAS test set = 0.4186


### Fine-tuning the model on FRACAS

In [41]:
model.train()

for epoch in range(epochs):
    # train model
    total_loss = 0
    for i, batch in enumerate(dev_fracas_loader):
        prems = batch[0]
        hyps = batch[1]
        rels = torch.Tensor(batch[2]).long().to(device)

        output = model(prems, hyps)
        
        loss = loss_function(output, rels)
        total_loss += loss.item()
        
        if i%2==0:
            print(f' Batch {i} : Average Loss = {round(total_loss/(i+1),5)}')#, end='\r')
            
        # calculate gradients
        loss.backward()
        # update model weights
        optimizer.step()
        # reset gradients
        optimizer.zero_grad()

 Batch 0 : Average Loss = 0.86952
 Batch 2 : Average Loss = 0.89718
 Batch 4 : Average Loss = 0.78495
 Batch 6 : Average Loss = 0.84265
 Batch 8 : Average Loss = 0.85689
 Batch 10 : Average Loss = 0.8796
 Batch 0 : Average Loss = 1.05906
 Batch 2 : Average Loss = 0.99514
 Batch 4 : Average Loss = 0.80413
 Batch 6 : Average Loss = 0.86354
 Batch 8 : Average Loss = 0.885
 Batch 10 : Average Loss = 0.93186
 Batch 0 : Average Loss = 1.13973
 Batch 2 : Average Loss = 0.99478
 Batch 4 : Average Loss = 0.83041
 Batch 6 : Average Loss = 0.9379
 Batch 8 : Average Loss = 0.92207
 Batch 10 : Average Loss = 0.89088


In [42]:
torch.save(model, 'inference_VG_finetuned.model')

In [43]:
finetuned_model = torch.load('inference_VG_finetuned.model')
model.eval()

SNLIModel(
  (embeddings): Embedding(56258, 64, padding_idx=1)
  (LSTM): LSTM(64, 128, bidirectional=True)
  (classifier): Linear(in_features=1024, out_features=4, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [44]:
test_model(finetuned_model, test_loader, 'SNLI test set', 200)

 Batch 0 : Average Test Accuracy on the SNLI test set = 0.375
 Batch 200 : Average Test Accuracy on the SNLI test set = 0.72264
 Batch 400 : Average Test Accuracy on the SNLI test set = 0.73005
 Batch 600 : Average Test Accuracy on the SNLI test set = 0.72421
 Batch 800 : Average Test Accuracy on the SNLI test set = 0.72425
 Batch 1000 : Average Test Accuracy on the SNLI test set = 0.72203
 Batch 1200 : Average Test Accuracy on the SNLI test set = 0.72304
Total Test Accuracy on the SNLI test set = 0.7239


In [56]:
test_model(finetuned_model, test_fracas_loader, 'FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the FRACAS test set = 0.75
 Batch 2 : Average Test Accuracy on the FRACAS test set = 0.66667
 Batch 4 : Average Test Accuracy on the FRACAS test set = 0.6
 Batch 6 : Average Test Accuracy on the FRACAS test set = 0.60714
 Batch 8 : Average Test Accuracy on the FRACAS test set = 0.54167
 Batch 10 : Average Test Accuracy on the FRACAS test set = 0.53012
Total Test Accuracy on the FRACAS test set = 0.53012


In [52]:
test_model(finetuned_model, syn_test_fracas_loader, 'syntactic FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the syntactic FRACAS test set = 0.75
 Batch 2 : Average Test Accuracy on the syntactic FRACAS test set = 0.66667
 Batch 4 : Average Test Accuracy on the syntactic FRACAS test set = 0.575
Total Test Accuracy on the syntactic FRACAS test set = 0.575


In [53]:
test_model(finetuned_model, sem_test_fracas_loader, 'semantic FRACAS test set', 2)

 Batch 0 : Average Test Accuracy on the semantic FRACAS test set = 0.25
 Batch 2 : Average Test Accuracy on the semantic FRACAS test set = 0.5
 Batch 4 : Average Test Accuracy on the semantic FRACAS test set = 0.425
Total Test Accuracy on the semantic FRACAS test set = 0.44186


## Summary, evaluation, and discussion
#### Summary
The goal of this VG project was to verify and compare how well, in terms of accuracy, a BiLSTM model trained on the SNLI dataset performs on an SNLI test set and on the FRACAS test set (optionally split into semantic- and syntactic-based examples). The second step was to fine-tune the same model on a section of the FRACAS dataset, and perform the same evaluation, so that the results could be compared.  

The first step in this assignment was transforming the data from the FRACAS dataset into a format that would be similar to what I had in the SNLI dataset. This meant having to decide how to translate the relation labels as well as what to do with the examples that had more than one premise. All of that has been described in section 1 of the assignment.  
The second step was to create the model, for which I used the same architecture as in Assignment 5, with the exception of not using the custom max pooling function, as that one was really slow in comparison with the PyTorch one. I did make sure that the PyTorch one was giving me the output I wanted.  
The final step was training the model on SNLI and evaluating it, and then fine-tuning it on FRACAS and re-evaluating again. Both of the models have been saved (along with the train_dataset, whose word2idx is used to encode the words in loaders). The results of the evaluation can be seen below:  

| MODEL | SNLI | FRACAS (all) | FRACAS (syntactic) | FRACAS (semantic) |
|-----|-----|-------|-----------|---------|
| base | 0.7313 | 0.49398 | 0.525 | 0.4186 |
| finetuned | 0.7239 | 0.53012 | 0.575 | 0.44186 |  

#### Evaluation 
It can be noticed that the base model (just trained on SNLI) performed better on the SNLI test set, but worse on all of the FRACAS ones. The fine-tuned model (first trained on SNLI, then trained more on FRACAS) performed worse on the SNLI test set, but better on the FRACAS ones. The difference in the peformance on SNLI is not major, and the one on FRACAS is bigger. However, these results are not always consistent when re-running the notebook, and generally oscillate around these values. It seems like fine-tuning on FRACAS has a rather marginal effect.  

#### Discussion
However, it is worth noting that the FRACAS dataset is significantly smaller than the SNLI train set. SNLI train has 550k examples, while our FRACAS dev - barely more than 80 (that is not 80k examples, that is 80 examples). This massive disproportion likely makes it so that the FRACAS examples do not contribute nearly as much as the SNLI ones, and perhaps balancing them out better would have been a better idea - this is difficult to do though, given that FRACAS itself is a small dataset. It is interesting to see though that it does have an effect, and on some runs of the notebook at least they do suggest that training a BiLSTM model on multiple kinds of inference datasets could improve its performance in some tasks without sacrificing too much of the performance elsewhere.  

Overall, given that we had 4 classes (the '-', 'contradiction', 'entailment', 'neutral' from SNLI), and given that '-' is a very rare one, the "stupid" baseline for this model would have been 33%. It performs better than that in all the categories, so I am fairly confident when I say it has learned something that lets it perform NLI - but whether it is something about the meaning or the structure of the data itself, I cannot say for certain.  

It is interesting to note that unlike what I thought would happen, the semantics-based examples from FRACAS are the ones the model is performing the worst on. This may be a clue to it actually not learning to classify the examples based on meaning but on sentence structure. I expected it to perform better on semantics-based examples, and not the syntactic ones. It might also be due to the semantics examples containing a lot of specialized vocabulary or just vocabulary that does not exist in the model's lexicon (so a lot of UNKNOWN tokens).

#### Possible future work
It would be really interesting to see the results with a more balanced SNLI-to-FRACAS ratio, seeing what the tradeoff is between gaining proficiency in FRACAS tasks and losing it in SNLI ones. It could also be interesting to test the model on other NLI datasets, or just the embeddings on other tasks. Overall I would also be interested in how NLI-based tasks could be used in training bigger, BERT-like models as one of the pre-training tasks, and what kind of meaning they would contribute - as clearly it also depends on the NLI dataset used.

#### Final remarks
When annotating the notebook for which parts I have done myself and which I have adapted, I felt like I have not done all that much, but I can promise that I did put quite some work into this little project, and hopefully it is sufficient. We have re-used similar elements in all of our neural network labs (e.g. the custom datasets and dataloaders which were initially done by Isac Boström from my group), and this project (and lab 5) is no exception in some of its sections. I found it particularly fun to have to think how to retrieve the FRACAS data and process it so that it would work with the model and the evaluation, and therein lies the most work I put into it. I also had to fix some minor issues with the model/training loop since relative to the draft submission of lab 5 there were some indexing errors when training on the training set. And, naturally, the discussion of the results is entirely mine, as is anything in the code that differs from lab 5 (unless stated otherwise). I hope that this is sufficient for the VG project, as discussed in e-mails before, and in case it is not, or you have any questions, issues, or comments, do not hesitate to contact me.

### Readings
[1] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

[2] Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.