# Pretraining language models on SNLI

This notebook gives an example on how to use the provided code

## Training and evaluation

First import the functions used for training and evaluation from the other files

In [1]:
from encoders import *
from trainFunctions import *
from utils import *
from sentEval import runSentEval

Then get the SNLI data and the field which contains the preprocessing pipeline and metadata (this takes a while)

In [2]:
train_data, val_data, test_data, TEXT, LABEL = get_data()
data = {"train":train_data, "val": val_data, "test": test_data}

print("Data loaded")

Accessing raw input and preprocessing


2020-04-17 14:44:57,730 : Loading vectors from .vector_cache/glove.840B.300d.txt.pt


done
Building vocabulary with GloVe
done
Loading data into iterables
done, returning data
Data loaded


We need to define the parameters that are fixed

In [3]:
metadata = {
    "vector_size" : 300,
    "vocab_size" : len(TEXT.vocab),
    "pretrained" : TEXT.vocab.vectors,
    "pad_idx" : TEXT.vocab.stoi[TEXT.pad_token]
}

We also need to define default parameters that are used during the sweeping \
(At a time we only sweep one parameters while the rest is unchanged, this is for saving time, however may not give the best results)

In [None]:
## edit this to change default parameters
default_params = {
    "lr_decrease_factor":5,
    "lr_stopping" : 1e-6,
    "layer_num" : 1,
    "layer_size" : 512,
    "lr" : 0.001,
}

We also define the ranges in which these are sweeped

In [None]:
## edit this to change parameters ranges
param_ranges = {
    "learning rates":[0.01, 0.001],
    "lr_decrease_factors":[3, 5],
    "lr_stoppings": [1e-5, 1e-6], 
    "layer nums":[1,2],
    "layer sizes":[512,1024],
}

Note that in all previous dictionaries the keys are fixed and the models are looking for them. Only change the values in them if you want to try different setups.

Now we define the list of encoder models that we want to train and evaluate

In [None]:
encoders = [MeanEncoder,LSTMEncoder,BiLSTMEncoder, MaxBiLSTMEncoder]

Finally, we loop through the encoders and perform
* parameter search
* constructing a model with the best parameters and train it
* test the model
* store the trained model and the dev/test results
* evaluate on SentEval

For each of these tasks there is a function, see readme for more details

(Note: I wouldn't recommend actually running it, it takes very long. All the cells below it will work as the outputs are stored)

In [None]:
for encoderClass in encoders:
    # searching for best params
    best_params_for_model = paramSweep(encoderClass, data, default_params, param_ranges, metadata, forceOptimize = False)
    # training model with best params (and saving training plots)
    best_model = construct_and_train_model_with_config(encoderClass, data, best_params_for_model, metadata, forceRetrain=False)
    # testing the best model
    best_model_results = testModel(best_model, data)
    # saving best model and results
    save_model_and_res(best_model, best_model_results)
    # running SentEval for the model
    runSentEval(best_model, TEXT, tasks="paper")

That's it. If the above cell is finished (it may take days, depending on the ranges), all trained models and their configs and results are stored in the appropriately named folders.

We can test some examples, just pass an encoder name, and the text field (for preprocessing) and label field (for getting the label, if the fields are not passed they are loaded by the script):

In [5]:
testExample("Vector mean", TEXT, LABEL)

Type a premise (x to exit): This is an example text
Type a hypothesis (x to exit): This is not an example text


  input = module(input)


Verdict is: Neutral
Type a premise (x to exit): This is an other example
Type a hypothesis (x to exit): I like omletts
Verdict is: Entailment
Type a premise (x to exit): I like football
Type a hypothesis (x to exit): The sun is great
Verdict is: Contradiction
Type a premise (x to exit): x


To more formally assess the performance, we can create tables with results, similarly to the paper:

In [4]:
encoderNames = ["Vector mean", "LSTM", "BiLSTM", "Pooled BiLSTM"]  ### you could select a subset, or store the name in the above loop as well
printResults(encoderNames, resultType = "SNLI+transfer")

| Model         |   dev accuracy |   test accuracy:  |   transfer macro |   transfer micro |
|---------------+----------------+-------------------+------------------+------------------|
| Vector mean   |        73.3981 |           73.8839 |          80.5629 |          81.4773 |
| LSTM          |        82.67   |           82.7618 |          78.0964 |          79.6539 |
| BiLSTM        |        82.3872 |           82.4269 |          77.1967 |          79.0204 |
| Pooled BiLSTM |        84.3397 |           84.3141 |          78.25   |          79.762  |


In [3]:
printResults(encoderNames, resultType = "SentEval")

| Model         |    MR |    CR |   MPQA |   SUBJ |   SST2 |   TREC | MRPC        |   SICKEntailment | STS14     |
|---------------+-------+-------+--------+--------+--------+--------+-------------+------------------+-----------|
| Vector mean   | 74.33 | 78.01 |  84.6  |  89.53 |  79.24 |   80.8 | 71.83/81.31 |            77.43 | 0.5/0.52  |
| LSTM          | 68.54 | 75.23 |  83.48 |  82.1  |  71.94 |   65.6 | 69.51/78.77 |            82.52 | 0.53/0.51 |
| BiLSTM        | 68.19 | 75.1  |  83.67 |  82.54 |  70.02 |   66.2 | 70.78/80.75 |            82.06 | 0.54/0.51 |
| Pooled BiLSTM | 73.21 | 80.05 |  85.26 |  88.57 |  77.38 |   81.6 | 72.75/81.18 |            83.8  | 0.64/0.61 |


## Customize runs

#### Default parameters and ranges
The default parameters and the ranges can be changed by defining different ones in the dictionaries given as input

#### I have the best configs stored, but I want to rerun sweeping
Set forceOptimize=True in paramSweep, and the script ignores the stored best config and overwrites it

#### I have the best model stored, but I have changed the best params
Set forceRetrain=True in construct_and_train_model_with_config so it ignores the stored model and overwrites with a new one

#### I just want to train one model with specified params
You can always call the above functions separately, just make sure you define valid inputs (note that the keys are not named the same as in the config), and give a run name. The run name will define in what directories will the output be saved. It defaults to "best", so on default the ouput is saved in best_configs, best_models, best_model_results, but given e.g. "lstm" they would be saved to lstm_configs ... (directories created in the script). Any function that needs to access some stored file can take runName as argument, and all default to "best". As an example, training a simple LSTM encoder without sweeping and SentEval evaluation, with custom params:

In [None]:
custom_params = {
    "learning rate": 0.0001,
    "lr_stopping": 1e-06,
    "lr_decrease_factor": 7,
    "number of layers": 1,
    "number of neurons per layer": 256
}

runName = "custom_lstm_run"

trained_model = construct_and_train_model_with_config(LSTMEncoder, data, custom_params, metadata, runName=runName)
trained_model_results = testModel(trained_model, data)
save_model_and_res(trained_model, trained_model_results, runName = runName)

## we can also call the result printing with the runName
printResults(["LSTM"], resultType = "SNLI", runName = runName)



## Analysis

The results and models that are reported in this notebook (so the ones under the "best" folders) are actually not the output of parameter sweeping, but using the same parameters as in the Conneau paper for easier comparison.\
\
There are two differences from the original paper's setup:
* I used **dropout** of 0.5 at both the encoder and the classifier (original did not report any dropout)
* I used **Adam optimizer**. For that, the (starting) learning rate and the stopping had to be reduced, to 0.001 and 1e-06.

If we compare the results to the ones reported in the paper we can see two things:
* The SNLI results are comparable, slightly better here
* The transfer results are noticably worse

This gives the idea that we are not overfitting the data, but we **are overfitting the task**. My intuition was that it might be due to the Adam optimizer, as the other difference, the dropout is a regularization that should not give task overfitting. To test that I ran a version with SGD optimizer (change optimizer in the trainFunctions), the corresponding files can be found in the sgd_... folders.

Let us look at how the results compare:



In [2]:
encoderNames = ["LSTM", "BiLSTM", "Pooled BiLSTM"]  
print("\nResults with Adam\n")
printResults(encoderNames, resultType = "SNLI+transfer")
print("\nResults with SGD\n")
printResults(encoderNames, resultType = "SNLI+transfer", runName = "sgd")


Results with Adam

| Model         |   dev accuracy |   test accuracy:  |   transfer macro |   transfer micro |
|---------------+----------------+-------------------+------------------+------------------|
| LSTM          |        82.67   |           82.7618 |          75.63   |          77.8306 |
| BiLSTM        |        82.3872 |           82.4269 |          75.5136 |          77.792  |
| Pooled BiLSTM |        84.3397 |           84.3141 |          77.479  |          79.1903 |

Results with SGD

| Model         |   dev accuracy |   test accuracy:  |   transfer macro |   transfer micro |
|---------------+----------------+-------------------+------------------+------------------|
| LSTM          |        80.5349 |           80.3064 |          77.3029 |          79.1279 |
| BiLSTM        |        80.1843 |           80.3571 |          77.3843 |          78.9623 |
| Pooled BiLSTM |        80.222  |           79.86   |          79.1748 |          80.8795 |


In [3]:
print("\nResults with Adam\n")
printResults(encoderNames, resultType = "SentEval")
print("\nResults with SGD\n")
printResults(encoderNames, resultType = "SentEval", runName="sgd")


Results with Adam

| Model         |    MR |    CR |   MPQA |   SUBJ |   SST2 |   TREC | MRPC        |   SICKEntailment | STS14     |
|---------------+-------+-------+--------+--------+--------+--------+-------------+------------------+-----------|
| LSTM          | 68.54 | 75.23 |  83.48 |  82.1  |  71.94 |   65.6 | 69.51/78.77 |            82.52 | 0.53/0.51 |
| BiLSTM        | 68.19 | 75.1  |  83.67 |  82.54 |  70.02 |   66.2 | 70.78/80.75 |            82.06 | 0.54/0.51 |
| Pooled BiLSTM | 73.21 | 80.05 |  85.26 |  88.57 |  77.38 |   81.6 | 72.75/81.18 |            83.8  | 0.64/0.61 |

Results with SGD

| Model         |    MR |    CR |   MPQA |   SUBJ |   SST2 |   TREC | MRPC        |   SICKEntailment | STS14     |
|---------------+-------+-------+--------+--------+--------+--------+-------------+------------------+-----------|
| LSTM          | 70.22 | 75.28 |  84.2  |  84.27 |  74.24 |   70.2 | 72.93/81.49 |            82.71 | 0.57/0.56 |
| BiLSTM        | 69.22 | 75.79 |  84.4  