# Pretraining sentence representations on SNLI

This notebook gives an example on how to use the provided code. It also contains examples on how to customize runs, and the analysis is also included here at the end.

## Training and evaluation

First import the functions used for training and evaluation from the other files

In [None]:
from encoders import *
from trainFunctions import *
from utils import *
from sentEval import runSentEval

## let's ignore the pytorch warnings for readability
import warnings
warnings.filterwarnings('ignore')

Then get the SNLI data and the field which contains the preprocessing pipeline and metadata (this takes a while)

In [None]:
train_data, val_data, test_data, TEXT, LABEL = get_data()
data = {"train":train_data, "val": val_data, "test": test_data}

print("Data loaded")

We need to define the parameters that are fixed

In [None]:
metadata = {
    "vector_size" : 300,
    "vocab_size" : len(TEXT.vocab),
    "pretrained" : TEXT.vocab.vectors,
    "pad_idx" : TEXT.vocab.stoi[TEXT.pad_token]
}

We also need to define default parameters that are used during the sweeping \
(At a time we only sweep one parameters while the rest is unchanged, this is for saving time, however may not give the best results)

In [None]:
## edit this to change default parameters
default_params = {
    "lr_decrease_factor":5,
    "lr_stopping" : 1e-6,
    "layer_num" : 1,
    "layer_size" : 512,
    "lr" : 0.001,
}

We also define the ranges in which these are sweeped

In [None]:
## edit this to change parameters ranges
param_ranges = {
    "learning rates":[0.01, 0.001],
    "lr_decrease_factors":[3, 5],
    "lr_stoppings": [1e-5, 1e-6], 
    "layer nums":[1,2],
    "layer sizes":[512,1024],
}

Note that in all previous dictionaries the keys are fixed and the models are looking for them. Only change the values in them if you want to try different setups.

Now we define the list of encoder models that we want to train and evaluate

In [None]:
encoders = [MeanEncoder,LSTMEncoder,BiLSTMEncoder, MaxBiLSTMEncoder]

Finally, we loop through the encoders and perform
* parameter search
* constructing a model with the best parameters and train it
* test the model
* store the trained model and the dev/test results
* evaluate on SentEval

For each of these tasks there is a function, see readme for more details

(Note: I wouldn't recommend actually running it, it takes very long. All the cells below it will work as the outputs are stored)

(Note2: we use the default "best" runName here)

In [None]:
for encoderClass in encoders:
    # searching for best params
    best_params_for_model = paramSweep(encoderClass, data, default_params, param_ranges, metadata, forceOptimize = False)
    # training model with best params (and saving training plots)
    best_model = construct_and_train_model_with_config(encoderClass, data, best_params_for_model, metadata, forceRetrain=False)
    # testing the best model
    best_model_results = testModel(best_model, data)
    # saving best model and results
    save_model_and_res(best_model, best_model_results)
    # running SentEval for the model
    runSentEval(best_model, TEXT, tasks="paper")

That's it. If the above cell is finished (it may take days, depending on the ranges), all trained models and their configs and results are stored in the appropriately named folders.

We can test some examples, just pass an encoder name, and the text field (for preprocessing) and label field (for getting the label, if the fields are not passed they are loaded by the script):

In [None]:
testExample("Pooled BiLSTM", TEXT, LABEL)

(Don't forget to exit above!)

To more formally assess the performance, we can create tables with results, similarly to the paper:

In [None]:
encoderNames = ["Vector mean", "LSTM", "BiLSTM", "Pooled BiLSTM"]  ### you could select a subset, or store the name in the above loop as well
printResults(encoderNames, resultType = "SNLI+transfer")

In [None]:
printResults(encoderNames, resultType = "SentEval")

## Customize runs

#### Default parameters and ranges
The default parameters and the ranges can be changed by defining different ones in the dictionaries given as input




#### I just want to train one model with specified params
You can always call the above functions separately, just make sure you define valid inputs (note that the keys are not named the same as in the config), and give a **run name**. 

The run name will define in what directories will the output be saved. It defaults to "best", so on default the ouputs are saved in runs/best/ best_configs, best_models, best_model_results, but given e.g. "lstm" they would be saved to runs/lstm/lstm_configs ... (directories created in the script). Any function that needs to access some stored file can take runName as argument, and all default to "best". As an example, training a simple LSTM encoder without sweeping and SentEval evaluation, with custom params:

In [None]:
custom_params = {
    "learning rate": 0.0001,
    "lr_stopping": 1e-06,
    "lr_decrease_factor": 7,
    "number of layers": 1,
    "number of neurons per layer": 256
}

runName = "custom_lstm_run"

trained_model = construct_and_train_model_with_config(MeanEncoder, data, custom_params, metadata, runName=runName)
trained_model_results = testModel(trained_model, data)
save_model_and_res(trained_model, trained_model_results, runName = runName)

## we can also call the result printing with the runName
printResults(["Vector mean"], resultType = "SNLI", runName = runName)



#### I have the best configs stored, but I want to rerun sweeping
Set forceOptimize=True in paramSweep, and the script ignores the stored best config and overwrites it. Example with LSTM:

In [None]:
best_params_for_model = paramSweep(LSTMEncoder, data, default_params, param_ranges, metadata, forceOptimize = True, runName = "retrain_test")


#### I have the best model stored, but I have changed the best params
Set forceRetrain=True in construct_and_train_model_with_config so it ignores the stored model and overwrites with a new one. Example with LSTM:

In [None]:
best_model = construct_and_train_model_with_config(LSTMEncoder, data, custom_params, metadata, forceRetrain=True, runName = "retrain_test")


## Analysis

The results and models that are shown above (so the ones under the "best" folders) are actually not the output of parameter sweeping, but using the same parameters as in the Conneau paper for easier comparison.\
\
If you are familiar with the paper, you can see that the SNLI results are comparable, somwhat even better than the ones reported there, with showing the same pattern: the baseling vector mean approach is the worse, LSTM is noticably better, BiLSTM is slightly better than LSTM, and the pooled BiLSTM performs the best. This is as expected since:
* The vector mean approach is a very naive compositional approach: as it only averages, it contains no information about word order, the single Glove vectors contain no information about context, the the model does not employ any attention mechanism to focus on the important parts.
* The LSTM method is sequential in nature, and the running cell state with input and forget gates offers a mechanism that could encapsulate some contextual meaning from the word vectors. The direction is still strictly unidirectional and the LSTM still has trouble seeing long distance relations (though better than RNN) as processes words one by one. The fact that we only use the last hidden state makes it hard to see the words at the beginning, or separate word's contributions in general.
* The BiLSTM improves on the previous one by concatenating two unidirectional approaches, one from the end. This introduces a shallow bidirectionality where the information from the other end is also encoded (however does not solve the problem that every other word should be seen at the same as in transformers).
* The pooled BiLSTM works best, as it adds a weak form of attention to the model. Though not queried based on the output, the fact that every word has a chance of contribution to the final output makes it possible for the model to extract more meaningful representation regardless the position.

However, if you've read the original paper, you might have noticed that, while the SNLI performance is good, the performance on the transfer tasks is quite bad, worse than the reported ones in the paper (and actually worse than the baseline model's)

To investigate this, we should note that there are two differences from the original paper's setup:
* I used **dropout** of 0.5 at both the encoder and the classifier (original did not report any dropout)
* I used **Adam optimizer**. For that, the (starting) learning rate and the stopping had to be reduced, to 0.001 and 1e-06.

This gives the idea that we are not overfitting the data, but we **are overfitting the task**. As the paper mentions this is probably due to the fact that we are using Adam which gives generally better fit on the objective task. To test that I ran a version with SGD optimizer (change optimizer in the trainFunctions), the corresponding files can be found in the runs/sgd/sgd_... folders.

Let us look at how the results compare:



In [None]:
encoderNames = ["LSTM", "BiLSTM", "Pooled BiLSTM"]  
print("\nResults with Adam\n")
printResults(encoderNames, resultType = "SNLI+transfer")
print("\nResults with SGD\n")
printResults(encoderNames, resultType = "SNLI+transfer", runName = "sgd")

What we can see here is that the transfer performance is indeed noticably higher for the SGD version, supporting the idea that Adam is overfitting the task. To get a better picture let's look at the separate task results:

In [None]:
print("\nResults with Adam\n")
printResults(encoderNames, resultType = "SentEval")
print("\nResults with SGD\n")
printResults(encoderNames, resultType = "SentEval", runName="sgd")

We can see, that on task directly involving inference or sentence comparison (SickEntailment, STS14) we achieve high performance, whereas as we move further from the original task (MR, TREC) the performance goes down rapidly. This difference is bigger when using Adam than with SGD, further strengthening the idea that Adam makes the model overfit on the specific task more strongly than SGD, so for transfer use the latter is more applicable even though the original performance is worse.

### Conclusion
The idea of transfer learning for NLP has been highly influental and one of the main reasons that the field has been one of the fastest developing ones in the past years. However, this specific task does not prove to be "NLP complete" as even slightly complex models easily overfit on it without generalizing very well on other NLP tasks. The architectures presented here also lack some of the key aspects of the models that have proven to be the most effective in recent years, such as proper attention or deep bidirectionality. The upside of these models is that they are lightweight in comparison to transformers so it is possible to learn them from skratch. If that is not an important aspect, it is normally a better approach to use pre-trained transformer language model as both the pretraining task (masked language modelling) is more NLP-complete, and the deep attention based architecture is proven to represent richer linguistic content. Still, the main point of the paper and the experiments, that transfer learning should be exploited in NLP is proven, and has become a standard practice.

### Further questions
* The finding is quite curious and it should be interesing to look into the theoretical background on why this happens. As mentioned in the paper Adam converges faster, but for a proper comparison the training should be stoppped at equal performance level so we know that if we see a difference, it's from the nature of the optimizer and not about how far they are into convergence.
* When starting the training, the first versions used large batch size (300), but it was stopped after discussing with the TA-s that it would probably hurt the performance. The runs could not finish so only the results for the baseline and the LSTM model were obtained, but surprisingly on both the performance was better. It would be interesting to actually run the full experiment with large batch size to see the results.