# Pretraining language models on SNLI

This notebook gives an example on how to use the provided code

## Training and evaluation

First import the functions used for training and evaluation from the other files

In [9]:
from encoders import *
from trainFunctions import *
from utils import *
from sentEval import runSentEval

Then get the SNLI data and the field which contains the preprocessing pipeline and metadata (this takes a while)

In [10]:
train_data, val_data, test_data, TEXT = get_data()
data = {"train":train_data, "val": val_data, "test": test_data}

print("Data loaded")

Accessing raw input and preprocessing


2020-04-15 19:26:12,631 : Loading vectors from .vector_cache/glove.840B.300d.txt.pt


done
Building vocabulary with GloVe
done
Loading data into iterables
done, returning data
Data loaded, starting train


We need to define the parameters that are fixed

In [11]:
metadata = {
    "vector_size" : 300,
    "vocab_size" : len(TEXT.vocab),
    "pretrained" : TEXT.vocab.vectors,
    "pad_idx" : TEXT.vocab.stoi[TEXT.pad_token]
}

We also need to define default parameters that are used during the sweeping \
(At a time we only sweep one parameters while the rest is unchanged, this is for saving time, however may not give the best results)

In [12]:
## edit this to change default parameters
default_params = {
    "lr_decrease_factor":5,
    "lr_stopping" : 1e-6,
    "layer_num" : 1,
    "layer_size" : 512,
    "lr" : 0.001,
}

We also define the ranges in which these are sweeped

In [13]:
## edit this to change parameters ranges
param_ranges = {
    "learning rates":[0.01, 0.001],
    "lr_decrease_factors":[3, 5],
    "lr_stoppings": [1e-5, 1e-6], 
    "layer nums":[1,2],
    "layer sizes":[512,1024],
}

Note that in all previous dictionaries the keys are fixed and the models are looking for them. Only change the values in them if you want to try different setups.

Now we define the list of encoder models that we want to train and evaluate

In [14]:
encoders = [MeanEncoder,LSTMEncoder,BiLSTMEncoder, MaxBiLSTMEncoder]

Finally, we loop through the encoders and perform
* parameter search
* constructing a model with the best parameters and train it
* test the model
* store the trained model and the dev/test results
* evaluate on SentEval

For each of these tasks there is a function, see readme for more details

In [15]:
for encoderClass in encoders:
    # searching for best params
    best_params_for_model = paramSweep(encoderClass, data, default_params, param_ranges, metadata, forceOptimize = False)
    # training model with best params (and saving training plots)
    best_model = construct_and_train_model_with_config(encoderClass, data, best_params_for_model, metadata, forceRetrain=False)
    # testing the best model
    best_model_results = testModel(best_model, data)
    # saving best model and results
    save_model_and_res(best_model, best_model_results)
    # running SentEval for the model
    runSentEval(best_model, TEXT, tasks="paper")

Vector mean SNLI optimized parameters already exists, retrieving
best learning rate retrieved from stored param file
best lr stopping criterion retrived from stored param file
best lr decrease factor retrived from stored param file
++++++++++++++++++++++++++ Training model Vector mean SNLI with best params +++++++++++++++++++++++++++++++


RuntimeError: CUDA error: unspecified launch failure

That's it. If the above cell is finished (it may take days, depending on the ranges), all trained models and their configs and results are stored in the appropriately named folders. To create a comparison table we can call:

## Customize runs

#### Default parameters and ranges
The default parameters and the ranges can be changed by defining different 

#### I have the best configs stored, but I want to rerun sweeping
Set forceOptimize=True in paramSweep, and the script ignores the stored best config and overwrites it

#### I have the best model stored, but I have changed the best params
Set forceRetrain=True in construct_and_train_model_with_config so it ignores the stored model and overwrites with a new one

#### I just want to train one model with specified params
You can always call the above functions separately, just make sure you define valid inputs, and give a run name. The run name will define in what directories will the output be saved. It defaults to "best", so on default the ouput is saved in best_configs, best_models, best_model_results, but given e.g. "lstm" they would be saved to lstm_configs ... (directories created in the script). As an example, training a simple LSTM encoder without sweeping and SentEval evaluation, with custom params:

In [16]:
custom_params = {
    "lr_decrease_factor":5,
    "lr_stopping" : 1e-6,
    "layer_num" : 1,
    "layer_size" : 512,
    "lr" : 0.001,
}

runName = "custom_lstm_run"

trained_model = construct_and_train_model_with_config(LSTMEncoder, data, custom_params, metadata, runName=runName)
trained_model_results = testModel(trained_model, data)
save_model_and_res(trained_model, trained_model_results, runName = runName)



TypeError: construct_and_train_model_with_config() got an unexpected keyword argument 'runName'

## Analysis