# Sentiment Polarity Prediction Model Analysis Introduction

This notebook analysis various models when used to carry out sentiment analysis on movie reviews. The goal of these sentiment analysis models is to classify a movie review as either positive or negative through document-level sentiment analysis.

The models analysed are:

1. A baseline Multinomial Naive Bayes model
2. A Logistic Regression model
3. A fine-tuned BERT model

Code for a Multinomial Naive Bayes classification model was provided as a baseline model for the previous assignment. As part of that assignment, I took the initial code for this model, refactored it, expanded it and implemented a more stream lined version of this modle which I use in this notebook.

The Logistic Regression model also stems from the previous assignment where as part of that assignment, I was to experiment with different models and parameters to improve on the accuracy of the initial baseline model. This Logistic Regression model was the best model that I experimented with and as a result deserves to be part of this analysis as it is the benchmark model in terms of accuracy.

The new model brought to this analysis is the BERT model. Code for A basic implementation of this model was provided as part of this assignment. This implementation uses the Hugging Face [Transformers](https://huggingface.co/transformers/) library with [PyTorch](https://pytorch.org/) and [Lightning](https://www.pytorchlightning.ai/).

---

All of the models in this notebook are trained on the movie review polarity data of Pang and Lee 2004 [A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts](https://www.aclweb.org/anthology/P04-1035/). The dataset used in this paper is available at http://www.cs.cornell.edu/People/pabo/movie-review-data (section "Sentiment polarity datasets") and contains 1000 positive and 1000 negative reviews, with each review being tokenised, sentence-split (one sentence per line) and lowercased.

In this dataset, each review has been assigned to 1 of 10 cross-validation folds by the authors. In order to compare the results of the different models outlined above, the models are evaluated and compared using an average of the 10-fold cross-validation accuracy scores.

During this process, no special treatment is given to rare or unknown words. Unknown words in the test data are skipped.

# --------------------------------------------------------------------------------------------------------------------------

# Setup

### Packges needed to install

In order to be able to re-implement this work, there are a number of packages you will need to install. The commands for these are as follows:

- conda install numpy
- conda install pandas
- conda install tabulate
- conda install matplotlib
- conda install selenium
- conda install tqdm
- conda install scikit-learn (version > 0.24.0)
- pip install torch==1.7.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
- pip install torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
- pip install torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
- pip install transformers
- pip install pytorch-lightning
- pip install pytorch-nlp
- pip install tensorboard

In [None]:
# adjust the torch version below following instructions on https://pytorch.org/get-started/locally/

import sys

# for why we use {sys.executable} see
# https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/

try:
    import torch
except ModuleNotFoundError:
    !{sys.executable} -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

try:
    import transformers
except ModuleNotFoundError:
    !{sys.executable} -m pip install transformers

try:
    import pytorch_lightning as pl
except ModuleNotFoundError:
    !{sys.executable} -m pip install pytorch-lightning

try:
    import torchnlp
except ModuleNotFoundError:
    !{sys.executable} -m pip install pytorch-nlp

try:
    import tensorboard
except ModuleNotFoundError:
    !{sys.executable} -m pip install tensorboard

!{sys.executable} -m pip install selenium

### Import packages

In [1]:
# general packages
import os
import sys
import pandas as pd
import time
from tqdm.auto import tqdm

In [2]:
# packages for defining the tokeniser
from transformers import AutoTokenizer
from tokenizers.pre_tokenizers import Whitespace

In [31]:
# packages for clearing CUDA cache
import gc
import torch

# packages for BERT tokanisation
import torch
import transformers
import pytorch_lightning as pl
import torchnlp
import tensorboard

### Import the classes and functions from the '.py' files

Instead of having this notebook full of different functions with different applications, I decided to extract these from the notebook and to put them into a '.py' file. I could then read these functions in from the '.py' file without any problem which would enable me to keep this notebook compact and streamlined.

For even more clarity, I have also broken the functions and classes down into three '.py' files. These files correspond to the functions and classes needed to:

1. Load the data
2. Train the Multinomial Naive Bayes model & the Logistic Regression model
3. Train the BERT model

The functions themselves are easily accessible and readable from the '.py' file.

In [3]:
# Load the autoreload extension
%load_ext autoreload
%autoreload 1

In [4]:
# functions for loading the data
%aimport Load_data_functions
import Load_data_functions as ld

In [5]:
# functions for running the Naive Bayes & Logistic Regression models
#%aimport Simple_model_functions
#import Simple_model_functions as smf

In [6]:
# functions for running the BERT model
%aimport Bert_functions
import Bert_functions as bf

### Increase the maximum number of rows displayed

In [7]:
pd.set_option("display.max_rows", 100)

### Define the location of the chromedriver

In this code, I make use of a chromedriver to download the data from the internet. This is only done if the data is not already in your local folder. If you have cloned my original repo then the data will already be in the right place and you won't have to worry about the chromedriver. This data should be in the 'Data' folder.

In [8]:
chromedriver_location = 'chromedriver.exe'

### Define the location of the data folder

In [9]:
data_directory = "Data"

# --------------------------------------------------------------------------------------------------------------------------

# Load in the data

I implemented a nice way to load in the data in the last assignment and it makes sense to use these same functions and the same logic again here in this assignment. The logic behind the below *load_data* function is:

1. Check if you have the data already downloaded and in the right place
    1. If you don't have the data already downloaded, the data will be downloaded from the web and put into the specified data folder
    2. If you have the data already downloaded then continue to step 2
2. Read in all the files in the data
3. Turn this data into a map of cross validation folds and class labels to their associated documents
4. Return this map

---

The resulting output is a dictionary in the following format:

    {(cross validation fold 1, 'pos'): [[list of sentences in doc1], [list of sentences in doc2], ...],
     (cross validation fold 1, 'neg'): [[list of sentences in doc1], [list of sentences in doc2], ...],
     .................................................................................................
     (cross validation fold 10, 'pos'): [[list of sentences in doc1], [list of sentences in doc2], ...],
     (cross validation fold 10, 'neg'): [[list of sentences in doc1], [list of sentences in doc2], ...],
    }

This dictionary has two enteries for each cross validation fold, one for the positive documents and one for the negative documents. Each value in the dictionary then contains a list of documents associated with that cross validation fold and class label pair. Each document in this list is made up of a list of sentences with each sentence, in turn, being made up of a list of tokens.

In [10]:
data_dict = ld.load_data(data_directory, chromedriver_location)

# --------------------------------------------------------------------------------------------------------------------------

# Define the BERT Tokeniser

We must set up a tokeniser for the BERT model. This ensures that the input sentence has been broken up into a sequence of tokens and that these tokens are changed into a numeric representation. When training the model, BERT uses these token sequence numeric represntations to predict the class of a given document.

### Choose which size model to use

There are three model types, which are based on size, the first being tiny, the second being a base model, and the third being a large model. Here, we can select which size we want to proceed with in this modelling.

I decided to use the distilbert model as it was the most reliable and efficient size model to use. The base model and the large model ran out of GPU RAM under a few parameter configurations so for me the logical choice was to use the distilbert model. Despite its smaller size, its accuracy score held up well in comparison to what I saw using the larger models.

In [11]:
model_name = 'distilbert-base-uncased' # tiny
#model_name = 'bert-base-uncased'       # base
#model_name = 'bert-large-uncased'      # large

### Define the tokeniser

When it comes to the tokeniser used with this model, hugging face provides functionality where you can use the same tokeniser for whatever type of model you use. As a result, we can select any size model above without having to download a new tokeniser.

In [12]:
tokeniser = AutoTokenizer.from_pretrained(model_name)

### Decide whether to pre-tokenise the data by splitting on whitespace

We also have the option of selecting to use a pre-tokeniser. This could be used to split up the input sentences into whitespace delimited tokens before they are passed into the tokeniser. Populat BERT libraries can provide tools to help with this.

In the case of our data, the P&L 04 corpus is already tokenised into words and punctuation so this step acrually doesn't matter as there are no whitespace characters in the data. As a result, this parameter will be set to **False** here.

In [13]:
force_whitespace_pre_tokeniser = False

if force_whitespace_pre_tokeniser:
    tokeniser.pre_tokenizer = Whitespace()

# --------------------------------------------------------------------------------------------------------------------------

# Test the Tokeniser on Example Input

In this section, I will show the steps that this tokeniser does on an example input. You will see that the tokenisation process involves:

1. Splitting the sentence into tokens
2. Adding a start and end token to the sentence to specify these positions - [CLS] & [SEP] respectively
3. Mapping each token in the sentence to a unique numeric encoding to be used in the BERT model

In [14]:
# Define a pre-tokenised input document
example_batch = [['hello', 'world', '!'],
                 ["tokenisation", "'s", "fun"],
                ]

In [15]:
bf.print_token_summary_of_pretokenised_sentences(tokeniser, example_batch)

|    |   sentence_num |   input_ids | tokens   |   word_ids |
|---:|---------------:|------------:|:---------|-----------:|
|  0 |              0 |         101 | [CLS]    |        nan |
|  1 |              0 |        7592 | hello    |          0 |
|  2 |              0 |        2088 | world    |          1 |
|  3 |              0 |         999 | !        |          2 |
|  4 |              0 |         102 | [SEP]    |        nan |


|    |   sentence_num |   input_ids | tokens    |   word_ids |
|---:|---------------:|------------:|:----------|-----------:|
|  0 |              1 |         101 | [CLS]     |        nan |
|  1 |              1 |       19204 | token     |          0 |
|  2 |              1 |        6648 | ##isation |          0 |
|  3 |              1 |        1005 | '         |          1 |
|  4 |              1 |        1055 | s         |          1 |
|  5 |              1 |        4569 | fun       |          2 |
|  6 |              1 |         102 | [SEP]     |        nan

In [16]:
bf.print_encoding_of_pretokenised_sentences(tokeniser, example_batch)

|    | token   | encoding   |
|---:|:--------|:-----------|
|  0 | hello   | [7592]     |
|  1 | world   | [2088]     |
|  2 | !       | [999]      |


|    | token        | encoding      |
|---:|:-------------|:--------------|
|  0 | tokenisation | [19204, 6648] |
|  1 | 's           | [1005, 1055]  |
|  2 | fun          | [4569]        |




You will notice from the above outputs that in some cases, the tokeniser tokenises words even further than was done in the corpus loaded into this notebook. This is particularly evident in the outputs for the second example sentence where *'tokenisation'* gets turned into *'token'* and *'##isation'*. As a result of this, the token *'tokenisation'* gets mapped to two numeric encodings, one for each of its sub tokens.

# --------------------------------------------------------------------------------------------------------------------------

# Analyse the Distribution of Sequence Length Across the Documents

The BERT models that have been made available to the public have only been trained up to a length of 512 subword units as memory requirements increase quadratically with the sequence length.

In this section, I set out to anayse the number of tokens in each of the documents to get a feel for how the documents are distributed based on their length. This will enable me to see how much information would be lost in the case of the 512 subword unit restriction and how to proceed with implementing the BERT model based on this information.

### Set the bin width of this distrbution

The below bin width parameter specifies the width of the document length bins to put the documents into. In our case, we will specify **256** as this parameter value as this is half of the maximum sequence length of the model and will enable us to get a better feel for the document length distributions.

In [17]:
bin_width = 256

### Create a table of the documents distribution

In [18]:
distribution, max_length_bin = bf.get_distribution_of_document_lengths(data_dict, tokeniser, bin_width)

Token indices sequence length is longer than the specified maximum sequence length for this model (926 > 512). Running this sequence through the model will result in indexing errors


In [19]:
bf.print_doc_breakdown_of_bins(distribution, bin_width, max_length_bin)

|    | Bin_length   |   pos |   neg |   total |
|---:|:-------------|------:|------:|--------:|
|  0 | 0 -> 255     |     7 |    16 |      23 |
|  1 | 256 -> 511   |   133 |   152 |     285 |
|  2 | 512 -> 767   |   284 |   339 |     623 |
|  3 | 768 -> 1023  |   277 |   288 |     565 |
|  4 | 1024 -> 1279 |   155 |   125 |     280 |
|  5 | 1280 -> 1535 |    78 |    46 |     124 |
|  6 | 1536 -> 1791 |    32 |    19 |      51 |
|  7 | 1792 -> 2047 |    18 |     8 |      26 |


As you can see from the above distribution table, only **15.5%** of the documents have document lengths less than 512 tokens. As a result of this alongside the constraints of the model, we will not be able to consider the document as a whole and instead will have to use a subset of the document. This subset, which will have a maximum sequence length of *(512)*, limits the amount of information that is available to us from each document.

# --------------------------------------------------------------------------------------------------------------------------

# Create Training-Test Splits for Cross-Validation

This is where we split the loaded dataset into train and test splits. In this dataset, there are 1000 positive documents and 1000 negative documents and we must split these 2000 documents into 10 different cross validation splits. This gives an 1800-200 train-test split for each of the 10 cross validation folds.

In [20]:
train_test_splits = ld.get_train_test_splits(data_dict)

### Show the splits

We can visualise the number of documents in each cross validation fold.

In [None]:
ld.count_docs_in_train_test_split(org_train_test_splits)

# --------------------------------------------------------------------------------------------------------------------------

# Test the Slicing of Documents to the specified BERT Sequence Length

Here we will test out how the documents will be distributed when we subset them to the maximum specified sequence length on one of the cross validation fold.

### Set the maximum sequence length to put into our BERT model

This parameter specifies the maximum sequence length that the documents will be subset to before they are fed into the BERT model. While this maximum specified length must be less than 512, as the BERT model cannot deal with any sequence with a greater length, its value can be changed to whatever value less than 512 that you choose.

This parameters value is usually set to a number with base 2. Each increment in this value comes with a significant increase in the resource requirments to run the code and as a result, its value may need to be restricted to reflect the GPU you have available to you. The fine-tuning of BERT is where the process becomes the most memory intensive.

As I have Google Collab Pro, I figured that I would be able to run my BERT model using the full sequence length of 512, however, when I ran the code using this maximum sequence length, I ran into the following error:

    RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity; 14.79 GiB already allocated; 215.75 MiB free; 14.81 GiB reserved in total by PyTorch)

This told me that the GPU I had access to at that time did not have enough memory to use all 512 tokens. For this reasons, I have chosen to restrict the model to a maximum sequence length of **256**.

In [21]:
max_sequence_length = 256

### Select which cross validation fold to use in this test

It makes sense to just use the first cross validation fold in this test, however, this parameter may be experimented with to see the results across the other folds.

In [22]:
cv_num = 0
cv_0_training_data = train_test_splits[cv_num][0]

### Specify how the sequence will be distributed across the start & end of the documents

If we specify 512 tokens as our maximum specified length, there are a number of ways this subset of 512 tokens can be selected from the documents, these are:
1. Using the first 512 tokens in the document
2. Using the last 512 tokens in the document
3. Combining a mix of the start tokens and the end tokens to form one list of tokens of length 512

We can actually specify how we want these tokens to be selected from the documents using the below *start_end_fraction* parameter. This parameter specifies what proportion of the tokens in the final sequence to take from the start, with the rest of the tokens being taken from the end of the sequence.

Eg. if:

    start_end_fraction = 1   -->  we take all tokens from the start of the document
    start_end_fraction = 0.5 -->  the tokens selected are split evenly between the start and the end
    start_end_fraction = 0   -->  we take all tokens from the end of the document

In [23]:
start_end_fraction = 0.25

### Create a summary table and a distribution table of the documents after they are sliced

In [27]:
sliced_doc_summary_df, length_to_count_map = bf.test_document_slicer_on_train_test_split(cv_0_training_data, tokeniser, start_end_fraction, max_sequence_length)

There are 1800 training documents in this cross validation fold


#### Output the sliced document summary

This shows how the tokens in the first 10 documents were distributed according to the above *start_end_fraction* parameter.

In [28]:
sliced_doc_summary_df

Unnamed: 0,doc_idx,seq_len,num_tokens_from_start,num_tokens_from_end,total_tokens
0,0,256,59,175,234
1,1,256,59,159,218
2,2,256,50,167,217
3,3,256,53,178,231
4,4,256,46,177,223
5,5,256,50,150,200
6,6,256,61,148,209
7,7,256,58,171,229
8,8,256,51,174,225
9,9,256,48,146,194


#### Output the distribution of each document sequence length

This shows the number of documents with each document length once this document slicing has been completed.
We should see the majority of documents having a length qual too the specified *'max_sequence_length'* parameter.

In [29]:
print('Frequency of each sequence length (document length --> # docs):')
for length in sorted(list(length_to_count_map.keys())):
    print(length, "-->", length_to_count_map[length])

Frequency of each sequence length (document length --> # docs):
53 --> 1
243 --> 1
254 --> 5
255 --> 35
256 --> 1758


# --------------------------------------------------------------------------------------------------------------------------

# Train a BERT classifier on each CV fold

Now that I have tested how the document sequences are selected, it is time to see how the BERT model performs when trained on each of the cross validation folds.
As this is how the Multinomial Naive Bayes model and the Logistic Regression model were evaluated, an independent BERT model will be trained on each of the cross validation folds to produce a model accuracy score. The accuracy scores from each of these models will then be averaged to obtain an overall accuracy score for the BERT model. We can then accurately compare these accuracy scores across the different models.

### Set the hyperparameters used in the BERT classifier

There are a number of hyper parameters that we need to set when running this model. These parameters are used to configure the model to a way that best suits the data we are using and our application. Some of these parameters define important things like the model learning rate and batch size.

These parameters must be set carfully as they directly effect the models performance and the computational requirments needed to run these models. In the case of the batch size, inceses in this paramter result in a linear increase in the memory requirments of the model.

While ideally we would like a batch size of 16 or 32 for efficient training, I have decided to select a batch size of 10 as this only requirs 12GB of GPU RAM. This is important as the GPU available only has 15GB RAM available.
An increase to batch_size = 16, gives:

    RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 15.90 GiB total capacity; 14.62 GiB already allocated; 87.75 MiB free; 14.94 GiB reserved in total by PyTorch)

In [30]:
classifier_hyperparams = {# Encoder specific learning rate
                          "encoder_learning_rate": 1e-05,

                          # Classification head learning rate
                          "learning_rate": 3e-05,

                          # Number of epochs we want to keep the encoder model frozen
                          "nr_frozen_epochs": 3,

                          # How many subprocesses to use for data loading - 0 means data is loaded in the main process
                          "loader_workers": 4,

                          # Number of GPUs you have available
                          "gpus": 1,

                          # Number of documents used in each iteration of the model
                          "batch_size": 10,
                         }

### Set the other parameters needed to train the model

There are other parameters we can specify when training the model. These parameters are more general than the above hyperparameters.

For instance, the *max_epochs* and the *model_patience* revolve around how many iterations the model does over the data. This value correlates directly to how long the model takes to train as the higher these numbers are, the more iterations needed and hence the more time the model will take.
In this model, I have chosen to set the maximum number of epochs to **10**. This enables me to keep the time it takes to train the model to a minimum while also giving a reasonable amount of iterations for the BERT model to still be accurate.

The *start_end_fraction* parameter is also specified here again. This parameter details how the tokens are extracted from the documents. This can have big implications on the model as we want to select a split where we maintain the majority of the sentiment details.
Here, I have set this parameter to favour the end of the document. The reason for this was that I felt the majority of the documents sentiment would come at the end of the review as the writer summed up their feelings.

In [73]:
model_params = {# Number of iterations the model does over the dataset
                "max_epochs": 10,

                # proportion of the tokens in the final sequence to take from the start of the document
                "start_end_fraction": 0.0,  # set to 0.0001 to duplicate short documents

                # The Pre-Processing Batch size
                "preproc_batch_size": 8,

                # Maximum number of epochs to allow the model to have without accuracy improvement
                "model_patience": 5,

                # Minimum change in accuracy the model should have
                "min_early_stopping_delta": 0.0,
               }

### Put the predefined variables in a dictionary to be used when training the classifiers

These are parameters that I have defined in the above code for various reasons that are also needed to define the classification model.

In [None]:
predefined_variables = {"model_name": model_name,
                        "tokeniser": tokeniser,
                        "max_sequence_length": max_sequence_length,
                       }

### Define a BERT classifier and a trainer for each CV fold using the above parameters

I have stored all of the above defined parameters in a dictionaroes as it enables me to pass a only a few parameters to my defined function when defining the classification models. These values can then be accessed and unpacked inside this function.

This function iterates through each cross-validation fold and creates a classifier and a model trainer for each of these. It also creates a model callback for each cross-validation which stores the values for the best model trained across all the epochs.

In [84]:
classifiers_list, trainers_list, modeL_callbacks_list = bf.define_classifier_and_trainer_for_each_cv_fold(train_test_splits, classifier_hyperparams, model_params, predefined_variables)

Defining the classifiers for each CV fold:


  0%|          | 0/10 [00:00<?, ?it/s]

Defining the trainers for each CV fold:


  0%|          | 0/10 [00:00<?, ?it/s]

MisconfigurationException: You requested GPUs: [0]
 But your machine only has: []

### Fit the defined classifiers to the data

Once these classifiers and trainers have been defined, it is time to train the models using them. Each of item in these lists corrosponds to one of the cross-validation folds so We can iterate through these lists to get a few evalutaion metrics for each fold.

Due to the way this model is set up, we have a validation accuracy along with a test accuracy score. The validation accuracy score was recorded during training while the test accuracy is obtained through a seperate test after the model is trained. We will record both of these metrics during the training.

In [None]:
# set up the dataframe we output with the evaluation summary statistics
fold_eval_df = pd.DataFrame(columns=['fold_no', 'fold_time', 'fold_val_accuracy', 'fold_test_accuracy', 'fold_test_loss'], index=range(len(train_test_splits)))

fold_val_accuracies, fold_test_accuracies, fold_test_loss = [], [], []
eval_start = time.time()
for i, (classifier, trainer, save_top_model_callback) in tqdm(enumerate(zip(classifiers_list, trainers_list, modeL_callbacks_list))):
    iteration_start = time.time()
    
    # fit the model to the data
    trainer.fit(classifier, classifier.data)
    
    # get the time this training took
    iteration_duration = time.time() - iteration_start
    
    # get an accuracy score for the trained model
    best_val_accuracy = save_top_model_callback.best_model_score.item()
    fold_accuracies.append(best_val_accuracy)

    # test the model and gets its test accuracy & test loss
    test_results = trainer.test(verbose=False)
    print(test_results)
    test_accuracy = test_results[0]["test_acc"]
    test_loss = test_results[0]["test_loss"]
    fold_test_accuracies.append(test_accuracy)
    fold_test_loss.append(test_loss)

    # add the results from this fold to the fold evaluation dataframe
    fold_eval_df.loc[i, 'fold_no'] = i + 1
    fold_eval_df.loc[i, 'fold_time'] = iteration_duration
    fold_eval_df.loc[i, 'fold_validation_accuracy'] = best_val_accuracy
    fold_eval_df.loc[i, 'fold_test_accuracy'] = test_accuracy
    fold_eval_df.loc[i, 'fold_test_loss'] = test_loss

    # save the path to the best model
    fold_best_model_path.append(save_top_model_callback.best_model_path) 
    
    # collect the garbage & empty the cuda memory
    gc.collect()
    torch.cuda.empty_cache()

### Analyse these models results

As explained above, during and after training the above metrics, 

In [None]:
# plot the fold evaluation scores
bf.plot_fold_eval_scores(fold_eval_df)

In [None]:
# create the evalutation summary statistics for this model
n_test = float(len(fold_test_accuracies))
avg_test = sum(fold_test_accuracies) / n_test
variance_test = sum([(x-avg_test)**2 for x in fold_test_accuracies]) / n_test
n_val = float(len(fold_val_accuracies))
avg_val = sum(fold_val_accuracies) / n_val
variance_val = sum([(x-avg_val)**2 for x in fold_val_accuracies]) / n_val
eval_duration = time.time() - eval_start

In [None]:
# create a datframe with one row summarising the model evaluation
eval_values = {'Full Name': model_name,
               'Avg Test Accuracy': avg_test,
               'Avg Val Accuracy': avg_val,
               'Test Accuracy Std Dev': variance_test**0.5,
               'Val Accuracy Std Dev': variance_val**0.5,
               'Min Test Accuracy': min(fold_test_accuracies),
               'Max Test Accuracy': max(fold_test_accuracies),
               'Min Val Accuracy': min(fold_val_accuracies),
               'Max Val Accuracy': max(fold_val_accuracies),
               'Total Time (s)': round(eval_duration, 2),
               'All Fold Test Accuracies': str(fold_test_accuracies),
               'All Fold Val Accuracies': str(fold_val_accuracies),
              }
full_eval_df = pd.DataFrame(eval_values, index=[0])
full_eval_df

In [None]:
cv_best_model_paths

# --------------------------------------------------------------------------------------------------------------------------

# Save Best Model outside Logs

Rather than manually locating the best model in the lightning logs folder and copying it to another location, use the  library to save a copy. This also gives us the option to save a copy without the training state of the Adam optimiser, reducing model size by about 67%, training parameters and filesystem paths that we may not want to share with users of the model.

In [None]:
for i, trainer in enumerate(trainer_list):
    
    # After just having run test(), the best checkpoint is still loaded but that's not a documented feature
    # To be on the safe side for future versions we need to 
    save_best_model(classifier, trainer, fold_num=i+1)

GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores


Ready


In [None]:
# https://pytorch-lightning.readthedocs.io/en/latest/common/weights_loading.html

# after just having run test(), the best checkpoint is still loaded but that's
# not a documented feature so to be on the safe side for future versions we
# need to explicitly load the best checkpoint:

best_model = bf.Classifier.load_from_checkpoint(checkpoint_path = trainer.checkpoint_callback.best_model_path
                                             # the hparams including hparams.batch_size appear to have been
                                             # saved in the checkpoint automatically
                                            )

# best_model.save_checkpoint('best.ckpt') does not exist
# --> need to wrap model into trainer to be able to save a checkpoint

new_trainer = pl.Trainer(resume_from_checkpoint=trainer.checkpoint_callback.best_model_path,
                         gpus = -1,  # avoid warnings (-1 = automatic selection)
                         # https://github.com/PyTorchLightning/pytorch-lightning/issues/6690
                         logger = pl.loggers.TensorBoardLogger(os.path.abspath('lightning_logs')),
                        )

new_trainer.model = best_model  # @model.setter in plugins/training_type/training_type_plugin.py

#new_trainer.save_checkpoint("best-model.ckpt")  # contains absoulte paths and training parameters

new_trainer.save_checkpoint("best-model-weights-only.ckpt", True,  # save_weights_only
                           )

# to just save the bert model in pytorch format and without the classification head, we could follow
# https://github.com/PyTorchLightning/pytorch-lightning/issues/3096#issuecomment-686877242
best_model.bert.save_pretrained('best-bert-encoder.pt')

# Since the lightning module inherits from pytorch, we can save the full network in
# pytorch format:
torch.save(best_model.state_dict(), 'best-model.pt')

print('Ready')

Note: The `.ckpt` files are zip files containing a [pickle](https://docs.python.org/3/library/pickle.html) file, version information and various binary files, presumably numpy arrays.

# --------------------------------------------------------------------------------------------------------------------------

## Load a Model and Test Again

In [None]:
best_model = Classifier.load_from_checkpoint(checkpoint_path='best-model-weights-only.ckpt')

best_model.eval()  # enter prediction mode, e.g. turn off dropout

print(best_model.data.data_split)  # confirm the data is not saved

test_dataloader = DataLoader(dataset     = SlicedDocuments(raw_data                    = train_test_splits[xval_run][-1], #test
                                                           tokeniser                   = tokeniser,
                                                           fraction_for_first_sequence = 0.0,
                                                           max_sequence_length         = max_sequence_length,
                                                           second_part_as_sequence_B   = False,
                                                           preproc_batch_size          = 8
                                                          ),
                             batch_size  = best_model.hparams.batch_size,
                             collate_fn  = best_model.prepare_sample,
                             num_workers = best_model.hparams.loader_workers,
                            )

print('number of batches:', len(test_dataloader))

new_trainer = pl.Trainer(gpus = -1,
                         # https://github.com/PyTorchLightning/pytorch-lightning/issues/6690
                         logger = pl.loggers.TensorBoardLogger(os.path.abspath('lightning_logs')),
                        )

if best_model.tokenizer is None:
    print('setting tokeniser')
    best_model.tokenizer = tokeniser

print(new_trainer.test(best_model, test_dataloaders=[test_dataloader]))

(None, None, None)


GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores


number of batches: 20
setting tokeniser


Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.8949999213218689, 'test_loss': 0.21454036235809326}
--------------------------------------------------------------------------------
[{'test_loss': 0.21454036235809326, 'test_acc': 0.8949999213218689}]


# --------------------------------------------------------------------------------------------------------------------------

## Make Predictions

Pytorch_lightning does not seem to provide functionality to re-use above code for making predictions. The example code from their website directly calls the `forward()` function of the model, assuming that the inputs of the test items are ready in a suitable batch. For a small test set that does not exceed the batch size, we can manally create such as a batch as follows.

### Small Test Sets

In [None]:
# reminder: In our dataset, documents are lists of sentences
# and each sentence is a list of words and punctuation

mini_test_set  = [# document 1
                  ([['This', 'movie', 'is', 'great', '.'], ['So', 'much', 'fun', '.']], 'pos'),
                  # document 2
                  ([['What', 'a', 'waste', 'of', 'time', '.'], ['Never', 'seen', 'anything', 'this', 'bad', '.']], 'neg'),
                 ]

dataset = SlicedDocuments(# subclass of torch.utils.data.Dataset
                          mini_test_set,
                          preproc_batch_size = 8,
                          # the following should match the trained model
                          tokeniser = tokeniser,
                          fraction_for_first_sequence = 0.0,  
                          max_sequence_length = max_sequence_length,
                          second_part_as_sequence_B = False,
                         )

print('model device:', best_model.device)

model device: cuda:0
number of documents: 2


In [None]:
import numpy

encoded_batch, gold_labels = best_model.prepare_sample(dataset)
print('number of items in batch:', len(encoded_batch))  # TODO: Why is this not len(dataset)?

best_model.eval()  # just in case (already called further above)

best_model.freeze()  # some examples call this before making predictions

# https://github.com/huggingface/transformers/issues/5111
encoded_batch.to(best_model.device)

model_out = best_model(encoded_batch)

# adjsuted copy of code from predict()
logits = model_out["logits"]
logits = torch.Tensor.cpu(logits).numpy()
predicted_labels = [best_model.data.label_encoder.index_to_token[prediction] for prediction in numpy.argmax(logits, axis=1)]

print('number of predictions:', len(predicted_labels))  # matches len(dataset)

for index, item in enumerate(dataset):
    print('[%d]' %index, item)
    print('prediction:', predicted_labels[index])
    
# the 'parts' list has two parts when second_part_as_sequence_B = True and fraction_for_first_sequence > 0.0

number of items in batch: 4
number of predictions: 2
[0] {'parts': [['This', 'movie', 'is', 'great', '.', 'So', 'much', 'fun', '.']], 'label': 'pos'}
prediction: pos
[1] {'parts': [['What', 'a', 'waste', 'of', 'time', '.', 'Never', 'seen', 'anything', 'this', 'bad', '.']], 'label': 'neg'}
prediction: neg


### Large Test Sets
For test sets that do not fit into a single batch, we extend the model's evaluation function to also record predictions in the metrics dictionary. We keep a record of the inputs as well as the test items may be distributed over multiple GPUs and the order of items may therefore change. We then only need to tokenise the test items again and fetch the predictions from the metrics dictionary.

In [86]:
# uses best_model, dataset and new_trainer from above

best_model.start_recording_predictions()

new_trainer.test(best_model,
                 test_dataloaders=[DataLoader(dataset     = dataset, # First test the functionality with a small test set
                                              batch_size  = best_model.hparams.batch_size,
                                              collate_fn  = best_model.prepare_sample,
                                              num_workers = best_model.hparams.loader_workers,
                                             )
                                  ]
                )

best_model.stop_recording_predictions()

NameError: name 'best_model' is not defined

In [None]:
print(best_model.seq2label)

{(101, 2023, 3185, 2003, 2307, 1012, 2061, 2172, 4569, 1012, 102): 'pos', (101, 2054, 1037, 5949, 1997, 2051, 1012, 2196, 2464, 2505, 2023, 2919, 1012, 102): 'neg'}


In [None]:
for index, item in enumerate(dataset):
    print('[%d]' %index, item)
    input_token_ids = best_model.prepare_sample([item])[0]['input_ids']
    key = input_token_ids.tolist()[0]
    
    # truncate zeros
    while key and key[-1] == 0:
        del key[-1]

    key = tuple(key)

    try:
        print('prediction:', best_model.seq2label[key])
    except KeyError:
        print('prediction not found')

[0] {'parts': [['This', 'movie', 'is', 'great', '.', 'So', 'much', 'fun', '.']], 'label': 'pos'}
prediction: pos
[1] {'parts': [['What', 'a', 'waste', 'of', 'time', '.', 'Never', 'seen', 'anything', 'this', 'bad', '.']], 'label': 'neg'}
prediction: neg


In [None]:
# now with a bigger dataset

best_model.start_recording_predictions()

xval_test_dataset = SlicedDocuments(raw_data                    = train_test_splits[xval_run][-1],  # test data
                                    tokeniser                   = tokeniser,
                                    fraction_for_first_sequence = 0.0,
                                    max_sequence_length         = max_sequence_length,
                                    second_part_as_sequence_B   = False,
                                    preproc_batch_size          = 8
                                   )

new_trainer.test(best_model,
                 test_dataloaders=[DataLoader(dataset     = xval_test_dataset,    
                                              batch_size  = best_model.hparams.batch_size,
                                              collate_fn  = best_model.prepare_sample,
                                              num_workers = best_model.hparams.loader_workers,
                                             )
                                  ]
                )

best_model.stop_recording_predictions()
print('Ready')

Testing: 0it [00:00, ?it/s]

--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_acc': 0.8949999213218689, 'test_loss': 0.21454036235809326}
--------------------------------------------------------------------------------
Ready


In [None]:
prediction_df = bf.test_model_and_get_results(best_model, xval_test_dataset)
prediction_df