In this lab session, we will build a text classifier for sentiment analysis. The idea is to reuse a pre-trained BERT model and to fine-tune it with a small amount of data. 


# Introduction
The hugging face python libraries provide convenient API to deal with transformer models, like BERT, ALBERTA, GPT, ... .  To quote their website: Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX.

Each model can have some peculiarities in their i/o. The Huggingface propose an API that helps you to build a relatively generic code. When you re-use a transformer model, you have to take care of some important steps: 

- What is required in input ? The simple sequence of indices is not enough in most of the case. 
- What is the output ? Do you use just a re-trained model, the model was trained using a self-supervised  criterion like masked language modeling. Or do you rely on an already fine-tuned model ? 


The roadmap: 
- Load the data
- Use the tokenizer associated with your model
- Build dataloaders
- Fine tune BERT



In [None]:
!pip install transformers

In [None]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
from tabulate import tabulate
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification, DistilBertTokenizer, DistilBertModel
from tqdm.notebook import tqdm

import torch as th
import torch.nn as nn
from torch.utils.data import TensorDataset, Dataset, DataLoader, Subset,  RandomSampler
from google_drive_downloader import GoogleDriveDownloader as gdd
import os, gzip, pickle

# If the machine you run this on has a GPU available with CUDA installed,
# use it. Using a GPU for learning often leads to huge speedups in training.
# See https://developer.nvidia.com/cuda-downloads for installing CUDA
device = th.device('cuda' if th.cuda.is_available() else 'cpu')
device

## Download the training data

In [None]:
DATA_PATH = 'data/imdb_reviews.csv'
if not os.path.isfile(DATA_PATH):
    gdd.download_file_from_google_drive(
        file_id='1zfM5E6HvKIe7f3rEt1V2gBpw5QOSSKQz',
        dest_path=DATA_PATH,
    )

In [None]:
df = pd.read_csv(DATA_PATH)

In [None]:
print(df)


In [None]:
alltexts = df.review.values
alllabels = df.label.values


Looking at some input texts: 

In [None]:
print(alltexts[0])
print("---------- label: ",alllabels[0])

# Prepare data for BERT finetuning

The input format to BERT looks like it is  "over-specified", especially if you focus on just one type of task (sequence classification, word tagging, paraphrase detection, ...). The format is thought to be "generic": 
- Add special tokens to the start and end of each sentence.
- Pad & truncate all sentences to a single constant length.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".

It looks like that: 

<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">

## Tokenizer 
If you don't want to recreate this kind of inputs with your own hands, you can use the pre-trained tokenizer associated to BERT. Moreover the function `encode_plus` will:
- Tokenize the sentence.
- Prepend the `[CLS]` token to the start.
- Append the `[SEP]` token to the end.
- Map tokens to their IDs.
- Pad or truncate the sentence to `max_length`
- Create attention masks for `[PAD]` tokens.

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case = True
    )

In [None]:
# Some useful steps: 
message = "hello my name is Kevin"
tok = tokenizer.tokenize(message)
print("tokenized : ",tok)
enc = tokenizer.encode(tok)
print("encoded : ",enc)
enc2 = tokenizer.encode_plus(tok)
print("encoded plus : ",enc2)
dec = tokenizer.decode(enc)
print("decoded : ",dec)

In [None]:
def preprocessing(input_text, tokenizer):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        truncation=True,
                        max_length = 20,
                        padding = 'max_length',
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


Note the important difference in the output between `encode` and `encode_plus`. For `encode_plus`, the output has the form of a python dictionnary with all you need for BERT.  
- can you explain the outputs ? 
- do you know how to acces to each of them ? 

Now we can tokenize everything. 

In [None]:
token_id = []
attention_masks = []
ntrain = 5000
nvalid = 1000

ids = np.arange(len(alllabels))
np.random.shuffle(ids)
trainidx = ids[:ntrain]
valididx = ids[ntrain:ntrain+nvalid]
allidx = ids[:ntrain+nvalid]
for sent in tqdm(alltexts[allidx]):
    encoding_dict = preprocessing(sent, tokenizer)
    token_id.append(encoding_dict['input_ids']) 
    attention_masks.append(encoding_dict['attention_mask'])


token_id = th.cat(token_id, dim = 0)
attention_masks = th.cat(attention_masks, dim = 0)

print(token_id.shape, token_id[:ntrain].shape)
labels = th.tensor(alllabels[allidx])
print("\n", labels.sum().item(), "postive labels / ", labels.shape[0], "total")


## Dataset and Dataloader
Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. 

PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

In [None]:
train_set = TensorDataset(token_id[:ntrain], 
                          attention_masks[:ntrain], 
                          labels[:ntrain])

valid_set = TensorDataset(token_id[ntrain:ntrain+nvalid], 
                          attention_masks[ntrain:ntrain+nvalid], 
                          labels[ntrain:ntrain+nvalid])


In [None]:
batch_size = 32
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )
valid_dataloader = DataLoader(
            valid_set,
            batch_size = batch_size
        )

# BERT fine-Tuning

For this task, we want to start from a pre-trained BERT model. If it was pre-trained as a MLM, we need to modify the architecture to get outputs for classification. Then we want to resume the training process (this is fine-tuning)  on our dataset. The final goal is to get a model which is well-suited for our task. 

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task.  

Here is the current list of classes provided for fine-tuning:
* BertModel
* BertForPreTraining
* BertForMaskedLM
* BertForNextSentencePrediction
* **BertForSequenceClassification** - The one we'll use.
* BertForTokenClassification
* BertForQuestionAnswering

The documentation for these can be found under [here](https://huggingface.co/transformers/v4.7.0/model_doc/bert.html).


We'll be using [BertForSequenceClassification](https://huggingface.co/transformers/v4.7.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. 

There are different pre-trained BERT models available. `bert-base-uncased` means the version that has only lowercase letters ("uncased") and is the smaller version ("base" vs "large").

The documentation for `from_pretrained` can be found [here](https://huggingface.co/transformers/v4.7.0/main_classes/model.html#transformers.PreTrainedModel.from_pretrained), with the additional parameters defined [here](https://huggingface.co/transformers/v4.7.0/main_classes/configuration.html#transformers.PretrainedConfig).

## Loading BERT

In [None]:
bmodel = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)


print(bmodel)

Here there might be a lot of questions: 
- what does the warning means ? 
- why `num_lables=2` ? 
- and the other options ? 
- and the print ? 

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the below cell, I've printed out the names and dimensions of the weights for:

- The embedding layer
- The first of the twelve transformers
- The output layer.



In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(bmodel.named_parameters())

In [None]:
print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

## Test your bert 
We can already try the model on the validation set. Before just look at the output of the model on one batch. 
- interpret the output !  
- do you understand everything ? 


In [None]:
## TODO

In [None]:
def validation(model, dloader): 
    """Run the BERT model for text classification and compute 
    the loss as well as the accuracy"""
    ## TODO 


In [None]:
# and test the function 
validation(bmodel, valid_dataloader)

- What do you think about the result ? 


## Fine-Tuning 

With our model loaded and ready,  we need to grab the training hyperparameters from within the stored model. For fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)):

- **Batch size:** 16, 32  
- **Learning rate (Adam):** 5e-5, 3e-5, 2e-5  
- **Number of epochs:** 2, 3, 4 

We chose:
* Batch size: 32 (set when creating our DataLoaders)
* Learning rate: 5e-5
* Epochs: 5 (we'll see that this is probably too many...)

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [None]:
## Fine tune BERT: 
## TODO 

# DistilBERT 

In this part we will see another use-case: 
- We have a pretrained model, but we don't have the adapted architecture for text classification. 
- As an illustration, we will use DistilBERT as another BERT-like model. It is lighter yet very efficient. [Here is the paper](https://arxiv.org/abs/1910.01108) that describes the distillation. 

The idea is to use a Wrapper around DistilBERT. 

##  Tokenizer and dataloader

As before, you have to create what you need for the interface. Maybe there are some differences in the i/o, but it is easy. 
The roadmap:
- get the tokenizer
- process the data
- create the dataloader



In [None]:
## TODO


## The Wrapper

We can simply write a class that load the pretrained DistilBert model: 


In [None]:
class DistilBertClassifier(th.nn.Module):
    def __init__(self):
        super().__init__()
        self.distilbert = # TODO 
        # Add a Dropout layer and a Linear classifier
        
        
    def forward(self, ids, mask):
        # TODO

In [None]:
# Create a model and test it on a batch 



Since we have created the model with almost our own hands, we have to define the loss function. Write a function that compute the validation scores (loss and accuracy) with your model and a dataloader

In [None]:
criterion = nn.BCEWithLogitsLoss()


In [None]:
def validDB(model, loader):
  ## TODO


## Fine Tuning 

Lets go: train the model. 

In [None]:
## TODO 

