In this lab session, we will build a text classifier for sentiment analysis. The idea is to reuse a pre-trained BERT model and to fine-tune it with a small amount of data. 


# Introduction
The hugging face python libraries provide convenient API to deal with transformer models, like BERT, ALBERTA, GPT, ... .  To quote their website: Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX.

Each model can have some peculiarities in their i/o. The Huggingface propose an API that helps you to build a relatively generic code. When you re-use a transformer model, you have to take care of some important steps: 

- What is required in input ? The simple sequence of indices is not enough in most of the case. 
- What is the output ? Do you use just a re-trained model, the model was trained using a self-supervised  criterion like masked language modeling. Or do you rely on an already fine-tuned model ? 


The roadmap: 
- Load the data
- Use the tokenizer associated with your model
- Build dataloaders
- Fine tune BERT



In [1]:
!pip install transformers



In [2]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
from tabulate import tabulate
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from transformers import BertTokenizer, BertForSequenceClassification, DistilBertTokenizer, DistilBertModel
from tqdm.notebook import tqdm

import torch as th
import torch.nn as nn
from torch.utils.data import TensorDataset, Dataset, DataLoader, Subset,  RandomSampler
from google_drive_downloader import GoogleDriveDownloader as gdd
import os, gzip, pickle

# If the machine you run this on has a GPU available with CUDA installed,
# use it. Using a GPU for learning often leads to huge speedups in training.
# See https://developer.nvidia.com/cuda-downloads for installing CUDA
device = th.device('cuda' if th.cuda.is_available() else 'cpu')
device

device(type='cpu')

## Download the training data

In [3]:
DATA_PATH = 'data/imdb_reviews.csv'
if not os.path.isfile(DATA_PATH):
    gdd.download_file_from_google_drive(
        file_id='1zfM5E6HvKIe7f3rEt1V2gBpw5QOSSKQz',
        dest_path=DATA_PATH,
    )

In [4]:
df = pd.read_csv(DATA_PATH)

In [5]:
print(df)


                                                  review  label
0      Once again Mr. Costner has dragged out a movie...      0
1      This is an example of why the majority of acti...      0
2      First of all I hate those moronic rappers, who...      0
3      Not even the Beatles could write songs everyon...      0
4      Brass pictures (movies is not a fitting word f...      0
...                                                  ...    ...
62150  I am a student of film, and have been for seve...      0
62151  Unimaginably stupid, redundant and humiliating...      0
62152  Guy is a loser. Can't get girls, needs to buil...      0
62153  This 30 minute documentary Buñuel made in the ...      0
62154  I saw this movie as a child and it broke my he...      1

[62155 rows x 2 columns]


In [6]:
alltexts = df.review.values
alllabels = df.label.values


Looking at some input texts: 

In [9]:
i = -1 
print(alltexts[i])
print("---------- label: ",alllabels[i])

I saw this movie as a child and it broke my heart! No other story had such a unfinished ending... I grew up on many great anime movies and this was one of my favourites, because it was so unusual - a story about unfairness, and cruelty, and loneliness, and life, and choices that can't be undone, and the need for others. Chirin is made alone when the Wolf kills his mother, but the Wolf is alone, too, when Chirin follows him into the mountain. The Wolf doesn't kill the lamb, even though each night he says \maybe I'll eat you tomorrow.\" The tape of it I have is broken and degraded from age and use. I will repair it and watch the movie again someday and cry just as hard as I did as a child. Stories like this, with this depth and feeling, and this intricacy of meaning, are very rare. It is a sad story, but I've never encountered any catharsis more beautifully made. I am glad I have seen this movie, and I'm glad I saw it as a child."
---------- label:  1


# Prepare data for BERT finetuning

The input format to BERT looks like it is  "over-specified", especially if you focus on just one type of task (sequence classification, word tagging, paraphrase detection, ...). The format is thought to be "generic": 
- Add special tokens to the start and end of each sentence.
- Pad & truncate all sentences to a single constant length.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".

It looks like that: 

<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">

## Tokenizer 
If you don't want to recreate this kind of inputs with your own hands, you can use the pre-trained tokenizer associated to BERT. Moreover the function `encode_plus` will:
- Tokenize the sentence.
- Prepend the `[CLS]` token to the start.
- Append the `[SEP]` token to the end.
- Map tokens to their IDs.
- Pad or truncate the sentence to `max_length`
- Create attention masks for `[PAD]` tokens.

In [10]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case = True
    )

In [18]:
## Some useful steps: 
message = "Hello my name is Kevin Costner from Washington"
tok = tokenizer.tokenize(message)
print("tokenized : ",tok)
enc = tokenizer.encode(tok)
print("encoded : ",enc)
enc2 = tokenizer.encode_plus(tok, max_length=100)
print("encoded plus : ",enc2)
dec = tokenizer.decode(enc)
print("decoded : ",dec)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


tokenized :  ['hello', 'my', 'name', 'is', 'kevin', 'cost', '##ner', 'from', 'washington']
encoded :  [101, 7592, 2026, 2171, 2003, 4901, 3465, 3678, 2013, 2899, 102]
encoded plus :  {'input_ids': [101, 7592, 2026, 2171, 2003, 4901, 3465, 3678, 2013, 2899, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
decoded :  [CLS] hello my name is kevin costner from washington [SEP]


In [20]:
def preprocessing(input_text, tokenizer):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        truncation=True,
                        max_length = 20,
                        padding = 'max_length',
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


Note the important difference in the output between `encode` and `encode_plus`. For `encode_plus`, the output has the form of a python dictionnary with all you need for BERT.  
- can you explain the outputs ? 
- do you know how to acces to each of them ? 

Now we can tokenize everything. 

In [21]:
token_id = []
attention_masks = []
ntrain = 5000
nvalid = 1000

ids = np.arange(len(alllabels))
np.random.shuffle(ids)
trainidx = ids[:ntrain]
valididx = ids[ntrain:ntrain+nvalid]
allidx = ids[:ntrain+nvalid]
for sent in tqdm(alltexts[allidx]):
    encoding_dict = preprocessing(sent, tokenizer)
    token_id.append(encoding_dict['input_ids']) 
    attention_masks.append(encoding_dict['attention_mask'])


token_id = th.cat(token_id, dim = 0)
attention_masks = th.cat(attention_masks, dim = 0)

print(token_id.shape, token_id[:ntrain].shape)
labels = th.tensor(alllabels[allidx])
print("\n", labels.sum().item(), "postive labels / ", labels.shape[0], "total")


  0%|          | 0/6000 [00:00<?, ?it/s]

torch.Size([6000, 20]) torch.Size([5000, 20])

 2980 postive labels /  6000 total


## Dataset and Dataloader
Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. 

PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

In [22]:
train_set = TensorDataset(token_id[:ntrain], 
                          attention_masks[:ntrain], 
                          labels[:ntrain])

valid_set = TensorDataset(token_id[ntrain:ntrain+nvalid], 
                          attention_masks[ntrain:ntrain+nvalid], 
                          labels[ntrain:ntrain+nvalid])


In [23]:
batch_size = 32
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )
valid_dataloader = DataLoader(
            valid_set,
            batch_size = batch_size
        )

# BERT fine-Tuning

For this task, we want to start from a pre-trained BERT model. If it was pre-trained as a MLM, we need to modify the architecture to get outputs for classification. Then we want to resume the training process (this is fine-tuning)  on our dataset. The final goal is to get a model which is well-suited for our task. 

Thankfully, the huggingface pytorch implementation includes a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task.  

Here is the current list of classes provided for fine-tuning:
* BertModel
* BertForPreTraining
* BertForMaskedLM
* BertForNextSentencePrediction
* **BertForSequenceClassification** - The one we'll use.
* BertForTokenClassification
* BertForQuestionAnswering

The documentation for these can be found under [here](https://huggingface.co/transformers/v4.7.0/model_doc/bert.html).


We'll be using [BertForSequenceClassification](https://huggingface.co/transformers/v4.7.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. 

There are different pre-trained BERT models available. `bert-base-uncased` means the version that has only lowercase letters ("uncased") and is the smaller version ("base" vs "large").

The documentation for `from_pretrained` can be found [here](https://huggingface.co/transformers/v4.7.0/main_classes/model.html#transformers.PreTrainedModel.from_pretrained), with the additional parameters defined [here](https://huggingface.co/transformers/v4.7.0/main_classes/configuration.html#transformers.PretrainedConfig).

## Loading BERT

In [24]:
bmodel = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)


print(bmodel)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Here there might be a lot of questions: 
- what does the warning means ? 
- why `num_lables=2` ? 
- and the other options ? 
- and the print ? 

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the below cell, I've printed out the names and dimensions of the weights for:

- The embedding layer
- The first of the twelve transformers
- The output layer.



In [25]:
# Get all of the model's parameters as a list of tuples.
params = list(bmodel.named_parameters())

In [26]:
print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## Test your bert 
We can already try the model on the validation set. Before just look at the output of the model on one batch. 
- interpret the output !  
- do you understand everything ? 


In [27]:
mb = next(iter(train_dataloader))


In [33]:
bids , bmasks, blabels = mb 
print(bids.shape, bmasks.shape, blabels.shape)


torch.Size([32, 20]) torch.Size([32, 20]) torch.Size([32])


In [37]:
out = bmodel(bids, 
       token_type_ids=None, 
       attention_mask=bmasks, 
       labels=blabels)
print(out.loss)
print(out.logits)

tensor(0.7194, grad_fn=<NllLossBackward0>)
tensor([[ 0.1542, -0.0306],
        [ 0.3419, -0.2630],
        [ 0.2716, -0.0286],
        [ 0.2706, -0.1063],
        [ 0.3288, -0.1396],
        [ 0.3151, -0.1417],
        [ 0.2859, -0.2869],
        [ 0.2438, -0.2639],
        [ 0.3364, -0.2155],
        [ 0.2203, -0.1720],
        [ 0.1995, -0.1742],
        [ 0.3729, -0.1778],
        [ 0.0526, -0.0860],
        [ 0.0658,  0.1173],
        [ 0.3416, -0.1921],
        [ 0.2862, -0.1030],
        [ 0.2916, -0.2205],
        [ 0.3379, -0.2262],
        [ 0.3611, -0.2140],
        [ 0.2508, -0.1993],
        [ 0.2826, -0.1588],
        [ 0.2767, -0.1246],
        [ 0.2150, -0.1597],
        [ 0.0660, -0.0161],
        [ 0.2602, -0.1841],
        [ 0.3565, -0.2100],
        [ 0.2451, -0.2276],
        [ 0.3035, -0.1527],
        [ 0.2642, -0.2702],
        [ 0.2563, -0.1769],
        [ 0.3047, -0.1870],
        [ 0.2198, -0.1558]], grad_fn=<AddmmBackward0>)


In [44]:
def validation(model, dloader, device = 'cpu'): 
    """Run the BERT model for text classification and compute 
    the loss as well as the accuracy"""
    model = model.to(device)
    model.eval()
    tgood = 0 
    total = 0 
    ltotal = 0 
    with th.no_grad():
        for batch in tqdm(dloader):
            batch = [t.to(device) for t in batch]
            bids, bmask, blabel = batch
            total += blabel.size(0)
            out = model(bids, token_type_ids = None,
                       attention_mask = bmask,
                       labels = blabel)
            preds = out.logits
            preds = th.argmax(preds, dim=1)
            tgood += (preds==blabel).sum().item()
            ltotal += out.loss.item()
    return ltotal, tgood, total, (tgood*100/total)

In [45]:
# and test the function 
validation(bmodel, valid_dataloader)

  0%|          | 0/32 [00:00<?, ?it/s]

(22.945774972438812, 490, 1000, 49.0)

- What do you think about the result ? 


## Fine-Tuning 

With our model loaded and ready,  we need to grab the training hyperparameters from within the stored model. For fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)):

- **Batch size:** 16, 32  
- **Learning rate (Adam):** 5e-5, 3e-5, 2e-5  
- **Number of epochs:** 2, 3, 4 

We chose:
* Batch size: 32 (set when creating our DataLoaders)
* Learning rate: 5e-5
* Epochs: 5 (we'll see that this is probably too many...)

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [47]:
## Fine tune BERT: 
# bmodel.to(device)
optimizer = th.optim.AdamW(bmodel.parameters(),
                          lr = 5e-5, eps = 1e-8)
Nepoch = 5
for e in range(Nepoch):
    tgood = 0 
    total = 0 
    tloss = 0 
    bmodel.train()
    for batch in tqdm(train_dataloader):
        bids, bmask, blabel = batch
        optimizer.zero_grad()
        out = bmodel(bids, token_type_ids=None,
                    attention_mask = bmask,
                    labels = blabel
                    )
        out.loss.backward()
        optimizer.step()
        preds = out.logits
        preds = th.argmax(preds, dim=1)
        tgood += (preds==blabel).sum().item()
        tloss += out.loss.item()
        total += blabel.size(0)
    ##################################
    vloss, _, _, acc = validation(bmodel,valid_dataloader)
    print("train: ", tloss, tgood*100/total, 
          "||| valid: ", vloss, acc)
    
        

  0%|          | 0/157 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

train:  94.67458745837212 66.0 ||| valid:  17.919084697961807 67.5


  0%|          | 0/157 [00:00<?, ?it/s]

  0%|          | 0/32 [00:00<?, ?it/s]

train:  68.78304666280746 80.02 ||| valid:  19.251477509737015 70.5


  0%|          | 0/157 [00:00<?, ?it/s]

KeyboardInterrupt: 

# DistilBERT 

In this part we will see another use-case: 
- We have a pretrained model, but we don't have the adapted architecture for text classification. 
- As an illustration, we will use DistilBERT as another BERT-like model. It is lighter yet very efficient. [Here is the paper](https://arxiv.org/abs/1910.01108) that describes the distillation. 

The idea is to use a Wrapper around DistilBERT. 

##  Tokenizer and dataloader

As before, you have to create what you need for the interface. Maybe there are some differences in the i/o, but it is easy. 
The roadmap:
- get the tokenizer
- process the data
- create the dataloader



In [48]:
## TODO
tokenizer = DistilBertTokenizer.from_pretrained(
    'distilbert-base-uncased',
    do_lower_case = True
    )

In [49]:
token_id = []
attention_masks = []
ntrain = 5000
nvalid = 1000

ids = np.arange(len(alllabels))
np.random.shuffle(ids)
trainidx = ids[:ntrain]
valididx = ids[ntrain:ntrain+nvalid]
allidx = ids[:ntrain+nvalid]
for sent in tqdm(alltexts[allidx]):
    encoding_dict = preprocessing(sent, tokenizer)
    token_id.append(encoding_dict['input_ids']) 
    attention_masks.append(encoding_dict['attention_mask'])


token_id = th.cat(token_id, dim = 0)
attention_masks = th.cat(attention_masks, dim = 0)

print(token_id.shape, token_id[:ntrain].shape)
labels = th.tensor(alllabels[allidx])
print("\n", labels.sum().item(), "postive labels / ", labels.shape[0], "total")


  0%|          | 0/6000 [00:00<?, ?it/s]

torch.Size([6000, 20]) torch.Size([5000, 20])

 2926 postive labels /  6000 total


In [50]:
train_set = TensorDataset(token_id[:ntrain], 
                          attention_masks[:ntrain], 
                          labels[:ntrain])

valid_set = TensorDataset(token_id[ntrain:ntrain+nvalid], 
                          attention_masks[ntrain:ntrain+nvalid], 
                          labels[ntrain:ntrain+nvalid])

batch_size = 32
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )
valid_dataloader = DataLoader(
            valid_set,
            batch_size = batch_size
        )

## The Wrapper

We can simply write a class that load the pretrained DistilBert model: 


In [53]:
db = DistilBertModel.from_pretrained("distilbert-base-uncased")
print(db)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [54]:
mb = next(iter(train_dataloader))
print(mb)

[tensor([[  101,  2216, 10261,  2000,  2156,  2143, 15587, 12661,  2015,  1010,
          2323,  2025,  2156,  2023,  2004,  2004,  1037, 12661,  1010,   102],
        [  101,  2023,  2003,  1996,  2309,  5409,  3185,  1045,  2031,  2412,
          2464,  1012,  1045,  3685,  4671,  2129,  2919,  2009,  2003,   102],
        [  101,  2616,  2064,  1005,  1056,  3432,  6235,  2129,  9643,  2023,
          2143,  2003,  1012,  1045,  3427,  2009,  2006,  2678,  2197,   102],
        [  101,  5041,  2438,  2005,  2017,  1029,  3524,  6229,  2017,  2156,
          2023,  3082,  4375,  1026,  7987,  1013,  1028,  1026,  7987,   102],
        [  101,  1037,  3819,  2210,  2012, 21735,  1012,  1012,  1012,  1045,
          4797,  2065,  1037,  2309,  2915,  6354,  2005,  2062,  2059,   102],
        [  101,  2065,  2023,  3185,  2003,  2746,  2000,  1037,  4258,  2379,
          2017,  1010,  5136,  2009,  1037,  5081,  1012,  1045,  2001,   102],
        [  101,  1045,  2870,  2514,  2023,  

In [57]:
bid, bmask,bl  = mb 
db(bid, bmask)

BaseModelOutput(last_hidden_state=tensor([[[ 0.0991,  0.0824, -0.1134,  ..., -0.1087,  0.6027,  0.4332],
         [ 0.0559,  0.4726,  0.0803,  ..., -0.0820,  0.5042, -0.1935],
         [ 0.3800,  0.0403,  0.3531,  ...,  0.0814, -0.0420,  0.0784],
         ...,
         [ 0.6492,  0.1376,  0.1465,  ..., -0.4594,  0.2568, -0.0452],
         [-0.1394,  0.0465, -0.3045,  ...,  0.1266,  0.1463,  0.1328],
         [ 0.8458,  0.4010, -0.3130,  ..., -0.0396, -0.4325, -0.0612]],

        [[ 0.2072,  0.0987,  0.0999,  ...,  0.0426,  0.5348,  0.3799],
         [-0.1453, -0.3001, -0.0451,  ..., -0.3086,  1.0134,  0.0872],
         [-0.2888,  0.0309,  0.3852,  ..., -0.0333,  0.6259,  0.6185],
         ...,
         [-0.0146, -0.3319, -0.1322,  ..., -0.2350,  0.5971,  0.1578],
         [-0.1944, -0.2464, -0.1043,  ...,  0.3554,  0.2606,  0.3138],
         [ 0.5858,  0.4162,  0.3597,  ...,  0.4129,  0.1757, -0.1727]],

        [[ 0.3372, -0.0214, -0.0111,  ..., -0.1411,  0.4034,  0.3552],
         [ 

In [None]:
class DistilBertClassifier(th.nn.Module):
    def __init__(self):
        super().__init__()
        self.distilbert = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.drop = th.nn.Dropout(0.3)
        self.out = th.nn.Linear(768, 1)

    def forward(self, ids, mask):
        distilbert_output = self.distilbert(ids, mask)
        hidden_state = distilbert_output[0]  # (bs, seq_len, dim)
        pooled_output = hidden_state[:, 0]  # (bs, dim)
        output_1 = self.drop(pooled_output)
        output = self.out(output_1)
        return output

In [None]:
# Create a model and test it on a batch 

dbmodel = DistilBertClassifier()
dbmodel = dbmodel.to(device)


Since we have created the model with almost our own hands, we have to define the loss function. Write a function that compute the validation scores (loss and accuracy) with your model and a dataloader

In [None]:
criterion = nn.BCEWithLogitsLoss()


In [None]:
def validDB(model, loader):
  total = 0
  tgood = 0
  tloss = 0
  with th.no_grad():
    for batch in tqdm(valid_dataloader):
        batch = tuple(t.to(device) for t in batch)
        binputs, bmasks, blabels = batch
        blabels = blabels*1.0
        preds = model(binputs,bmasks)
        loss = criterion(preds.squeeze(), blabels)
        acc = (preds.squeeze()>0) == blabels
        total += acc.shape[0]
        tgood += acc.sum().item()
        tloss += loss.item()

  return tloss , tgood , total
validDB(dbmodel, valid_dataloader)

## Fine Tuning 

Lets go: train the model. 

In [None]:
epochs = 5
accs = []
trainlosses = []
optimizer = th.optim.AdamW(dbmodel.parameters(),
                              lr = 2e-5,
                              eps = 1e-08
                              )
for e in range(epochs):

    # ========== Training ==========

    # Set model to training mode
    dbmodel.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    print
    for batch in tqdm(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        binputs, bmasks, blabels = batch
        blabels = blabels*1.0
        optimizer.zero_grad()
        # Forward pass
        outputs = dbmodel(binputs,
                             bmasks)
        # Backward pass
        loss = criterion(outputs.squeeze(), blabels)
        loss.backward()
        optimizer.step()
        tr_loss += loss.item()
    # ========== Validation ==========

    # Set model to evaluation mode
    dbmodel.eval()
    l , a , t = validDB(dbmodel, valid_dataloader)
    accs.append(a*100/t)
    print(e,'\n\t - Train loss: {:.4f}'.format(tr_loss), accs[-1])