# Lab assignment: Transformers

<img src="https://i0.wp.com/wallur.com/wp-content/uploads/2016/12/transformers-background-1.jpg?w=1920">
<div align="right"><a href=http://wallur.com/wallpaper/36471>Image source</a></div>

In this assignment notebook we will once again tackle the task of detecting toxic comments in social media. But this time we will make use of a pre-trained Transformer-based language model to do so. We will also play a bit with the deep learning library Pytorch.

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
You will need to solve a question by writing your own code or answer in the cell immediately below or in a different file, as instructed.</font>

***

<img src="img/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
This is a hint or useful observation that can help you solve this assignment. You should pay attention to these hints to better understand the assignment.
</font>

***

<img src="img/pro.png" height="80" width="80" style="float: right;"/>

***
<font color=#259b4c>
This is an advanced exercise that can help you gain a deeper knowledge into the topic. Good luck!</font>

***

To avoid missing packages and compatibility issues you should run this notebook under one of the [recommended Deep Learning environment files](https://github.com/albarji/teaching-environments/tree/master/deeplearning), or make use of [Google Colaboratory](https://colab.research.google.com/). If you use Colaboratory make sure to [activate GPU support](https://colab.research.google.com/notebooks/gpu.ipynb).

If your are running this notebook in Google Colaboratory, you will need to install the transformers library by uncommenting and running the following line.

In [1]:
#pip install transformers==2.5.1

Let us set a random seed so experiments are reproducible across runs

In [2]:
import torch
import numpy as np
torch.manual_seed(0)
np.random.seed(0)

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Shift+Tab to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Data loading

Data is provided as two separate files, one with texts for training the model and another one for testing. Both files are available in compressed form under the *data* folder.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    
Load the data into two <a href=https://pandas.pydata.org/>pandas</a> Dataframes, <b>train</b> for the training data and <b>test</b> for the test data.
</font>

***

In [3]:
####### INSERT YOUR CODE HERE
import pandas as pd
train = pd.read_csv("data/toxic_multiclass_train.csv.zip", index_col="id")
test = pd.read_csv("data/toxic_multiclass_test.csv.zip", index_col="id")

If you have loaded the data properly, you should be able to visualize the first rows of each data set as follows

In [4]:
train.head(10)

Unnamed: 0_level_0,comment_text,toxicity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
e0fdfd98c66fb643,"""\n\n Huggle not working \n\nHi Gurch. There i...",normal
1864753b5fb6c9a3,Mossad actually. I know where you live.,normal
ce1db53fb22d399c,REDIRECT Talk:UFC Fight Night: Belfort vs. Hen...,normal
fed4f08d59399398,"""\n\nUPA IRC\nWhat about 19:00 UTC? e | ταλκ """,normal
06e7f93938ad9e72,"""\nI've re-added your information, together wi...",normal
4a5e851879fdd674,"""\nI'm not an elitist, I'm just spreading the ...",normal
ff39db4975a78363,"""\n\nIt is not listed on this European list as...",normal
73cc03c5e157ce86,You made a mistake you ass.,obscene
ca0891e20b7bbd66,Lol dynamic IP. Just you try to stop me! 82.13...,normal
b890cc6153e51480,"""Thanks for trying to fix Neil Steinberg. I ju...",normal


In [5]:
test.head(10)

Unnamed: 0_level_0,comment_text,toxicity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
dd2dcd01f0536e53,"""\nEdit: Please stated the basis for your acc...",normal
b20b1e8381e306f8,Wikipedia is a repetition of the old joke that...,normal
54a933831401b9b2,And the same in Serbian:ђе and Croatian:če and...,normal
e7a4575dba88d9f3,"""\n\n Congrats... You gave an awesome answer i...",normal
452d5cc2ccd1d611,Wiki users controlling information \n\nDeliber...,normal
e13de405c4f2ae03,The name Tajik no more implies a lack of conne...,normal
a7b33257178c17a7,Updated? \n\nAccording to this website (http:/...,normal
d6093ce71ecbb577,"""\n\nIt is incorrect the way it sounds now. Th...",normal
7e5f6eafd2cd3147,"Template:Geobox coor \n\nHi, I had to do an em...",normal
c665f590918ecce3,ADDITIONAL EVIDENCE TO ABOVE: One of the coll...,normal


As you can see, the data files include a column *comment_text* with the text we must classify, and an additional columns with the kind of toxicity that is presents in a comment: *toxic*, *severe_toxic*, *obscene*, *threat*, *insult* and *identity_hate*, or *normal* if the text contains no toxicity.

### Reducing the training data

To allow faster experimenting in this assignment, we will only use a portion of the training data.

<img src="img/pro.png" height="80" width="80" style="float: right;"/>

***

<font color=#259b4c>
    
Do not apply this reduction, use instead the whole training dataset. This will lead to better performances, though with much longer training times!</font>

***

In [6]:
import numpy as np

print(f"Training patterns before reduction: {len(train)}")
np.random.seed(123456789)
train = train.sample(int(len(train)/10))
print(f"Training patterns after reduction:  {len(train)}")

Training patterns before reduction: 119678
Training patterns after reduction:  11967


### Extract X and Y

In [7]:
from sklearn.preprocessing import LabelEncoder

X_train = train["comment_text"].values
X_test = test["comment_text"].values

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train["toxicity"].values)
y_test = label_encoder.transform(test["toxicity"].values)

#y_train = train["toxicity"].values
#y_test = test["toxicity"].values

## Transformers library

A convenient library to make use of Transformer-based language models is... [Transformers](https://github.com/huggingface/transformers)!

<img src="img/transformers_logo_name.png">

Transformers provides implementations of many language models like BERT, GPT-2 and many more. It also allows to make use of pre-trained versions of these models, thus saving a lot of time when solving practical problems.

Among the provided models, in this notebook we will make use of DistilBERT, a distilled version of BERT that can obtain good accuracies while keeping the model size small. Let's start by importing the configuration object for DistilBERT.

In [8]:
from transformers import DistilBertConfig

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


We will use the configuration to tell transformers we want to use a pretrained version of DistilBERT, trained on a dataset of uncased data, since case might not be important for the problem at hand. We also need to specify that we will use this pre-trained model to solve a classification problem with a specific number of labels.

In [9]:
num_labels = len(set(y_train))
config = DistilBertConfig.from_pretrained('distilbert-base-uncased', num_labels=num_labels)

We can check out the resultant configuration, which contains all the model parameters, like dropout rates, embeddings sizes, and so on

In [10]:
config

DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "vocab_size": 30522
}

## Tokenization

Each model in transformers can use its own tokenizer. DistilBERT is no exception

In [11]:
from transformers import DistilBertTokenizer

Again, we will load a particular tokenizer, pre-trained with an uncased dataset as we did with the configuration above

In [12]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

Let's check that the tokenizer works

In [13]:
tokenizer.tokenize("A long trip to Mordor")

['a', 'long', 'trip', 'to', 'mor', '##dor']

Most BERT models use a word-pieces tokenizer, dividing text into tokens that might represent a whole word, or a part of a word if the word is not common in the language. Also, since we are using an uncased model, the tokenizer maps all words to lower case.

Equivalently, we can also ask the tokenizer to transform the text to a list of dictionary ids, plus other lists of indexes required by the model.

In [14]:
tokenizer.encode("A long trip to Mordor")

[101, 1037, 2146, 4440, 2000, 22822, 7983, 102]

This encoding also adds the special tokens `[CLS]` and `[SEP]` required by BERT models at the beginning and end of each text. We can check that out as follows:

In [15]:
for token_id in tokenizer.encode("A long trip to Mordor"):
    print(f'{token_id} -> {tokenizer.decode([token_id])}')

101 -> [CLS]
1037 -> a
2146 -> long
4440 -> trip
2000 -> to
22822 -> mor
7983 -> ##dor
102 -> [SEP]


If we want to tokenize more than one text we can use the `batch_encode_plus` function. We can configure this function to make sure that every encoded text has the same length, which will we useful when working in batches on the GPU. In  the following example we will use a common length of 10, which manges to covers all tokens in every text.

In [16]:
texts = [
    "A long trip to Mordor", 
    "Our mind a sea",
    "Mabuka is the end of light"
]

tokenizer.batch_encode_plus(texts, max_length=10, pad_to_max_length=True)

{'input_ids': [[101, 1037, 2146, 4440, 2000, 22822, 7983, 102, 0, 0], [101, 2256, 2568, 1037, 2712, 102, 0, 0, 0, 0], [101, 26661, 15750, 2003, 1996, 2203, 1997, 2422, 102, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]}

`batch_encode_plus` returns a dictionary with three entries:

* `input_ids`: the ids of the tokens encoding each of the texts.
* `token_type_ids`: for language models that learn with pairs of sentences, 0/1 indicators telling the sentence to which each token belongs.
* `attention_mask`: 0/1 indicators telling whether the attention layers should consider this token in the mixings or not. Padding tokens always get a 0 value in the mask.

Depending on the type of model we are using, we might need all three sets of indicators or only a few of them.

We can also ask the `batch_encode_plus` function to produce Tensorflow or Pytorch tensors instead of python lists. For instance, to obtain Pytorch tensors we will do as follows:

In [17]:
tokenizer.batch_encode_plus(texts, max_length=10, pad_to_max_length=True, return_tensors="pt")

{'input_ids': tensor([[  101,  1037,  2146,  4440,  2000, 22822,  7983,   102,     0,     0],
        [  101,  2256,  2568,  1037,  2712,   102,     0,     0,     0,     0],
        [  101, 26661, 15750,  2003,  1996,  2203,  1997,  2422,   102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}

The returned structure is the same, but now each entry is a Pytorch tensor.

Now, what would be the ideal maximum length for encoding our texts? BERT accepts inputs texts as long as 512 tokens, but using always this maximum length will result in slow training times. We can try tokenizing all texts without length limitation and study the distribution of text lengths.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Perform the following tasks to find an appropriate maximum length to use in the encoding:
    <ul>
        <li>Encode all texts in X_train using batch_encode_plus, but don't specify any maximum length, padding or tensor options.</li>
        <li>Create a list with the lengths of each one of the indices vectors.</li>
        <li>Find the value of 90% percentile of the lengths distribution. Tip: use the <a href=https://docs.scipy.org/doc/numpy/reference/generated/numpy.quantile.html>numpy quantile function</a>. Save this value as an integer number into a variable named <b>maxlength</b>.</li>
    </ul>
</font>

***

In [18]:
####### INSERT YOUR CODE HERE
import numpy as np

encoded = tokenizer.batch_encode_plus(X_train)
lenghts = [len(x) for x in encoded["input_ids"]]
maxlength = int(np.quantile(lenghts, 0.9))
maxlength

Token indices sequence length is longer than the specified maximum sequence length for this model (1068 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (751 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (989 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1191 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (902 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

Token indices sequence length is longer than the specified maximum sequence length for this model (816 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (858 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1084 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (794 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (741 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for th

Token indices sequence length is longer than the specified maximum sequence length for this model (1728 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (824 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (865 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (738 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (835 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for th

Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1078 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (525 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (518 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (811 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for th

Token indices sequence length is longer than the specified maximum sequence length for this model (834 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (676 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1043 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (746 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1074 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

Token indices sequence length is longer than the specified maximum sequence length for this model (1111 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (796 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (531 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (842 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1133 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for t

202

We will use this maximum length later on.

## DistilBERT model

We are now ready to use the DistilBERT model. Let's load the pre-trained version of DistilBERT

In [19]:
from transformers import DistilBertModel

distilbert = DistilBertModel.from_pretrained('distilbert-base-uncased')

This pre-trained version contains the "body" of the model, which can receive a sequence of tokens and produce the "contextualized" embeddings for each one of those tokens.

<img src="http://jalammar.github.io/images/bert-encoders-input.png">
<div align="right">Image credit: <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a></div>

We can try this with a batch of one text of the training data, but remember we first need to transform it through the tokenizer and obtain Pytorch tensors

In [20]:
sample = tokenizer.batch_encode_plus(X_train[0:1], max_length=40, pad_to_max_length=True, return_tensors="pt")
sample

{'input_ids': tensor([[  101,  4283,  4563,  2078,  1010,  2204,  2000,  2156,  2008,  2002,
          2003, 10303,  1012,  1045,  1005,  1049,  2469,  2383,  1996,  2610,
         10591,  2012,  2010,  2341,  2097,  5676,  2002,  2180,  1005,  1056,
          2191,  2008,  6707,  2153,  1012,   102,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

Now we can input the tensor into the DistilBERT model. Among all the information produced by the tokenizer, DistilBERT only needs the `input_ids`

In [21]:
outputs = distilbert(sample["input_ids"])

A Transformers model always returns a tuple which might contain several pieces of information. In the case of DistilBERT, only a single object is returned, which is a pytorch tensor containing the embeddings

In [22]:
embeddings = outputs[0]
print(f"Input tensor shape {sample['input_ids'].shape}")
print(f"Input tensor values {sample['input_ids']}")
print(f"DistilBERT embeddings shape {embeddings.shape}")
print(f"DistilBERT embeddings values {embeddings}")

Input tensor shape torch.Size([1, 40])
Input tensor values tensor([[  101,  4283,  4563,  2078,  1010,  2204,  2000,  2156,  2008,  2002,
          2003, 10303,  1012,  1045,  1005,  1049,  2469,  2383,  1996,  2610,
         10591,  2012,  2010,  2341,  2097,  5676,  2002,  2180,  1005,  1056,
          2191,  2008,  6707,  2153,  1012,   102,     0,     0,     0,     0]])
DistilBERT embeddings shape torch.Size([1, 40, 768])
DistilBERT embeddings values tensor([[[ 0.0731,  0.1213,  0.0741,  ...,  0.1772,  0.2574,  0.2577],
         [ 0.3810,  0.2826,  1.1690,  ...,  0.2767,  0.2446,  0.3849],
         [ 0.6726, -0.3960, -0.0019,  ...,  0.5035,  0.1053, -1.3693],
         ...,
         [ 0.2292,  0.1030,  0.3928,  ...,  0.2168, -0.0718, -0.0745],
         [ 0.2484,  0.0594,  0.3928,  ...,  0.2539, -0.0813, -0.0433],
         [ 0.2328,  0.0147,  0.2474,  ...,  0.3424, -0.0758,  0.0546]]],
       grad_fn=<NativeLayerNormBackward>)


DistilBERT returns an embedding vector of 768 numbers for each input token.

Although it is tempting to use these embeddings as features for the toxic classification task, this approach does not generally give good results. Instead, it is advisable to add a classification "head" to the model, growing out of the embedding produced for the `[CLS]` special token, and fine-tune the whole model to the task through back-propagation.

<img src="http://jalammar.github.io/images/bert-classifier.png">
<div align="right">Image credit: <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a></div>

The Transformers library can prepare all of this for us, by loading a version of DistilBERT with a Sequence Classification head. We will provide the configuration we prepared above, so the the classification head produces as many outputs as classes in our problem.

In [23]:
from transformers import DistilBertForSequenceClassification
distilbert = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config)

## Let's try fine-tuning using the new Trainer class

In [24]:
from transformers import TrainingArguments, Trainer, DataCollator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

training_args = TrainingArguments(
    output_dir="./models/toxic_model",
    overwrite_output_dir=True,
    #do_train=True,
    #do_eval=True,
    per_gpu_train_batch_size=32,
    per_gpu_eval_batch_size=128,
    num_train_epochs=1,
    logging_steps=100,
    save_steps=100,
    #evaluate_during_training=True,
)

def encode_texts(texts):
    tensors = tokenizer.batch_encode_plus(texts, max_length=maxlength, pad_to_max_length=True, return_tensors="pt")
    return tensors["input_ids"].to(device)

class ToxicCollator(DataCollator):
    def collate_batch(self, patterns):
        # Split input ids and targets
        train_idx, targets = zip(*patterns)
        X = encode_texts(train_idx)
        Y = torch.tensor(targets).long().to(device)
        batch = {"input_ids": X, "labels": Y}
        return batch

In [25]:
#train_dataset=list(zip(X_train, y_train))
#eval_dataset=list(zip(X_test, y_test))

train_dataset=list(zip(X_train[:100], y_train[:100]))
eval_dataset=list(zip(X_test[:100], y_test[:100]))

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=ToxicCollator(),
    #prediction_loss_only=True
)

In [26]:
%%time
trainer.train()

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=1.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=4.0, style=ProgressStyle(description_widt…



CPU times: user 2.88 s, sys: 1.09 s, total: 3.97 s
Wall time: 4.01 s


TrainOutput(global_step=4, training_loss=1.659606009721756)

Now we can evaluate the model

In [27]:
trainer.evaluate()

HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=1.0, style=ProgressStyle(description_wid…


{"eval_loss": 1.3135157823562622, "epoch": 1.0, "step": 4}


{'eval_loss': 1.3135157823562622, 'epoch': 1.0}

In [None]:
preds = trainer.predict(list(zip(X_test, y_test)))

HBox(children=(FloatProgress(value=0.0, description='Prediction', max=312.0, style=ProgressStyle(description_w…

In [None]:
preds.predictions[0]

In [None]:
# Transform to probabilities
from scipy.special import softmax

scores = softmax(preds.predictions, axis=1)
scores[0]

In [None]:
from sklearn.metrics import roc_auc_score

print("AUC score", roc_auc_score(y_test, scores, multi_class='ovr'))

BULLSHIT

Now let's try predicting with this

In [None]:
def predict(X):
    with torch.no_grad():
        encoded = encode_texts(X)
        logits = model(encoded)[0].cpu()
        print(logits)
        most_probable = np.argmax(logits, axis=1)
        return most_probable

In [None]:
preds = predict(X_test)

In [None]:
# Pipeline does not control for maximum input sequence size, and so it fails for sequences longer than 512 tokens
from transformers import pipeline

pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model='./models/toxic_model/checkpoint-300/')
#pipe = pipeline("sentiment-analysis", tokenizer='distilbert-base-uncased', model=model)

In [None]:
pipe.tokenizer

In [None]:
pipe.predict(X_test.tolist())

In [None]:
len(tokenizer.tokenize(X_test[17]))

In [None]:
#TODO: pipeline does not control te

In [None]:
for i in range(0, 20):
    print(i)
    print(f"Expected: {y_test[i]}")
    print(f"Predicted: {pipe.predict(X_test[i])}")

## Preparing for fine-tuning

Fine-tuning a Transformers model is not as simple as fitting scikit-learn or Keras model: we will need to provide all the details on how to batch the data and perform the learning loop, following the style of Pytorch. To easen this task, we will first prepare some useful functions.

### Using the GPU

First, fine-tuning a language model on the CPU is not a good idea. We will be better off using a GPU. To do so, we first need to identify the computing device. The code below checks if a GPU is available in the system, and if so, prepares a Torch device to send the calculations there.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

The cell above should print `device(type='cuda')` if a GPU was found.

Let's load the model into the GPU

In [None]:
distilbert = distilbert.to(device)

### Collate functions

Now, we will need a <b>collate function</b> that receives an iterable of texts and produces a Torch tensor in GPU all that information. We can implement this function easily by taking advantage of the `batch_encode_plus` function we used above.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Implement a function named <b>collateX</b> that receives an iterable of texts and performs the following operations:
    <ul>
        <li>Use batch_encode_plus to transform the texts into a Pytorch tensor, with encodings of length maxlength, using padding for shorter texts.</li>
        <li>Extract the input_ids, move them to the GPU, and return them as the function output</li>
    </ul>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE
def collateX(texts):
    tensors = tokenizer.batch_encode_plus(texts, max_length=maxlength, pad_to_max_length=True, return_tensors="pt")
    return tensors["input_ids"].to(device)

If implemented correctly, the following should produce a tensor with a batch of 5 training texts

In [None]:
collateX(X_train[0:5])

We will also need a collate function that works with pairs of (inputs text, output labels).

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Implement a function named <b>collateXY</b> that receives an iterable of tuples (text, labels) and performs the following operations:
    <ul>
        <li>Unzip the tuples to obtain a list of texts and a list of labels.</li>
        <li>Use collateX to obtain the pytorch tensor for the texts.</li>
        <li>Transform the labels into a torch tensor, cast them to floats, and move them to the GPU.</li>
        <li>Return both the encoded texts tensor and the labels tensor.</li>
    </ul>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE
def collateXY(patterns):
    # Split input ids and targets
    train_idx, targets = zip(*patterns)
    X = collateX(train_idx)
    Y = torch.tensor(targets).float().to(device)
    return X, Y

If implemented properly, the following should produce a tuple with two tensors, one for the input texts and another one for the output labels

In [None]:
collateXY(zip(X_train[0:5], y_train[0:5]))

### DataLoaders

The main point of creating the collate functions is to build DataGenerator objects that produce batches for the learning process. For instance, we can create a DataLoader that generates batches of size 16 made of training input-output pairs as follows.

In [None]:
from torch.utils.data import DataLoader
loader = DataLoader(list(zip(X_train, y_train)), batch_size=16, collate_fn=collateXY)

So we can run over the dataset generating batches as follows

In [None]:
for i, batch in enumerate(loader):
    print(f"Generated batch {i} with {len(batch[0])} patterns, shape {(batch[0].shape, batch[1].shape)}")
    # Only iterate over the first 5 batches
    if i >= 5:
        break

Let's prepare the DataLoaders we will need for the training and evaluation processes below. These will be three loaders:
* A loader for running backpropagation through the network, which will provide batches of pairs of training inputs-outputs patterns.
* A loader for evaluating the performance over the training set, which produces batches of training inputs.
* A loader for evaluating the performance over the test set, which produces batches of training inputs.

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Create the following DataLoader objects, configured as follows:
    <ul>
        <li><b>dataloader_trainXY</b>: producing batches of 16 training pairs (inputs, ouputs).</li>
        <li><b>dataloader_trainX</b>: producing batches of 64 training inputs.</li>
        <li><b>dataloader_testX</b>: producing batches of 64 testing inputs.</li>
    </ul>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE
dataloader_trainXY = DataLoader(list(zip(X_train, y_train)), batch_size=16, collate_fn=collateXY)
dataloader_trainX = DataLoader(X_train, batch_size=64, collate_fn=collateX)
dataloader_testX = DataLoader(X_test, batch_size=64, collate_fn=collateX)

If implemented correctly, the following cell should run without exceptions

In [None]:
assert len(next(iter(dataloader_trainXY))) == 2, "dataloader_trainXY must generate tuples of 2 elements"
assert torch.is_tensor(next(iter(dataloader_trainXY))[0]), "dataloader_trainXY must generate tensors"
assert torch.is_tensor(next(iter(dataloader_trainXY))[1]), "dataloader_trainXY must generate tensors"
assert torch.is_tensor(next(iter(dataloader_trainX))), "dataloader_trainX must generate tensors"
assert torch.is_tensor(next(iter(dataloader_testX))), "dataloader_testX must generate tensors"

### Optimizer

Now we will define the optimizer for the learning process. An algorithm that has provided good results is AdamW, a corrected version of Adam.

In [None]:
from transformers import AdamW

The optimizer needs to know the parameters it is going to optimizer over, as well as the learning rate and other Adam parameters. We will initialize it as follows

In [None]:
optimizer = AdamW(distilbert.parameters(), lr=1e-4, eps=1e-8)

### Output and loss functions

The output and loss functions to use for the network must also be defined, as the classification head provided by DistilBERT ends without an activation or loss function. Since our problem is multilabel, we will use the sigmoid as activation function for the output

In [None]:
from torch.nn import Sigmoid
output_activation = Sigmoid().to(device)

Regarding the loss function, again because we are in multilabel problem, the binary cross-entropy is a good choice. But we will use the `BCEWithLogitsLoss` loss, which combines in the same layer the Sigmoid activation with the binary cross-entropy loss, for better numerical stability.

In [None]:
from torch.nn import BCEWithLogitsLoss
lossf = BCEWithLogitsLoss().to(device)

### Training loop

We have all the basic pieces to write the training loop. Here we define a function that performs one training epoch over the model, following the general pattern of Pytorch networks training:

In [None]:
from tqdm import tqdm

def train_epoch(model, dataloader, optimizer, lossf):
    """Performs a training epoch of a model over a dataset, following a given optimizer and loss function

    Arguments:
        model: pytorch model to train
        dataloader: DataLoader object generating batches of inputs-outputs pairs
        optimizer: pytorch optimizer to use to minimize loss
        lossf: loss function to minimize in during training
        
    Returns the loss attained by the model in this iteration
    """
    # Set the model in training mode
    model.train()
    # Initialize loss
    total_loss = 0
    # Itearte over the training batches produced by the dataloader
    for batch in tqdm(dataloader, desc=f"Training epoch", total=len(dataloader)):
        # Split inputs and outputs
        inputs, labels = batch
        
        ###### Forward step
        # Reset network gradients
        model.zero_grad()
        # Forward the inputs through the network
        outputs = model(inputs)
        # Extract values of output layer
        logits = outputs[0]
        # Apply loss function against expected outputs
        loss = lossf(logits, labels)
        # Extract loss to CPU as a standard python float
        total_loss += float(loss.cpu().detach().numpy())
        
        ###### Backward step
        # Compute gradient
        loss.backward()
        # Clip gradient for numerical stability
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Perform one optimizer step using the computed gradient
        optimizer.step()
        
    # Average loss across all batches
    total_loss /= len(dataloader)
    return total_loss

We also define a function that generates predictions for a dataset using a model

In [None]:
def predict(model, dataloader, outputf):
    """Generate predictions using a fine-tuned model
    
    Arguments:
        model: pytorch model to use to generate predictions
        dataloader: DataLoader object generating batches of inputs
        outputf: output activation function
    """
    # Set the model to evaluation mode
    model.eval()
    preds = []
    # Deactivate gradient calculations
    with torch.no_grad():
        # Iterate over evaluation batches produced by the dataloader
        for batch in tqdm(dataloader, desc="Model prediction", total=len(dataloader)):
            # Forward the inputs through the network
            outputs = model(batch)
            # Extract values of output layer
            logits = outputs[0]
            # Apply output activation function, extract predictions as a python list
            preds.extend(outputf(logits).tolist())
        
    return preds

The final piece to add is how to compute the performance of our model. Since this is an unbalanced problem, as most of the texts are not toxic, the ROC AUC is a good metric.

In [None]:
from sklearn.metrics import roc_auc_score

To establish a first baseline, we can check what is the performance of the untrained model. We just need to run the `predict` function we just defined over the test DataLoader, and compare the predictions against the expected targets using the `roc_auc_score`.

In [None]:
preds = predict(distilbert, dataloader_testX, output_activation)
print("AUC score", roc_auc_score(y_test, preds))

## Fine-tuning the model

We have all pieces we need to fine-tune the language model for our problem at hand. Let's go!

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Write a loop that performs <b>5 epochs</b> of model training. At each epoch:
    <ul>
        <li>Perform a call to <b>train_epoch</b> with the appropriate parameters, saving the returned training loss into a variable.</li>
        <li>Obtaining predictions for the training data by calling <b>predict</b>.</li>
        <li>Compute the ROC AUC of those predictions, and print it together with the training loss.</li>
    </ul>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE
epochs = 5
for epoch in range(epochs):
    loss = train_epoch(distilbert, dataloader_trainXY, optimizer, lossf) 
    preds = predict(distilbert, dataloader_trainX, output_activation)
    print(f"Epoch {epoch}: training loss {loss}, AUC score {roc_auc_score(y_train, preds)}")

<img src="img/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
    Obtain predictions for the test dataset, and compute the ROC AUC for them.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE
preds = predict(distilbert, dataloader_testX, output_activation)
print("AUC score", roc_auc_score(y_test, preds))

<center>
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.<br>
                          THIS IS THE END OF THE ASSIGNMENT<br>
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.<br>
</center>

<img src="img/pro.png" height="80" width="80" style="float: right;"/>

***
<font color=#259b4c>
    You can try to obtain better test performance by tuning the following parameters:
    <ul>
        <li>Number of training epochs.</li>
        <li>Training batch size.</li>
        <li>Optimizer learning rate.</li>
        <li>Model dropouts (change them in the model configuration).</li>
    </ul>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE