# Training Transformers models

<img src="https://i0.wp.com/wallur.com/wp-content/uploads/2016/12/transformers-background-1.jpg?w=1920">
<div align="right"><a href=http://wallur.com/wallpaper/36471>Image source</a></div>

In this notebook we will tackle the task of detecting toxic comments in social media, making use of a pre-trained Transformer-based language model to do so. The data we will use are a simplified version of the [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) dataset, where comments with several toxicity labels have been simplified to just one label (multilabel to multiclass).

To avoid missing packages and compatibility issues you should run this notebook under the environment defined in the accompanying environment file, or make use of [Google Colaboratory](https://colab.research.google.com/). If you use Colaboratory make sure to [activate GPU support](https://colab.research.google.com/notebooks/gpu.ipynb).

If your are running this notebook in Google Colaboratory, you will need to install the transformers library by uncommenting and running the following line.

In [None]:
#pip install transformers==2.9

Let us set a random seed so experiments are reproducible across runs

In [None]:
import torch
import numpy as np
torch.manual_seed(0)
np.random.seed(0)

## Data loading

Data is provided as two separate files, one with texts for training the model and another one for testing. Both files are available in compressed form under the *data* folder.

In [None]:
import pandas as pd
train = pd.read_csv("data/toxic_multiclass_train.csv.zip", index_col="id")
test = pd.read_csv("data/toxic_multiclass_test.csv.zip", index_col="id")

If you have loaded the data properly, you should be able to visualize the first rows of each data set as follows

In [None]:
train.head(10)

In [None]:
test.head(10)

As you can see, the data files include a column *comment_text* with the text we must classify, and an additional columns with the kind of toxicity that is presents in a comment: *toxic*, *severe_toxic*, *obscene*, *threat*, *insult* and *identity_hate*, or *normal* if the text contains no toxicity.

### Reducing the training data

To allow faster experimenting, we will only use a portion of the data. Note that reducing the training data will result in worse model performance, and reducing the test data will result in a poorer estimate of the performance of the model. If you want to obtain the best results with the best confidence, do not run the following cell. But be prepared for a very, VERY long training!

In [None]:
import numpy as np

print(f"Training patterns before reduction: {len(train)}")
train = train.sample(int(len(train)/100), random_state=12345)
print(f"Training patterns after reduction:  {len(train)}")

print(f"Test patterns before reduction: {len(test)}")
test = test.sample(int(len(test)/100), random_state=12345)
print(f"Test patterns after reduction:  {len(test)}")

### Extract X and Y

In [None]:
from sklearn.preprocessing import LabelEncoder

X_train = train["comment_text"].values
X_test = test["comment_text"].values

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train["toxicity"].values)
y_test = label_encoder.transform(test["toxicity"].values)

## Transformers library

A convenient library to make use of Transformer-based language models is... [Transformers](https://github.com/huggingface/transformers)!

<img src="img/transformers_logo_name.png">

Transformers provides implementations of many language models like BERT, GPT-2 and many more. It also allows to make use of pre-trained versions of these models, thus saving a lot of time when solving practical problems.

Let's start by importing an AutoConfig object, which allows us to specify the configuration details for a language model.

In [None]:
from transformers import AutoConfig

Among the provided models, in this notebook we will make use of DistilBERT, a distilled version of BERT that can obtain good accuracies while keeping the model size small. We will use the configuration to tell Transformers we want to use a pretrained version of DistilBERT, trained on a dataset of uncased data, since case might not be important for the problem at hand. We also need to specify that we will use this pre-trained model to solve a classification problem with a specific number of labels.

In [None]:
pretrained_model = 'distilbert-base-uncased'
num_labels = len(set(y_train))
config = AutoConfig.from_pretrained(pretrained_model, num_labels=num_labels)

We can check out the resultant configuration, which contains all the model parameters, like dropout rates, embeddings sizes, and so on

In [None]:
config

## Tokenization

The first step in a language model pipeline is to tokenize the data. We can do so using an AutoTokenizer

In [None]:
from transformers import AutoTokenizer

Again, we will load a particular tokenizer: the one used for training DistilBERT. This tokenizer is pre-trained with an uncased dataset, following the pattern we specifyied in the configuration above

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Let's check that the tokenizer works

In [None]:
tokenizer.tokenize("A long trip to Mordor")

Most BERT models use a word-pieces tokenizer, dividing text into tokens that might represent a whole word, or a part of a word if the word is not common in the language. Also, since we are using an uncased model, the tokenizer maps all words to lower case.

Equivalently, we can also ask the tokenizer to transform the text to a list of dictionary ids, plus other lists of indexes required by the model.

In [None]:
tokenizer.encode("A long trip to Mordor")

This encoding also adds the special tokens `[CLS]` and `[SEP]` required by BERT models at the beginning and end of each text. We can check that out as follows:

In [None]:
for token_id in tokenizer.encode("A long trip to Mordor"):
    print(f'{token_id} -> {tokenizer.decode([token_id])}')

If we want to tokenize more than one text we can use the `batch_encode_plus` function. We can configure this function to make sure that every encoded text has the same length, which will we useful when working in batches on the GPU. In  the following example we will use a common length of 10, which manages to cover all tokens in every one of these sample texts.

In [None]:
texts = [
    "A long trip to Mordor", 
    "Our mind a sea",
    "Mabuka is the end of light"
]

tokenizer.batch_encode_plus(texts, max_length=10, pad_to_max_length=True)

`batch_encode_plus` returns a dictionary with three entries:

* `input_ids`: the ids of the tokens encoding each of the texts.
* `attention_mask`: 0/1 indicators telling whether the attention layers should consider this token in the mixings or not. Padding tokens always get a 0 value in the mask.
* (optional) `token_type_ids`: for language models that learn with pairs of sentences, 0/1 indicators telling the sentence to which each token belongs.

All the returned elements will be necessary as inputs to the language model. Since DistilBERT does not learn from pairs of sentences, the `token_type_ids` entry is not returned.

We can also ask the `batch_encode_plus` function to produce Tensorflow or Pytorch tensors instead of python lists. For instance, to obtain Pytorch tensors we will do as follows:

In [None]:
tokenizer.batch_encode_plus(texts, max_length=10, pad_to_max_length=True, return_tensors="pt")

The returned structure is the same, but now each entry is a Pytorch tensor.

Now, what would be the ideal maximum length for encoding our texts? BERT accepts inputs texts as long as 512 tokens, but using always this maximum length will result in slow training times. We can try tokenizing all texts without length limitation and study the distribution of text lengths.

In [None]:
encoded = tokenizer.batch_encode_plus(X_train)
lenghts = [len(x) for x in encoded["input_ids"]]

We will use the numpy function `quantile` to obtain a length in which 90% of the documents can fit

In [None]:
import numpy as np

maxlength = int(np.quantile(lenghts, 0.9))
maxlength

We will use this maximum length later on.

## DistilBERT model

We are now ready to explore the DistilBERT model. Let's load the pre-trained version of DistilBERT using an AutoModel class

In [None]:
from transformers import AutoModel

distilbert = AutoModel.from_pretrained('distilbert-base-uncased')

This pre-trained version contains the "body" of the model, which can receive a sequence of tokens and produce the "contextualized" embeddings for each one of those tokens.

<img src="http://jalammar.github.io/images/bert-encoders-input.png">
<div align="right">Image credit: <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a></div>

We can try this with a batch of one text of the training data, but remember we first need to transform it through the tokenizer and obtain Pytorch tensors

In [None]:
sample = tokenizer.batch_encode_plus(X_train[0:1], max_length=40, pad_to_max_length=True, return_tensors="pt")
sample

Now we can input the tensor into the DistilBERT model. A convenient way to input all three tensors into the model is using the unpacking operator `**`

In [None]:
outputs = distilbert(**sample)

A Transformers model always returns a tuple which might contain several pieces of information. In the case of DistilBERT, only a single object is returned, which is a pytorch tensor containing the embeddings

In [None]:
embeddings = outputs[0]
print(f"Input tensor shape {sample['input_ids'].shape}")
print(f"Input tensor values {sample['input_ids']}")
print(f"DistilBERT embeddings shape {embeddings.shape}")
print(f"DistilBERT embeddings values {embeddings}")

DistilBERT returns an embedding vector of 768 numbers for each input token.

Although it is tempting to use these embeddings as features for the toxic classification task, this approach does not generally give good results. Instead, it is advisable to add a classification "head" to the model, growing out of the embedding produced for the `[CLS]` special token, and fine-tune the whole model to the task through back-propagation.

<img src="http://jalammar.github.io/images/bert-classifier.png">
<div align="right">Image credit: <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)</a></div>

The Transformers library can prepare all of this for us, by loading a version of DistilBERT with a Sequence Classification head. We will provide the configuration we prepared above, so the the classification head produces as many outputs as classes in our problem.

In [None]:
from transformers import AutoModelForSequenceClassification
distilbert_classification = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', config=config)

## Preparing for fine-tuning

Fine-tuning a Transformers model is not as simple as fitting scikit-learn or Keras model: we will need to provide all the details on how to batch the data, as well as other details on the training procedure. To easen this task, we will first prepare some useful functions.

### Using the GPU

First, fine-tuning a language model on the CPU is not a good idea. We will be better off using GPUs. To do so, we first need to identify the computing device. The code below checks if GPUs are available in the system, and if so, prepares a Torch device to send the calculations there.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

The cell above should print `device(type='cuda')` if a GPU was found.

Note that if your machine has several GPUs, all of them will be used for training. To constrain this notebook to use a specific GPU, run jupyter notebook as 

   `CUDA_VISIBLE_DEVICES=X jupyter notebook`
   
where `X` is the number of the GPU you want to use. You can check your available GPUs and their usage with the following command:

In [None]:
!nvidia-smi

### Datasets

Any dataset used in Transformers must follow one contraint: the dataset must be an iterable of samples. But what is a sample? A text together with the outputs we expect our model to generate for it. In our classification problem each sample will be a text together with its class, so a natural way to organize a sample is the forma a tuple (text, class). We will prepare out training and test data in this format:

In [None]:
train_dataset = list(zip(X_train, y_train))
eval_dataset = list(zip(X_test, y_test))

### Collator

Now, we will need a <b>collate class</b> that receives an part of a dataset (iterable of samples), and performs all the tokenization and encoding procedure to obtain a Torch tensor in GPU. This class should inherit from the Transformers `DataCollator` class, and implement a `collate_batch` method that receives an iterable of samples and returns an encoded batch. Here we provide an implementation for you:

In [None]:
from transformers import DataCollator

class TextClassificationCollator(DataCollator):
    """Data collator for a text classification problem"""
    
    def __init__(self, tokenizer, max_length):
        """Initializes the collator with a tokenizer and a maximum document length (in tokens)"""
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def encode_texts(self, texts):
        """Transforms an iterable of texts into a dictionary of model input tensors, stored in the GPU"""
        # Tokenize and encode texts as tensors, with maximum length
        tensors = self.tokenizer.batch_encode_plus(
            texts, 
            max_length=self.max_length, 
            pad_to_max_length=True, 
            return_tensors="pt"
        )
        # Move tensors to GPU
        for key in tensors:
            tensors[key] = tensors[key].to(device)
        return tensors
    
    def collate_batch(self, patterns):
        """Collate a batch of patterns
        
        Arguments:
            - patterns: iterable of tuples in the form (text, class)
            
        Output: dictionary of torch tensors ready for model input
        """
        # Split texts and classes from the input list of tuples
        train_idx, targets = zip(*patterns)
        # Encode inputs
        input_tensors = self.encode_texts(train_idx)
        # Transform class labels to a tensor in GPU
        Y = torch.tensor(targets).long().to(device)
        # Return batch as a dictionary wikth all the inputs tensors and the labels
        batch = {**input_tensors, "labels": Y}
        return batch

With the collator class defined, we will create an instance for the particular tokenizer and maximum sequence length we have chosen above

In [None]:
collator = TextClassificationCollator(tokenizer, maxlength)

### Training arguments

The next step is creating a `TrainingArguments` object. This object allows us to specify the training procedure details. For this notebook we will use a batch size of 32, which should fit into a small GPU. We will also train the model only 1 epoch over the training data, to allow us to check the results quickly.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./models/toxic_model",  # Folder in which to save the trained model
    overwrite_output_dir=True,  # Whether to overwrite previous models found in the output folder
    per_gpu_train_batch_size=16,  # batch size during training
    per_gpu_eval_batch_size=128,  # batch size during evaluation (prediction)
    num_train_epochs=1,  # Model training epochs
    logging_steps=25,  # After how many training steps (batches) a log message showing progress will be printed
    save_steps=25  # After how many training steps (batches) the model will be checkpointed to disk
)

### Trainer

The last step of the preparation, is to create a `Trainer` object. This is the object that performs the actual training, and will need to receive all the information we prepared above, which is:

* The model to be fine-tuned
* The `TrainingArguments` object we prepared above
* Training and evaluation datasets
* The DataCollator object that will batch the data.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=distilbert_classification,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collator
)

Everything is ready for training!

## Model training and evaluation

Once all the work above is done, training is easy

In [None]:
%%time
trainer.train()

Once the model is trained we can obtain the model predictions with the `predict` method of the `Trainer` object

In [None]:
preds = trainer.predict(eval_dataset)

We can see that in a `ForSequenceClassification` model each prediction is a list of **unnormalized** probabilities for each class.

In [None]:
preds.predictions[0]

To obtain actual probabilities we need to apply a `softmax` function, which enforces each probability to take a value in the range `[0, 1]`, and also that all probabilities sum up to `1`.

In [None]:
from scipy.special import softmax
probs = softmax(preds.predictions, axis=1)

We can check now that indeed the number we have look like probabilities 

In [None]:
probs[0]

Now we can use these probabilities to compute any standard metric from scikit-learn. For instance, ROC AUC

In [None]:
from sklearn.metrics import roc_auc_score
print("AUC score", roc_auc_score(y_test, probs, multi_class='ovr'))