# Ironic Sentences - with DistilBERT

This kernel is taken from [this tutorial kernel](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta) ADD URL, which is a suplement of the Medium article ["Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)"](https://medium.com/p/fastai-with-transformers-bert-roberta-xlnet-xlm-distilbert-4f41ee18ecb2?source=email-29c8f5cf1dc4--writer.postDistributed&sk=119c3e5d748b2827af3ea863faae6376). Make sure to upvote the original tutorial, too, if you found this helpful.

## Transformers: another type of transfer learning

This kernel uses pretrained transformers to perform a classification tasks. In another kernel, I use the [ULMFiT tecchnique]().

Since the introduction of ULMFiT, **Transfer Learning** became very popular in NLP and yet Google (BERT, Transformer-XL, XLNet), Facebook (RoBERTa, XLM) or even OpenAI (GPT, GPT-2) begin to pre-train their own model on very large corpora. This time, instead of using the AWD-LSTM neural network, they all used a more powerful architecture based on the Transformer (cf. [Attention is all you need](https://arxiv.org/abs/1706.03762)).

Although these models are powerful, ``fastai`` do not integrate all of them. Fortunately, [HuggingFace](https://huggingface.co/) 🤗 created the well know [transformers library](https://github.com/huggingface/transformers). Formerly knew as ``pytorch-transformers`` or ``pytorch-pretrained-bert``, this library brings together over 40 state-of-the-art pre-trained NLP models (BERT, GPT-2, RoBERTa, CTRL…). The implementation gives interesting additional utilities like tokenizer, optimizer or scheduler.

The ``transformers`` library can be self-sufficient but incorporating it within the ``fastai`` library provides simpler implementation compatible with powerful fastai tools like  **Discriminate Learning Rate**, **Gradual Unfreezing** or **Slanted Triangular Learning Rates**. The point here is to allow non-NLP-experts to get easily state-of-the-art results and therefore "make NLP uncool again".

Before beginning the implementation, note that integrating ``transformers`` within ``fastai`` can be done in multiple different ways. The tutorial author decided to implement generic and flexible solutions. Specifically, he made the minimum amount of modifications in both libraries while making them compatible with the maximum amount of transformer architectures. Go check out the [original tutorial](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta) to see other transfomer architectures. 

This notebook contains the following sections:
1. Loading the data
1. Integrating transformers and fastai for data processing
    - Custom Tokenizer
    - Custom Numericalizer
    - Custom Processor
1. Preparing cross validations and testing the DataBunch
1. Creating a custom model



# Loading the data

Before starting the implementation, you will need to install the ``fastai`` and ``transformers`` libraries. To do so, just follow the instructions [here](https://github.com/fastai/fastai/blob/master/README.md#installation) and [here](https://github.com/huggingface/transformers#installation).

In Kaggle, the ``fastai`` library is already installed. So you just have to instal ``transformers`` with :

In [None]:
%%bash
pip install transformers

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from pathlib import Path 
from collections import defaultdict

import os

import torch
import torch.optim as optim

import random 

# fastai
from fastai import *
from fastai.text import *
from fastai.callbacks import *

# transformers
from transformers import PreTrainedModel, PreTrainedTokenizer, PretrainedConfig
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer, DistilBertConfig

The current versions of the fastai and transformers libraries are respectively 1.0.58 and 2.1.1.

In [None]:
import fastai
import transformers
print('fastai version :', fastai.__version__)
print('transformers version :', transformers.__version__)

Load and check the Irony Corpus data

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
irony_data = pd.read_csv('/kaggle/input/ironic-corpus/irony-labeled.csv')
irony_data.head()
print(irony_data.shape)
irony_data.head()

In [None]:
irony_data.label.value_counts()

# Integrating transformers and fastai for data processing

In ``transformers``, each model architecture is associated with 3 main types of classes:
* A **model class** to load/store a particular pre-train model.
* A **tokenizer class** to pre-process the data and make it compatible with a particular model.
* A **configuration class** to load/store the configuration of a particular model.

For example, if you want to use the Bert architecture for text classification, you would use [``BertForSequenceClassification``](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification) for the **model class**, [``BertTokenizer``](https://huggingface.co/transformers/model_doc/bert.html#berttokenizer) for the **tokenizer class** and [``BertConfig``](https://huggingface.co/transformers/model_doc/bert.html#bertconfig) for the **configuration class**. 

First, we set some parameters, and create a utility function for random numbers.

In [None]:
seed = 42
use_fp16 = False
bs = 16

model_type = 'distilbert'
pretrained_model_name = 'distilbert-base-uncased'

MODEL_CLASSES = {
    'distilbert': (DistilBertForSequenceClassification, DistilBertTokenizer, DistilBertConfig)
}

model_class, tokenizer_class, config_class = MODEL_CLASSES[model_type]

Although I've chosen to use the 'distilbert-base-uncased' model, there are a few distilbert moddels to choose from. 

In [None]:
model_class.pretrained_model_archive_map.keys()

Function to set the seed for generating random numbers.

In [None]:
def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

In [None]:
seed_all(seed)

To match pre-training, we have to format the model input sequence in a specific format.
To do so, you have to first **tokenize** and then **numericalize** the texts correctly.
The difficulty here is that each pre-trained model, that we will fine-tune, requires exactly the same specific pre-process - **tokenization** & **numericalization** - than the pre-process used during the pre-train part.
Fortunately, the **tokenizer class** from ``transformers`` provides the correct pre-process tools that correspond to each pre-trained model.

In the ``fastai`` library, data pre-processing is done automatically during the creation of the ``DataBunch``. 
As you will see in the ``DataBunch`` implementation, the **tokenizer** and **numericalizer** are passed in the processor argument under the following format :

``processor = [TokenizeProcessor(tokenizer=tokenizer,...), NumericalizeProcessor(vocab=vocab,...)]``

Let's first analyse how we can integrate the ``transformers`` **tokenizer** within the ``TokenizeProcessor`` function.

## Custom Tokenizer

This part can be a little bit confusing because a lot of classes are wrapped in each other and with similar names.
To resume, if we look attentively at the ``fastai`` implementation, we notice that :
1. The [``TokenizeProcessor`` object](https://docs.fast.ai/text.data.html#TokenizeProcessor) takes as ``tokenizer`` argument a ``Tokenizer`` object.
2. The [``Tokenizer`` object](https://docs.fast.ai/text.transform.html#Tokenizer) takes as ``tok_func`` argument a ``BaseTokenizer`` object.
3. The [``BaseTokenizer`` object](https://docs.fast.ai/text.transform.html#BaseTokenizer) implement the function ``tokenizer(t:str) → List[str]`` that take a text ``t`` and returns the list of its tokens.

Therefore, we can simply create a new class ``TransformersBaseTokenizer`` that inherits from ``BaseTokenizer`` and overwrite a new ``tokenizer`` function.

In [None]:
class TransformersBaseTokenizer(BaseTokenizer):
    """Wrapper around PreTrainedTokenizer to be compatible with fast.ai"""
    def __init__(self, pretrained_tokenizer: PreTrainedTokenizer, model_type = 'bert', **kwargs):
        self._pretrained_tokenizer = pretrained_tokenizer
        self.max_seq_len = pretrained_tokenizer.max_len
        self.model_type = model_type

    def __call__(self, *args, **kwargs): 
        return self

    def tokenizer(self, t:str) -> List[str]:
        """Limits the maximum sequence length and add the spesial tokens"""
        CLS = self._pretrained_tokenizer.cls_token
        SEP = self._pretrained_tokenizer.sep_token
        if self.model_type in ['roberta']:
            tokens = self._pretrained_tokenizer.tokenize(t, add_prefix_space=True)[:self.max_seq_len - 2]
        else:
            tokens = self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2]
        return [CLS] + tokens + [SEP]

In [None]:
transformer_tokenizer = tokenizer_class.from_pretrained(pretrained_model_name)
transformer_base_tokenizer = TransformersBaseTokenizer(pretrained_tokenizer = transformer_tokenizer, model_type = model_type)
fastai_tokenizer = Tokenizer(tok_func = transformer_base_tokenizer, pre_rules=[], post_rules=[])

In [None]:
tokenizer_class.pretrained_vocab_files_map

## Custom Numericalizer

In ``fastai``, [``NumericalizeProcessor``  object](https://docs.fast.ai/text.data.html#NumericalizeProcessor) takes as ``vocab`` argument a [``Vocab`` object](https://docs.fast.ai/text.transform.html#Vocab). 
Here, we create a new class ``TransformersVocab`` that inherits from ``Vocab`` and uses the functions ``convert_tokens_to_ids`` and ``convert_ids_to_tokens`` respectively to overwrite the ``numericalize`` and ``textify`` functions. 

In [None]:
class TransformersVocab(Vocab):
    def __init__(self, tokenizer: PreTrainedTokenizer):
        super(TransformersVocab, self).__init__(itos = [])
        self.tokenizer = tokenizer
    
    def numericalize(self, t:Collection[str]) -> List[int]:
        "Convert a list of tokens `t` to their ids."
        return self.tokenizer.convert_tokens_to_ids(t)
        #return self.tokenizer.encode(t)

    def textify(self, nums:Collection[int], sep=' ') -> List[str]:
        "Convert a list of `nums` to their tokens."
        nums = np.array(nums).tolist()
        return sep.join(self.tokenizer.convert_ids_to_tokens(nums)) if sep is not None else self.tokenizer.convert_ids_to_tokens(nums)

#### Custom processor
Now that we have our custom **tokenizer** and **numericalizer**, we can create the custom **processor**. Notice we are passing the ``include_bos = False`` and ``include_eos = False`` options. This is because ``fastai`` adds its own special tokens by default which interferes with the ``[CLS]`` and ``[SEP]`` tokens added by our custom tokenizer.

In [None]:
transformer_vocab =  TransformersVocab(tokenizer = transformer_tokenizer)
numericalize_processor = NumericalizeProcessor(vocab=transformer_vocab)

tokenize_processor = TokenizeProcessor(tokenizer=fastai_tokenizer, include_bos=False, include_eos=False)

transformer_processor = [tokenize_processor, numericalize_processor]

For the DataBunch creation, you have to pay attention to set the processor argument to our new custom processor ``transformer_processor`` and manage correctly the padding.

As mentioned in the HuggingFace documentation, DistilBERT models use absolute position embeddings, so it's usually advised to pad the inputs on the right rather than the left. Regarding XLNET, it is a model with relative position embeddings, therefore, you can either pad the inputs on the right or on the left.

In [None]:
pad_first = bool(model_type in ['xlnet'])
pad_idx = transformer_tokenizer.pad_token_id

# Preparing cross validations and testing the DataBunch

Now that we have integrated the transformers compenents into the fastai architecture, we can begin to create the data structures we will need. Since the original paper used a 5 fold cross validation, we divide the data into 5 folds using `sklearn`.

In [None]:
from sklearn.model_selection import KFold

X = irony_data['comment_text']
y = irony_data['label']
kf = KFold(n_splits=5)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

Here, we separate each fold into the 'train' and 'test' (i.e., validation) sets. 

In [None]:
trains = list()
tests = list()
for train_index, test_index in kf.split(X):
    trains.append(train_index)
    tests.append(test_index)

Each fold is then condensed into one dataframe with the validationset labelled in the `is_valid` column.

In [None]:
def create_validation(valnum):

    train = {'comment_text': X[trains[valnum]], 'label': y[trains[valnum]],'is_valid':False}
    dftrain = pd.DataFrame(data=train)
    
    valid = {'comment_text': X[tests[valnum]], 'label': y[tests[valnum]],'is_valid':True}
    dfvalid = pd.DataFrame(data=valid)
    
    alldata = dftrain.append(dfvalid)
    
    return alldata

In [None]:
fold1 = create_validation(0)
fold2 = create_validation(1)
fold3 = create_validation(2)
fold4 = create_validation(3)
fold5 = create_validation(4)

Fastai uses a structure called a `DataBunch` to hold everything related to the data: the processor, training and testing sets. Let's do a check to make sure it works correctly. 

In [None]:
bs = 10
databunch = (TextList.from_df(fold2, cols='comment_text', processor=transformer_processor)
             .split_from_df(col='is_valid')
             .label_from_df(cols= 'label')
             .databunch(bs=bs, pad_first=pad_first, pad_idx=pad_idx))

Check batch and tokenizer :

In [None]:
print('[CLS] token :', transformer_tokenizer.cls_token)
print('[SEP] token :', transformer_tokenizer.sep_token)
print('[PAD] token :', transformer_tokenizer.pad_token)
databunch.show_batch()

Check batch and numericalizer :

In [None]:
print('[CLS] id :', transformer_tokenizer.cls_token_id)
print('[SEP] id :', transformer_tokenizer.sep_token_id)
print('[PAD] id :', pad_idx)
test_one_batch = databunch.one_batch()[0]
print('Batch shape : ',test_one_batch.shape)
print(test_one_batch)

# Creating a custom model

As mentioned [here](https://github.com/huggingface/transformers#models-always-output-tuples), every model's forward method always outputs a ``tuple`` with various elements depending on the model and the configuration parameters. In our case, we are interested to access only to the logits. [Note here](https://huggingface.co/transformers/_modules/transformers/modeling_distilbert.html#DistilBertForSequenceClassification) in the `outputs` of the DistilBERT classifier, the logits are output in the first position.

Here we access them by creating a custom model.

In [None]:
# defining our model architecture 
class CustomTransformerModel(nn.Module):
    def __init__(self, transformer_model: PreTrainedModel):
        super(CustomTransformerModel,self).__init__()
        self.transformer = transformer_model
        
    def forward(self, input_ids, attention_mask=None):
        
        #attention_mask = (input_ids!=1).type(input_ids.type()) # Test attention_mask for RoBERTa
        
        logits = self.transformer(input_ids,
                                attention_mask = attention_mask)[0]   
        return logits

To make our transformers adapted to multiclass classification, before loading the pre-trained model, we need to precise the number of labels. To do so, you can modify the config instance or either modify like in [Keita Kurita's article](https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/) (Section: *Initializing the Learner*) the ``num_labels`` argument.

Here, we have two labels: 'ironic' and 'not ironic'

In [None]:
config = config_class.from_pretrained(pretrained_model_name)
config.num_labels = 2
config.use_bfloat16 = use_fp16
print(config)

In [None]:
transformer_model = model_class.from_pretrained(pretrained_model_name, config = config)
# transformer_model = model_class.from_pretrained(pretrained_model_name, num_labels = 5)

custom_transformer_model = CustomTransformerModel(transformer_model = transformer_model)

### Learner : Custom Optimizer / Custom Metric
In ``pytorch-transformers``, HuggingFace had implemented two specific optimizers  -  BertAdam and OpenAIAdam  -  that have been replaced by a single AdamW optimizer.
This optimizer matches Pytorch Adam optimizer Api, therefore, it becomes straightforward to integrate it within ``fastai``.
It is worth noting that for reproducing BertAdam specific behavior, you have to set ``correct_bias = False``.

Here, because the dataset is imbalanced, we want to use weighted cross entropy. An empty model is created and then saved. Because we are going to use cross validation, we need to be able to return to a clean beginning. If not, we might end up having some items cross over between the training and validation sets, thus inflating our metrics. 

In [None]:
weights = [1., 3.]
class_weights=torch.FloatTensor(weights).cuda()

In [None]:
from fastai.callbacks import *
from transformers import AdamW

learner = Learner(databunch, 
                  custom_transformer_model, 
                  loss_func = nn.CrossEntropyLoss(weight=class_weights),
                  opt_func = lambda input: AdamW(input,correct_bias=False), 
                  metrics=[Precision(),Recall(),FBeta(beta=1)])

# Show graph of learner stats and metrics after each epoch.
learner.callbacks.append(ShowGraph(learner))

# Put learn in FP16 precision mode. --> Seems to not working
if use_fp16: learner = learner.to_fp16()
learner.save('untrained')


# Training the model

Although we can use fastai build-in features like **Slanted Triangular Learning Rates**, **Discriminate Learning Rate** and **gradually unfreezing the model**, it is unclear if this will help in training. Here, we unfreeze the whole model for training. Since there is some randomness involved, we will run the model multiple times and average the output.

In [None]:
# constants across test
epochs = 2
n_reps = 2

folds = [fold1, fold2, fold3, fold4, fold5]
learning_rates = 1e-6
moms = (0.825,0.725)

metrics = np.zeros([len(folds),n_reps,3]) 

This loop creates a new databunch and new learner for each fold in the cross validation. The last metrics from each training cycle are retained. 

In [None]:
for reps in range(n_reps):
    
    for fold in range(0,len(folds)):
        
        # create a databunch with the current fold
        databunch = (TextList.from_df(folds[fold], cols='comment_text', processor=transformer_processor)
            .split_from_df(col='is_valid')
            .label_from_df(cols= 'label')
            .databunch(bs=bs, pad_first=pad_first, pad_idx=pad_idx))
        learner = Learner(databunch, 
            custom_transformer_model, 
            loss_func = nn.CrossEntropyLoss(weight=class_weights),
            opt_func = lambda input: AdamW(input,correct_bias=False), 
            metrics=[Precision(),Recall(),FBeta(beta=1)])
        if use_fp16: learner = learner.to_fp16()
    
        # start with empty weights
        learner.load('untrained')
        learner.unfreeze()
        #learner.freeze_to(unfreeze_layers)
        # train on current parameters
        learner.fit_one_cycle(epochs, max_lr=learning_rates, moms=moms)
        metrics[fold,reps:] = learner.recorder.metrics[-1]

# Interpreting the results

Let's see how the results compare to the ones presented in the paper. 

In [None]:
metrics.shape

In [None]:
avg_per_fold = np.mean(metrics,axis=1);avg_per_fold

In [None]:
def format_scores(avg_metrics):
    def print_line(name,arr):
        print(name,':',format(np.mean(arr), '.3f'), '(range ', np.min(arr), ' - ',np.max(arr))
    
    print_line('F1 score',avg_metrics[:,2])
    print_line('recall',avg_metrics[:,1])
    print_line('precision',avg_metrics[:,0])
    

In [None]:
format_scores(avg_per_fold)

Scores presented in the paper:
- average [F1 score](https://en.wikipedia.org/wiki/F1_score): 0.383 (range 0.330 - 0.412)
- average [recall](https://en.wikipedia.org/wiki/Precision_and_recall): 0.496 (range 0.446 - 0.548)
- average [precision](https://en.wikipedia.org/wiki/Precision_and_recall): 0.315 (range 0.261 - 0.380)

Overall, the tranformers perform comaribly to the SVM. What I notice on the transformers is that occasionally they perform very badly at first. The minimum values can be very low on some folds, thus bringing the average down. 

## References
* Hugging Face, Transformers GitHub (Nov 2019), [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
* Fast.ai, Fastai documentation (Nov 2019), [https://docs.fast.ai/text.html](https://docs.fast.ai/text.html)
* Jeremy Howard & Sebastian Ruder, Universal Language Model Fine-tuning for Text Classification (May 2018), [https://arxiv.org/abs/1801.06146](https://arxiv.org/abs/1801.06146)
* Keita Kurita's article : [A Tutorial to Fine-Tuning BERT with Fast AI](https://mlexplained.com/2019/05/13/a-tutorial-to-fine-tuning-bert-with-fast-ai/) (May 2019)
* Dev Sharma's article : [Using RoBERTa with Fastai for NLP](https://medium.com/analytics-vidhya/using-roberta-with-fastai-for-nlp-7ed3fed21f6c) (Sep 2019)