# Day 2 of AI Academy 2022 - NLP Track

Hello! Welcome to the second day in your AI academy 2022, NLP track. In this lab we'll go over the usage of the HuggingFace's `transformers` library to train a BERT-based model for our sentiment analysis task. BERT is a transfromer-based model, a deep learning model, which requires more computation than the traditional models we worked with yesterday. Deep learning models typically require a GPU to train and run effciently, if you're local machine is not powered with a GPU to use, we recommend that you run this notebook in Google's Colabortatory (or Colab for short), which is a free notebook runtime that can assign you a GPU for a limited amount of time (usually 12 hours). If you wish to run this in Colab, simply click the following badge.

<a target="_blank" href="https://colab.research.google.com/github/Mostafa-Samir/AI-Academy-NLP-Dec-2022/blob/main/Day-2-lab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Colab Configuration

If you chose to run in Colab, you'll need to run the following cell. 

_**DO NOT RUN THE NEXT CELL IF YOU'RE RUNNING THE NOTEBOOK LOCALLY!**_

In [None]:
!git clone https://github.com/Mostafa-Samir/AI-Academy-NLP-Dec-2022.git

!cd AI-Academy-NLP-Dec-2022 && \
 pip install --upgrade pip && \
 pip install -r requirements.txt && \
 python -m setup-nltk

## Data Path Resolution

There'll be a slight difference in the file structure if you chose to run this notebook in Colab compared to running locally. The following script will take into account this difference in file structure to make sure sure that pathes to the data are correct in the rest of the notebook. The script will define a `DATA_ROOT` under which calling `DATA_ROOT/data/english/train.csv` for example will always be resolved correctly regardless of the environemnt.

In [3]:
running_in_colab = 'google.colab' in str(get_ipython()) if hasattr(__builtins__,'__IPYTHON__') else False
DATA_ROOT = "./AI-Academy-NLP-Dec-2022" if running_in_colab else "."

## Data Preparation

To start using a BERT model to predict the sentiment of the text, we first need to prepare the data in the right format. Preperation here is much lighter than we did yesterday. Here, we'll be doing very light cleaning on the data by normalizing user mentions and possibly normalizing all URLs to a representative token. We'll then pass these lightly cleaned text into a pretrained unsupervised tokenizer like sentencepiece. We're chosing to do the light cleaning in order to show case the power of pertained unsupervised tockenizers.

In [12]:
import re

def pipeline(fn_list):
    def inner_function(text):
        out = text
        for fn in fn_list:
            out = fn(out)
        return out
    
    return inner_function

def normalize_mentions(text: str) -> str:
    return re.sub("@\w*", "@user", text)
    
def normalize_urls(text: str) -> str:
    return re.sub("http(s{0,1})://[\w\-_./:]*", "http", text)    

In [13]:
import pandas as pd
import os

cleaning_pipeline = pipeline([normalize_mentions, normalize_urls])

training_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/train.csv"))
clean_training_data = training_data.copy()
clean_training_data.loc[:, "tweet"] = clean_training_data.loc[:, "tweet"].apply(cleaning_pipeline)

dev_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/dev.csv"))
clean_dev_data = dev_data.copy()
clean_dev_data.loc[:, "tweet"] = clean_dev_data.loc[:, "tweet"].apply(cleaning_pipeline)

testing_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/test.csv"))
clean_testing_data = testing_data.copy()
clean_testing_data.loc[:, "tweet"] = clean_testing_data.loc[:, "tweet"].apply(cleaning_pipeline)

### Tokenization

Now that we have our dataset lightly cleaned, we'll start looking at the tokenizer and how sentences are splitted into subword tokens for the BERT model. The pretrained BERT model that we'll be using is called `bert-base-cased`, of which we can get the tockenizer very easily using HuggingFace's `transformers` API. If the tokenizer object is available locally, the library will load it directly. Otherwise, it will be downloaded automatically from their hub and cached locally for later usage.

In [56]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

To get a glimpse of how subword tokenization work, let's take a comparitive look at a single tweet's tokens in two modes. The first mode is our regular split by space mode, and the other is by running the tweet through our pretrained tokenizer that we just instantiated.

In [57]:
tweet = clean_training_data.iloc[0 ,0]
print(tweet.split(" "))
print("")
print(tokenizer.tokenize(tweet))

['This', 'time', 'tomorrow\\u002c', '@user', 'and', 'I', 'will', 'be', 'well', 'on', 'our', 'way', 'to', 'Starkville', 'for', 'a', 'sick', 'weekend...the', 'USM', 'game', 'is', 'going', 'to', 'be', 'atrocious.', '\n']

['this', 'time', 'tomorrow', '\\', 'u', '##00', '##2', '##c', '@', 'user', 'and', 'i', 'will', 'be', 'well', 'on', 'our', 'way', 'to', 'stark', '##ville', 'for', 'a', 'sick', 'weekend', '.', '.', '.', 'the', 'us', '##m', 'game', 'is', 'going', 'to', 'be', 'at', '##ro', '##cious', '.']


The first two lines in the output above is the the tokenization by whitespaces, the other two lines are for the pretrained tokenizer. We can see that the pretrained tokenizer is able to take a single token linke `tomorrow\\u002c` and tokenize it to multiple subwords `tomorrow, \\, u, ##00, ##2, ##c` hence recoverting the proper word tomorrow and splitting the other parts to tokens that it may have seen before. Another example is `Starkville`, where it could have not been seen in the training data, but the tokenizer have seen other samples postfixd with `ville` and may have learned that these represent locations, so it generates two subwords `Stark` and `##ville` to represent that single token. The double hashs we see in some of the tokens is an indicator of a subword. Sometimes the generated subword tokenization may not make direct sense to us humans, but the it makse statistical sense given the data that the tokenizer was pretrained on.

_**Does it make some sense now that we only applied light cleaning?**_

What we need to do now is start tokenizing all of our data into the format needed for training. This format has two main components to it:
- The input_ids of the tokenized senetences. These are the numerical ids of the tokens in the pretrained vocabulary.
- The attention mask, which represent what elements should be processed by the model and what shouldn't. This is important because we need to make all the representations have the same sequence length so that we can process them in a parallel and memory effcient way, and this could result in adding extra non-informative `PAD` symbols to the tokens. The attention mask will have 0 for these non-informative pad tokens, and 1 for the other informative original tokens of the sentence.

Calling the tokenizer directly on the sentences will result in these two pieces of data.

In [58]:
tokenized_training_data = tokenizer(
    clean_training_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

tokenized_dev_data = tokenizer(
    clean_dev_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

tokenized_testing_data = tokenizer(
    clean_testing_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

In [59]:
tokenized_dev_data

{'input_ids': tensor([[  101,  1000,  2577,  ...,     0,     0,     0],
        [  101, 16009, 14851,  ...,     0,     0,     0],
        [  101,  1030,  5310,  ...,     0,     0,     0],
        ...,
        [  101,  1030,  5310,  ...,     0,     0,     0],
        [  101,  4788,  2518,  ...,     0,     0,     0],
        [  101,  1000,  3203,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Before we can finalize the preparation of our dataset, we need first to encode our labels into numeric values, we'll set negative to be 0, neutral to be 1 and positive to be 2.

In [60]:
import torch

label_to_id = {"negative": 0, "neutral": 1, "positive": 2}

training_labels = torch.tensor([label_to_id[label] for label in clean_training_data.loc[:, "label"]])
dev_labels = torch.tensor([label_to_id[label] for label in clean_dev_data.loc[:, "label"]])
testing_labels = torch.tensor([label_to_id[label] for label in clean_testing_data.loc[:, "label"]])

Now that we have tokenized all of our data, we only need to wrap them into a `pytorch`'s `Dataset` class. This `Dataset` class is a convenient class used to facilitate generating batches for the training. What we'll do is to define a generic dataset class that is not tied to any of the specific splits, and then use that to instantiate different dataset objects for each of the splits.

In [61]:
from torch.utils.data import Dataset

class SentimentDataset(Dataset):

    def __init__(self, input_ids, attention_masks, encoded_labels):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.encoded_labels = encoded_labels

        self.size, *_ = input_ids.shape

    def __len__(self):
        return self.size

    def __getitem__(self, index):
        return {
            "input_ids": self.input_ids[index],
            "attention_mask": self.input_ids[index],
            "labels": self.encoded_labels[index]
        }

In [62]:
training_dataset = SentimentDataset(tokenized_training_data["input_ids"], tokenized_training_data["attention_mask"], training_labels)
dev_dataset = SentimentDataset(tokenized_dev_data["input_ids"], tokenized_dev_data["attention_mask"], dev_labels)
testing_dataset = SentimentDataset(tokenized_testing_data["input_ids"], tokenized_testing_data["attention_mask"], testing_labels)

## Training a BERT-based Classifier

Now that we have our datset ready, we can start preparing our model for training. First we need to get an instance of the BERT pretrained weights to use in our model. This can simply be done via HuggingFace's APIs in a similar way for the tokenizer; by using the `AutoModelForSequenceClassification`.

In [63]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

As we can see, the `transformers` package will download the pretrained weights from the hub and cache them locally like it did with the tokenizer earlier. As we discussed earlier in the first session, this BERT model is trained on the maksed langauge modeling task (MLM), so it doesn't work directly for text classification. What `AutoModelForSequenceClassification` does is that it takes the pretrained weights from the MLM task, and appends a feed forward neural network on top of its output. This feed forward NN has an output size equal to the number of labels in our data so that it will able to predict them correctly. This feedforward NN is intitialized randomly and not trained, but it will get trained along with the other weights of BERT when we present our training data to it. This process is what we call **fine-tuning**.

To do that fine-tuning, we'll utilize the `Trainer` object from the `transformers` library. But in order to do so, we need first to provide two external objects, the first one is a metrics computation function. This function accepts an object called `evaluation_prediction` from the training loop and this object contains both the model's predictions and the true labels for a batch. We'll write a function that consumes this object and returns a dictionary of metrics, one of them should be used to track the model performance.

In [64]:
import numpy as np
from sklearn.metrics import f1_score, recall_score

def compute_metrics(evaluation_predictions):

    logits = evaluation_predictions.predections
    y_true = evaluation_predictions.label_ids

    y_pred = np.argmax(logits, axis=1)

    weighted_f1 = f1_score(y_true, y_pred, average='weighted')
    negative_recall = recall_score(y_true, y_pred, labels=[0], average=None)

    return {
        "weighted-F1": weighted_f1,
        "negative-recall": negative_recall
    }

After defining the `compute_metrics` function, we'll go on a define the next object required to run the `Traininer`, which is the training arguments. You can think of this as a bunch of configurations that determine how the training should run. This includes the training batch size, how many epochs to train for, when to save for checkpoints, how many checkpoints we should save on disk, what is the metric (out of the two we have defined above) is the one to monitor ... etc.

In [66]:
from transformers import TrainingArguments

batch_size = 16
metric_name = "weighted-f1"

training_args = TrainingArguments(
    evaluation_strategy="epoch",
    save_strategy = "epoch",
    num_train_epochs=50,
    load_best_model_at_end=True,
    save_total_limit=1,
    metric_for_best_model=metric_name,
    per_device_eval_batch_size=batch_size,
    per_device_train_batch_size=batch_size,
    output_dir="./models"
)

We're ready now to train our model using all the pieces that we have been defining so far. We just need to import `Trainer` from transformers and intialize a trainer object with our specifications.

In [67]:
from transformers import Trainer

trainier = Trainer(
    model,
    training_args,
    train_dataset=training_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
)

We simply run the training now by calling `trainer.train()`!

In [68]:
best_model = trainier.train()

***** Running training *****
  Num examples = 1244
  Num Epochs = 50
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3900
  Number of trainable parameters = 109484547


  0%|          | 0/3900 [00:00<?, ?it/s]

KeyboardInterrupt: 