# Day 2 of AI Academy 2022 - NLP Track

Hello! Welcome to the second day in your AI academy 2022, NLP track. In this lab we'll go over the usage of the HuggingFace's `transformers` library to train a BERT-based model for our sentiment analysis task. BERT is a transfromer-based model, a deep learning model, which requires more computation than the traditional models we worked with yesterday. Deep learning models typically require a GPU to train and run effciently, if you're local machine is not powered with a GPU to use, we recommend that you run this notebook in Google's Colabortatory (or Colab for short), which is a free notebook runtime that can assign you a GPU for a limited amount of time (usually 12 hours). If you wish to run this in Colab, simply click the following badge.

<a target="_blank" href="https://colab.research.google.com/github/Mostafa-Samir/AI-Academy-NLP-Dec-2022/blob/main/Day-2-lab.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Colab Configuration

If you chose to run in Colab, you'll need to run the following cell. 

_**DO NOT RUN THE NEXT CELL IF YOU'RE RUNNING THE NOTEBOOK LOCALLY!**_

In [None]:
!git clone https://github.com/Mostafa-Samir/AI-Academy-NLP-Dec-2022.git

!cd AI-Academy-NLP-Dec-2022 && \
 pip install --upgrade pip && \
 pip install -r requirements.txt && \
 python -m setup-nltk

## Data Path Resolution

There'll be a slight difference in the file structure if you chose to run this notebook in Colab compared to running locally. The following script will take into account this difference in file structure to make sure sure that pathes to the data are correct in the rest of the notebook. The script will define a `DATA_ROOT` under which calling `DATA_ROOT/data/english/train.csv` for example will always be resolved correctly regardless of the environemnt.

In [3]:
running_in_colab = 'google.colab' in str(get_ipython()) if hasattr(__builtins__,'__IPYTHON__') else False
DATA_ROOT = "./AI-Academy-NLP-Dec-2022" if running_in_colab else "."

## Data Preparation

To start using a BERT model to predict the sentiment of the text, we first need to prepare the data in the right format. Preperation here is much lighter than we did yesterday. Here, we'll be doing very light cleaning on the data by normalizing user mentions and possibly normalizing all URLs to a representative token. We'll then pass these lightly cleaned text into a pretrained unsupervised tokenizer like sentencepiece. We're chosing to do the light cleaning in order to show case the power of pertained unsupervised tockenizers.

In [12]:
import re

def pipeline(fn_list):
    def inner_function(text):
        out = text
        for fn in fn_list:
            out = fn(out)
        return out
    
    return inner_function

def normalize_mentions(text: str) -> str:
    return re.sub("@\w*", "@user", text)
    
def normalize_urls(text: str) -> str:
    return re.sub("http(s{0,1})://[\w\-_./:]*", "http", text)    

In [13]:
import pandas as pd
import os

cleaning_pipeline = pipeline([normalize_mentions, normalize_urls])

training_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/train.csv"))
clean_training_data = training_data.copy()
clean_training_data.loc[:, "tweet"] = clean_training_data.loc[:, "tweet"].apply(cleaning_pipeline)

dev_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/dev.csv"))
clean_dev_data = dev_data.copy()
clean_dev_data.loc[:, "tweet"] = clean_dev_data.loc[:, "tweet"].apply(cleaning_pipeline)

testing_data = pd.read_csv(os.path.join(DATA_ROOT, "data/english/test.csv"))
clean_testing_data = testing_data.copy()
clean_testing_data.loc[:, "tweet"] = clean_testing_data.loc[:, "tweet"].apply(cleaning_pipeline)

### Tokenization

Now that we have our dataset lightly cleaned, we'll start looking at the tokenizer and how sentences are splitted into subword tokens for the BERT model. The pretrained BERT model that we'll be using is called `bert-base-cased`, of which we can get the tockenizer very easily using HuggingFace's `transformers` API. If the tokenizer object is available locally, the library will load it directly. Otherwise, it will be downloaded automatically from their hub and cached locally for later usage.

In [17]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

To get a glimpse of how subword tokenization work, let's take a comparitive look at a single tweet's tokens in two modes. The first mode is our regular split by space mode, and the other is by running the tweet through our pretrained tokenizer that we just instantiated.

In [41]:
tweet = clean_training_data.iloc[0 ,0]
print(tweet.split(" "))
print("")
print(tokenizer.tokenize(tweet))

['This', 'time', 'tomorrow\\u002c', '@user', 'and', 'I', 'will', 'be', 'well', 'on', 'our', 'way', 'to', 'Starkville', 'for', 'a', 'sick', 'weekend...the', 'USM', 'game', 'is', 'going', 'to', 'be', 'atrocious.', '\n']

['This', 'time', 'tomorrow', '\\', 'u', '##00', '##2', '##c', '@', 'user', 'and', 'I', 'will', 'be', 'well', 'on', 'our', 'way', 'to', 'Stark', '##ville', 'for', 'a', 'sick', 'weekend', '.', '.', '.', 'the', 'US', '##M', 'game', 'is', 'going', 'to', 'be', 'at', '##ro', '##cious', '.']


The first two lines in the output above is the the tokenization by whitespaces, the other two lines are for the pretrained tokenizer. We can see that the pretrained tokenizer is able to take a single token linke `tomorrow\\u002c` and tokenize it to multiple subwords `tomorrow, \\, u, ##00, ##2, ##c` hence recoverting the proper word tomorrow and splitting the other parts to tokens that it may have seen before. Another example is `Starkville`, where it could have not been seen in the training data, but the tokenizer have seen other samples postfixd with `ville` and may have learned that these represent locations, so it generates two subwords `Stark` and `##ville` to represent that single token. The double hashs we see in some of the tokens is an indicator of a subword. Sometimes the generated subword tokenization may not make direct sense to us humans, but the it makse statistical sense given the data that the tokenizer was pretrained on.

_**Does it make some sense now that we only applied light cleaning?**_

What we need to do now is start tokenizing all of our data into the format needed for training. This format has two main components to it:
- The input_ids of the tokenized senetences. These are the numerical ids of the tokens in the pretrained vocabulary.
- The attention mask, which represent what elements should be processed by the model and what shouldn't. This is important because we need to make all the representations have the same sequence length so that we can process them in a parallel and memory effcient way, and this could result in adding extra non-informative `PAD` symbols to the tokens. The attention mask will have 0 for these non-informative pad tokens, and 1 for the other informative original tokens of the sentence.

Calling the tokenizer directly on the sentences will result in these two pieces of data.

In [48]:
tokenized_training_data = tokenizer(
    clean_training_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

tokenized_dev_data = tokenizer(
    clean_dev_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

tokenized_testing_data = tokenizer(
    clean_testing_data.loc[:, "tweet"].to_list(),
    padding='longest',
    return_tensors='pt',
    return_attention_mask = True,
    return_token_type_ids=False
)

In [49]:
tokenized_dev_data

{'input_ids': tensor([[  101,   107,  1667,  ...,     0,     0,     0],
        [  101, 19408, 15603,  ...,     0,     0,     0],
        [  101,   137,  4795,  ...,     0,     0,     0],
        ...,
        [  101,   137,  4795,  ...,     0,     0,     0],
        [  101,   160, 18649,  ...,     0,     0,     0],
        [  101,   107,  2876,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}