# Sentence Classification with BERT

In the last module, we learnt how to load a custom dataset, convert it into features, and divide the dataset into train/validation splits.
In this module, we will learn how to fine-tune BERT for a sentence-level classification task. Using BERT, we will build a sentiment analyzer on the climate-change tweet dataset that we went through in the last notebook. 

__What you will learn:__
This notebook, in addition to the learnings from the previous notebook, will help you finetune BERT for a sentence-level classification task. By the end of the notebook, you should have the skills to load a custom dataset, convert it into features, and finetune BERT for the task.

Topics covered:
- Sentence classification
- Train/validation loop


### Using GPUs on Colab
To run this notebook on GPU, we will need to enable them on Colab. Enable GPUs by doing the following steps:
- Click on `Edit`
- `Notebook Settings`
- Choose `GPU` under `Hardware accelerator`
Wait until the resources have been allocated.


### Recap
In the last lesson, we learnt how to load our own dataset. Let's go ahead use that code to load a custom dataset. While we are at it, let's simplify the code into `convert_examples_to_features` function.

In [None]:
# from google.colab import files
# climate_change_dataset = files.upload()

In [1]:
import pandas as pd

df = pd.read_csv("twitter_sentiment_data.csv")

Set up `transformers`

In [4]:
# !pip install --quiet transformers

### Load pretrained BERT
We will now load BERT model and it's corresponding tokenizers. As you might already know, BERT is one of the widely used transformer architectures in NLP. To know more about BERT models, do checkout this great [video](https://ai.science/e/bertbert-explained-pre-training-of-deep-bidirectional-transformers-for-language-understanding--2018-11-06).

Since our task is to do sequence or sentence-level classification, we will be using `AutoModelForSequenceClassification` module from transformers. `AutoClass` is just an abstraction that will automatically infer the transformer model type sparing the practitioners the pain of finding the right modules. Each `AutoClass` is mapped to individual model types. This model type is based on name of the model passed to the `AutoClass`. For example, if we were to pass `gpt2` to `AutoTokenizer`, we would automatically construct a `GPT2Tokenizer`. Read more about `AutoClass` [here](https://huggingface.co/transformers/model_doc/auto.html).




In [2]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Pre-process data
Here comes our pre-processing step from the last notebook. One more step that we will also add in this notebook is  to convert the labels to corresponding numerical forms. 

In [3]:
from keras.preprocessing.sequence import pad_sequences
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

MAX_LEN = 64
label2id = {id:id+1 for id in range(-1, 3, 1)}
id2label = {v:k for k, v in label2id.items()}

def convert_examples_to_features(tweets, labels):
  input_ids = [
      bert_tokenizer.encode(tweet, add_special_tokens=True) for tweet in tweets
  ]

  input_ids = pad_sequences(
      input_ids,
      maxlen=MAX_LEN,
      dtype="long", 
      value=bert_tokenizer.pad_token_id,
      padding="post",
      truncating="post"
  )

  input_ids = torch.tensor(input_ids)
  attention_masks = (input_ids > 0).int()
  labels = torch.tensor([label2id[label] for label in labels])

  return TensorDataset(input_ids, attention_masks, labels)

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [4]:
dataset = convert_examples_to_features(df.message, list(df.sentiment))

### Train/Validation Set
Divide the dataset into train/validation splits

In [5]:
from sklearn.model_selection import train_test_split

train_data, val_data, train_labels, val_labels = train_test_split(
    dataset,
    list(df.sentiment), 
    random_state=1234,
    test_size=0.2
)

print(f"Train size: {len(train_data)}, Validation size: {len(val_data)}")

Train size: 35154, Validation size: 8789


# Model definition
Okay, let's get right into the defining our model. We will be loading the `bert-base-uncased` model.

In [6]:
bert_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Using `AutoModelForSequenceClassification` will map to `BertForSequenceClassification` as defined [here](https://github.com/huggingface/transformers/blob/c89bdfbe720bc8f41c7dc6db5473a2cb0955f224/src/transformers/models/bert/modeling_bert.py#L1313). When you open the link, you'll see that the `BertForSequenceClassification` model essentially does the following:
- `BERT` model
- Dropout
- a Linear layer

Note that `BertForSequenceClassification` already comes with a linear layer, and we just have to modify it to meet our requirements. BERT output for each token (`MAX_LEN` number of tokens) is a 768-dimension vector. The BERT vector corresponding to the `CLS` token is used for sequence classification tasks. The linear layer maps the 768-dimension vector to a vector of output-size dimension.

You might also notice that the final number of labels is determined by a config file. Let us load the config file corresponding to the pre-trained model and investigate whether the `num_labels` is equal to 4. 

In [7]:
from transformers import AutoConfig

bert_config = AutoConfig.from_pretrained(
    "bert-base-uncased"
)
print(bert_config.num_labels)

2


The default number of output labels in the pre-trained model is 2. But, our classification task has 4 labels. Let us fix that first. To do this obtian the pre-trained model config and change the number of labels to 4. We will also have to specify the new label to id mapping.

In [8]:
from transformers import AutoConfig, AutoModel

bert_sequential_config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    num_labels=4,
    id2label=id2label,
    label2id=label2id
)

In [9]:
bert_sequential_model = AutoModelForSequenceClassification.from_pretrained(
            pretrained_model_name_or_path="bert-base-uncased",
            config=bert_sequential_config,
        )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

That is everything! We will now model to GPU using `.to()` if `GPU`s are available. If `GPU`s is available, we set the device to be `cuda`, `cpu` otherwise.

In [10]:
if torch.cuda.is_available():
  device = torch.device("cuda")
else:
  device = torch.device("cpu")

print(f"Moving model to device: {device}")
bert_sequential_model = bert_sequential_model.to(device)

Moving model to device: cpu


# Training
Alright, let's revisit what we have done so far:
1. Preprocess data: done!
2. Load pre-trained model: done!
3. Train model: let's get to it now!

In this step, we will sample data in batches from our train `DataLoader` and will finetune our BERT model.


## Setup Dataloaders

Time to answer the homework question from last notebook. The question was why shouldn't we use a `RandomSampler` on the validation dataset. It is not a good practice to use `RandomSampler` because with random sampling, the validation accuracy will vary and therefore the model that is saved based on the best validation accuracy. Avoid using random sampler and use a `SequentialSampler` instead.

In [14]:
from torch.utils.data import (
    DataLoader,
    TensorDataset,
    RandomSampler,
    SequentialSampler,
)

# BATCH_SZ = 64
BATCH_SZ = 2

train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(
    dataset=train_data,
    sampler=train_sampler,
    batch_size=BATCH_SZ
)

val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(
    dataset=val_data,
    sampler=val_sampler,
    batch_size=BATCH_SZ
)

## Setup optimizer
As a part of the training process, we should also define an optimizer. We will use our good old `SGD` optimizer in this case.

In [15]:
from torch.optim import SGD

# define a learning rate
LR=5e-4
optimizer = SGD(bert_sequential_model.parameters(), lr=LR)

## Training loop

The training loop has two steps:
1. Epoch loop: an epoch means one pass through the training dataset. 
2. Batch training loop: in this inner loop, the model is trained on a batch of data at each step.

At the end of each epoch, we run our model through the validation dataset and calculate evaluation metrics (accuracy and f1) on it. As you might know, at the beginning of each training loop, we have to specify that we're in training mode by using `model.train()` method. This is to ensure that opertions like `dropout` and `batchnorm` are performed in train mode rather than in evaluation mode. Similarly, during the validation phase, we make sure that gradients are not computed, which will help in computation speedup. 

Some of these concepts can be confusing and you might ask, "oh what if I accidentally forget to add `model.train()`". Luckily for us, we have [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) which will abstract away a lot of these boiler plate code as we will see in the next notebook.

Before the training process:
- BERT model has weights from pre-training
- the linear model on top has random weights

During finetuning process:
- the weights of the linear layer as well as the model will be updated. 
- we can also control what layers in the overall model should be updated. For example, we can only decide to finetune the linear layers, or certain layers in BERT model. For the purpose of our notebook, we will finetune the entire model (BERT + linear layers) for 7 epochs.

In [None]:
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

EPOCHS = 7
loss = []

for epoch in range(EPOCHS):
    batch_loss = 0
    # The model is in training model now; while in evaluation mode,
    # we change this to .eval()
    bert_sequential_model.train()

    for batch in train_dataloader:
        # move the input data to device
        batch = tuple(t.to(device) for t in batch)
        input_ids, attention_mask, labels = batch

        # pass the input to the model
        outputs = bert_sequential_model(
            input_ids, 
            attention_mask=attention_mask, 
            labels=labels
        )
        
        # set model gradients to 0, so that optmizer won't accumulate
        # them over subsequent training iterations
        optimizer.zero_grad()
        loss = outputs[0]

        # obtain loss, and backprop
        batch_loss += loss.item()
        loss.backward()
        #clip gradient norms to avoid any exploding gradient problems
        # torch.nn.utils.clip_grad_norm_(bert_sequential_model.parameters(), 1.0)
        optimizer.step()

    epoch_train_loss = batch_loss / len(train_dataloader)  
    print(f"epoch: {epoch+1}, train_loss: {epoch_train_loss}")
    
    # At the end of each epoch, we will also run the model 
    # on the validation dataset
    val_loss, val_accuracy = 0, 0
    true_labels, predictions = [], []

    for val_batch in val_dataloader:
        val_batch = tuple(t.to(device) for t in val_batch)
        input_ids, attention_mask, labels = val_batch
        
        with torch.no_grad():        
            outputs = bert_sequential_model(
              input_ids, 
              attention_mask=attention_mask, 
              labels=labels
            )
        
        val_loss += loss.item()
        
        # convert predictions and gold labels to numpy arrays so that
        # we can compute evaluation metrics like accuracy and f1
        label_ids = labels.to('cpu').numpy()
        preds = outputs[1].detach().cpu().numpy()
        preds = np.argmax(preds, axis=1)
        true_labels.extend(label_ids)
        predictions.extend(preds)
      
    acc = f1_score(y_true=true_labels, y_pred=predictions, average='micro')
    f1 = f1_score(y_true=true_labels, y_pred=predictions, average='macro')

    print(f"epoch: {epoch+1} val loss: {val_loss}, accuracy:{acc}, f1:{f1}")

## On Data Annotation
In an industrial setting, you may not have enough data to even support finetuning. In such cases, it is general practice to annotate more data. Some standards steps that you would follow as a Machine Larning practioner for faster data annotation:
1. Collect unlabeled data relevant for the downstream task
2. Label the dataset with a pre-trained model
3. Manually verify the label, annotate in cases where there are wrong predictions.

[Prodigy](prodi.gy) is an annotation tool that is widely used in industry for annotation purposes. 
One of the advantages of using Prodigy makes annotation simpler and faster through its integrations with existing models! Our team has labelled thousands of samples just in a matter of hours by leveraging existing models + Prodigy.  For a quick look on how Prodigy looks like and its working, check out this [video](https://www.youtube.com/watch?v=5di0KlKl0fE) by their co-founder Inas. 


If you are an organization with multiple annotators, it is totally worth the buy, otherwise it is way too expensive!  You can install prodigy by following the instructions from [here](https://prodi.gy/docs/install).  If Prodigy seems too expensive for your needs, consider annotating with a script or through Jupyter Notebook.


 

# Homework: Custom Networks

So far, we have used the `Linear` layer provided by `BertForSequenceClassification` class. As a part of your homework, construct a custom neural network to add a multi-layer-perceptron (MLP) on top of `BERT` model. 

Some hints:
- Define a PyTorch model by inheriting `nn.Module` class
- Use `AutoModel` to obtain an instance of `BertModel`
    ```python
    bert_model = Automodel.from_pretrained(
          pretrained_model_name_or_path="bert-base-uncased"
    )
    ```
- Add `MLP` layers on top:
  - BERT output vector dimension is 768, therefore the input dimension of MLP should also be 768

This excercise might tuurn out to be a bit challenging, but don't worry as we will learn how to define a custom network in the next notebook.