# Lesson 10: Fine-tunning BERT for sentiment analysis using HuggingFace Transformers

<a href="https://colab.research.google.com/github/Paulescu/practical-nlp-2021/blob/main/notebooks/1_fine_tune_bert_for_sentiment_analysis.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/>
</a>

## Learnings:

At the end of this lesson you will know:

- How to leverage the power of massive pre-trained models for you particular ML problem.

- How to use the HuggingFace Transformers API to quickly build a training pipeline.

# Stage 1: Read the data

### Download the data

In [6]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2020-11-24 16:47:48--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz.1’


2020-11-24 16:48:36 (1.67 MB/s) - ‘aclImdb_v1.tar.gz.1’ saved [84125825/84125825]



### Read the data into Python lists

In [1]:
from typing import List, Tuple
from pathlib import Path

def read_imdb_split(split_dir: str) -> Tuple[List[str], List[str]]:
    split_dir = Path(split_dir)
    texts = []
    labels = []
    for label_dir in ['pos', 'neg']:
        for text_file in (split_dir/label_dir).iterdir():
            texts.append(text_file.read_text())
            labels.append(0 if label_dir is "neg" else 1)

    return texts, labels

train_texts_, train_labels_ = read_imdb_split('aclImdb/train')
print('train_texts: ', len(train_texts_))

test_texts, test_labels = read_imdb_split('aclImdb/test')
print('test_texts: ', len(test_texts))

train_texts:  25000
test_texts:  25000


## 3. Split data into train/validation/test

In [2]:
from sklearn.model_selection import train_test_split

train_texts, val_texts, train_labels, val_labels = \
    train_test_split(train_texts_, train_labels_, test_size=0.2, random_state=1)

## 4. Text tokenization

A tokenizer maps a string (sentence) to a list of integers (token_ids)

In [5]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

In Transformers v4.0.0, the default path to cache downloaded models changed from '~/.cache/torch/transformers' to '~/.cache/huggingface/transformers'. Since you don't seem to have overridden and '~/.cache/torch/transformers' is a directory that exists, we're moving it to '~/.cache/huggingface/transformers' to avoid redownloading models you have already in the cache. You should only see this message once.


In [14]:
print(len(train_encodings.input_ids[0]))
print(len(train_encodings.input_ids[10]))

512
512


In [16]:
idx = 10

for key, value in train_encodings.items():
    print(key, value[idx])

input_ids [101, 3100, 1010, 2061, 1045, 2131, 2009, 1012, 2057, 1005, 2128, 4011, 2000, 2022, 14603, 1012, 1996, 2801, 2038, 2042, 8461, 1012, 1037, 2611, 2003, 2725, 2014, 3611, 1998, 2635, 7760, 1997, 2009, 1012, 2655, 2033, 2058, 1996, 5213, 1011, 2600, 6907, 2021, 1045, 2655, 2005, 1996, 13216, 17555, 1997, 2019, 2552, 2077, 1045, 2064, 2991, 2005, 2023, 1012, 2021, 2123, 1005, 1056, 5987, 2033, 2000, 3422, 1037, 3730, 1011, 22555, 1998, 2468, 14603, 2008, 2016, 2003, 1005, 2725, 2014, 2269, 1005, 1012, 1012, 1012, 1045, 2812, 8440, 1005, 1056, 2008, 4680, 2468, 1037, 2978, 16999, 1999, 1996, 4639, 2143, 3068, 2525, 29543, 2094, 2007, 1005, 9040, 1010, 1998, 16709, 20100, 1005, 22555, 1012, 1012, 1012, 5469, 3475, 1005, 1056, 2054, 2115, 2568, 2064, 7966, 2017, 2046, 8929, 1012, 2009, 2003, 2054, 2941, 6526, 1999, 2143, 1012, 2023, 2003, 2073, 2771, 17339, 11896, 1999, 10367, 1053, 1012, 4654, 7913, 26725, 4150, 10256, 2043, 2009, 4150, 1037, 5454, 2115, 2219, 6172, 1012, 102, 0, 0

**Example of text tokenization**

In [58]:
# example 1
sentence = "I loved the movie"
mock_encodings = tokenizer(sentence, truncation=True, padding=True)
print('Original text: \t', sentence)
print('Tokens: \t', mock_encodings.tokens())
print('Token ids: \t', mock_encodings.input_ids)

# example 2
sentence = "I couldn't understand the movie"
mock_encodings = tokenizer(sentence, truncation=True, padding=True)
print('\nOriginal text: \t', sentence)
print('Tokens: \t', mock_encodings.tokens())
print('Token ids: \t', mock_encodings.input_ids)

# example 3
sentence = "I have seen this movie 23 times!!!"
mock_encodings = tokenizer(sentence, truncation=True, padding=True)
print('\nOriginal text: \t', sentence)
print('Tokens: \t', mock_encodings.tokens())
print('Token ids: \t', mock_encodings.input_ids)

Original text: 	 I loved the movie
Tokens: 	 ['[CLS]', 'i', 'loved', 'the', 'movie', '[SEP]']
Token ids: 	 [101, 1045, 3866, 1996, 3185, 102]

Original text: 	 I couldn't understand the movie
Tokens: 	 ['[CLS]', 'i', 'couldn', "'", 't', 'understand', 'the', 'movie', '[SEP]']
Token ids: 	 [101, 1045, 2481, 1005, 1056, 3305, 1996, 3185, 102]

Original text: 	 I have seen this movie 23 times!!!
Tokens: 	 ['[CLS]', 'i', 'have', 'seen', 'this', 'movie', '23', 'times', '!', '!', '!', '[SEP]']
Token ids: 	 [101, 1045, 2031, 2464, 2023, 3185, 2603, 2335, 999, 999, 999, 102]


## 5. Create a PyTorch Dataset object

### How to create a custom dataset in PyTorch?

Steps:
- Define a new class (e.g. MyDataset) that extends torch.utils.data.Dataset.
- Overwrite __getitem__(idx) method.
- Overwrite __len__() method.

In [47]:
import torch

class IMDbDataset(torch.utils.data.Dataset):
    
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, train_labels)
test_dataset = IMDbDataset(test_encodings, train_labels)

## Setup Tensorboard to visualize metrics during training

In [8]:
%load_ext tensorboard
%tensorboard --logdir runs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


## Fine-tuning with native PyTorch

In [9]:
BATCH_SIZE = 16
LEARNING_RATE = 5e-5
EPOCHS = 10

In [None]:
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification, AdamW

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(device)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

optim = AdamW(model.parameters(), lr=LEARNING_RATE)
global_train_step = 0

# Setup logging to Tensorboard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()

def train_val_epoch(
    # model,
    batch_loader: DataLoader,
    epoch: int,
    is_train: bool=True
    ):
    
    if is_train:
        model.train()
    else:
        model.eval()

    epoch_loss = 0
    epoch_predictions = 0
    epoch_correct_predictions = 0
    for batch in batch_loader:
        
        # forward pass
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(
            input_ids,
            attention_mask=attention_mask,
            labels=labels,
        )
        loss = outputs[0]
        _, predictions = torch.max(outputs[1], 1)

        if is_train:
            # backward pass
            optim.zero_grad()    
            loss.backward()
            optim.step()
        
        # batch stats
        batch_size = input_ids.shape[0]
        batch_correct_predictions = predictions.eq(labels.data).sum().item()
        batch_accuracy = batch_correct_predictions/batch_size
        batch_loss = loss.item()

        # epoch stats
        epoch_correct_predictions += batch_correct_predictions
        epoch_predictions += batch_size     
        epoch_loss += loss.item()

        # log batch metrics, only in train mode.
        # The purpose is to verify the loss is actually going down as we traverse
        # the whole train set.
        if is_train:
            global global_train_step
            global_train_step += batch_size
            writer.add_scalar('training_batch_loss',
                              batch_loss,
                              global_train_step)
            writer.add_scalar('training_batch_accuracy',
                              batch_accuracy,
                              global_train_step)
      
    # epoch loss and accuracy
    epoch_loss = epoch_loss / epoch_predictions 
    epoch_accuracy = epoch_correct_predictions / epoch_predictions

    # log epoch metrics, both in train and validation mode.
    epoch_loss_metric_name = 'training_epoch_loss' if is_train \
        else 'validation_epoch_loss'
    epoch_accuracy_metric_name = 'training_epoch_accuracy' if is_train \
        else 'validation_epoch_accuracy'
    writer.add_scalar(epoch_loss_metric_name, epoch_loss, epoch)
    writer.add_scalar(epoch_accuracy_metric_name, epoch_accuracy, epoch)

for epoch in range(EPOCHS):
    # train
    train_val_epoch(train_loader, epoch, is_train=True)

    # validation
    with torch.no_grad():
        train_val_epoch(val_loader, epoch, is_train=False)