## Sentiment Analysis with BERT
To train a seniment analysis model, we will perform the following operations:
- Install Transformers library;
- Load the BERT Classifier and Tokenizer;
- Download the dataset Financial Phrasal from HuggingFace and create a processes dataset;
- Configure the loaded BERT model and train for fine-tuning
- Make predictions with the fine-tuned model

To train and fine-tune the BERT model, it is recommeded to use the Google Colab for running the python notebook. Training a model on CPU may take several hours or weeks depending on the size of the dataset. 
Google Colab offers free GPUs and TPUs. Since we'll be training a large neural network it's best to utilize these features.

A GPU can be added by going to the menu and selecting:

Runtime -> Change runtime type -> Hardware accelerator: GPU


### Step 1 : Installing Transformers
Install the transformers library using the following command:

In [None]:
!pip install -qqq transformers
!pip install -qqq datasets

[K     |████████████████████████████████| 3.1 MB 5.5 MB/s 
[K     |████████████████████████████████| 596 kB 27.6 MB/s 
[K     |████████████████████████████████| 895 kB 18.2 MB/s 
[K     |████████████████████████████████| 56 kB 2.9 MB/s 
[K     |████████████████████████████████| 3.3 MB 34.3 MB/s 
[K     |████████████████████████████████| 290 kB 5.3 MB/s 
[K     |████████████████████████████████| 125 kB 39.2 MB/s 
[K     |████████████████████████████████| 243 kB 39.2 MB/s 
[K     |████████████████████████████████| 1.1 MB 31.7 MB/s 
[K     |████████████████████████████████| 271 kB 48.0 MB/s 
[K     |████████████████████████████████| 192 kB 42.0 MB/s 
[K     |████████████████████████████████| 160 kB 45.7 MB/s 
[?25h

## Step 2: Load BERT Classifier and Tokenizer 
After the installation is completed, we will import the essential libraries for defining, exploring, and visualizing the dataset. The transformer library of Hugging Face contains PyTorch implementation of state-of-the-art NLP models including BERT and pre-trained model weights. 

In [None]:
import numpy as np
import pandas as pd


import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from torch import nn, optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset

from transformers import BertModel, BertConfig, BertTokenizer, BertForSequenceClassification


### Step 3: Download Financial_phrasebank dataset from Hugginface
This dataset is downloaded from Huggingface that contain 4846 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators. 
#### Data Fields
- sentence: a tokenized line from the dataset
- label: a label corresponding to the class as a string: 'positive', 'negative' or 'neutral'

#### Data Splits
The dataset is available in four possible configurations depending on the percentage of agreement of annotators. I will be working with the following configuration:

**sentences_50agree**; Number of instances with >=50% annotator agreement: 4846

Since the dataset has no train/validation/test split. I use the train_test_split library from scikit-learn to split the dataset in the train and test. I split the test datset further into validation and test split; and gather all into a single dataset dictionary. 


In [None]:
# load  dataset and split the data into train/validation/test datasets
from datasets import load_dataset, DatasetDict

raw_datasets = load_dataset("financial_phrasebank", "sentences_50agree")
# 90% train and 10% test + validation
train_test_ds = raw_datasets["train"].train_test_split(test_size=0.1)

# Split the 10% test + valid in half test, half valid
test_valid = train_test_ds['test'].train_test_split(test_size=0.5)

# Gather everything into a single DatasetDict
dataset = DatasetDict({
    'train': train_test_ds['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})
dataset

Downloading:   0%|          | 0.00/2.61k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.93k [00:00<?, ?B/s]

Downloading and preparing dataset financial_phrasebank/sentences_50agree (download: 665.91 KiB, generated: 663.32 KiB, post-processed: Unknown size, total: 1.30 MiB) to /root/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0...


Downloading:   0%|          | 0.00/682k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

Dataset financial_phrasebank downloaded and prepared to /root/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 4361
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 243
    })
    valid: Dataset({
        features: ['sentence', 'label'],
        num_rows: 242
    })
})

In [None]:
dataset["train"][0]

{'label': 2, 'sentence': 'Cargo volume increased by approximately 5 % .'}

The labels are already in integers. To know the corresponding label to the integer, use features to inspect the dataset.

In [None]:
dataset["train"].features

{'label': ClassLabel(num_classes=3, names=['negative', 'neutral', 'positive'], names_file=None, id=None),
 'sentence': Value(dtype='string', id=None)}

### Preprocess the dataset
Machine Learning models don't work with the raw text. Therefore, I need to convert the text to numbers the model can make sense of. This can be done using prebuild BertTokenizer which will transform the text inputs to numeric tokens ids and then, convert tokens to unique integers. _DataCollatorWithPadding_ will apply the correct amount of padding to the items of the dataset to the maximum length. 

In [None]:
from transformers import BertTokenizer, DataCollatorWithPadding

checkpoint= "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(dataset["train"]["sentence"])

def tokenize_function(example):
    return tokenizer(example["sentence"], padding = True, truncation = True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Loading cached processed dataset at /root/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0/cache-6893cd3f0794b36a.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0/cache-2adb7bb9d07b3050.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0/cache-af976b7042f3af6c.arrow


Process the **_tokenized_datasets_** in a way to transform it in proper format that dataloaders and trainer API can make sense of.  

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(
    ["sentence"]
)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets["train"].column_names

['attention_mask', 'input_ids', 'labels', 'token_type_ids']

Before writing the training loop, I create an iterator for the dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed.

In [None]:
# Define dataloaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["valid"], batch_size=8, collate_fn=data_collator
)

To ensure we have completed the data preprocessing, inspect dataloaders and turn to the model. 

In [None]:
#inspecting dataloaders
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 150]),
 'input_ids': torch.Size([8, 150]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 150])}

## Step 4: Configure, train and fine-tune the BERT model
BERT-base consists of 12 transformer layers, each transformer layer takes in a list of token embeddings, and produces the same number of embeddings with the same hidden size (or dimensions) on the output. The transformers library has the BertForSequenceClassification class which is designed for classification tasks. 


### 4.1 Define the model

In [None]:
# Define the model 
model = BertForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

To make sure that everything will go smoothly during training, we pass our batch to this model:

In [None]:
# Pass batch to the model
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(1.2860, grad_fn=<NllLossBackward>) torch.Size([8, 3])


All Transformers models will return the loss when labels are provided and we alsp get the logits that can be further converted into probabilities. 
Before training the model, I need to create an optimizer and a learning rate scheduler.

### 4.2 Optimizer & Learning Rate Scheduler

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)
print(num_training_steps)

1638


### 4.3 The training loop

I will train the Bert Classifier for 3 epochs. In each epoch, I will train the model and evaluate its performance on the validation and test dataset. Training the model might take a while, so ensure to enable the GPU acceleration from the notebook settings.  

In [None]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

device(type='cuda')

To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        
        outputs = model(**batch )
        loss = outputs.loss
        loss.backward()
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1638 [00:00<?, ?it/s]

### 4.4 The evaluation loop
I can evaluate the metrics through **_claasification_report_** class which need the predicted labels and true_labels. Since our model predicted the labels in tensor format and in batches, I first transformed the logits in probabilities, accumulated all probabilities in a dictionary of batches and then, concatenate all the batches in a list of arrays. 

In [None]:
from sklearn.metrics import classification_report
pred_list = []

model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        pred_list.append(predictions.cpu().numpy())
        
pred_list = np.concatenate(pred_list, axis =0)
pred_list


array([0, 2, 1, 1, 0, 2, 2, 0, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 0, 1, 1, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
       1, 1, 0, 2, 1, 0, 1, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,
       1, 1, 2, 1, 1, 1, 2, 1, 2, 0, 1, 0, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2,
       2, 2, 0, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2,
       2, 1, 0, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2,
       1, 1, 2, 2, 1, 1, 1, 1, 1, 0, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 2,
       2, 2, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 0, 0, 1, 2, 1, 2, 1, 2, 1, 0,
       1, 1, 1, 2, 0, 1, 1, 2, 2, 1, 1, 1, 1, 2, 0, 1, 0, 1, 1, 2, 1, 1,
       0, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 1, 2, 2, 1, 0, 1, 2, 1, 1, 1])

In [None]:
true_list = dataset["valid"]["label"]

In [None]:
print(classification_report(true_list,pred_list))

              precision    recall  f1-score   support

           0       0.86      0.83      0.84        23
           1       0.88      0.88      0.88       147
           2       0.78      0.81      0.79        72

    accuracy                           0.85       242
   macro avg       0.84      0.84      0.84       242
weighted avg       0.85      0.85      0.85       242



In [None]:
test_dataloader = DataLoader(
    tokenized_datasets["test"], batch_size=8, collate_fn=data_collator
)


In [None]:
pred_test = []
for batch in test_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        pred_test.append(predictions.cpu().numpy())
        
pred_test = np.concatenate(pred_test, axis =0)
pred_test

array([1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1,
       1, 0, 0, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2,
       1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 2, 1, 1, 1, 0, 1, 1, 1, 2, 0, 1, 1, 1, 2, 1, 2, 1, 1, 2,
       2, 2, 2, 1, 2, 1, 1, 1, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 2, 0, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1,
       1, 1, 1, 0, 2, 2, 2, 1, 1, 2, 1, 2, 1, 0, 2, 1, 0, 2, 1, 1, 1, 1,
       1, 1, 2, 1, 1, 2, 0, 2, 1, 0, 0, 1, 0, 2, 1, 2, 1, 1, 2, 1, 2, 1,
       2, 0, 2, 1, 1, 1, 0, 0, 1, 0, 2, 1, 0, 2, 1, 2, 1, 1, 1, 2, 2, 1,
       1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 2, 1, 1, 0, 1, 0, 1, 1, 1, 1, 2,
       0])

In [None]:
print(classification_report(dataset["test"]["label"],pred_test))

              precision    recall  f1-score   support

           0       0.75      0.88      0.81        24
           1       0.95      0.88      0.91       160
           2       0.74      0.83      0.78        59

    accuracy                           0.87       243
   macro avg       0.81      0.86      0.83       243
weighted avg       0.88      0.87      0.87       243

