# Introduction

### What is BERT? (Bidirectional Encoder Representations from Transformers)

***BERT is a deep learning model developed by Google in 2018.***  
  
***It is based on the Transformer architecture.***  
  
***It reads entire sentences at once, from both left and right — this is called bidirectional.***

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805).

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)


# Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [None]:
import torch # Imports the PyTorch library which is a popular deep learning framework used for building and training neural networks.
import pandas as pd
from tqdm.notebook import tqdm
pd.set_option('display.max_rows', None)

In [None]:
df = pd.read_csv('smileannotationsfinal.csv', names=['id', 'text', 'category'])
df.set_index('id', inplace=True)

In [None]:
df.head()

In [None]:
df.category.value_counts()

In [None]:
df = df[~df.category.str.contains('\|')]

In [None]:
df = df[df.category != 'nocode']

In [None]:
df.category.value_counts()

In [None]:
possible_labels = df.category.unique()

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
label_dict

In [None]:
df['label'] = df.category.replace(label_dict)

In [None]:
df.head()

# Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size=0.15,
                                                  random_state=17,
                                                  stratify=df.label.values)

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df

In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count()

# Loading Tokenizer and Encoding our Data

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [None]:
#uses a BERT tokenizer to convert text data into numerical
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

In [None]:
encoded_data_train

In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
len(dataset_train)

In [None]:
len(dataset_val)

# Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

**BertForSequenceClassification is built on top of BertModel. The BertForSequenceClassification effectively takes the BertModel and builds a classification extra layer. Import this library from transfomer which is used to convert the input data into a format that can be processed by the model.**

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False

**This line loads a pre-trained BERT model for classifying text into categories. It uses the bert-base-uncased version, which ignores letter casing like uppercase and lowercase.
num_labels sets how many classes the model will predict.
It turns off extra outputs like attention and hidden states to save memory and speed up processing.**

# Creating Data Loaders

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

**The dataloader facilitates efficient loading and processing of data, particularly for machine learning models.**

**Random sampler samples elements from the dataset randomly, without replacement, ensuring that each element is selected only once during an epoch.**

**Sequential Sampler iterates through the dataset indices in a sequential manner, starting from the first element and moving to the next until the end.**

In [None]:
#set up data loaders to efficiently feed data to the model in batches during training and validation.
batch_size = 32

dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

**These lines create data loaders to feed data into the model during training and validation.
Batch size means that the model will process 32 samples at a time.
dataloader_train randomly shuffles training data using RandomSampler.
dataloader_validation reads validation data in order using SequentialSampler.**

# Setting Up Optimiser and Scheduler

In [None]:
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup

**The AdamW will adjust the model's weight to reduce the error.**

 **get_linear_schedule_with_warmup slowly increases the learning rate at the start (warmup), then decreases it, helping the model learn more smoothly.**

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5,
                  eps=1e-8)

**This line initializes the AdamW optimizer, which is used to update the model's weights during training to minimize the error. It takes the model's parameters as input, meaning it knows which parts of the model to adjust. The learning rate is set to a small value of 1e-5, which controls how quickly the model learns. Smaller values make learning slower and more stable. The eps=1e-8 is a small number added to avoid division by zero or very small numbers during optimization, ensuring numerical stability.**

In [None]:
epochs = 3
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

**The first line initializes the interation to 3. We use adam optmizer as an optimizer. num_warmup_steps=0 means there's no warmup period, so the learning rate starts decreasing right away. The num_training_steps is set to the total number of batches the model will see during training, calculated by multiplying the number of batches per epoch (len(dataloader_train)) by the total number of epochs. This helps the learning rate decrease gradually as training progresses.**

# Defining our Performance Metrics

In [None]:
import numpy as np
from sklearn.metrics import f1_score

**F1 score is a machine learning evaluation metric that measures a model's accuracy.
Numpy is used to work with n dimensional arrays and lists.**

In [None]:
#calculates the weighted F1 score by comparing the model's predicted class
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

**This function calculates the F1 score, which measures how well the model is performing. It first finds the predicted class with the highest score using argmax. Then, it flattens both predictions and true labels to 1D arrays for comparison. Finally, it returns the weighted F1 score, which balances precision and recall based on label importance.**

In [None]:
#calculates and prints the prediction accuracy for each individual class in your dataset
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

**This function prints the accuracy for each class separately.
First, it creates a dictionary to map label numbers back to their names.
It flattens the predictions and true labels to make them easier to compare.
Then, for each unique class, it finds how many predictions were correct out of the total for that class.
It prints the class name and its accuracy in the format: correct predictions / total samples.**

# Creating our Training Loop

In [None]:
#andom seeds for reproducibility in Python, NumPy, and PyTorch, ensuring consistent results from random operations.
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

**These lines set a fixed random seed to make the results reproducible.
By setting seed_val = 17, it ensures that random operations (like shuffling or weight initialization) give the same results every time.
It sets the seed for Python random module, NumPy, and PyTorch, so everything behaves consistently during each run.**

In [None]:
#sets the computing device for the model to either a GPU (if available) or the CPU and then prints the selected device.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

**If cuda is available in torch.cuda.is_available method that use cuda else use cpu. Then print the device to see which device is used.**

In [None]:
# function evaluates the model on validation data, calculates the average loss, and collects predictions and t
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

**This function evaluates the models performance on the validation data.
It sets the model to evaluation mode using model.eval() to turn off dropout and other training-only behaviors.
For each batch in the validation set, it moves data to the correct device CPU or GPU and prepares the input.
It then runs the model without updating weights (torch.no_grad()), collects the loss, predictions, and true labels.
Finally, it calculates the average loss and returns it along with all predictions and true labels for further analysis.**

In [None]:
#BERT model for a specified number of epochs, processes data in batches, updates model weights based on calculated loss, saves the model state after each epoch, and evaluates performance on validation data.
for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

**This code trains the model for multiple epochs with progress bars to show real-time updates.
For each epoch, it sets the model to training mode and initializes the total training loss.
It loops through batches in the training data, moves data to the device, and computes the loss.
After calculating gradients with loss.backward(), it clips them to prevent very large updates, then updates model weights with the optimizer and adjusts the learning rate using the scheduler.
The progress bar shows the current training loss per batch. After each epoch, the models state is saved to a file.
Finally, it prints the average training loss, evaluates the model on validation data, and prints the validation loss and weighted F1 score to track performance.**

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device) #pre-trained BERT model for text classification

**This code loads a pre-trained BERT model designed for text classification using the "bert-base-uncased" version, which treats uppercase and lowercase letters the same. It sets the number of output classes based on your label dictionary and disables extra outputs like attention scores and hidden states to keep things simple. Finally, it moves the model to the specified device (CPU or GPU) so it can run there efficiently.**

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation) #evaluates the model on the validation data and stores the predictions and true labels for later analysis.

**This line runs the evaluate function on the validation data to check how well the model is performing.**

**It returns three values: the average validation loss, the models predicted outputs (predictions), and the actual labels (true_vals).
These predictions and true values can then be used to calculate metrics like accuracy or F1 score.**

In [None]:
accuracy_per_class(predictions, true_vals) #This line calculates and prints the accuracy of the model for each category in the validation data.

**This line calls the accuracy_per_class function to measure how accurately the model predicted each class.
It compares the model's predictions (predictions) with the actual labels (true_vals) and prints the accuracy for each class individually.**