## Project Outline
Task 1: Exploratory Data Analysis and Preprocessing

Task 2: Training/Validation Split

Task 3: Loading Tokenizer and Encoding our Data

Task 4: Setting up BERT Pretrained Model

Task 5: Creating Data Loaders

Task 6: Setting Up Optimizer and Scheduler

Task 7: Defining our Performance Metrics

Task 8: Creating our Training Loop

Task 9: Loading and Evaluating our Model


Importing the necessary libraries

In [67]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [68]:
import warnings
warnings.filterwarnings('ignore')

In [69]:
#Reading the data from the csv file
df = pd.read_csv('smile-annotations-final.csv',
                 names = ['id', 'text', 'category'])

#Setting the id as index
df.set_index('id', inplace=True)

## Task 1: Exploratory Data Analysis and Preprocessing

In [70]:
df.shape

(3085, 2)

In [71]:
#Checking the first few records
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [72]:
#Checking the first tweets
df.text.iloc[3]

'@Sofabsports thank you for following me back. Great to hear from a diverse &amp; interesting panel #DefeatingDepression @RAMMuseum'

In [73]:
#Checking the total number of counts under each category
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [74]:
#Removing the multile emotions from the dataset
df = df[~df.category.str.contains('\|')]

In [75]:
#Removing the nocode category
df = df[df['category'] != 'nocode']

In [76]:
#Checking the total number of counts under each category
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [77]:
#Coverting the labels into numerics
possible_labels = list(df.category.unique())
possible_labels

['happy', 'not-relevant', 'angry', 'disgust', 'sad', 'surprise']

In [78]:
list(enumerate(possible_labels))

[(0, 'happy'),
 (1, 'not-relevant'),
 (2, 'angry'),
 (3, 'disgust'),
 (4, 'sad'),
 (5, 'surprise')]

In [79]:
label_dict = {}
for index,possible_labels in enumerate(possible_labels):
    label_dict[possible_labels] = index

In [80]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [81]:
#Creating a separate column for the encoded labels
df['label'] = df.category.replace(label_dict)

In [82]:
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


## Task 2: Training/Validation Split

In [83]:
from sklearn.model_selection import train_test_split

In [84]:
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    stratify=df.label.values #for the imbalanced class
)

In [85]:
df['data_type'] = ['not_set']*df.shape[0]

In [86]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [87]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Task 3: Loading Tokenizer and Encoding our Data

In [88]:
pip install transformers



In [89]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [90]:
#Loading the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained(
     'bert-base-uncased', #all cases are in lowercase
      do_lower_case= True)

In [91]:
#Tokenizing the text in batch using the BERT tokenizer
encoded_train_data = tokenizer.batch_encode_plus(
    df[df['data_type']=='train'].text.values,
    add_special_tokens=True, #[SEP], [CLR]
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    truncation=True,
    return_tensors='pt'
)

encoded_val_data = tokenizer.batch_encode_plus(
    df[df['data_type']=='val'].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    truncation=True,
    return_tensors='pt'
)

input_ids_train = encoded_train_data['input_ids']
attention_mask_train = encoded_train_data['attention_mask']
labels_train = torch.tensor(df[df['data_type']=='train'].label.values)

input_ids_val = encoded_val_data['input_ids']
attention_mask_val = encoded_val_data['attention_mask']
labels_val = torch.tensor(df[df['data_type']=='val'].label.values)

In [92]:
#Creating tensor dataset out of train and validation dataset
dataset_train = TensorDataset(input_ids_train,
                              attention_mask_train,
                              labels_train)

dataset_val = TensorDataset(input_ids_val,
                            attention_mask_val,
                            labels_val)

In [93]:
len(dataset_train)

1258

In [94]:
len(dataset_val)

223

## Task 4: Setting up BERT Pretrained Model

In [95]:
from transformers import BertForSequenceClassification

In [96]:
model = BertForSequenceClassification.from_pretrained(
   'bert-base-uncased',
    num_labels=len(label_dict),
    output_attentions=False,
    output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Task 5: Creating Data Loaders

In [97]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [98]:
batch_size = 32

dataloader_train = DataLoader(
     dataset_train,
     sampler=RandomSampler(dataset_train),
     batch_size=batch_size)

dataloader_val = DataLoader(
     dataset_val,
     sampler=RandomSampler(dataset_val),
     batch_size=batch_size)

## Task 6: Setting Up Optimizer and Scheduler

In [99]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [100]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5, #2e-5 > 5e-5
    eps=1e-8
)

In [101]:
# for name, param in model.named_parameters():
#     print(f"Parameter name: {name}")
#     print(f"Parameter size: {param.size()}")
#     print("------")

In [102]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=len(dataloader_train)*epochs #the number of times the optimizer will update the model's parameters.
)

## Task 7: Defining our Performance Metrics

In [103]:
import numpy as np

In [104]:
from sklearn.metrics import f1_score
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

In [105]:
#preds = [0.9, 0.05, 0.05, 0, 0, 0]
#we want preds = [1, 0, 0, 0, 0, 0

In [106]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [107]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v:k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true  = labels_flat[labels_flat==label]
        print(f'class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])/len(y_true)}\n')

In [108]:
def performance_metrics(preds, labels):

    class_names = list(df.category.unique())

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    conf_mat = confusion_matrix(y_true=labels_flat,
                                y_pred=preds_flat)

    cls_rpt = classification_report(y_true=labels_flat,
                                    y_pred=preds_flat,
                                    target_names=class_names,
                                    digits=4)

    print(f"Confusion matrix: \n {conf_mat} \n")

    print(f"Classification report: \n {cls_rpt} \n")

    accuracy = metrics.accuracy_score(labels_flat, preds_flat)

    print(f"Accuracy Score = {accuracy}")

    return None

## Task 8: Creating our Training Loop

In [109]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [110]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [112]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in tqdm(dataloader_val):

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals


In [113]:
import os
os.makedirs("Models", exist_ok=True)

for epoch in range(1, epochs+1):

    model.train()
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train,
                       desc='Epoch {:1d}'.format(epoch),
                       leave=False,
                       disable=False)

    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {
            'input_ids'         : batch[0],
            'attention_mask'    : batch[1],
            'labels'            : batch[2]
         }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})

    torch.save(model.state_dict(), f'Models/BERT_ft_epoch{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)

    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_val)

    val_f1 = f1_score_func(predictions, true_vals)

    tqdm.write(f'Validation loss: {val_loss}')

    tqdm.write(f'F1 score (weighted): {val_f1}')

Epoch 1:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 1
Training loss: 1.0446282580494881


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.770200422831944
F1 score (weighted): 0.6953185953656175


Epoch 2:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.6656686812639236


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.610957954611097
F1 score (weighted): 0.7563849108764485


Epoch 3:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.5200087413191795


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5645225346088409
F1 score (weighted): 0.7977010042316681


Epoch 4:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 4
Training loss: 0.441915014013648


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5576145393507821
F1 score (weighted): 0.7913155090931329


Epoch 5:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 5
Training loss: 0.3862154735252261


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5843018846852439
F1 score (weighted): 0.7889399248379658


Epoch 6:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 6
Training loss: 0.35830602450296284


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5754426462309701
F1 score (weighted): 0.7913155090931329


Epoch 7:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 7
Training loss: 0.3193357797339559


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5585537425109318
F1 score (weighted): 0.797312114366634


Epoch 8:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 8
Training loss: 0.28044388853013513


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5529141213212695
F1 score (weighted): 0.7982797009484407


Epoch 9:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 9
Training loss: 0.26472845710813997


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5500086035047259
F1 score (weighted): 0.7982797009484407


Epoch 10:   0%|          | 0/40 [00:00<?, ?it/s]


Epoch 10
Training loss: 0.2514849495142698


  0%|          | 0/7 [00:00<?, ?it/s]

Validation loss: 0.5589492746761867
F1 score (weighted): 0.8020210951809512


We have fine-tuned the pretrained BERT model on 10 epochs. The model clearly overfits as we can see that the training loss decreases while the validation loss increases. We know that the model is generally trained over many epochs and the best model (whcih finds the sweet spot between the training and the validation) are selected for testing on the unseen data. So here for testing, we can decide on the model from epoch 2 as there are less gap between the training loss and validation loss.

## Task 9: Loading and Evaluating our Model

In [114]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [115]:
model.to(device)
pass

Loading the fine-tuned model from the first epoch and evaluating

In [116]:
model.load_state_dict(
torch.load('Models/BERT_ft_epoch1.model',
          map_location=torch.device('cpu')))

<All keys matched successfully>

In [117]:
_, predictions, true_vals = evaluate(dataloader_val)

  0%|          | 0/7 [00:00<?, ?it/s]

In [118]:
accuracy_per_class(predictions, true_vals)

class: happy
Accuracy: 1.0

class: not-relevant
Accuracy: 0.09375

class: angry
Accuracy: 0.0

class: disgust
Accuracy: 0.0

class: sad
Accuracy: 0.0

class: surprise
Accuracy: 0.0



The model is able to predict only the 'happy' class correctly and the 'not-relevant' class as only 9%.

In [119]:
performance_metrics(predictions, true_vals)

Confusion matrix: 
 [[171   0   0   0   0   0]
 [ 29   3   0   0   0   0]
 [  9   0   0   0   0   0]
 [  1   0   0   0   0   0]
 [  5   0   0   0   0   0]
 [  5   0   0   0   0   0]] 

Classification report: 
               precision    recall  f1-score   support

       happy     0.7773    1.0000    0.8747       171
not-relevant     1.0000    0.0938    0.1714        32
       angry     0.0000    0.0000    0.0000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.0000    0.0000    0.0000         5

    accuracy                         0.7803       223
   macro avg     0.2962    0.1823    0.1744       223
weighted avg     0.7395    0.7803    0.6953       223
 

Accuracy Score = 0.7802690582959642


For the first model in the epoch we have seen that the training loss is more than the validation loss. As a result, this model can only correctly predict 'happy' class.

Loading the fine-tuned model from the fourth epoch and evaluating

In [124]:
model.load_state_dict(
torch.load('Models/BERT_ft_epoch2.model',
          map_location=torch.device('cpu')))

<All keys matched successfully>

In [125]:
_, predictions, true_vals = evaluate(dataloader_val)

  0%|          | 0/7 [00:00<?, ?it/s]

In [126]:
accuracy_per_class(predictions, true_vals)

class: happy
Accuracy: 0.9824561403508771

class: not-relevant
Accuracy: 0.375

class: angry
Accuracy: 0.0

class: disgust
Accuracy: 0.0

class: sad
Accuracy: 0.0

class: surprise
Accuracy: 0.0



We can already see some improvement over the predictions during evaluation. The model from the epoch4 can correctly classify 98% as 'happy' and 37% as 'not-relevant'.

In [123]:
performance_metrics(predictions, true_vals)

Confusion matrix: 
 [[166   5   0   0   0   0]
 [ 13  19   0   0   0   0]
 [  1   8   0   0   0   0]
 [  1   0   0   0   0   0]
 [  4   1   0   0   0   0]
 [  2   3   0   0   0   0]] 

Classification report: 
               precision    recall  f1-score   support

       happy     0.8877    0.9708    0.9274       171
not-relevant     0.5278    0.5938    0.5588        32
       angry     0.0000    0.0000    0.0000         9
     disgust     0.0000    0.0000    0.0000         1
         sad     0.0000    0.0000    0.0000         5
    surprise     0.0000    0.0000    0.0000         5

    accuracy                         0.8296       223
   macro avg     0.2359    0.2608    0.2477       223
weighted avg     0.7564    0.8296    0.7913       223
 

Accuracy Score = 0.8295964125560538


The recall in the classification report reflects the accuracy_per_class result. The overall accuracy by model is improved as almost 83%.