# Sentiment Analysis with Deep Learning using BERT

### Project Outline

**Task 1**: Exploratory Data Analysis and Preprocessing

**Task 2**: Training/Validation Split

**Task 3**: Loading Tokenizer and Encoding Data

**Task 4**: Setting up BERT Pretrained Model

**Task 5**: Creating Data Loaders

**Task 6**: Setting Up Optimizer and Scheduler

**Task 7**: Defining the Performance Metrics

**Task 8**: Creating Training Loop

**Task 9**: Loading and Evaluating the Model

## Task 1: Exploratory Data Analysis and Preprocessing

In [1]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [2]:
df = pd.read_csv('/content/smile-annotations-final.csv',
                 names=['id', 'text', 'category'])
df.set_index('id', inplace = True)

In [3]:
df.text.iloc[0]

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [4]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
nocode,1572
happy,1137
not-relevant,214
angry,57
surprise,35
sad,32
happy|surprise,11
happy|sad,9
disgust|angry,7
disgust,6


In [5]:
df = df[~df.category.str.contains('\|')]

In [6]:
df = df[df.category != 'nocode']

In [7]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
happy,1137
not-relevant,214
angry,57
surprise,35
sad,32
disgust,6


In [8]:
possible_labels = df.category.unique()

In [9]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [10]:
label_dict

{'happy': 0,
 'not-relevant': 1,
 'angry': 2,
 'disgust': 3,
 'sad': 4,
 'surprise': 5}

In [11]:
df['label'] = df.category.replace(label_dict)
df.head(10)

  df['label'] = df.category.replace(label_dict)


Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0
614499696015503361,Lucky @FitzMuseum_UK! Good luck @MirandaStearn...,happy,0
613601881441570816,Yr 9 art students are off to the @britishmuseu...,happy,0
613696526297210880,@RAMMuseum Please vote for us as @sainsbury #s...,not-relevant,1
610746718641102848,#AskTheGallery Have you got plans to privatise...,not-relevant,1
612648200588038144,@BarbyWT @britishmuseum so beautiful,happy,0


## Task 2: Training/Validation Split

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size = 0.15,
    random_state = 17,
    stratify = df.label.values)

In [14]:
df['data_type'] = ['not_set']*df.shape[0]

In [15]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [16]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


## Task 3: Loading Tokenizer and Encoding Data

In [17]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [18]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case = True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [19]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type =='train'].text.values,
    add_special_tokens = True,
    return_attention_mask = True,
    padding="max_length",
    truncation=True,
    max_length = 256,
    return_tensors = 'pt')

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type =='val'].text.values,
    add_special_tokens = True,
    return_attention_mask = True,
    padding="max_length",
    truncation=True,
    max_length = 256,
    return_tensors = 'pt')

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type == 'train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type == 'val'].label.values)

In [20]:
dataset_train = TensorDataset(input_ids_train,
                              attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val,
                            attention_masks_val, labels_val)

In [21]:
len(dataset_train)

1258

In [22]:
len(dataset_val)

223

## Task 4: Setting up BERT Pretrained Model

In [23]:
from transformers import BertForSequenceClassification

In [24]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(label_dict),
    output_attentions = False,
    output_hidden_states = False)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Task 5: Creating Data Loaders

In [25]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [26]:
batch_size = 4 #32

dataloader_train = DataLoader(
    dataset_train,
    sampler = RandomSampler(dataset_train),
    batch_size = batch_size
    )

dataloader_val = DataLoader(
    dataset_val,
    sampler = RandomSampler(dataset_val),
    batch_size = 32
    )

## Task 6: Setting Up Optimizer and Scheduler

In [27]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [28]:
optimizer = AdamW(
    model.parameters(),
    lr = 1e-5, #2e-5 > 5e-5
    eps = 1e-8
)



In [29]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = 0,
    num_training_steps = len(dataloader_train)*epochs
)

## Task 7: Defining the Performance Metrics

In [30]:
import numpy as np

In [31]:
from sklearn.metrics import f1_score

In [32]:
# preds = [0.9 0.05 0.05 0 0 0]
# preds = [1 0 0 0 0 0]

In [33]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [34]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis = 1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat == label]
        y_true = labels_flat[labels_flat == label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds == label])}/{len(y_true)}\n')

## Task 8: Creating Training Loop

In [35]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [36]:
device  = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [37]:
def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals


In [38]:
for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train,
                        desc='Epoch {:d}'.format(epoch),
                        leave = False,
                        disable = False)
    for batch in progress_bar:
        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {
            'input_ids' : batch[0],
            'attention_mask' : batch[1],
            'labels': batch[2]
        }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})

    torch.save(model.state_dict(), f'Models/BERT_ft_epoch{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validations loss: {val_loss}')
    tqdm.write(f'F1 Score (weighted): {val_f1}')

  0%|          | 0/10 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.8246664029974786
Validations loss: 0.6390010586806706
F1 Score (weighted): 0.7734205796280599


Epoch 2:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.4808978693234542
Validations loss: 0.6431908522333417
F1 Score (weighted): 0.8325884138219828


Epoch 3:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.29499922998781714
Validations loss: 0.7349185006959098
F1 Score (weighted): 0.8224495708744424


Epoch 4:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 4
Training loss: 0.19634591821130246
Validations loss: 0.7270713661398206
F1 Score (weighted): 0.8451223873360602


Epoch 5:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 5
Training loss: 0.13469819418849452
Validations loss: 0.7164951775755201
F1 Score (weighted): 0.8544989784517811


Epoch 6:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 6
Training loss: 0.07779741223369326
Validations loss: 0.7493510544300079
F1 Score (weighted): 0.8630324357026097


Epoch 7:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 7
Training loss: 0.04458564270383841
Validations loss: 0.7233505419322422
F1 Score (weighted): 0.8516963210642173


Epoch 8:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 8
Training loss: 0.03245461578998301
Validations loss: 0.7122840455600193
F1 Score (weighted): 0.8661428587363785


Epoch 9:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 9
Training loss: 0.022284895068012355
Validations loss: 0.7187472837311881
F1 Score (weighted): 0.8548504750034691


Epoch 10:   0%|          | 0/315 [00:00<?, ?it/s]


Epoch 10
Training loss: 0.018592688935394917
Validations loss: 0.7285295852593013
F1 Score (weighted): 0.8630079734034756


## Task 9: Loading and Evaluating the Model

In [39]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [40]:
model.load_state_dict(
    torch.load('/content/Models/BERT_ft_epoch1.model',
              map_location=torch.device('cpu')))

  torch.load('/content/Models/BERT_ft_epoch1.model',


<All keys matched successfully>

In [41]:
_, predictions, true_vals = evaluate(dataloader_val)

In [42]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 164/171

Class: not-relevant
Accuracy: 17/32

Class: angry
Accuracy: 0/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 0/5



In [43]:
model.load_state_dict(
    torch.load('/content/Models/BERT_ft_epoch9.model',
              map_location=torch.device('cpu')))
_, predictions, true_vals = evaluate(dataloader_val)
accuracy_per_class(predictions, true_vals)

  torch.load('/content/Models/BERT_ft_epoch9.model',


Class: happy
Accuracy: 162/171

Class: not-relevant
Accuracy: 19/32

Class: angry
Accuracy: 7/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 2/5

Class: surprise
Accuracy: 2/5

