# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

**Task 10**: Loading and Evaluating our Model

## Task 1: Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="images/BERT_diagrams.png" width="1000">

## Task 2: Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [24]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [25]:
tweets_df = pd.read_csv('Data/smile-annotations-final.csv',
                names=['id', 'text', 'category'])
tweets_df.set_index('id', inplace=True)
tweets_df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [26]:
tweets_df.text.iloc[0]

'@aandraous @britishmuseum @AndrewsAntonio Merci pour le partage! @openwinemap'

In [27]:
tweets_df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [28]:
# \, since | is a special character ; ~ to negate the following
tweets_df = tweets_df[~tweets_df.category.str.contains('\|')]
tweets_df = tweets_df[tweets_df.category != 'nocode']
tweets_df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [29]:
emotion_dict = {}

for idx, emotion in enumerate(tweets_df.category.unique()):
    emotion_dict[emotion] = idx
    
print(emotion_dict)

{'happy': 0, 'not-relevant': 1, 'angry': 2, 'disgust': 3, 'sad': 4, 'surprise': 5}


In [30]:
tweets_df['label'] = tweets_df.category.replace(emotion_dict)
tweets_df.tail()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
611258135270060033,@_TheWhitechapel @Campaignforwool @SlowTextile...,not-relevant,1
612214539468279808,“@britishmuseum: Thanks for ranking us #1 in @...,happy,0
613678555935973376,MT @AliHaggett: Looking forward to our public ...,happy,0
615246897670922240,@MrStuchbery @britishmuseum Mesmerising.,happy,0
613016084371914753,@NationalGallery The 2nd GENOCIDE against #Bia...,not-relevant,1


## Task 3: Training/Validation Split

In [31]:
from sklearn.model_selection import train_test_split

In [32]:
# Stratified Splitting, as we have very unbalanced data
X_train, X_val, y_train, y_val = train_test_split(tweets_df.index.values, tweets_df.label.values, test_size=0.15,
                                                  random_state=0, shuffle=True, stratify=tweets_df.label.values)

In [33]:
tweets_df['set'] = ['unknown']*tweets_df.shape[0]
tweets_df.sample(2)

Unnamed: 0_level_0,text,category,label,set
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614109057998323712,"This looks great, so hard to decide which day ...",happy,0,unknown
614783210191486976,Very intense and well written on #LGBT history...,happy,0,unknown


In [34]:
tweets_df.loc[X_train, 'set'] = 'train'
tweets_df.loc[X_val, 'set'] = 'val'

In [58]:
# We should see 85% in train & 15% in val
tweets_df.groupby(['category', 'label', 'set']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,set,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


In [59]:
tweets_df.groupby('set').count()

Unnamed: 0_level_0,text,category,label
set,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
train,1258,1258,1258
val,223,223,223


## Task 4: Loading Tokenizer and Encoding our Data

In [37]:
#!pip install transformers==2.11.0 -> Batch_encode_plus is not present in older versions

In [38]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [39]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [45]:
# We need to encode our Tweets in a way the model can read them
# 'BatchEncoding holds the output of the tokenizers (BertTokenizer) encoding methods
# We just call batch_encode_plus and encode our Tweets to make them fit to the model

# Params: 'What we want to encode' -> The values of our text; 
# add_special_tokens -> Refers to SEP, CRL tokens which mark sentence end & beginning; 
# We want them, as multiple sentences
# return_attention_masks -> Attention masks make sure only text which is relevant is returned -> As we'll have tweets with 
# less than our max_count size, this makes sure it onlys pay attention to the tweet not the remainng empty fields;
# By setting return... True, we get them back and can retrieve the masks, see 
# https://huggingface.co/transformers/main_classes/tokenizer.html
# Pad_to.. & max_length -> Pad our input to the same length
# return_tensors -> type we want to get back; pt for pytorch
# https://huggingface.co/transformers/glossary.html#attention-mask

encoded_data_train = tokenizer.batch_encode_plus(tweets_df[tweets_df.set=='train'].text.values, 
                                                add_special_tokens=True, return_attention_masks=True, 
                                                pad_to_max_length=True, max_length=256, return_tensors='pt')

encoded_data_val = tokenizer.batch_encode_plus(tweets_df[tweets_df.set=='val'].text.values, 
                                                add_special_tokens=True, return_attention_masks=True, 
                                                pad_to_max_length=True, max_length=256, return_tensors='pt')

In [68]:
print(encoded_data_train.keys())
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(tweets_df[tweets_df.set == 'train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(tweets_df[tweets_df.set == 'val'].label.values)

# 1258 for the amount of tweets & 256 for the specified length
print(input_ids_train.shape, attention_masks_train.shape, labels_train.shape)
# what we feed the model – e.g. 101 is very likely the sentence start code & 1030 some common word
print(input_ids_train[:5, :5])
# what the model should pay attention to -> 0 is where there is no text anymore
print(attention_masks_train[:20, :20])
# the different labels
print(labels_val[:10])

# WHAT DOES BERT THEN DO WITH THESE IDS? DOES IT RETRIEVE VECTORS FOR EACH WORD FROM A DICT? DOES IT FEED THEM IN
# LIKE THIS?

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
torch.Size([1258, 256]) torch.Size([1258, 256]) torch.Size([1258])
tensor([[  101, 16092,  3897,  2007, 10098],
        [  101,  1030, 27034, 14406, 18382],
        [  101,  1030, 10682,  5910, 25378],
        [  101,  1030,  2329,  7606, 14820],
        [  101,  1030,  2120, 22263,  7301]])
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1

In [69]:
# Creating datasets for both – It lets us iterate over each of them at one time
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

## Task 5: Setting up BERT Pretrained Model

In [72]:
# There are multiple optimized BERT models such as BERTForQuestionAnswering or BERTForMultipleChoice
# We will use SequenceClassification to classifiy our text sequences into diff emotions
from transformers import BertForSequenceClassification

In [74]:
# Base is the smaller version (also a 'large' one exists)
# We need to feed it the config – It's base config already has the following default:
# https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json
# Add. we can give it the parameters specified in the PretrainedConfig:
# https://github.com/huggingface/transformers/blob/6c32d8bb95aa81de6a047cca5ae732b93b9db020/src/transformers/configuration_utils.py
# -> We see that the default for num_labels is 2, but we want 5!

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(emotion_dict), 
                                                      output_attentions=False, output_hidden_states=False )

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




## Task 6: Creating Data Loaders

In [77]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [81]:
# We use the RandomSample for our Training Dataset
# For Val it doesn't matter if random or not, as weights alrdy fixed – We'll just use the Sequential one

batch_size = 4 # As limited RAM ; For Val we set 32, as less comp. expensive (no backprop)

dataloader_train = DataLoader(dataset_train, batch_size=batch_size, sampler=RandomSampler(dataset_train))
dataloader_val = DataLoader(dataset_val, batch_size=32, sampler=SequentialSampler(dataset_val))

## Task 7: Setting Up Optimizer and Scheduler

In [82]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [99]:
# Taking roughly the recommendations from BERT paper
# -> e is a scientific notation refering to ten to the power ('exponent', e.g. 1e5 = 10^5)
# Not to confuse w/ Eulers number
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)

In [100]:
epochs = 10

# Schedules changes in the LR 
# Num_warmup_steps: For how many steps to increase the LR (0 = none)
# Num Tr Steps: For how many steps LR should decrease
scheduler = get_linear_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=0, 
                                            num_training_steps=epochs*len(dataloader_train))

## Task 8: Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [101]:
import numpy as np

In [102]:
from sklearn.metrics import f1_score

In [110]:
def f1_score_func(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(pred_flat, labels_flat, average='weighted')

In [111]:
def accuracy_per_class(preds, labels):
    emotion_dict_inverse = {v: k for k, v in emotion_dict.items()}
    
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = pred_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {emotion_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Task 9: Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [106]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [107]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cpu


In [125]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals


In [114]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, desc='Epoch: {}'.format(epoch), leave=False, disable=False)
    
    for batch in progress_bar:
        
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
        
        # ** unpacks the dict
        output = model(**inputs)
        
        loss = output[0]
        loss_train_total += loss.item()
        loss.backward()
        
        torch.nn.utils.clip_grad_norm(model.parameters(), 10.0)
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item() / len(batch))})
        
    torch.save(model.state_dict(), f'model/BERT_ft_epoch{epoch}.model')
    
    tqdm.write(f'\nEpoch: {epoch}')
    
    loss_train_avg = loss_train_total / len(dataloader)
    tqdm.write(f'Training Loss: {loss_train_avg}')
    
    loss_val_avg, predictions, true_vals = evaluate(dataloader_val)
    
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation Loss: {loss_val_avg}')
    tqdm.write(f'F1 Score: {val_f1}')
        

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch: 1', max=315.0, style=ProgressStyle(description_wid…






KeyboardInterrupt: 

## Task 10: Loading and Evaluating our Model

In [121]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(emotion_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

In [122]:
model.to(device)
pass

In [123]:
# Find the model on https://drive.google.com/file/d/1StVw0jsBcz_9w_N_x9LaWC5coWe-GcFN/view?usp=sharing
model.load_state_dict(torch.load('model/finetuned_bert_epoch_1_gpu_trained.model', map_location=torch.device(device)))

<All keys matched successfully>

In [126]:
loss_val_avg, predictions, true_vals = evaluate(dataloader_val)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [127]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 170/171

Class: not-relevant
Accuracy: 31/32

Class: angry
Accuracy: 9/9

Class: disgust
Accuracy: 1/1

Class: sad
Accuracy: 5/5

Class: surprise
Accuracy: 5/5



In [None]:
# Google Colab – GPU Instance K80
# Batch Size 32
# epoch = 10

**Graded Quiz: Test your Project understanding**
LATEST SUBMISSION GRADE
100%
1.Question 1
Why was Exploratory Data Analysis useful for our project?

1 / 1 point

It showed us that there are some samples that existed to multiple classes.

Correct
Correct. The BERT finetuning approach we undertook required a sample to belong to a single class.


It showed us the severe class imbalance in our dataset.

Correct
Correct! This helped us to use a stratified approach when splitting our dataset.


It gave us hints as to what learning rate to use.

2.Question 2
Why did we use a stratified approach to split out dataset for training and validation?

1 / 1 point

To ensure that each class had some representation in each resulting set.

Correct
Correct! Not using this approach could have catastrophic repercussions on our model.


To avoid having samples fall into both the training and validation splits.


Our dataset comprised severe class imabalances, and we had to address it by splitting each class's samples into one of the two sets.

Correct
Correct. This addressed the class imbalance in the eyes of evaluation.

3.Question 3
What does BERT's attention mask refer to?

1 / 1 point

It tells BERT which words in an input sentence are important and which are insignifcant.


It allows for one word to look back at different words to gather additional context.


It marks whether or not a dimension in the input vector is text or padding.

Correct
Correct. Since BERT needs a fixed-size input, padding is necessary to ensure this holds.

4.Question 4
Why do we use a RandomSampler for training, but not necessarily for validation?

1 / 1 point

It's an insignficant artifact of BERT's training.


For each epoch, we want our dataset to be randomly sorted to improve generlization and prevent the model from learning common sequences of input.

Correct
Correct. It's just a method of adding good 'noise' to training.

5.Question 5
Why do we use random seed values in machine learning projects?

1 / 1 point

To make sure that our model is not biased.


For the sake of reproducibility.

Correct
Correct. Claims of amazing performance mean nothing without being able to reproduce them.

6.Question 6
What is the point of using torch.zero_grad() when training a PyTorch model?

1 / 1 point

It allows for our model's weights to beging from 0.


It halts model training for evaluation.


It sets all gradients to zero for each new batch gradient change.

Correct
Correct. We don't want gradient accumulation which is usually useful for Recurrent Neural Networks (RNNs).

7.Question 7
What does model.train() do?

1 / 1 point

It commences model training.


It sets the model mode to enter training mode, wherein backpropagation can occur.

Correct
Correct. Just like model.eval() sets us up for backpropagation to not occur and freezes all weights.

8.Question 8
Large-scale models are incredibly powerful since they have access to such huge chunks of data from which to learn. Can you think of another task you could finetune BERT to learn, and secondly, do you see any potential problems with using such a model in production?

1 / 1 point
a) Apart from sentiment analysis, BERT could be useful for intention detection, e.g. in a chatbot, error detection, e.g. like grammarly, writing tools, e.g. suggesting words – As we use language everyday, NLP applications are huge.
b) Bias, Impact on environment (Mostly during the training phase), Outdated (Information changes – The text BERT was trained on might get outdated), Robustness (BERT always needs to be fine-tuned and not very good at zero shot tasks)
Correct
BERT can be adapted to do question answering, multiple choice, sentence completion, and predicting missing words.

Firstly, one clear problem with BERT is its speed. It can be very slow for production. Secondly, we have a more philosophical problem. What kind of biases do you think a large scale language model will have if it's trained on data from the internet which represents wealthier people and countries more than poorer ones? Sure, if we are just detecting sentiment in Tweets, maybe this isn't an issue. What if we're using it to screen and monitor perception of people though?