# Emotion classification with BERT

This notebook focuses on classifying text with emotions using the BERT transformer model. The goal is to analyze and understand the emotions conveyed in textual data.

By Constant Fotie Moghommahie

[fotiecodes](https://fotiecodes.com).


Classes: 
```json {
    'sadness': 0,
    'fear': 1,
    'anger': 2,
    'love': 3,
    'happy': 4,
    'surprise': 5
}
```

### 1. Import and install neccessary libs


In [16]:
!pip3 install nltk
!pip3 install --upgrade tensorflow




In [17]:
import tensorflow as tf
#import tensorflow_hub as hub
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
import unicodedata
from nltk.corpus import stopwords
from tensorflow import keras
from tensorflow.keras.layers import Dense,Dropout, Input
#from tqdm import tqdm
import pickle
from sklearn.metrics import confusion_matrix,f1_score,classification_report
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from tensorflow.keras import regularizers

In [18]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fotiem.constant/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
from transformers import BertTokenizer, TFBertModel, BertConfig,TFDistilBertModel,DistilBertTokenizer,DistilBertConfig

  from .autonotebook import tqdm as notebook_tqdm


### 2.  Preprocessing and cleaning functions

In [20]:
def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def clean_stopwords_shortwords(w):
    stopwords_list=stopwords.words('english')
    words = w.split() 
    clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
    return " ".join(clean_words) 

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    w = re.sub(r"([?.!,¿])", r" ", w)
    w = re.sub(r'[" "]+', " ", w)
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    w=clean_stopwords_shortwords(w)
    w=re.sub(r'@\w+', '',w)
    return w

### 3. Load and Inspect our dataset
We will first load and and preview our dataset. So we can check for any missing values.

In [21]:
# Load and inspect the new dataset
data_file = './data/emotion_final.csv'

# Read the data from the CSV file into a pandas DataFrame
data=pd.read_csv(data_file,encoding='ISO-8859-1')

# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


>  Removing Unnamed Columns, dropping NaN data and resetting the index after dropping some rows/columns containing NaN dataset and finally shuffling the dataset

In [22]:
print('File has {} rows and {} columns'.format(data.shape[0],data.shape[1]))
data=data.dropna()
data=data.reset_index(drop=True)
print('File has {} rows and {} columns'.format(data.shape[0],data.shape[1]))
data = shuffle(data)

data.head()

File has 21459 rows and 2 columns
File has 21459 rows and 2 columns


Unnamed: 0,Text,Emotion
2168,i couldnt bring myself to blog about it right ...,sadness
3814,i have eternal hope he says and when they arri...,love
7834,i have a feeling that it is in canada where sh...,happy
1611,i remember that beauty truly is in the eye of ...,fear
336,i recall those high school feelings and the lo...,love


In [23]:
data=data.rename(columns = {'Emotion': 'label', 'Text': 'text'}, inplace = False)

# Ensure all unique labels are included in the mapping dictionary
unique_labels = data['label'].unique()
mapping_dict = {label: index for index, label in enumerate(unique_labels)}

data['gt'] = data['label'].map(mapping_dict)

# Fill NaN values with a specific value (e.g., -1) and convert to integer
data['gt'] = data['gt'].fillna(-1).astype(int)

data['text']=data['text'].map(preprocess_sentence)

num_classes=len(data.label.unique())

data.head()

Unnamed: 0,text,label,gt
2168,couldnt bring blog right away mostly feel abso...,sadness,0
3814,eternal hope says arrive bridge finds likes fe...,love,1
7834,feeling canada find prince charming,happy,2
1611,remember beauty truly eye beholder people see ...,fear,3
336,recall high school feelings longing watched ol...,love,1


###  4. Loading DistilBERT Tokenizer and the DistilBERT model

In [24]:
dbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [25]:
dbert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


###  5. Preparing input for the model

In [26]:
# Set the maximum length of the input sentences
max_len = 32

# Get the sentences and labels from the data
sentences = data['text']
labels = data['gt']

# Print the length of the sentences and labels
len(sentences), len(labels)


(21459, 21459)

####  Let's take a sentence from the dataset and understand the input and output of the DistilBERT

In [27]:
sentences[0]

'didnt feel humiliated'

In [28]:
dbert_tokenizer.tokenize(sentences[0])

['didn', '##t', 'feel', 'humiliated']

> Input ids and the attention masks from the tokenizer

In [29]:
dbert_inp=dbert_tokenizer.encode_plus(sentences[0],add_special_tokens = True,max_length =20,pad_to_max_length = True,truncation=True)
dbert_inp



{'input_ids': [101, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [30]:
dbert_inp['input_ids']

[101, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>  DistilBERT model output: Give input_ids and the attention_mask obtained from the tokenizer. The output will be a tuple of the size (1,max_len,768)

In [31]:
id_inp=np.asarray(dbert_inp['input_ids'])
mask_inp=np.asarray(dbert_inp['attention_mask'])
out=dbert_model([id_inp.reshape(1,-1),mask_inp.reshape(1,-1)])
type(out),out

(transformers.modeling_tf_outputs.TFBaseModelOutput,
 TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(1, 20, 768), dtype=float32, numpy=
 array([[[-0.1713852 ,  0.04808453, -0.11152899, ..., -0.06434262,
           0.3309234 ,  0.36421138],
         [ 0.35398236,  0.237599  ,  0.21699837, ...,  0.00849791,
           0.46305755, -0.41722137],
         [-0.41779673,  0.0801597 ,  0.4836889 , ..., -0.17692386,
           0.5626378 ,  0.43658376],
         ...,
         [-0.15297808,  0.03547417,  0.06371693, ...,  0.03469767,
           0.0273383 , -0.1618905 ],
         [-0.14419594,  0.06348429,  0.07919774, ...,  0.11205912,
           0.0483825 , -0.20951898],
         [-0.11421237,  0.06052669,  0.07216734, ...,  0.1228725 ,
           0.0578859 , -0.19464041]]], dtype=float32)>, hidden_states=None, attentions=None))

> Obtain the embeddings of a sentence from the output

In [32]:
out[0][:,0,:]

<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-1.71385199e-01,  4.80845273e-02, -1.11528993e-01,
        -2.09115595e-02, -2.34398291e-01, -4.43111286e-02,
         2.90086627e-01,  2.66656905e-01,  2.05921069e-01,
        -2.20593572e-01,  5.82962371e-02, -2.02092111e-01,
        -1.94992542e-01,  1.50783718e-01,  9.85640213e-02,
         5.78697622e-02,  2.98279021e-02,  1.51283830e-01,
         6.53628679e-03, -5.72892092e-02, -7.78306574e-02,
        -3.38846177e-01, -1.36250377e-01,  7.62554258e-02,
        -1.06427208e-01,  3.59946080e-02,  1.48689047e-01,
        -3.07668447e-01,  6.57811016e-02, -1.10575236e-01,
         6.58358634e-02,  2.32130304e-01, -1.61469772e-01,
         2.03121938e-02, -1.84822798e-01,  1.77119240e-01,
         3.92265581e-02,  1.18414432e-01,  1.43307313e-01,
         1.92273743e-02, -2.72444546e-01, -5.23973629e-02,
        -2.44558126e-01, -4.83159423e-02,  1.09537050e-01,
        -1.48615092e-01, -2.21261787e+00, -3.53776738e-02,
      

> Decode the original sentence from the tokenizer 

In [33]:
dbert_tokenizer.decode(dbert_inp['input_ids'])

'[CLS] didnt feel humiliated [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [34]:
print("Available labels: ",data.label.unique())
num_classes = len(data.label.unique())
data.head()

Available labels:  ['sadness' 'love' 'happy' 'fear' 'anger' 'surprise']


Unnamed: 0,text,label,gt
2168,couldnt bring blog right away mostly feel abso...,sadness,0
3814,eternal hope says arrive bridge finds likes fe...,love,1
7834,feeling canada find prince charming,happy,2
1611,remember beauty truly eye beholder people see ...,fear,3
336,recall high school feelings longing watched ol...,love,1


###  6. Create a basic NN model using DistilBERT embeddings to get the predictions

In [35]:
def create_model():
    inps = Input(shape = (max_len,), dtype='int64')
    masks= Input(shape = (max_len,), dtype='int64')
    dbert_layer = dbert_model(inps, attention_mask=masks)[0][:,0,:]
    dense = Dense(512,activation='relu',kernel_regularizer=regularizers.l2(0.01))(dbert_layer)
    dropout= Dropout(0.5)(dense)
    pred = Dense(num_classes, activation='softmax',kernel_regularizer=regularizers.l2(0.01))(dropout)
    model = tf.keras.Model(inputs=[inps,masks], outputs=pred)
    print(model.summary())
    return model   

> Feel free to add more Dense and Dropout layers with variable units and the regularizers

In [36]:
model=create_model()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 32)]                 0         []                            
                                                                                                  
 input_2 (InputLayer)        [(None, 32)]                 0         []                            
                                                                                                  
 tf_distil_bert_model (TFDi  TFBaseModelOutput(last_hid   6636288   ['input_1[0][0]',             
 stilBertModel)              den_state=(None, 32, 768),   0          'input_2[0][0]']             
                              hidden_states=None, atten                                           
                             tions=None)                                                      

> Prepare the model input

In [37]:
input_ids=[]
attention_masks=[]

for sent in sentences:
    dbert_inps=dbert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =max_len,pad_to_max_length = True,return_attention_mask = True,truncation=True)
    input_ids.append(dbert_inps['input_ids'])
    attention_masks.append(dbert_inps['attention_mask'])

input_ids=np.asarray(input_ids)
attention_masks=np.array(attention_masks)
labels=np.array(labels)



In [38]:
len(input_ids),len(attention_masks),len(labels)

(21459, 21459, 21459)

> Save the model input in the pickle files to use it later without performing the above steps

In [39]:
print('Preparing the pickle file.....')

pickle_inp_path='./data/pickle_files/dbert_inp.pkl'
pickle_mask_path='./data/pickle_files/dbert_mask.pkl'
pickle_label_path='./data/pickle_files/dbert_label.pkl'

Preparing the pickle file.....


In [40]:
pickle.dump((input_ids),open(pickle_inp_path,'wb'))
pickle.dump((attention_masks),open(pickle_mask_path,'wb'))
pickle.dump((labels),open(pickle_label_path,'wb'))


print('Pickle files saved as ',pickle_inp_path,pickle_mask_path,pickle_label_path)

Pickle files saved as  ./data/pickle_files/dbert_inp.pkl ./data/pickle_files/dbert_mask.pkl ./data/pickle_files/dbert_label.pkl


In [41]:
print('Loading the saved pickle files..')

input_ids=pickle.load(open(pickle_inp_path, 'rb'))
attention_masks=pickle.load(open(pickle_mask_path, 'rb'))
labels=pickle.load(open(pickle_label_path, 'rb'))

print('Input shape {} Attention mask shape {} Input label shape {}'.format(input_ids.shape,attention_masks.shape,labels.shape))

Loading the saved pickle files..
Input shape (21459, 32) Attention mask shape (21459, 32) Input label shape (21459,)


In [42]:
label_class_dict = dict(enumerate(data['label'].unique()))
target_names = label_class_dict.values()
target_names


dict_values(['sadness', 'love', 'happy', 'fear', 'anger', 'surprise'])

> Train Test split and setting up the loss function, accuracy and optimizer for the model. 

In [43]:
train_inp,val_inp,train_label,val_label,train_mask,val_mask=train_test_split(input_ids,labels,attention_masks,test_size=0.2)

print('Train inp shape {} Val input shape {}\nTrain label shape {} Val label shape {}\nTrain attention mask shape {} Val attention mask shape {}'.format(train_inp.shape,val_inp.shape,train_label.shape,val_label.shape,train_mask.shape,val_mask.shape))


log_dir='dbert_model'
model_save_path='./dbert_model.h5'

callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]

loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)

model.compile(loss=loss,optimizer=optimizer, metrics=[metric])



Train inp shape (17167, 32) Val input shape (4292, 32)
Train label shape (17167,) Val label shape (4292,)
Train attention mask shape (17167, 32) Val attention mask shape (4292, 32)


In [44]:
callbacks= [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]
model.compile(loss=loss,optimizer=optimizer, metrics=[metric])

### Training

In [61]:
history=model.fit([train_inp,train_mask],train_label,batch_size=16,epochs=5,validation_data=([val_inp,val_mask],val_label),callbacks=callbacks)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Tensorboard visualization (Training-Testing curve)

In [62]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [63]:
%tensorboard --logdir {log_dir}

Reusing TensorBoard on port 6006 (pid 42543), started 0:37:09 ago. (Use '!kill 42543' to kill it.)

### We will Increase the number of epochs in order to decrease the loss further
Here we will use the saved model for predictions and calculating the evaluation metrics

In [64]:
trained_model = create_model()
trained_model.compile(loss=loss,optimizer=optimizer, metrics=[metric])
trained_model.load_weights(model_save_path)

Model: "model_2"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_5 (InputLayer)        [(None, 32)]                 0         []                            
                                                                                                  
 input_6 (InputLayer)        [(None, 32)]                 0         []                            
                                                                                                  
 tf_distil_bert_model (TFDi  TFBaseModelOutput(last_hid   6636288   ['input_5[0][0]',             
 stilBertModel)              den_state=(None, 32, 768),   0          'input_6[0][0]']             
                              hidden_states=None, atten                                           
                             tions=None)                                                    

In [65]:
preds = trained_model.predict([val_inp,val_mask],batch_size=16)
pred_labels = preds.argmax(axis=1)
f1 = f1_score(val_label, pred_labels, average='weighted')
f1



0.9288570145902261

In [66]:
print(target_names)
# print(target_names.shape)
print(val_label.shape)
print(pred_labels.shape)


# we print F1 score and classification report
print('F1 score:', f1)
print('Classification Report:')
print(classification_report(val_label, pred_labels))

print('Training and saving built model...')

dict_values(['sadness', 'love', 'happy', 'fear', 'anger', 'surprise'])
(4292,)
(4292,)
F1 score: 0.9288570145902261
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.97      1287
           1       0.82      0.82      0.82       326
           2       0.94      0.94      0.94      1365
           3       0.89      0.94      0.91       545
           4       0.97      0.89      0.93       603
           5       0.85      0.77      0.81       166

    accuracy                           0.93      4292
   macro avg       0.90      0.89      0.90      4292
weighted avg       0.93      0.93      0.93      4292

Training and saving built model...


### Quick test in real world scenario
We are now gonna try to predict a sentence with any input

In [67]:
# now we try to predict the label of a random sentence from user
def predict(sentence):
    # we first preprocess the sentence
    sentence = preprocess_sentence(sentence)
    # then we do some tokenization on the sentence
    dbert_inps=dbert_tokenizer.encode_plus(sentence,add_special_tokens = True,max_length =max_len,pad_to_max_length = True,return_attention_mask = True,truncation=True)
    # then we convert to numpy array
    id_inp=np.asarray(dbert_inps['input_ids'])
    mask_inp=np.asarray(dbert_inps['attention_mask'])
    # and predict the label using the trained model
    pred = trained_model.predict([id_inp.reshape(1,-1),mask_inp.reshape(1,-1)])
    # and then get the label with the highest probability
    pred_label = np.argmax(pred,axis=1)
    # we then return it
    return label_class_dict[pred_label[0]]

In [71]:
# test the model on a random sentence
predict("It is a love hate relationship between me and my father")



'anger'