# Emotion Classification and Explanation with BERT

Applying BERT to the problem of multi-class text classification. The dataset consists of written dialogs. Each dialog utterance/message is labeled with one of the five emotion categories: joy, anger, sadness, fear, neutral. 

## Workflow: 
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Predition on MIT dataset
5. Explanations for the prediction
6. Saving the model

In [None]:
# install ktrain on Google Colab
!pip3 install ktrain
!pip3 install shap
!pip3 install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip

In [None]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')

import transformers
import shap

import csv
from tqdm import tqdm
import os 
import glob

import ast
from IPython.display import HTML

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 1. Import Data

In [None]:
data_train = pd.read_csv('drive/MyDrive/Colab Notebooks/XAI/datasets/data_train.csv', encoding='utf-8')
data_test = pd.read_csv('drive/MyDrive/Colab Notebooks/XAI/datasets/data_test.csv', encoding='utf-8')

X_train = data_train.Text.tolist()
X_test = data_test.Text.tolist()

y_train = data_train.Emotion.tolist()
y_test = data_test.Emotion.tolist()

data = data_train.append(data_test, ignore_index=True)

class_names = ['joy', 'sadness', 'fear', 'anger', 'neutral']

print('size of training set: %s' % (len(data_train['Text'])))
print('size of test set: %s' % (len(data_test['Text'])))
print(data.Emotion.value_counts())

data.head(10)

size of training set: 7934
size of test set: 3393
joy        2326
sadness    2317
anger      2259
neutral    2254
fear       2171
Name: Emotion, dtype: int64


Unnamed: 0,Emotion,Text
0,neutral,There are tons of other paintings that I thin...
1,sadness,"Yet the dog had grown old and less capable , a..."
2,fear,When I get into the tube or the train without ...
3,fear,This last may be a source of considerable disq...
4,anger,She disliked the intimacy he showed towards so...
5,sadness,When my family heard that my Mother's cousin w...
6,joy,Finding out I am chosen to collect norms for C...
7,anger,A spokesperson said : ` Glen is furious that t...
8,neutral,Yes .
9,sadness,"When I see people with burns I feel sad, actua..."


In [None]:
encoding = {
    'joy': 0,
    'sadness': 1,
    'fear': 2,
    'anger': 3,
    'neutral': 4
}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

## 2. Data preprocessing

* The text must be preprocessed in a specific way for use with BERT. This is accomplished by setting preprocess_mode to ‘bert’. The BERT model and vocabulary will be automatically downloaded

* BERT can handle a maximum length of 512, but let's use less to reduce memory and improve speed. 

In [None]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=350, 
                                                                       max_features=35000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


task: text classification


## 3. Training and validation


Loading the pretrained BERT for text classification 

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

Is Multi-Label? False
maxlen is 350
done.


Wrap it in a Learner object

In [None]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=6)

In [None]:
learner.fit_onecycle(2e-5, 3)



begin training using onecycle policy with max lr of 2e-05...
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fb02c4b7850>

Validation

In [None]:
learner.validate(val_data=(x_test, y_test), class_names=class_names)

              precision    recall  f1-score   support

         joy       0.87      0.85      0.86       707
     sadness       0.80      0.82      0.81       676
        fear       0.87      0.85      0.86       679
       anger       0.79      0.78      0.79       693
     neutral       0.80      0.82      0.81       638

    accuracy                           0.83      3393
   macro avg       0.83      0.83      0.83      3393
weighted avg       0.83      0.83      0.83      3393



array([[601,  12,  15,  12,  67],
       [ 20, 556,  29,  52,  19],
       [ 13,  26, 580,  48,  12],
       [ 20,  66,  33, 543,  31],
       [ 40,  34,  10,  29, 525]])

# 4. Predict labels for the MIT dataset

In [None]:
MIT_df = pd.read_csv('drive/MyDrive/Colab Notebooks/XAI/datasets/MIT_interviews_preprocessed.csv', encoding='utf-8')
MIT_df['tokenize_sentence'] = MIT_df['text_remove_interview_signs'].apply(sent_tokenize)
MIT_df['prediction'] = MIT_df['tokenize_sentence'].apply(lambda sentences: [predictor.predict_proba(sentence) for sentence in sentences])
MIT_df

Unnamed: 0.1,Unnamed: 0,Participant,text_remove_interview_signs,tokenize_sentence,prediction
0,0,p1,Im pretty good. ok uhm so have you looked at ...,"[ Im pretty good., ok uhm so have you looked a...","[[0.14461343, 0.0018815603, 0.015007783, 0.002..."
1,1,p10,Great how about you? Im a little by the resu...,"[ Great how about you?, Im a little by the re...","[[0.9812996, 0.0010401229, 0.0009908269, 0.001..."
2,2,p11,Uhh Im a junior at MIT uhh Im double majoring...,[ Uhh Im a junior at MIT uhh Im double majorin...,"[[0.0042518657, 0.0040851985, 0.9801481, 0.000..."
3,3,p12,Im good how are you? Ok so Im a Junior at MIT...,"[ Im good how are you?, Ok so Im a Junior at M...","[[0.23260541, 0.0010147338, 0.00569431, 0.0008..."
4,4,p13,Good. Ok umm Im currently a junior at M.I.T. ...,"[ Good., Ok umm Im currently a junior at M.I.T...","[[0.025457432, 0.0010670138, 0.0010337941, 0.0..."
...,...,...,...,...,...
133,133,pp83,Um pretty good pretty good. Getting busy with...,"[ Um pretty good pretty good., Getting busy wi...","[[0.9243789, 0.0022251836, 0.010719742, 0.0022..."
134,134,pp84,Good thank you how are you? Alright well so I...,"[ Good thank you how are you?, Alright well so...","[[0.9003823, 0.00066801475, 0.0015706142, 0.00..."
135,135,pp85,Okay well Im a junior here at MIT. Umm Im dou...,"[ Okay well Im a junior here at MIT., Umm Im d...","[[0.23916756, 0.007959177, 0.13531123, 0.00218..."
136,136,pp86,In my technical background um. Been a junior ...,"[ In my technical background um., Been a junio...","[[0.36475015, 0.053461522, 0.11626156, 0.00347..."


In [None]:
MIT_df.to_csv('/content/drive/MyDrive/Colab Notebooks/XAI/datasets/MIT_dataset_emotion_prediction_percentage.csv')

# Explanations of MIT sentences

In [None]:
# The explanations are too heavy to do them in one go split into 10 interviews

with open('drive/MyDrive/Colab Notebooks/XAI/datasets/explanations_from_leftout.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Number' ,'Participant', 'tokenize_sentence', 'explanations'])

for index, interview in MIT_df.iterrows():
  # print(f'Index: {index}, interview: {interview.Participant}')
  with open('drive/MyDrive/Colab Notebooks/XAI/datasets/explanations_from_leftout.csv', 'a', newline='') as file:
      writer = csv.writer(file)

      explanations_list = []
      if interview['Participant'] == 'p31' or interview['Participant'] == 'p32' or interview['Participant'] == 'p67':
        print(f'Index: {index}, interview: {interview.Participant}')
        for i, sentence in tqdm(enumerate(interview['tokenize_sentence'])):
          try:
            explanations_list.append(predictor.explain(sentence).data)
          except:
            explanations_list.append(None)
        writer.writerow([index, interview['Participant'], interview['tokenize_sentence'], explanations_list])
      file.close()

print('\nfertig')

Index: 18, interview: p31


3it [2:42:44, 3253.23s/it]

## 4. Saving/Loaeding Bert model


In [None]:
# save
ktrain.get_predictor(model, preproc).save('/content/drive/MyDrive/Colab Notebooks/XAI/Save Model')

In [None]:
# load
predictor = ktrain.load_predictor('/content/drive/MyDrive/Colab Notebooks/XAI/Save Model')

['joy', 'sadness', 'fear', 'anger', 'neutral']

In [None]:
file_list = glob.glob("drive/MyDrive/Colab Notebooks/XAI/datasets/explanations_from_*.csv")
list_dataframe = []
for i, name in enumerate(file_list):
  df = pd.read_csv(name)
  list_dataframe.append(df)
df = pd.concat(list_dataframe, ignore_index=True)
df = df.drop('Number', axis=1)
df['tokenize_sentence'] = df['tokenize_sentence'].apply(lambda x: ast.literal_eval(x))
df['explanations'] = df['explanations'].apply(lambda x: ast.literal_eval(x))
df.head()

Unnamed: 0,Participant,tokenize_sentence,explanations
0,p21,"[ Um pretty good., Um yeah., So um my name is...",[\n <style>\n table.eli5-weights tr:hove...
1,p22,"[ Im good., How are you?, OK. Um Im a junior i...",[\n <style>\n table.eli5-weights tr:hove...
2,p24,"[ Im doing very well., How are you?, Im curren...",[\n <style>\n table.eli5-weights tr:hove...
3,p25,"[ Um., Im good., How are you?, Um., Im a core ...","[None, \n <style>\n table.eli5-weights t..."
4,p27,"[ Im doing well thank you., Um Im currently a ...",[\n <style>\n table.eli5-weights tr:hove...
