# Deep Broad Learning for Emotion Classification in Textual Conversations

This model is inspired by the paper on Deep Broad Learning for Emotion Classification in Textual Conversations.
I have implemented a similar model in this project for emotion classification in conversations. 

Due to limited computing resources, I have reduced several hyperparameters to suit easier computing at heavy cost of accuracy. I have mentioned the places where I have compromised these parameters.



# Importing necessary libraries

In [636]:
import torch
import pandas as pd
import numpy as np
import tensorflow as tf

In [637]:
from keras import layers, models
from keras.preprocessing import sequence

In [638]:
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score

In [639]:
from transformers import BertTokenizer, BertModel

# Importing Dataset and preprocessing

In [640]:
#Using approximately half the dataset to train CNNs and Bi-LSTMs to save memory
df= pd.read_csv(r"MELD.Raw\MELD.Raw\train\train_sent_emo_halved.csv")

In [641]:
#Using the full dataset to train the Broad Learning System
df_full = pd.read_csv(r"MELD.Raw\MELD.Raw\train\train_sent_emo.csv")

In [642]:
class_mapping = {'anger': 0, 'disgust': 1, 'fear': 2, 'joy': 3, 'neutral': 4, 'sadness': 5, 'surprise': 6}
df['Emotion_int'] = df['Emotion'].map(class_mapping)

In [643]:
df_full['Emotion_int'] = df_full['Emotion'].map(class_mapping)

In [644]:
emotions = torch.tensor(df['Emotion_int'].tolist())

In [645]:
emotions_hot = tf.keras.utils.to_categorical(emotions) #One-Hot encodings for emotion classes

In [646]:
#Getting the conversation list and emotion list for training the utterance-level Bi-LSTM model
conversation_list = []
emotion_list = []
for value in df["Dialogue_ID"].unique():
  conversation = df[df["Dialogue_ID"] == value]["Utterance"].tolist()
  conversation_list.append(conversation)
  emotions_conv = df[df["Dialogue_ID"] == value]["Emotion_int"].tolist()
  emotions_conv_hot = tf.keras.utils.to_categorical(emotions_conv, num_classes=7)
  emotion_list.append(emotions_conv_hot)

In [647]:

emotion_list_full = []
for value in df_full["Dialogue_ID"].unique():
  emotions_conv = df_full[df_full["Dialogue_ID"] == value]["Emotion_int"].tolist()
  emotions_conv_hot = tf.keras.utils.to_categorical(emotions_conv, num_classes=7)
  emotion_list_full.append(emotions_conv_hot)

In [648]:
#The maximum length of any conversations (number of utterances) in input
max_seq_length = 35

In [649]:
#Padding the emotion data for training the Bi-LSTM models
emotion_list = sequence.pad_sequences(emotion_list, padding='post', maxlen=max_seq_length, value=[0, 0, 0, 0, 0, 0, 0])

In [650]:
emotion_list_full = sequence.pad_sequences(emotion_list_full, padding='post', maxlen=max_seq_length, value=[0, 0, 0, 0, 0, 0, 0])

In [651]:
##Getting the conversation list and emotion list for training the speaker-level Bi-LSTM model
conversation_speaker_utts = []
emotion_speaker = []
for value in df["Dialogue_ID"].unique():
  speaker_utts = []
  for speaker in df[df["Dialogue_ID"] == value]["Speaker"].unique():
    conversation_speaker = df[(df["Speaker"] == speaker) & (df["Dialogue_ID"] == value)]["Utterance"].tolist()
    speaker_utts.append(conversation_speaker)
    emotions_conv = df[(df["Dialogue_ID"] == value) & (df["Speaker"] == speaker)]["Emotion_int"].tolist()
    emotions_conv_hot = tf.keras.utils.to_categorical(emotions_conv, num_classes=7)
    emotion_speaker.append(emotions_conv_hot)
  conversation_speaker_utts.append(speaker_utts)

In [652]:
#The maximum number of utterances by a speaker in a conversations in input
max_length_speaker = 15

In [653]:
#Padding the emotion data for training the Bi-LSTM models
emotion_speaker = sequence.pad_sequences(emotion_speaker, padding='post', maxlen=max_length_speaker, value=[0, 0, 0, 0, 0, 0, 0])

## Utterance encoding

BERT is used to extract the features of utterances as it gets both, the contextual and key features in any text.
Thus, we will obtain the embeddings of every utterance in the dataset


In [654]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Loading the BERT tokenizer and the pretrained BERT model
bert = BertModel.from_pretrained('bert-base-uncased',
                                output_hidden_states=True)  #

# Defining a function to get BERT embeddings for a given text
def get_embeddings(text):
    # [CLS] and [SEP] are added to the text for the BERT model
    marked_text = "[CLS] " + text + " [SEP]"

    # Tokenizing the input text and getting the token indices
    tokenized_text = tokenizer.tokenize(marked_text)
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    tokens = torch.tensor(indexed_tokens).unsqueeze(0)

    with torch.no_grad():
        outputs = bert(tokens)

    # Extracting the embeddings and returning them
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings


In [655]:
utt_list= []
for utterance in df['Utterance']:
  embedding = get_embeddings(utterance)
  utt_list.append(embedding.tolist())

# Utterance-level context encoding

In this section, we will pass a conversation with $K$ utterances to a Convolution Neural Network (CNN). The CNN only considers the particular input utterance and predicts an emotion based on that. We will extract the features after the maxpooling layer and pass these features to a Bi-LSTM. Bi-LSTM is able to recognise contextual information of the utterances said before and after. For each utterance, we will concatenate the contextual features of forward and backward LSTM.

Thus, for a conversation, we will have list of features from the Bi-LSTM.

$C_{u} = [c_{1}, c_{2},c_{3}....c_{K}]$

where $K$ is the number of utterances in the conversation and $c_{j}$ is the vector of utterance-level features of the
$j$-th utterance


First, we will train the CNN using the BERT encodings and the target emotion values

In [672]:
#Parameters for the CNN model
num_filters= 10 #According to the paper- 100
batch_size= 20 #Ideally, higher the better (32)
embedding_dim = 768 #Embedding vector for BERT
num_classes = max(emotions) + 1
num_stride = 2 #To reduce the size of array output from CNN

In [657]:
utterances = np.array(utt_list).reshape((-1, embedding_dim, 1))

In [673]:
# Constructing the CNN model using TensorFlow and Keras
model_CNN = tf.keras.models.Sequential()
# Adding 3 1D convolutional layers with 100 filters and kernel sizes 2, 3 and 4 respectively.
model_CNN.add(tf.keras.layers.Conv1D(num_filters, 2, padding='valid', activation='relu', input_shape=(embedding_dim, 1), strides = num_stride))
model_CNN.add(tf.keras.layers.Conv1D(num_filters, 3, padding='valid', activation='relu', strides = num_stride))
model_CNN.add(tf.keras.layers.Conv1D(num_filters, 4, padding='valid', activation='relu',strides = num_stride))

# Adding a 1D max pooling layer to down-sample the feature maps and flattening them
model_CNN.add(tf.keras.layers.MaxPooling1D())
model_CNN.add(tf.keras.layers.Flatten())


model_CNN.add(tf.keras.layers.Dropout(0.5))

# Adding a dense layer for classification of emotions
model_CNN.add(tf.keras.layers.Dense(num_classes, activation='softmax'))

output_at_maxpool = tf.keras.models.Model(inputs=model_CNN.input, outputs=model_CNN.layers[4].output)
output_at_maxpool.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_CNN.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model_CNN.summary()

Model: "sequential_12"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_18 (Conv1D)          (None, 384, 10)           30        
                                                                 
 conv1d_19 (Conv1D)          (None, 191, 10)           310       
                                                                 
 conv1d_20 (Conv1D)          (None, 94, 10)            410       
                                                                 
 max_pooling1d_6 (MaxPoolin  (None, 47, 10)            0         
 g1D)                                                            
                                                                 
 flatten_6 (Flatten)         (None, 470)               0         
                                                                 
 dropout_6 (Dropout)         (None, 470)               0         
                                                     

In [674]:
#Training the model
model_CNN.fit(utterances, emotions_hot, batch_size=batch_size, epochs= 10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x27345f70a10>

In [688]:
#Output dimension after the maxpool layer
output_maxpool_dim = ((embedding_dim - 2)/num_stride) + 1
output_maxpool_dim = ((output_maxpool_dim - 3)/num_stride) + 1
output_maxpool_dim = ((output_maxpool_dim - 4)/num_stride) + 1
output_maxpool_dim = ((output_maxpool_dim - 1) //2) + 1

In [676]:
CNN_out = []
for convo in conversation_list:
  convo_embed = []
  for string in convo:
    string_embed = get_embeddings(string)
    convo_embed.append(string_embed)
  text = np.array(convo_embed).reshape((-1, embedding_dim, 1))
  temp = output_at_maxpool.predict(text)
  CNN_out.append(temp)



In [677]:
CNN_out = sequence.pad_sequences(CNN_out, maxlen=max_seq_length, value=0.0, padding="post",dtype= "float32")


In [682]:
CNN_out.shape

(498, 35, 470)

Now, to train the utterance-level Bi-LSTM model we will make use of the emotions data in the dataset. This is like training an auto-encoder.

In [678]:
#Parameters for the Bi-LSTM models
hidden_layer_nodes = 200

In [689]:
# Constructing the Bi-LSTM model
input_layer = layers.Input(shape=(None, (int(output_maxpool_dim*num_filters))))

# Masking layer will effectively ignore any added padding
masking_layer = layers.Masking(mask_value= [0]*(int(output_maxpool_dim*num_filters)))(input_layer)

# Bidirectional LSTM layer with 200 units
bi_lstm = layers.Bidirectional(layers.LSTM(hidden_layer_nodes, activation='relu', return_sequences=True, return_state=True))
bi_output_u, forward_state_h_u, forward_state_c_u, backward_state_h_u, backward_state_c_u = bi_lstm(masking_layer)

# Concatenating forward and backward LSTM outputs
concatenated_output = layers.Concatenate(name = 'concatenate')([bi_output_u[:, :, :200], bi_output_u[:, :, 200:]])

repeated_vector = layers.Dense(hidden_layer_nodes, activation='relu')(concatenated_output)

# LSTM layer with 200 units
lstm_output = layers.LSTM(hidden_layer_nodes, activation='relu', return_sequences=True)(repeated_vector)

output_layer = layers.Dense(num_classes, activation = "softmax")(lstm_output)


# Creating and compiling the Bi-LSTM model with input and output layers
model_LSTM = tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
model_LSTM.compile(optimizer="adam", loss='mse')


model_LSTM.summary()


Model: "model_3982"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_52 (InputLayer)       [(None, None, 470)]          0         []                            
                                                                                                  
 masking_54 (Masking)        (None, None, 470)            0         ['input_52[0][0]']            
                                                                                                  
 bidirectional_47 (Bidirect  [(None, None, 400),          1073600   ['masking_54[0][0]']          
 ional)                       (None, 200),                                                        
                              (None, 200),                                                        
                              (None, 200),                                               

In [690]:
model_LSTM.fit(CNN_out, emotion_list, epochs= 50, batch_size=batch_size)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x275015ab090>

In [691]:
model_LSTM.save("model_LSTM")

INFO:tensorflow:Assets written to: model_LSTM\assets


INFO:tensorflow:Assets written to: model_LSTM\assets


# Speaker-level encoding

Now, we will pass the output of the CNN to a Bi-LSTM which will capture the contextual information of the same speaker in the conversation.

We will obtain the output for a conversation as
$C_{s} = [c_{1}, c_{2},c_{3}....c_{K}]$

where $K$ is the number of utterances in the conversation and $c_{j}$ is the vector of speaker-level features of the $j$-th utterance

Similar to the utterance-level Bi-LSTM model, we will train the speaker-level Bi-LSTM model like an auto-encoder from the dataset.

In [692]:
CNN_out_speaker = []
for convo in conversation_speaker_utts:
  convo_embed = []
  for speaker in convo:
    for string in speaker:
      string_embed = get_embeddings(string)
      convo_embed.append(string_embed)
    text = np.array(convo_embed).reshape((-1, embedding_dim, 1))
    temp = output_at_maxpool.predict(text)
    CNN_out_speaker.append(temp)



In [693]:
CNN_out_speaker = sequence.pad_sequences(CNN_out_speaker, maxlen=max_length_speaker, value=0.0, padding="post",dtype= "float32")

In [696]:
len(CNN_out_speaker)

1342

In [697]:
CNN_out_speaker =CNN_out_speaker.reshape((len(CNN_out_speaker),max_length_speaker,int(output_maxpool_dim*num_filters)))

In [699]:
# Constructing the Bi-LSTM model
input_layer_s = layers.Input(shape=(None, int(output_maxpool_dim*num_filters)))

# Masking layer will effectively ignore the added padding
masking_layer_s = layers.Masking(mask_value= [0]*(int(output_maxpool_dim*num_filters)))(input_layer_s)
# Bidirectional LSTM layer with 200 units
bi_lstm_s = layers.Bidirectional(layers.LSTM(hidden_layer_nodes, activation='relu', return_sequences=True, return_state=True))
bi_output_s, forward_state_h_s, forward_state_c_s, backward_state_h_s, backward_state_c_s = bi_lstm_s(masking_layer_s)

# Concatenating forward and backward LSTM outputs
concatenated_output_s = layers.Concatenate(name = 'concatenate_1')([bi_output_s[:, :, :200], bi_output_s[:, :, 200:]])

repeated_vector_s = layers.Dense(hidden_layer_nodes, activation='relu')(concatenated_output_s)

# LSTM layer with 200 units
lstm_output_s = layers.LSTM(hidden_layer_nodes, activation='relu', return_sequences=True)(repeated_vector_s)

output_layer_s = layers.Dense(num_classes, activation = "softmax")(lstm_output_s)

# Creating and compiling the Bi-LSTM model
model_LSTM_speaker = tf.keras.models.Model(inputs=input_layer_s, outputs=output_layer_s)
model_LSTM_speaker.compile(optimizer="adam", loss='mse')

model_LSTM_speaker.summary()


Model: "model_3983"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_54 (InputLayer)       [(None, None, 470)]          0         []                            
                                                                                                  
 masking_55 (Masking)        (None, None, 470)            0         ['input_54[0][0]']            
                                                                                                  
 bidirectional_48 (Bidirect  [(None, None, 400),          1073600   ['masking_55[0][0]']          
 ional)                       (None, 200),                                                        
                              (None, 200),                                                        
                              (None, 200),                                               

In [700]:
model_LSTM_speaker.fit(CNN_out_speaker, emotion_speaker, epochs= 100, batch_size=batch_size)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x27416118090>

# Emotion classifier


Now, we will use these features from both of the Bi-LSTMs as inputs to a Broad Learning System model.

This model will effectively integrate the utterance-level and speaker-level contextual information to predict the emotion.

Thus, we will have a prediction for each utterance by considering its contextual information

Here, for each utterance in a conversation from the training dataset, we will extract the features from the Bi-LSTM models. These features will be used to "train" the Broad learning model.

The utterance-level and speaker-level features will be horizontally stacked in the below code-block.

It will give us

$C = [C_{u}, C_{s}]$


$C = [[c_{u1},c_{s1}],[c_{u2},c_{s2}]....[c_{uK},c_{sK}]]$

In [701]:
#Pre-processing the dataset to train the BL model from the outputs of previously trained data

Data_X = []

concatenated_output_model_utt = tf.keras.models.Model(inputs=model_LSTM.input, outputs=model_LSTM.get_layer('concatenate').output)
concatenated_output_model_speaker = tf.keras.models.Model(inputs=model_LSTM_speaker.input, outputs=model_LSTM_speaker.get_layer('concatenate_1').output)

# Iterating through the conversations
for value in df_full["Dialogue_ID"].unique():
    convo_concat_out = []
    
    # Extracting data for the current utterance in the conversation
    conversation_data = df_full[df_full["Dialogue_ID"] == value]
    convo_embed = []
    for string in conversation_data["Utterance"].tolist():
        string_embed = get_embeddings(string)
        convo_embed.append(string_embed)
    text = np.array(convo_embed).reshape((-1, embedding_dim, 1))
    
    # Getting predictions from the model for the utterance
    prediction = output_at_maxpool.predict(text)
    prediction = prediction.reshape(1, prediction.shape[0], prediction.shape[1])
    
    # Getting concatenated output for the utterance model
    concatenated_utterance_result = concatenated_output_model_utt.predict(prediction)[0]
    
    speaker_order = []
    speaker_utts_out = []
    conco_temp = []
    
    # Iterating through unique speakers in the conversation
    for speaker in conversation_data["Speaker"].unique():
        speaker_utterance = conversation_data[conversation_data["Speaker"] == speaker]
        
        # Collecting the order of utterances and embeddings for each speaker 
        speaker_order = speaker_order + (speaker_utterance["Utterance"].tolist())
        temp_convo = []
        for string in speaker_utterance["Utterance"]:
            string_embed = get_embeddings(string)
            temp_convo.append(string_embed)
        
        text = np.array(temp_convo).reshape((-1, embedding_dim, 1))
        
        # Getting predictions from the model for the speaker utterance
        prediction = output_at_maxpool.predict(text)
        prediction = prediction.reshape(1, prediction.shape[0], prediction.shape[1])
        
        # Getting concatenated output for the speaker model
        concatenated_speaker_result = concatenated_output_model_speaker.predict(prediction)[0]
        speaker_utts_out.append(concatenated_speaker_result)
    
    # Concatenating the speaker outputs
    conco_temp = np.concatenate(speaker_utts_out, axis=0)
    
    conversation_order = conversation_data["Utterance"].tolist()
    
    # Iterating through the order of utterances in the conversation and convatinating the features from both the models
    for utterance_index, utterance in enumerate(conversation_order):
        for speak_utterance_index, speak_utterance in enumerate(speaker_order):
            if utterance == speak_utterance:
                concat_temp = np.hstack((concatenated_utterance_result[utterance_index], conco_temp[speak_utterance_index]))
                convo_concat_out.append(concat_temp)
                break
    
    Data_X.append(convo_concat_out)





Here, we will pad all the conversations to be of the same length equal to the *max_seq_length* variable

In [702]:
X_LSTM = []
for list in Data_X:
  zero_lists = [[0] * Data_X[0][0].shape[0] for _ in range(max_seq_length-len(list))]
  padded_list = list + zero_lists
  X_LSTM.append(padded_list)

In [703]:
X_arr = []
for array in X_LSTM:
  temp_array = np.array(array)
  X_arr.append(temp_array)

# Broad learning model
The concatenated features will be mapped to m groups of enhancement nodes.
The j-th group of enhancement nodes will be



$E_{j} = \xi ([C_{u},C_{s}]W_{ej} + \beta _{ej})$



where $\xi$ is an activation function. $W_{ej}$ and $\beta_{ej}$ are randomly generated denoting the weight matrix and bias matrix respectively

$E = [E_{1},E_{2},E_{3}...E_{m}]$

$Y = [C_{u},C_{s}, E]W_{BL}$

$A = [C_{u},C_{s}, E] = [[c_{u1},c_{s1}, E],[c_{u2},c_{s2}, E]....[c_{uK},c_{sK}, E]]$

$Y$ is the output of BL

During training, $W_{BL}$ is found using ridge regression



Parameters for the Broad Learning System

In [704]:
num_windows = 10
nodes_per = 50
C = 0.001

tanh will be used as the activation function and ridge regression is used to find \W_{BL}

In [705]:
def tanh(x):
  return np.tanh(x)


def ridgeRegression(x, y):
  return (np.linalg.inv(x.T.dot(x)+C*np.eye(x.shape[1])).dot(x.T).dot(y))

In [706]:
#Generating random W_{ej}
W_enhanced = []
for j in range(num_windows):
  W_enhanced.append(2*np.random.rand(800, nodes_per)-1)

In [707]:
#Appending E to the concatenation of utterance and speaker level features
A_list = []
for convo in X_arr:
  A = convo
  for j in range(num_windows):
    E_j = tanh(convo.dot(W_enhanced[j]))
    A = np.hstack((A, E_j))
  A_list.append(A)

In [708]:
A = np.array(A_list)

In [709]:
#Reshaping for training model
X = np.vstack(A_list)
Y = np.vstack(emotion_list_full)

In [710]:
#Obtaining W_BL
beta = ridgeRegression(X, Y)
W_BL = beta.reshape((A_list[0].shape[1], emotion_list_full[0].shape[1]))

In [711]:
#List of outputs from the BL
Y_out = []
for array in A_list:
  Y = array.dot(W_BL)
  Y_out.append(Y)

In [712]:
Y_out = np.array(Y_out)

#  Prediction
The output of BL will be put into a softmax prediction model which will give the final predictions after integrating speaker and utterance level features with the help of BL

In [713]:
emotion_train = np.array(emotion_list_full)

In [714]:
model_prediction = models.Sequential()
model_prediction.add(layers.Masking(mask_value=[0] * num_classes, input_shape=(Y_out.shape[1],Y_out.shape[2])))
model_prediction.add(layers.Dense(256, activation='relu'))
model_prediction.add(layers.Dense(128, activation='relu'))
model_prediction.add(layers.Dense(num_classes, activation='softmax'))
model_prediction.compile(optimizer="adam", loss='categorical_crossentropy', metrics=['accuracy'])

In [715]:
model_prediction.fit(Y_out, emotion_train, epochs=50, batch_size=batch_size)

Epoch 1/50


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x2732c5a30d0>

# Testing the new model called DBL (Deep Broad Learning)

In [716]:
df_test= pd.read_csv(r"MELD.Raw\MELD.Raw\test_sent_emo.csv")

In [717]:
class_mapping = {'anger': 0, 'disgust': 1, 'fear': 2, 'joy': 3, 'neutral': 4, 'sadness': 5, 'surprise': 6}
df_test['Emotion_int'] = df_test['Emotion'].map(class_mapping)

Checking the accuracy of just the CNN model which will only look at the utterance to predict the emotion

In [718]:
utt_list_test= []
for utterance in df_test['Utterance']:
  embedding = get_embeddings(utterance)
  utt_list_test.append(embedding.tolist())

In [719]:
ground_truth = df_test["Emotion_int"].tolist()

In [720]:
utterances_test = np.array(utt_list_test).reshape((-1, embedding_dim, 1))

In [721]:
prediction_test_CNN = model_CNN.predict(utterances_test)



In [722]:
pred_class = []
for array in prediction_test_CNN:
    pred_class.append(np.argmax(array))

In [723]:
f1_score(ground_truth, pred_class, average= "weighted")

0.54073647725169

Checking the accuracy of DBL which integrates contextual information from the conversation from Bi-LSTMs and Broad Learning system

In [724]:
Data_X_test = []

for value in df_test["Dialogue_ID"].unique():
  convo_concat_out = []
  conversation_data = df_test[df_test["Dialogue_ID"] == value]
  convo_embed = []
  for string in conversation_data["Utterance"].tolist():
    string_embed = get_embeddings(string)
    convo_embed.append(string_embed)
  text = np.array(convo_embed).reshape((-1, embedding_dim, 1))
  prediction = output_at_maxpool.predict(text)
  prediction = prediction.reshape(1, prediction.shape[0], prediction.shape[1])
  concatenated_output_model = tf.keras.models.Model(inputs=model_LSTM.input, outputs=model_LSTM.get_layer('concatenate').output)
  concatenated_utterance_result = concatenated_output_model.predict(prediction)[0]
  speaker_order = []
  speaker_utts_out = []
  conco_temp = []
  for speaker in conversation_data["Speaker"].unique():
    speaker_utterance = conversation_data[conversation_data["Speaker"] == speaker]
    speaker_order = speaker_order + (speaker_utterance["Utterance"].tolist())
    temp_convo = []
    for string in speaker_utterance["Utterance"]:
      string_embed = get_embeddings(string)
      temp_convo.append(string_embed)
    text = np.array(temp_convo).reshape((-1, embedding_dim, 1))
    prediction = output_at_maxpool.predict(text)
    prediction = prediction.reshape(1, prediction.shape[0], prediction.shape[1])
    concatenated_output_model = tf.keras.models.Model(inputs=model_LSTM_speaker.input, outputs=model_LSTM_speaker.get_layer('concatenate_1').output)
    concatenated_speaker_result = concatenated_output_model.predict(prediction)[0]
    speaker_utts_out.append(concatenated_speaker_result)
  conco_temp = np.concatenate(speaker_utts_out, axis=0)
  conversation_order = conversation_data["Utterance"].tolist()
  for utterance_index, utterance in enumerate(conversation_order):
    for speak_utterance_index, speak_utterance in enumerate(speaker_order):
        if utterance == speak_utterance:
            concat_temp = np.hstack((concatenated_utterance_result[utterance_index], conco_temp[speak_utterance_index]))
            convo_concat_out.append(concat_temp)
            break
  Data_X_test.append(convo_concat_out)

(3, 400)
(8, 400)
(11, 400)
(7, 400)
(9, 400)
(9, 400)
(3, 400)
(9, 400)
(7, 400)
(18, 400)
(4, 400)
(5, 400)
(20, 400)
(8, 400)
(11, 400)
(6, 400)
(10, 400)
(33, 400)
(3, 400)
(1, 400)
(8, 400)
(7, 400)
(22, 400)
(3, 400)
(4, 400)
(17, 400)
(1, 400)
(2, 400)
(17, 400)
(13, 400)
(6, 400)
(1, 400)
(13, 400)
(11, 400)
(19, 400)
(5, 400)
(2, 400)
(8, 400)
(5, 400)
(12, 400)
(2, 400)
(13, 400)
(9, 400)
(7, 400)
(4, 400)
(9, 400)
(9, 400)
(14, 400)
(7, 400)
(17, 400)
(4, 400)
(3, 400)
(12, 400)
(13, 400)
(12, 400)
(3, 400)
(12, 400)
(6, 400)
(17, 400)
(3, 400)
(7, 400)
(3, 400)
(2, 400)
(2, 400)
(17, 400)
(11, 400)
(5, 400)
(4, 400)
(15, 400)
(6, 400)
(23, 400)
(14, 400)
(17, 400)
(11, 400)
(5, 400)
(6, 400)
(19, 400)
(5, 400)
(10, 400)
(2, 400)
(10, 400)
(2, 400)
(1, 400)
(11, 400)
(3, 400)
(6, 400)
(8, 400)
(2, 400)
(14, 400)
(3, 400)
(6, 400)
(1, 400)
(6, 400)
(13, 400)
(3, 400)
(20, 400)
(21, 400)
(12, 400)
(12, 400)
(10, 400)
(20, 400)
(4, 400)
(10, 400)
(6, 400)
(3, 400)
(3, 400)
(4, 

In [725]:
X_LSTM_test = []
convo_len = []
for list in Data_X_test:
  zero_lists = [[0] * Data_X_test[0][0].shape[0] for _ in range(max_seq_length-len(list))]
  convo_len.append(len(list))
  padded_list = list + zero_lists
  X_LSTM_test.append(padded_list)

In [726]:
X_arr_test = []
for array in X_LSTM_test:
  temp_array = np.array(array)
  X_arr_test.append(temp_array)

In [727]:
A_list_test = []
for convo in X_arr_test:
  A = convo
  for j in range(num_windows):
    E_j = tanh(convo.dot(W_enhanced[j]))
    A = np.hstack((A, E_j))
  A_list_test.append(A)

In [728]:
Y_pred_test = []
for convo in A_list_test:
  Y_pred = convo.dot(W_BL)
  Y_pred_test.append(Y_pred)

In [729]:
input_sequence = np.expand_dims(Y_pred_test[1], axis=0)

In [730]:
final_out = []
for seq in Y_pred_test:
  input_sequence = np.expand_dims(seq, axis=0)
  prediction = model_prediction.predict(input_sequence)
  final_out.append(prediction[0])



In [731]:
final_emo = []
for index, convo in enumerate(final_out):
  temp = []
  for i in range(convo_len[index]):
    Y = np.argmax(convo[i])
    temp.append(Y)
  final_emo.append(temp)

In [732]:
accuracy_test = []
for list in final_emo:
  accuracy_test = accuracy_test + list

In [733]:
f1_score(ground_truth, accuracy_test, average = 'weighted')

0.563886617136929

Thus, we achieve an increase in accuracy with the integration of contextual features. This increase can be made better with higher parameters

# Notes

The code above has been adjusted for lower computing resources and can be improved a lot with higher resource availability. Only a part of the dataset could be used to train the Bi-LSTM models as an array of size (1038, 35, 381, 100) would be too large for the resources at hand. I have compromised for this with a larger number of epochs. The suggested parameters have been mentioned throughout the code to improve accuracy. 

The cited paper shows an accuracy of around 64% with the above model


