### The goal of this notebook is to explore convolutional neural network (CNN) in the prediction of multi label emotion classification.

###### Why is CNN considered?

In [1]:
# mount drive to google colab
from google.colab import drive
drive.mount('/content/drive')

# mount specific file path to notebook
%cd /content/drive/Othercomputers/My_laptop/02_shiningstars_work/01_dataset

Mounted at /content/drive
/content/drive/Othercomputers/My_laptop/02_shiningstars_work/01_dataset


In [10]:
#import neccesary packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Dense, Flatten, Dropout
from tensorflow.keras.utils import to_categorical

In [18]:
# load in train data, make your you imput your own file path
train_data = pd.read_csv("/content/drive/Othercomputers/My_laptop/02_shiningstars_work/01_dataset/01_train/01_eng.csv")
train_data.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_train_track_a_00001,But not very happy.,0,0,1,1,0
1,eng_train_track_a_00002,Well she's not gon na last the whole song like...,0,0,1,0,0
2,eng_train_track_a_00003,She sat at her Papa's recliner sofa only to mo...,0,0,0,0,0
3,eng_train_track_a_00004,"Yes, the Oklahoma city bombing.",1,1,0,1,1
4,eng_train_track_a_00005,They were dancing to Bolero.,0,0,1,0,0


In [7]:
# combine target lables into a multi-label array
labels = train_data[['Anger', 'Fear', 'Joy', 'Sadness', 'Surprise']]

##### Steps in data processing
- Tokenize the text column
- Convert to sequences
- Pad the sequences (to have the same length)
- Split data


In [8]:
# Text Preprocessing
MAX_NUM_WORDS = 10000  # Vocabulary size
MAX_SEQUENCE_LENGTH = 50  # Maximum length of input sequences
EMBEDDING_DIM = 100  # Dimension of the embedding vector

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train_data['text'])

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(train_data['text'])
word_index = tokenizer.word_index
print(f'Found {len(word_index)} unique tokens.')

Found 5695 unique tokens.


In [11]:
# Pad sequences to the same length
X = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, labels, test_size=0.2, random_state=42)

> Sequential Model: The model is built using Keras' Sequential() API, which allows layers to be stacked sequentially.

> Embedding Layer:

- Input dimension (input_dim=MAX_NUM_WORDS): The size of the vocabulary.
- Output dimension (output_dim=EMBEDDING_DIM): The size of the embedding vector for each word.
- Maps words to dense vectors, providing a compact representation of the input text.

> Convolutional Layer:

- Conv1D with 128 filters, a kernel size of 5, and ReLU activation function.
- Extracts local features from the text sequences.

> MaxPooling Layer:

- MaxPooling1D with a pool size of 4.
- Reduces the dimensionality of the feature map and keeps the most important features.

> Flatten Layer:

- Flattens the output of the previous layer into a one-dimensional vector for the fully connected layers.

> Fully Connected (Dense) Layers:

- First dense layer with 128 units and ReLU activation.
- Dropout layer with 50% dropout rate to prevent overfitting.
- Second dense layer (output layer) with 5 units and sigmoid activation for multi-label classification.

> Compilation:

- Loss function: binary_crossentropy (since it’s a multi-label classification problem).
- Optimizer: adam (adaptive learning rate optimization algorithm).
- Metrics: accuracy (used to evaluate the performance during training).

In [15]:
# Model Building
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=MAX_NUM_WORDS, output_dim=EMBEDDING_DIM))
##input_length=MAX_SEQUENCE_LENGTH))

# Convolutional Layer
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))

# Flatten and Fully Connected Layers
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(5, activation='sigmoid'))  # Multi-label classification, so use sigmoid

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [16]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

Epoch 1/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 35ms/step - accuracy: 0.4737 - loss: 0.6049 - val_accuracy: 0.4874 - val_loss: 0.5654
Epoch 2/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 41ms/step - accuracy: 0.4729 - loss: 0.5635 - val_accuracy: 0.4874 - val_loss: 0.5580
Epoch 3/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 53ms/step - accuracy: 0.4933 - loss: 0.5324 - val_accuracy: 0.4621 - val_loss: 0.5553
Epoch 4/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 31ms/step - accuracy: 0.5232 - loss: 0.4615 - val_accuracy: 0.4278 - val_loss: 0.5563
Epoch 5/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 34ms/step - accuracy: 0.6131 - loss: 0.3404 - val_accuracy: 0.4116 - val_loss: 0.6069
Epoch 6/10
[1m70/70[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 30ms/step - accuracy: 0.6489 - loss: 0.2576 - val_accuracy: 0.3953 - val_loss: 0.6804
Epoch 7/10
[1m70/70[0m [32m━━━━

In [17]:
# Evaluate the model
loss, accuracy = model.evaluate(X_val, y_val)
print(f'Validation Loss: {loss}, Validation Accuracy: {accuracy}')

[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.3528 - loss: 0.9188
Validation Loss: 0.9542540311813354, Validation Accuracy: 0.3357400596141815


In [20]:
# load in dev data
dev_data = pd.read_csv("/content/drive/Othercomputers/My_laptop/02_shiningstars_work/01_dataset/02_dev/01_eng_a.csv")
dev_data.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_dev_track_a_00001,"My mouth fell open `` No, no, no... I..",,,,,
1,eng_dev_track_a_00002,You can barely make out your daughter's pale f...,,,,,
2,eng_dev_track_a_00003,But after blinking my eyes for a few times lep...,,,,,
3,eng_dev_track_a_00004,Slowly rising to my feet I came to the conclus...,,,,,
4,eng_dev_track_a_00005,I noticed this months after moving in and doin...,,,,,


In [21]:
dev_data.shape[0]

116

In [22]:
# Preprocess the dev data
dev_sequences = tokenizer.texts_to_sequences(dev_data['text'])
dev_X = pad_sequences(dev_sequences, maxlen=MAX_SEQUENCE_LENGTH)

# Make predictions using the trained model
dev_predictions = model.predict(dev_X)

# Convert predictions to binary (0 or 1) using a threshold
threshold = 0.5
dev_predictions_binary = (dev_predictions > threshold).astype(int)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step


In [26]:
# Create a DataFrame for the predictions
predicted_labels = pd.DataFrame(dev_predictions_binary, columns=['Anger', 'Fear', 'Joy', 'Sadness', 'Surprise'])

# Combine the dev text with the predicted labels
dev_data_with_predictions = pd.concat([dev_data[['id','text']], predicted_labels], axis=1)

In [27]:
dev_data_with_predictions.head()

Unnamed: 0,id,text,Anger,Fear,Joy,Sadness,Surprise
0,eng_dev_track_a_00001,"My mouth fell open `` No, no, no... I..",0,0,0,0,0
1,eng_dev_track_a_00002,You can barely make out your daughter's pale f...,0,1,0,1,0
2,eng_dev_track_a_00003,But after blinking my eyes for a few times lep...,0,0,0,0,0
3,eng_dev_track_a_00004,Slowly rising to my feet I came to the conclus...,0,1,0,1,0
4,eng_dev_track_a_00005,I noticed this months after moving in and doin...,0,1,0,0,1


In [29]:
dev_data_with_predictions.to_csv('predictions_first_version.csv')