# <center>NLP Course <br><small>Graded Project Instructions <br>Spring 2023</small></center>

About the dataset: 

List of tweet texts with emotion labels like joy, sadness, fear, anger... 
Dataset is split into train, test and validation sets for building the machine learning model. At first, you are 
given only train and test sets. The validation one will be given in the end of the project for you to check 
the final performance of your algorithm (to make sure there is no overfitting over the test data). 
You can work on this project on group of one, two or three students. This exercise is mandatory, not 
giving it back is equivalent to getting to lowest grade. 
Goal: 

• Train different kind of models able to classify each text according to the sentiment mainly present 
in it 

• Compare the results of your different models and try to analyze and explain the differences

Train different classification models relying mainly on 

1. A Fully Connected Neural Network (see Course 2) 5 points 

2. A Recurrent Neural Network, based on LSTM or GRU (see Course 3) 5 points 

3. A fine-tuned Transformer Architecture from a pretrained model that can be found on sites 
like HuggingFace (see Course 4) 5 points 

4. Compare the different models to find the best approach and try to duplicate it on a “real life” 
text classification approach (this new “real life” dataset will be given to you soon) 5 points

# Loading and Preprocessing the data sets

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_train = pd.read_csv('./train.txt', header=None, delimiter=';')
df_test = pd.read_csv('./test.txt', header=None, delimiter=';')
df_train = df_train.rename(columns={0: 'tweet', 1: 'sentiment'})
df_test = df_test.rename(columns={0: 'tweet', 1: 'sentiment'})
df_train.head()

Unnamed: 0,tweet,sentiment
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [3]:
df_train['sentiment'].value_counts(), df_train.shape

(joy         5362
 sadness     4666
 anger       2159
 fear        1937
 love        1304
 surprise     572
 Name: sentiment, dtype: int64,
 (16000, 2))

In [4]:
df_test['sentiment'].value_counts(), df_test.shape

(joy         695
 sadness     581
 anger       275
 fear        224
 love        159
 surprise     66
 Name: sentiment, dtype: int64,
 (2000, 2))

# RNN, LSTM and GRU

In [6]:
def plot_results(history):
    hist_df = pd.DataFrame(history.history)
    hist_df.columns=["loss", "accuracy", "val_loss", "val_accuracy"]
    hist_df.index = np.arange(1, len(hist_df)+1)

    fig, axs = plt.subplots(nrows=2, sharex=True, figsize=(16, 10))
    axs[0].plot(hist_df.val_accuracy, lw=3, label='Validation Accuracy')
    axs[0].plot(hist_df.accuracy, lw=3, label='Training Accuracy')
    axs[0].set_ylabel('Accuracy')
    axs[0].set_xlabel('Epoch')
    axs[0].grid()
    axs[0].legend(loc=0)
    axs[1].plot(hist_df.val_loss, lw=3, label='Validation Loss')
    axs[1].plot(hist_df.loss, lw=3, label='Training Loss')
    axs[1].set_ylabel('Loss')
    axs[1].set_xlabel('Epoch')
    axs[1].grid()
    axs[1].legend(loc=0)

    plt.show();

In [7]:
df_train.head()

Unnamed: 0,tweet,sentiment
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [55]:
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from tensorflow import keras
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.callbacks import EarlyStopping
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
import matplotlib.pyplot as plt
from keras.metrics import Precision, Recall, AUC

In [39]:
# max_length = df_train['tweet'].map(lambda x: len(x)).sort_values().values[-1] # the length of the longest tweet 300
max_length = 100
trunc_type='post'
# oov_tok = "<OOV>"

tokenizer = Tokenizer() #(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(df_train['tweet'])
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(df_train['tweet'])
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(df_test['tweet'])
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

In [40]:
le = LabelEncoder()
y_train = le.fit_transform(df_train['sentiment'])
y_test = le.transform(df_test['sentiment'])
y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test)

In [56]:
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = max_length


model = tf.keras.Sequential([
    Embedding(vocab_size, embedding_dim, input_length=max_length),
    Bidirectional(LSTM(64)), #32 
    Dropout(0.4),
    Dense(32, activation='leaky_relu', kernel_regularizer='l1_l2'),
    Dropout(0.4),
    Dense(6, activation='softmax')
])


loss_function = 'categorical_crossentropy'
optimizer = 'adam'

model.compile(loss=loss_function, optimizer=optimizer, metrics=['accuracy', Precision(), Recall(), AUC()])

print(model.summary())

Model: "sequential_8"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_8 (Embedding)     (None, 100, 100)          1521300   
                                                                 
 bidirectional_8 (Bidirectio  (None, 128)              84480     
 nal)                                                            
                                                                 
 dropout_10 (Dropout)        (None, 128)               0         
                                                                 
 dense_16 (Dense)            (None, 32)                4128      
                                                                 
 dropout_11 (Dropout)        (None, 32)                0         
                                                                 
 dense_17 (Dense)            (None, 6)                 198       
                                                      

In [57]:
predictors = np.array(padded) 
label = np.array(y_train_encoded)
epochs_value = 50
validation_split_value = 0.2
early_stopping = tf.keras.callbacks.EarlyStopping(patience=3)

history = model.fit(predictors, label, epochs=epochs_value, verbose=1, validation_split=validation_split_value, callbacks=[early_stopping])


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50


In [58]:
plot_results(history)


KeyboardInterrupt



In [59]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

predictions = model.predict(testing_padded)
prediction_labels = predictions.argmax(axis=1)
print(classification_report(y_test, prediction_labels))
print(confusion_matrix(y_test, prediction_labels))
print(accuracy_score(y_test, prediction_labels))

              precision    recall  f1-score   support

           0       0.89      0.91      0.90       275
           1       0.85      0.92      0.88       224
           2       0.95      0.92      0.94       695
           3       0.86      0.77      0.81       159
           4       0.94      0.97      0.95       581
           5       0.81      0.73      0.77        66

    accuracy                           0.92      2000
   macro avg       0.88      0.87      0.88      2000
weighted avg       0.91      0.92      0.91      2000

[[249  12   2   0  12   0]
 [  8 206   1   0   4   5]
 [  7   6 642  19  16   5]
 [  4   0  27 123   4   1]
 [ 11   4   4   0 562   0]
 [  1  14   2   1   0  48]]
0.915
