# GoEmotion Dataset Sentiment Analysis

This assignment is used for Spring 2020 CSCI-360 machine learning final project.
The sentiment analysis with multiple emotion labels will be mainly performed by LSTM network using keras with pretrained tokenizers.
Baseline model was built and performed beforehand(naive bayes, decision tree, mlp, cnn)

Author: Kaiyan Zhan(kz2271),

In [63]:
# import
import pandas as pd
import numpy as np
import json

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
# from keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import matplotlib.pyplot as plt
import re

import the dataset and perform data-preprocessing. In the original dataset

In [25]:
dir = '/Users/zhankaiyan/PycharmProjects/CSCI-360/final_project/archive 3/data/full_dataset/'
set1 = pd.read_csv(dir + 'goemotions_1.csv')
set2 = pd.read_csv(dir + 'goemotions_2.csv')
set3 = pd.read_csv(dir + 'goemotions_3.csv')

data = pd.concat([set1, set2, set3], ignore_index=True)


In [26]:
with open('/Users/zhankaiyan/PycharmProjects/CSCI-360/final_project/archive 3/data/ekman_mapping.json') as file:
    ekman_mapping = json.load(file)

In [27]:
print(ekman_mapping)

{'anger': ['anger', 'annoyance', 'disapproval'], 'disgust': ['disgust'], 'fear': ['fear', 'nervousness'], 'joy': ['joy', 'amusement', 'approval', 'excitement', 'gratitude', 'love', 'optimism', 'relief', 'pride', 'admiration', 'desire', 'caring'], 'sadness': ['sadness', 'disappointment', 'embarrassment', 'grief', 'remorse'], 'surprise': ['surprise', 'realization', 'confusion', 'curiosity']}


take a look at the data. The one-hot key is already embedded in the original dataset.

In [28]:
data.shape

(211225, 37)

In [29]:
print(data.head())

                                                text       id  \
0                                    That game hurt.  eew5j0j   
1   >sexuality shouldn’t be a grouping category I...  eemcysk   
2     You do right, if you don't care then fuck 'em!  ed2mah1   
3                                 Man I love reddit.  eeibobj   
4  [NAME] was nowhere near them, he was by the Fa...  eda6yn6   

                author            subreddit    link_id   parent_id  \
0                Brdd9                  nrl  t3_ajis4z  t1_eew18eq   
1          TheGreen888     unpopularopinion  t3_ai4q37   t3_ai4q37   
2             Labalool          confessions  t3_abru74  t1_ed2m7g7   
3        MrsRobertshaw             facepalm  t3_ahulml   t3_ahulml   
4  American_Fascist713  starwarsspeculation  t3_ackt2f  t1_eda65q2   

    created_utc  rater_id  example_very_unclear  admiration  ...  love  \
0  1.548381e+09         1                 False           0  ...     0   
1  1.548084e+09        37               

## data cleaning

In [30]:
# drop the unclear data and only keep the necessary column
data = data[~data['example_very_unclear']]
data = data.drop(columns=['author', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'rater_id'])
# data = data.drop(columns=['neutral'])
print(data.head())

                                                text       id  \
0                                    That game hurt.  eew5j0j   
2     You do right, if you don't care then fuck 'em!  ed2mah1   
3                                 Man I love reddit.  eeibobj   
4  [NAME] was nowhere near them, he was by the Fa...  eda6yn6   
5  Right? Considering it’s such an important docu...  eespn2i   

   example_very_unclear  admiration  amusement  anger  annoyance  approval  \
0                 False           0          0      0          0         0   
2                 False           0          0      0          0         0   
3                 False           0          0      0          0         0   
4                 False           0          0      0          0         0   
5                 False           0          0      0          0         0   

   caring  confusion  ...  love  nervousness  optimism  pride  realization  \
0       0          0  ...     0            0         0      0 

In [31]:
print(data.head())

                                                text       id  \
0                                    That game hurt.  eew5j0j   
2     You do right, if you don't care then fuck 'em!  ed2mah1   
3                                 Man I love reddit.  eeibobj   
4  [NAME] was nowhere near them, he was by the Fa...  eda6yn6   
5  Right? Considering it’s such an important docu...  eespn2i   

   example_very_unclear  admiration  amusement  anger  annoyance  approval  \
0                 False           0          0      0          0         0   
2                 False           0          0      0          0         0   
3                 False           0          0      0          0         0   
4                 False           0          0      0          0         0   
5                 False           0          0      0          0         0   

   caring  confusion  ...  love  nervousness  optimism  pride  realization  \
0       0          0  ...     0            0         0      0 

In [34]:
# Grouping emotions:
anger_list = [ "anger", "annoyance", "disapproval", "disgust"]
fear_list = ["fear", "nervousness"]
joy_list = ["joy", "amusement", "approval", "excitement", "gratitude","love", "optimism", "relief", "pride", "admiration", "desire", "caring"]
sadness_list = ["sadness", "disappointment", "embarrassment", "grief", "remorse"]
surprise_list = ["surprise", "realization", "confusion", "curiosity"]
emotion_groups = [anger_list, fear_list, joy_list, sadness_list, surprise_list]

"""
Labels:
Anger (0) : [“Anger”, “annoyance”, “disapproval”, “disgust”]
Fear (1) : [“fear”, “nervousness” ]
Joy (2) : [“joy” , “amusement”, “approval”, “excitement”, “gratitude”,
     “love”, “optimism”, “relief”, “pride”, “admiration”, “desire”, “caring”]
Sadness (3) : [“Sadness”, “Disappointment”, “Embarrassment”, “grief”, “remorse”]
Surprise (4) : [“Surprise”, “Realization”, “confusion”, “curiosity”]
Neutral (5) : ["Neutral"]
"""
col_names = ['text','group_label']
new_data = []
for id,row in data.iterrows():
    if row['example_very_unclear'] == True:
        continue
    else:
        if row['neutral'] == True:
            info = [row['text'], 5]
        else:
            max_cnt = -1
            max_label = -1
            for ix,eg in enumerate(emotion_groups):
                cnt = 0
                for label in eg:
                    if row[label] == 1:
                        cnt += 1
                if cnt > max_cnt:
                    max_cnt = cnt
                    max_label = ix
            info = [row['text'], max_label]
        new_data.append(info)

emotion_group = pd.DataFrame(np.array(new_data),columns=col_names)
emotion_group.head()
emotion_group.shape

(207814, 2)

the data in the dataset was highly imbalanced(extremely little data in 'fear'). Reorganise to have a balanced class.

In [36]:
emotion_group.group_label.value_counts()

2    79279
5    55298
0    33937
4    20967
3    14292
1     4041
Name: group_label, dtype: int64

In [46]:
num_of_text = 4000
shuffled = emotion_group.reindex(np.random.permutation(emotion_group.index))
anger = shuffled[shuffled['group_label'] == '0'][:num_of_text]
fear = shuffled[shuffled['group_label'] == '1'][:num_of_text]
joy = shuffled[shuffled['group_label'] == '2'][:num_of_text]
sad = shuffled[shuffled['group_label'] == '3'][:num_of_text]
surprise = shuffled[shuffled['group_label'] == '4'][:num_of_text]
neutral = shuffled[shuffled['group_label'] == '5'][:num_of_text]
concated = pd.concat([anger,fear,joy,sad, surprise, neutral], ignore_index=True)

concated = concated.reindex(np.random.permutation(concated.index))

In [47]:
print(concated)

                                                    text group_label
7922   He’s probably just getting stressed out that i...           1
10832  I agree. More stupid ideas by ''the adults in ...           2
13046  leave bro- this is the exact type of behavior ...           3
21817      You were. I gave you away to see Tool in '06.           5
6735   Certain mental health issues can be smelled on...           1
...                                                  ...         ...
10357  Unfortunately it isn’t available in my locatio...           2
6513   [NAME] " haters gonna hate" the most cringe I'...           1
4645   Blizzard was working with Activision long befo...           1
21093          Sorry, [NAME], but [NAME] beat you to it.           5
4992   Aaaand I'm still here. I don't have a alt acco...           1

[24000 rows x 2 columns]


perform one-hot key encoding again

In [51]:
concated['LABEL'] = 0
concated.loc[concated['group_label'] == '0', 'LABEL'] = 0
concated.loc[concated['group_label'] == '1', 'LABEL'] = 1
concated.loc[concated['group_label'] == '2', 'LABEL'] = 2
concated.loc[concated['group_label'] == '3', 'LABEL'] = 3
concated.loc[concated['group_label'] == '4', 'LABEL'] = 4
concated.loc[concated['group_label'] == '5', 'LABEL'] = 5
print(concated['LABEL'][:10])
labels = to_categorical(concated['LABEL'], num_classes=6)
print(labels[:10])
if 'group_label' in concated.keys():
    concated.drop(['group_label'], axis=1)

7922     1
10832    2
13046    3
21817    5
6735     1
11988    2
21310    5
15181    3
14300    3
17873    4
Name: LABEL, dtype: int64
[[0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]]


tokenizing with text filtering, perform text cleaning using regular expressions

In [58]:
n_most_common_words = 8000
max_len = 130
tokenizer = Tokenizer(num_words=n_most_common_words, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(concated['text'].values)
sequences = tokenizer.texts_to_sequences(concated['text'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# X = keras.pad_sequences(sequences, maxlen=max_len)
X = keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len)

Found 18285 unique tokens.


## model setup

In [59]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X , labels, test_size=0.25, random_state=42)
print((X_train.shape, y_train.shape, X_test.shape, y_test.shape))

In [60]:
# defining hyperparameter
epochs = 10
emb_dim = 128
batch_size = 256
labels[:2]

array([[0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0.]], dtype=float32)

In [None]:
model = Sequential()
model.add(Embedding(n_most_common_words, emb_dim, input_length=X.shape[1]))
model.add(SpatialDropout1D(0.7))
model.add(LSTM(64, dropout=0.7, recurrent_dropout=0.7))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
print(model.summary())
history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',patience=7, min_delta=0.0001)])

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 130, 128)          1024000   
                                                                 
 spatial_dropout1d_3 (Spatia  (None, 130, 128)         0         
 lDropout1D)                                                     
                                                                 
 lstm_3 (LSTM)               (None, 64)                49408     
                                                                 
 dense_3 (Dense)             (None, 6)                 390       
                                                                 
Total params: 1,073,798
Trainable params: 1,073,798
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/10
Epoch 2/10
Epoch 3/10

In [None]:
# model evaluation
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(1, len(acc) + 1)

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
txt = [""]
seq = tokenizer.texts_to_sequences(txt)
padded = keras.preprocessing.sequence.pad_sequences(seq, maxlen=max_len)
pred = model.predict(padded)
labels = emotion_groups + ['neutral_list']
for segment in labels:
    segment = segment[:-5]
print(pred, labels[np.argmax(pred)])