In [43]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

### First things first, let's look a little bit at the data

In [44]:
test = pd.read_csv('test.txt', sep = ';',header = None )
test.head()
train = pd.read_csv('train.txt', sep = ';',header = None )
train.head()

Unnamed: 0,0,1
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [45]:
train[1].value_counts()

joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: 1, dtype: int64

In [46]:
test[1].value_counts()

joy         695
sadness     581
anger       275
fear        224
love        159
surprise     66
Name: 1, dtype: int64

The data is already separated into a test file and a training file so we don't need to do a train test split. As we can see the data isn't really made equal. Later we will try to get the same number of cases for each emotion. For now, let's do a simple Naive Bayes to see how well it performs.

In [47]:
def tester(train:pd.DataFrame, test:pd.DataFrame) ->None:
    x_train = train[0]
    x_test = test[0]
    y_train = train[1]
    y_test = test[1]
    count_vect = CountVectorizer()
    x_train = count_vect.fit_transform(x_train)
    x_test = count_vect.transform(x_test)
    clf = MultinomialNB().fit(x_train, y_train)
    print('Score:\n',clf.score(x_test, y_test))
    print('F1_Score:\n',classification_report(clf.predict(x_test), y_test))

In [48]:
tester(train, test)

Score:
 0.7655
F1_Score:
               precision    recall  f1-score   support

       anger       0.57      0.92      0.70       170
        fear       0.53      0.82      0.64       146
         joy       0.97      0.74      0.84       911
        love       0.23      0.95      0.37        38
     sadness       0.94      0.74      0.83       735
    surprise       0.00      0.00      0.00         0

    accuracy                           0.77      2000
   macro avg       0.54      0.69      0.56      2000
weighted avg       0.88      0.77      0.80      2000



Although we achieved an ok score, we can see that the model never gets the emotion surprise right.

Let's first just get 572 cases for each emotion.

In [49]:
new_train = pd.concat([train.loc[train[1] =='joy'][:572],
train.loc[train[1] =='sadness'][:572],
 train.loc[train[1] =='anger'][:572],
 train.loc[train[1] =='fear'][:572],
 train.loc[train[1] =='love'][:572],
 train.loc[train[1] =='surprise'][:572]])

In [50]:
tester(new_train, test)

Score:
 0.6445
F1_Score:
               precision    recall  f1-score   support

       anger       0.74      0.63      0.68       324
        fear       0.71      0.60      0.65       264
         joy       0.57      0.85      0.68       468
        love       0.81      0.41      0.54       313
     sadness       0.60      0.78      0.68       443
    surprise       0.83      0.29      0.43       188

    accuracy                           0.64      2000
   macro avg       0.71      0.59      0.61      2000
weighted avg       0.68      0.64      0.63      2000



This time, we got a worse score but the predictions are more balanced.