Import library

In [3]:
import pandas as pd
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

Linereader function to read our csv and stores every row in a list called line

In [4]:
def lineReader(x):
  with open(x) as file: 
    line = []
    for lines in file.readlines():
      line.append(lines)
    return line

CSV function to split our line array into 2 parts, label and array. this was because our list contains every row in the dataset so we must split it using ';' as the splitting criterion

In [5]:
def csv(line):
  list1,list2 = [],[]
  for lines in line:
    x,y = lines.split(';')
    y = y.replace('\n','')
    list1.append(x)
    list2.append(y)
  df = pd.DataFrame(list(list1),columns=['sentence'])
  df['emotion'] = list2
  return df

Calling linereader and csv function for training dataset

In [6]:
line = lineReader('./train.txt')
df = csv(line)

In [7]:
df

Unnamed: 0,sentence,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
...,...,...
15995,i just had a very brief time in the beanbag an...,sadness
15996,i am now turning and i feel pathetic that i am...,sadness
15997,i feel strong and good overall,joy
15998,i feel like this was such a rude comment and i...,anger


Checking the amount of training dataset for each emotion

In [8]:
df.emotion.value_counts()

joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: emotion, dtype: int64

building the Lemmetizer model and using it to lemetize our sentences, transforming each word to its original form, removing eny affix and prefix to our words. we also lemetize our trainig dataset

In [9]:
wn = WordNetLemmatizer()

In [10]:
def lem(x):
  corpus = []
  i=1
  for words in x:
    words = words.split()
    y = [wn.lemmatize(word) for word in words if not word in stopwords.words('english')]
    y =  ' '.join(y)
    corpus.append(y)
  return corpus
x = lem(df['sentence'])

Calling linereader and csv for our test data

In [11]:
test_line = lineReader('./test.txt') 
test_df = csv(test_line)

Lemmetizing our testing dataset, we also only use the sentence column so it's only the feature

In [12]:
x_test = lem(test_df['sentence'])

we made our labels, we only use the index 1 column for that is our labels

In [13]:
y_train = df.iloc[:,1].values
y_test = test_df.iloc[:,1].values

In [14]:
y_train = pd.DataFrame(y_train)
y_test = pd.DataFrame(y_test)

well i just made all_y for all label dataset

In [15]:
all_y = y_train.append(y_test)

doing the same thing to our features

In [16]:
x_train = x
print(len(x_train))

all_x = x_train + x_test

16000


In [17]:
all_x = pd.DataFrame(all_x)

In [18]:
all_y

Unnamed: 0,0
0,sadness
1,sadness
2,anger
3,love
4,anger
...,...
1995,anger
1996,anger
1997,joy
1998,joy


In [19]:
all_x[0].values

array(['didnt feel humiliated',
       'go feeling hopeless damned hopeful around someone care awake',
       'im grabbing minute post feel greedy wrong', ...,
       'feel useful people give great feeling achievement',
       'im feeling comfortable derby feel though start step shell',
       'feel weird meet w people text like dont talk face face w'],
      dtype=object)

We transform all our features into a vector so that all sentences shape stay the same, this is required because not all features has the same length, it works by making a pseudo corpus and listing how many words in a certain sentences exist on that corpus and turning that into a list

In [20]:
v = CountVectorizer()
all_x = v.fit_transform(all_x[0].values)

This is train_test_split but manual version, 16000 training data and the rest is testing data

In [21]:
x_train = all_x[:16000]
x_test = all_x[16000:]

y_train = all_y[:16000]
y_test = all_y[16000:]

Building the Multinominal Naive Bayes as our model, getting the prediction for testing dataset and checking the accuracy using testing dataset too, we use 2 type of accuracy checker and they yield the same result

In [22]:
MNB = MultinomialNB().fit(x_train,y_train)
MNB_pred = MNB.predict(x_test)
MNB_score = MNB.score(x_test,y_test) * 100
MNB_acc = accuracy_score(MNB_pred,y_test) * 100
print(f"MNB Score : {MNB_score : .2f} % ---- MNB Accuracy : {MNB_acc:.2f} %")

MNB Score :  80.05 % ---- MNB Accuracy : 80.05 %


  return f(**kwargs)


this is the classification rerport

In [23]:
print(classification_report(y_test,MNB_pred))

              precision    recall  f1-score   support

       anger       0.90      0.68      0.77       275
        fear       0.83      0.65      0.73       224
         joy       0.78      0.95      0.86       695
        love       0.87      0.37      0.52       159
     sadness       0.78      0.93      0.85       581
    surprise       0.64      0.11      0.18        66

    accuracy                           0.80      2000
   macro avg       0.80      0.61      0.65      2000
weighted avg       0.81      0.80      0.78      2000



and this is the prediction for testing dataset

In [24]:
MNB_pred

array(['sadness', 'sadness', 'sadness', ..., 'joy', 'joy', 'fear'],
      dtype='<U8')