 use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.



Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp



This data consists of two columns.
  - Comment
  - Emotion

In [2]:
import pandas as pd

df = pd.read_csv("Emotion_classify_Data.csv")

print(df.shape)

df.head()

(5937, 2)


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [3]:
df.Emotion.value_counts()

anger    2000
joy      2000
fear     1937
Name: Emotion, dtype: int64

no class imbalance

Add the new column "Emotion_num" which gives a unique number to each of these Emotions

In [4]:
df['Emotion_num'] = df['Emotion'].map({'joy': 0, 'fear': 1, 'anger': 2})

df.head()

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


# modelling without pre_processing data

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.Comment, 
    df.Emotion_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.Emotion_num
)

In [6]:

print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)

Shape of X_train:  (4749,)
Shape of X_test:  (1188,)


random forest with tri-grams

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

clf = Pipeline([
    ('vectorizer_tri_grams', CountVectorizer(ngram_range = (3, 3))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier()))         
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.59      0.27      0.37       400
           1       0.36      0.81      0.50       388
           2       0.56      0.20      0.29       400

    accuracy                           0.42      1188
   macro avg       0.51      0.42      0.39      1188
weighted avg       0.51      0.42      0.39      1188




using CountVectorizer with both unigram and bigrams.
use Multinomial Naive Bayes as the classifier.

In [10]:
from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('vectorizer_bi_grams', CountVectorizer(ngram_range = (1, 2))),                   
    ('Multi NB', MultinomialNB())        
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.86      0.87       400
           1       0.87      0.83      0.85       388
           2       0.83      0.88      0.85       400

    accuracy                           0.86      1188
   macro avg       0.86      0.86      0.86      1188
weighted avg       0.86      0.86      0.86      1188



using CountVectorizer with both unigram and bigrams.
use RandomForest as the classifier

In [11]:

clf = Pipeline([
    ('vectorizer_bi_grams', CountVectorizer(ngram_range = (1, 2))),                   
    ('Random Forest', RandomForestClassifier())        
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.97      0.90       400
           1       0.95      0.88      0.91       388
           2       0.94      0.86      0.90       400

    accuracy                           0.90      1188
   macro avg       0.91      0.90      0.90      1188
weighted avg       0.91      0.90      0.90      1188



using TF-IDF vectorizer for pre-processing the text.
use RandomForest as the classifier.

In [13]:
from sklearn.feature_extraction import TfidfVectorizer

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),                   
    ('Random Forest', RandomForestClassifier())        
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.87      0.94      0.90       400
           1       0.92      0.91      0.91       388
           2       0.93      0.86      0.89       400

    accuracy                           0.90      1188
   macro avg       0.91      0.90      0.90      1188
weighted avg       0.91      0.90      0.90      1188



# Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [16]:
df['pre_processed_text'] = df['Comment'].apply(preprocess)

preprocessed model

In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    df.pre_processed_text, 
    df.Emotion_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.Emotion_num
)

using CountVectorizer with both unigrams and bigrams.
use RandomForest as the classifier.


In [18]:

clf = Pipeline([
    ('vectorizer_bi_grams', CountVectorizer(ngram_range = (1, 2))),                   
    ('Random Forest', RandomForestClassifier())        
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94       400
           1       0.95      0.91      0.93       388
           2       0.92      0.94      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



using TF-IDF vectorizer for pre-processing the text.
use RandomForest as the classifier.

In [19]:

clf = Pipeline([
    ('vectorizer_tfidf', TfidfVectorizer()),                   
    ('Random Forest', RandomForestClassifier())        
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.95      0.94       400
           1       0.93      0.93      0.93       388
           2       0.92      0.92      0.92       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188

