# 🧠 Emotion Classification from Text

Author: Roya – Growth Analyst & AI Enthusiast
Date: July 2025

This notebook presents an emotion classification model trained on labeled textual data to identify the emotional tone (e.g., joy, anger, sadness) in user-generated content. The model uses a classical NLP pipeline with CountVectorizer for feature extraction and Logistic Regression for classification.

Despite the simplicity of the approach, the model achieves an accuracy of 89%, demonstrating the strength of traditional NLP methods when paired with careful preprocessing and clean feature engineering.

This project is part of a broader journey to understand how AI can decode user sentiment at scale, particularly in growth marketing contexts, where interpreting customer emotions can directly influence retention strategies, personalization, and brand trust.

In [None]:
import pandas as pd
import numpy as np
import datasets

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer

# !pip install -U datasets


In [None]:
emotions = datasets.load_dataset('emotion', trust_remote_code=True)


`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'emotion' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
ERROR:datasets.load:`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'emotion' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [None]:
emotions.set_format(type="pandas")
train = emotions["train"][:]
test = emotions["test"][:]
train.shape
test.shape


(2000, 2)

In [None]:
train_text = train["text"]
y_train = train["label"]
test_text = test["text"]
y_test = test["label"]

In [None]:
vect = CountVectorizer().fit(train_text)
print(f"Vocabulary size: {len(vect.vocabulary_)}")
# print(f"Vocabulary content:\n {vect.vocabulary_}")


Vocabulary size: 15186


In [None]:

X_train = vect.transform(train_text)
X_test = vect.transform(test_text)

print(f"bag_of_words: {repr(X_train)}")


bag_of_words: <Compressed Sparse Row sparse matrix of dtype 'int64'
	with 249634 stored elements and shape (16000, 15186)>


In [None]:

scores = cross_val_score(LogisticRegression(max_iter=1000), X_train, y_train, cv=7)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))


Mean cross-validation accuracy: 0.89


## Improving Tokenization
I reduced the number of features from 15186 to 3421. The performance of the model shows no improve but cutting the number of features by 77% is a great achievement here.

In [None]:
vect = CountVectorizer(min_df=5).fit(train_text)
X_train = vect.transform(train_text)
print("X_train with min_df: {}".format(repr(X_train)))


X_train with min_df: <Compressed Sparse Row sparse matrix of dtype 'int64'
	with 231641 stored elements and shape (16000, 3421)>


In [None]:

scores = cross_val_score(LogisticRegression(max_iter=1000), X_train, y_train, cv=7)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))


Mean cross-validation accuracy: 0.89


## Stop Words

In [None]:
vect = CountVectorizer(min_df=5, stop_words="english").fit(train_text)
X_train = vect.transform(train_text)
print("X_train without stop words:\n{}".format(repr(X_train)))


X_train without stop words:
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 112573 stored elements and shape (16000, 3152)>


In [None]:

scores = cross_val_score(LogisticRegression(max_iter=1000), X_train, y_train, cv=7)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))


Mean cross-validation accuracy: 0.89


## TF-IDF
With tf-idf, I got a little bit of worse score.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(TfidfVectorizer(min_df=5, stop_words="english", norm=None), LogisticRegression())
param_grid = {'logisticregression__C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(pipe, param_grid, cv=5)

grid.fit(train_text, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))


Best cross-validation score: 0.88


## Bag-of-Words with More Than One Word (n-Grams)


In [None]:

cv = CountVectorizer(ngram_range=(1, 3)).fit(train_text)
print("Vocabulary size: {}".format(len(cv.vocabulary_)))
print("Vocabulary:\n{}".format(cv.get_feature_names_out()))


Vocabulary size: 323027
Vocabulary:
['aa' 'aa full' 'aa full force' ... 'zz' 'zz top' 'zz top logo']


In [None]:

scores = cross_val_score(LogisticRegression(max_iter=1000), X_train, y_train, cv=7)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))


Mean cross-validation accuracy: 0.89


## Advanced Tokenization, Stemming, and Lemmatization


In [None]:
import spacy
import nltk
# load spacy's English-language models
en_nlp = spacy.load("en_core_web_sm")
# instantiate nltk's Porter stemmer
stemmer = nltk.stem.PorterStemmer()

In [None]:
def custom_tokenizer(document):
  doc_spacy = en_nlp(document)
  return [token.lemma_ for token in doc_spacy]

# define a count vectorizer with the custom tokenizer
lemma_vect = CountVectorizer(tokenizer=custom_tokenizer, min_df=5)
X_train_lemma = lemma_vect.fit_transform(train_text)
print("X_train_lemma.shape: {}".format(X_train_lemma.shape))



In [None]:

scores = cross_val_score(LogisticRegression(max_iter=1000), X_train_lemma, y_train, cv=7)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))


Mean cross-validation accuracy: 0.87
