<a href="https://colab.research.google.com/github/Sara-Esm/NLP/blob/main/1_Bag_of_Words__TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Emotion Classification Using NLP Techniques (Bag-of-Words & TF-IDF)

This project classifies emotions in Twitter messages using the Kaggle Emotion Classification dataset. It applies Bag-of-Words (BoW) and TF-IDF for feature extraction, with a K-Nearest Neighbors (KNN) model trained to predict six emotions: sadness, joy, love, anger, fear, and surprise.

In [18]:
#!pip install pandas
#!pip install sklearn

In [19]:
# This sets the tensorflow log level to "warn"
import os
import kagglehub
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [20]:
# Download dataset from Kaggle
path = kagglehub.dataset_download("bhavikjikadara/emotions-dataset")
print("Path to dataset files:", path)

# Load the dataset
dataset_path = f"{path}/emotions.csv"
dataset = pd.read_csv(dataset_path)

# Display dataset information
print("Dataset Shape:", dataset.shape)
print("Columns:", dataset.columns)
print("\nFirst few rows of the dataset:")
print(dataset.head())

Path to dataset files: /root/.cache/kagglehub/datasets/bhavikjikadara/emotions-dataset/versions/1
Dataset Shape: (416809, 2)
Columns: Index(['text', 'label'], dtype='object')

First few rows of the dataset:
                                                text  label
0      i just feel really helpless and heavy hearted      4
1  ive enjoyed being able to slouch about relax a...      0
2  i gave up my internship with the dmrg and am f...      4
3                         i dont know i feel so lost      0
4  i am a kindergarten teacher and i am thoroughl...      4


In [21]:
dataset

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4
...,...,...
416804,i feel like telling these horny devils to find...,2
416805,i began to realize that when i was feeling agi...,3
416806,i feel very curious be why previous early dawn...,5
416807,i feel that becuase of the tyranical nature of...,3


In [22]:
# Check for missing values
print("\nMissing values:", dataset.isnull().sum())


Missing values: text     0
label    0
dtype: int64


In [23]:
# Preprocessing: Drop rows with missing values
dataset.dropna(inplace=True)

In [24]:
# Extract features (text) and labels (emotion)
X = dataset["text"]
y = dataset["label"]  # Emotions: 0=sadness, 1=joy, 2=love, 3=anger, 4=fear, 5=surprise

In [25]:
# Bag-of-Words (BoW) approach
print("\n--- Bag-of-Words (BoW) ---")
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(X)

# Display vocabulary size and some sample words
print(f"There are {len(vectorizer_bow.get_feature_names_out())} unique words in the dataset.")
print("Sample vocabulary:", vectorizer_bow.get_feature_names_out()[:10])


--- Bag-of-Words (BoW) ---
There are 75276 unique words in the dataset.
Sample vocabulary: ['aa' 'aaa' 'aaaa' 'aaaaaaaaaaaaaaaaggghhhh'
 'aaaaaaaaaaaaaaarrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrggggggggggggggggggggggggggggghhhhhhhhhhhhhh'
 'aaaaaaaall' 'aaaaaaand' 'aaaaaand' 'aaaaah' 'aaaaahhhhhh']


In [26]:
# Train and evaluate a KNN model using BoW features
X_train_bow, X_test_bow, y_train, y_test = train_test_split(X_bow, y, test_size=0.3, random_state=42)

model_bow = KNeighborsClassifier(n_neighbors=3)
model_bow.fit(X_train_bow, y_train)
y_pred_bow = model_bow.predict(X_test_bow)

accuracy_bow = accuracy_score(y_test, y_pred_bow)
print("BoW Model Accuracy:", accuracy_bow)

BoW Model Accuracy: 0.5381108898538902


In [27]:
# TF-IDF approach
print("\n--- TF-IDF ---")
vectorizer_tfidf = TfidfVectorizer()
X_tfidf = vectorizer_tfidf.fit_transform(X)

# Display vocabulary size and some sample words
print(f"There are {len(vectorizer_tfidf.get_feature_names_out())} unique words in the dataset.")
print("Sample vocabulary:", vectorizer_tfidf.get_feature_names_out()[:10])


--- TF-IDF ---
There are 75276 unique words in the dataset.
Sample vocabulary: ['aa' 'aaa' 'aaaa' 'aaaaaaaaaaaaaaaaggghhhh'
 'aaaaaaaaaaaaaaarrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrggggggggggggggggggggggggggggghhhhhhhhhhhhhh'
 'aaaaaaaall' 'aaaaaaand' 'aaaaaand' 'aaaaah' 'aaaaahhhhhh']


In [28]:
# Train and evaluate a KNN model using TF-IDF features
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(X_tfidf, y, test_size=0.3, random_state=42)

model_tfidf = KNeighborsClassifier(n_neighbors=3)
model_tfidf.fit(X_train_tfidf, y_train)
y_pred_tfidf = model_tfidf.predict(X_test_tfidf)

accuracy_tfidf = accuracy_score(y_test, y_pred_tfidf)
print("TF-IDF Model Accuracy:", accuracy_tfidf)

TF-IDF Model Accuracy: 0.7044616651871756
