In [3]:
import pandas as pd
from google.colab import files

uploaded = files.upload()


Saving nlp_dataset.csv to nlp_dataset.csv


In [4]:
df = pd.read_csv("/content/nlp_dataset.csv")

# Show first few rows
print(df.head())

                                             Comment Emotion
0  i seriously hate one subject to death but now ...    fear
1                 im so full of life i feel appalled   anger
2  i sit here to write i start to dig out my feel...    fear
3  ive been really angry with r and i feel like a...     joy
4  i feel suspicious if there is no one outside l...    fear


In [10]:
df['Comment'] = df['Comment'].str.lower()
#convert to lower case
print(df)
print("count",df['Comment'].count())

                                                Comment Emotion
0     i seriously hate one subject to death but now ...    fear
1                    im so full of life i feel appalled   anger
2     i sit here to write i start to dig out my feel...    fear
3     ive been really angry with r and i feel like a...     joy
4     i feel suspicious if there is no one outside l...    fear
...                                                 ...     ...
5932                 i begun to feel distressed for you    fear
5933  i left feeling annoyed and angry thinking that...   anger
5934  i were to ever get married i d have everything...     joy
5935  i feel reluctant in applying there because i w...    fear
5936  i just wanted to apologize to you because i fe...   anger

[5937 rows x 2 columns]
count 5937


In [11]:
df['Comment'] = df['Comment'].str.replace(r'http\S+|www\S+', '', regex=True)
#removing url
print(df)
print("count",df['Comment'].count())

                                                Comment Emotion
0     i seriously hate one subject to death but now ...    fear
1                    im so full of life i feel appalled   anger
2     i sit here to write i start to dig out my feel...    fear
3     ive been really angry with r and i feel like a...     joy
4     i feel suspicious if there is no one outside l...    fear
...                                                 ...     ...
5932                 i begun to feel distressed for you    fear
5933  i left feeling annoyed and angry thinking that...   anger
5934  i were to ever get married i d have everything...     joy
5935  i feel reluctant in applying there because i w...    fear
5936  i just wanted to apologize to you because i fe...   anger

[5937 rows x 2 columns]
count 5937


In [13]:
df['Comment'] = df['Comment'].str.replace(r'[^a-z\s]', '', regex=True)
#remove special charcter, number if any
print(df)
print("count",df['Comment'].count())

                                                Comment Emotion
0     i seriously hate one subject to death but now ...    fear
1                    im so full of life i feel appalled   anger
2     i sit here to write i start to dig out my feel...    fear
3     ive been really angry with r and i feel like a...     joy
4     i feel suspicious if there is no one outside l...    fear
...                                                 ...     ...
5932                 i begun to feel distressed for you    fear
5933  i left feeling annoyed and angry thinking that...   anger
5934  i were to ever get married i d have everything...     joy
5935  i feel reluctant in applying there because i w...    fear
5936  i just wanted to apologize to you because i fe...   anger

[5937 rows x 2 columns]
count 5937


In [15]:
df['tokens'] = df['Comment'].str.split()
df['tokens']
#spliting sentances into words

Unnamed: 0,tokens
0,"[i, seriously, hate, one, subject, to, death, ..."
1,"[im, so, full, of, life, i, feel, appalled]"
2,"[i, sit, here, to, write, i, start, to, dig, o..."
3,"[ive, been, really, angry, with, r, and, i, fe..."
4,"[i, feel, suspicious, if, there, is, no, one, ..."
...,...
5932,"[i, begun, to, feel, distressed, for, you]"
5933,"[i, left, feeling, annoyed, and, angry, thinki..."
5934,"[i, were, to, ever, get, married, i, d, have, ..."
5935,"[i, feel, reluctant, in, applying, there, beca..."


In [18]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

df['tokens'] = df['tokens'].apply(
    lambda words: [w for w in words if w not in stop_words]
)
print(df['tokens'])
#Removes common words like is, the, and etc

0       [seriously, hate, one, subject, death, feel, r...
1                        [im, full, life, feel, appalled]
2       [sit, write, start, dig, feelings, think, afra...
3       [ive, really, angry, r, feel, like, idiot, tru...
4       [feel, suspicious, one, outside, like, rapture...
                              ...                        
5932                            [begun, feel, distressed]
5933    [left, feeling, annoyed, angry, thinking, cent...
5934    [ever, get, married, everything, ready, offer,...
5935    [feel, reluctant, applying, want, able, find, ...
5936    [wanted, apologize, feel, like, heartless, bitch]
Name: tokens, Length: 5937, dtype: object


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
df['cleaned_comment'] = df['tokens'].str.join(' ')
#after cleaning converting tokens back to comment
print(df['cleaned_comment'])

0       seriously hate one subject death feel reluctan...
1                              im full life feel appalled
2       sit write start dig feelings think afraid acce...
3       ive really angry r feel like idiot trusting fi...
4       feel suspicious one outside like rapture happe...
                              ...                        
5932                                begun feel distressed
5933    left feeling annoyed angry thinking center stu...
5934    ever get married everything ready offer got to...
5935    feel reluctant applying want able find company...
5936           wanted apologize feel like heartless bitch
Name: cleaned_comment, Length: 5937, dtype: object


The dataset was loaded into Google Colab and consists of two columns: Comment and Emotion. All text was converted to lowercase to ensure uniformity. URLs, numbers, and special characters were removed to eliminate noise. Tokenization was performed by splitting the text into individual words. Stopwords such as “the”, “is”, and “and” were removed to reduce unwanted meaningless features. These preprocessing steps help improve model performance by reducing vocabulary size, removing noise, and focusing on emotionally meaningful words.

## **Feature Extraction**

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_comment'],
    df['Emotion'],
    test_size=0.2,
    random_state=42
)
#split data into test and train

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer


In [23]:
vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

Feature extraction was performed using TF-IDF Vectorizer. TF-IDF converts text data into numerical features by calculating how important a word is to a document relative to the entire dataset. Words that occur frequently in a comment but rarely across all comments receive higher weights, while common words receive lower weights. This representation helps the model focus on emotionally significant words and improves classification performance.

In [24]:
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train_tfidf, y_train)


Multinomial Naive Bayes is used to train the emotion classification model. The model is trained using TF-IDF features and emotion labels to learn patterns in text data.

In [25]:
from sklearn.svm import LinearSVC

svm_model = LinearSVC()
svm_model.fit(X_train_tfidf, y_train)


Support Vector Machine is used to train the emotion classification model. The model is trained using TF-IDF features to learn how to classify emotions from text data.

In [26]:
from sklearn.metrics import accuracy_score, f1_score

nb_pred = nb_model.predict(X_test_tfidf)

nb_accuracy = accuracy_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred, average='weighted')

print("Naive Bayes Accuracy:", nb_accuracy)
print("Naive Bayes F1-score:", nb_f1)


Naive Bayes Accuracy: 0.9082491582491582
Naive Bayes F1-score: 0.9081097912623348


The Naive Bayes model was evaluated using accuracy and F1-score. Accuracy measures the number of correct predictions, while F1-score gives an overall balanced performance of the model.

In [27]:
svm_pred = svm_model.predict(X_test_tfidf)

svm_accuracy = accuracy_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred, average='weighted')

print("SVM Accuracy:", svm_accuracy)
print("SVM F1-score:", svm_f1)


SVM Accuracy: 0.9528619528619529
SVM F1-score: 0.9528275140277518


The SVM model was evaluated using accuracy and F1-score. Accuracy measures correct predictions, while F1-score gives a balanced evaluation of the model’s performance.

In [28]:
import pandas as pd

results = pd.DataFrame({
    'Model': ['Naive Bayes', 'SVM'],
    'Accuracy': [nb_accuracy, svm_accuracy],
    'F1-score': [nb_f1, svm_f1]
})

results


Unnamed: 0,Model,Accuracy,F1-score
0,Naive Bayes,0.908249,0.90811
1,SVM,0.952862,0.952828


The Naive Bayes and Support Vector Machine (SVM) models were evaluated using accuracy and F1-score. Naive Bayes achieved an accuracy of 90.82% and an F1-score of 0.908. SVM performed better with an accuracy of 95.29% and an F1-score of 0.953. Since SVM gives higher accuracy and F1-score, it is more suitable for emotion classification as it handles text features more effectively.