# <span style="color : Purple" > **NLP - Emotion Classification** </span>

## <span style="color : maroon" > Objective </span>

### **Develop machine learning models to classify emotions in text samples.**

### Models used : 
* <span style="color : green" > **Naive Bayes** </span>
* <span style="color : green" > **Support Vector Machine** </span>


In [22]:
import nltk

In [23]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## <span style="color : maroon" > Loading and Preprocessing </span>

In [24]:
df=pd.read_csv("nlp_dataset.csv")
print(df)

print("\nData Loaded Successfully")

                                                Comment Emotion
0     i seriously hate one subject to death but now ...    fear
1                    im so full of life i feel appalled   anger
2     i sit here to write i start to dig out my feel...    fear
3     ive been really angry with r and i feel like a...     joy
4     i feel suspicious if there is no one outside l...    fear
...                                                 ...     ...
5932                 i begun to feel distressed for you    fear
5933  i left feeling annoyed and angry thinking that...   anger
5934  i were to ever get married i d have everything...     joy
5935  i feel reluctant in applying there because i w...    fear
5936  i just wanted to apologize to you because i fe...   anger

[5937 rows x 2 columns]

Data Loaded Successfully


In [25]:
print("First few rows of the dataset are :\n")
df.head()

First few rows of the dataset are :



Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [26]:
print("Last few rows of the dataset are :\n")
df.tail()

Last few rows of the dataset are :



Unnamed: 0,Comment,Emotion
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear
5936,i just wanted to apologize to you because i fe...,anger


In [27]:
df.shape

(5937, 2)

#### There are 5937 rows and 2 columns in the dataset.

In [28]:
df.describe()

Unnamed: 0,Comment,Emotion
count,5937,5937
unique,5934,3
top,i feel like a tortured artist when i talk to her,anger
freq,2,2000


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB


In [30]:
missing_values=df.isnull().sum()
print("Missing values in the dataframe :\n", missing_values)

Missing values in the dataframe :
 Comment    0
Emotion    0
dtype: int64


#### There are no missing values in the dataset.

In [31]:
df.drop_duplicates(inplace=True)

In [32]:
df.columns

Index(['Comment', 'Emotion'], dtype='object')

#### There are two columns : Comment and Emotion.

## Text Cleaning

In [33]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gokul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gokul\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Remove punctuation and numbers

In [34]:
cleaned_text = []
for text in df['Comment']:
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    cleaned_text.append(text)

### Convert to lower case

In [35]:
cleaned_text = [text.lower() for text in cleaned_text]

## Tokenization

### Tokenization 

In [36]:
tokenized_text = [word_tokenize(text) for text in cleaned_text]

Tokenization is the process of breaking down a piece of text into smaller units called "tokens." These tokens can be words, phrases, or even individual characters, depending on the level of granularity desired. Tokenization is a crucial step in text preprocessing, especially in natural language processing (NLP) tasks, as it transforms text into a format that can be more easily analyzed and processed by machine learning models.

### Stopwords removal

In [37]:
stop_words = set(stopwords.words('english'))
filtered_text = []
for tokens in tokenized_text:
    filtered_text.append([word for word in tokens if word not in stop_words])

Stopword removal is the process of eliminating common words from a text that are often considered unimportant for certain natural language processing (NLP) tasks. These words, known as "stopwords," include frequently used words. It reduce noises, improve efficiency and enhance model performance.

In [38]:
final_text = [' '.join(tokens) for tokens in filtered_text]

# Add cleaned text to the dataframe
df['cleaned_text'] = final_text
print(df.head)

<bound method NDFrame.head of                                                 Comment Emotion  \
0     i seriously hate one subject to death but now ...    fear   
1                    im so full of life i feel appalled   anger   
2     i sit here to write i start to dig out my feel...    fear   
3     ive been really angry with r and i feel like a...     joy   
4     i feel suspicious if there is no one outside l...    fear   
...                                                 ...     ...   
5932                 i begun to feel distressed for you    fear   
5933  i left feeling annoyed and angry thinking that...   anger   
5934  i were to ever get married i d have everything...     joy   
5935  i feel reluctant in applying there because i w...    fear   
5936  i just wanted to apologize to you because i fe...   anger   

                                           cleaned_text  
0     seriously hate one subject death feel reluctan...  
1                            im full life feel ap

## <span style="color : maroon" > Feature Extraction </span>

Feature extraction is the process of transforming raw data into a set of features that can be effectively used by machine learning models. The goal is to capture the most important information in the data while reducing its complexity, making it easier for algorithms to process and learn from the data.
In this model, we are using TfidfVectorizer.

In [39]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the cleaned text
X = vectorizer.fit_transform(df['cleaned_text']).toarray()

# Extract target variable
y = df['Emotion']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## <span style="color : maroon" > Model Development </span>

### **Naive Bayes**

In [40]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

accuracy_nb = accuracy_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb, average='weighted')

print("Naive Bayes Accuracy:",accuracy_nb)
print("Naive Bayes F1-Score: ",f1_nb)

Naive Bayes Accuracy: 0.9132996632996633
Naive Bayes F1-Score:  0.9133716011282641


#### Naive Bayes 
* Naive Bayes is easy to implement and understand.
* Naive Bayes performs well with high-dimensional data.
* It is particularly effective with discrete features, making it a natural choice for text classification.
* It requires less preprocessing of data compared to other algorithms.

### **Support Vector Machine**

In [41]:
from sklearn.svm import SVC

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')

print("Support Vector Machine Accuracy:",accuracy_svm)
print("Support Vector Machine F1-Score: ",f1_svm)

Support Vector Machine Accuracy: 0.9461279461279462
Support Vector Machine F1-Score:  0.9460903357272678


#### Support Vector Machine
* SVM is a powerful supervised learning algorithm used primarily for classification tasks.
* Effective in Binary and Multiclass Classification.
* SVM tends to be less prone to overfitting.
* It is based on solid mathematical principles of optimization, ensuring reliable performance and predictability.

## <span style="color : maroon" > Model Comparison </span>

In [42]:
model_comparison = pd.DataFrame({
    'Model': ['Naive Bayes', 'Support Vector Machine'],
    'Accuracy': [accuracy_nb, accuracy_svm],
    'F1 Score': [f1_nb, f1_svm]})
print(model_comparison)

                    Model  Accuracy  F1 Score
0             Naive Bayes  0.913300  0.913372
1  Support Vector Machine  0.946128  0.946090


#### Accuracy of Support vector machine(94.61%) is more than that of Naive Bayes(91.33%). Similarly, F1 score of Support vector machine(94.6%) is more than that of Naive bayes(91.33%). Hence, best model is Support vector machine.