 <h1 style="color:blue">NLP - Emotion Classification</h1>

<h3 style>
    
    1. Loading and Preprocessing
    2. Feature Extraction
    3. Model Development
    4. Model Comparison
</h3>



 <h4  style="color:green;"> 1. Loading and Preprocessing </h4>

In [1]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Load the dataset
df = pd.read_csv("nlp_dataset.csv")
print(df)

                                                Comment Emotion
0     i seriously hate one subject to death but now ...    fear
1                    im so full of life i feel appalled   anger
2     i sit here to write i start to dig out my feel...    fear
3     ive been really angry with r and i feel like a...     joy
4     i feel suspicious if there is no one outside l...    fear
...                                                 ...     ...
5932                 i begun to feel distressed for you    fear
5933  i left feeling annoyed and angry thinking that...   anger
5934  i were to ever get married i d have everything...     joy
5935  i feel reluctant in applying there because i w...    fear
5936  i just wanted to apologize to you because i fe...   anger

[5937 rows x 2 columns]


In [3]:
df.shape

(5937, 2)

In [4]:
# Display basic information about the dataset
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB
None


In [5]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [6]:
## Display the last few rows of the dataset
df.tail(10)

Unnamed: 0,Comment,Emotion
5927,i have never done anything to make her cry or ...,fear
5928,i feel angry because i have led myself to lead...,anger
5929,i mean weve been friends for a long time and t...,anger
5930,i think we often feel this way about planting ...,fear
5931,i have lost touch with the things that i feel ...,joy
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear
5936,i just wanted to apologize to you because i fe...,anger


In [7]:
# Display summary statistics
print(df.describe())


                                                 Comment Emotion
count                                               5937    5937
unique                                              5934       3
top     i feel like a tortured artist when i talk to her   anger
freq                                                   2    2000


In [8]:
df.columns

Index(['Comment', 'Emotion'], dtype='object')

In [9]:
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 Comment    0
Emotion    0
dtype: int64


In [10]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [11]:
# Remove punctuation and numbers
cleaned_text = []
for text in df['Comment']:
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    cleaned_text.append(text)


In [12]:
# Convert to lowercase
cleaned_text = [text.lower() for text in cleaned_text]


In [13]:
# Tokenize
tokenized_text = [word_tokenize(text) for text in cleaned_text]

In [14]:
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_text = []
for tokens in tokenized_text:
    filtered_text.append([word for word in tokens if word not in stop_words])

In [20]:
final_text = [' '.join(tokens) for tokens in filtered_text]

# Add cleaned text to the dataframe
df['cleaned_text'] = final_text
print(df.head)

<bound method NDFrame.head of                                                 Comment Emotion  \
0     i seriously hate one subject to death but now ...    fear   
1                    im so full of life i feel appalled   anger   
2     i sit here to write i start to dig out my feel...    fear   
3     ive been really angry with r and i feel like a...     joy   
4     i feel suspicious if there is no one outside l...    fear   
...                                                 ...     ...   
5932                 i begun to feel distressed for you    fear   
5933  i left feeling annoyed and angry thinking that...   anger   
5934  i were to ever get married i d have everything...     joy   
5935  i feel reluctant in applying there because i w...    fear   
5936  i just wanted to apologize to you because i fe...   anger   

                                           cleaned_text  
0     seriously hate one subject death feel reluctan...  
1                            im full life feel ap

<h4>Preprocessing Techniques and Their Impact on Model Performance</h4>
Reduces noise in the text data, making it easier for models to focus on meaningful words.
Improve Consistency: Ensure uniform representation of words.
Enhance Feature Quality: Provide cleaner, more meaningful features for the models to learn from,
leading to better performance in text classification tasks.

<h4  style="color:green;">2.Feature Extraction</h4>

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the cleaned text
X = vectorizer.fit_transform(df['cleaned_text']).toarray()

# Extract target variable
y = df['Emotion']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Feature Representation: Text data is transformed into numerical features that machine learning algorithms can process.
Weighting Important Terms: Terms that are more relevant to the document (higher TF) and less common across documents (higher IDF) are given higher 
weights.Dimensionality Reduction: Limiting the number of features (e.g., top 1000 terms) helps in reducing the complexity and improving the 
efficiency of the model.By using TF-IDF, we create a feature set that captures the importance of terms in the context of individual documents and 
the entire dataset, leading to more accurateand meaningful numerical representations for text classification models.

<h4  style="color:green;">3 Model Development</h4>

In [28]:
#Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, f1_score

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

y_pred_nb = nb_model.predict(X_test)

accuracy_nb = accuracy_score(y_test, y_pred_nb)
f1_nb = f1_score(y_test, y_pred_nb, average='weighted')

print("Naive Bayes Accuracy:",accuracy_nb)
print("Naive Bayes F1-Score: ",f1_nb)

Naive Bayes Accuracy: 0.9158249158249159
Naive Bayes F1-Score:  0.9158756487178424


In [29]:
#Support Vector Machine
from sklearn.svm import SVC

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
y_pred_svm = svm_model.predict(X_test)

accuracy_svm = accuracy_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')

print("Support Vector Machine Accuracy:",accuracy_svm)
print("Support Vector Machine F1-Score: ",f1_svm)

Support Vector Machine Accuracy: 0.9436026936026936
Support Vector Machine F1-Score:  0.9436022027688145


<h4  style="color:green;">4.Model Comparison</h4>

In [30]:
model_comparison = pd.DataFrame({
    'Model': ['Naive Bayes', 'Support Vector Machine'],
    'Accuracy': [accuracy_nb, accuracy_svm],
    'F1-Score': [f1_nb, f1_svm]})
print(model_comparison)


                    Model  Accuracy  F1-Score
0             Naive Bayes  0.915825  0.915876
1  Support Vector Machine  0.943603  0.943602


<h3>Chosen Models and Their Suitability for Emotion Classification</h3>

Naive Bayes

Model Description:
Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming independence between features.

Text Data: Works well with text data where the assumption of feature independence is often reasonable.

Speed and Efficiency: Fast to train and predict, making it suitable for large datasets.

Performance: Often performs well with sparse data, which is common in text classification tasks like emotion classification.

Support Vector Machine (SVM)

Model Description:
SVM is a powerful classifier that finds the hyperplane that best separates data into classes.

High Dimensional Spaces: Effective in high-dimensional spaces and with sparse data, typical of text features.

Margin Maximization: Focuses on maximizing the margin between classes, which can lead to better generalization on unseen data.

Versatility: Can handle non-linear classification through the use of kernel functions, making it adaptable to complex relationships in the data.

Naive Bayes is suitable for emotion classification due to its simplicity, efficiency, and good performance with text data.

SVM is suitable because of its effectiveness in high-dimensional and sparse datasets, and its strong generalization capabilities.in this model SVM have high accuracy and f1-score