<a href="https://colab.research.google.com/github/Arjun-R-krishnan/NLP---Emotion-Classification-in-Text/blob/main/NLP__Emotion_Classification_in_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Importing necessary libraries

In [None]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,f1_score

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

## Loading Dataset

In [None]:
# Provide the URL of the CSV file
csv_url ='https://drive.google.com/uc?export=download&id=1HWczIICsMpaL8EJyu48ZvRFcXx3_pcnb'


# Load the CSV file into a DataFrame
df = pd.read_csv(csv_url)

# Display the first few rows of the DataFrame to verify the data is loaded
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


## Preprocessing

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB


In [None]:
# checking for null values
df.isnull().sum()

Unnamed: 0,0
Comment,0
Emotion,0


In [None]:
# checking for any dulicates in 'Comment' column
duplicate_df = df[df['Comment'].duplicated(keep = False)]


duplicate_df

Unnamed: 0,Comment,Emotion
986,i resorted to yesterday the post peak day of i...,anger
1930,i resorted to yesterday the post peak day of i...,fear
2262,i feel like a tortured artist when i talk to her,anger
2877,i feel pretty tortured because i work a job an...,anger
4869,i feel pretty tortured because i work a job an...,fear
5870,i feel like a tortured artist when i talk to her,fear


### Since the dataset contains three duplicate comments, each associated with different emotions, we need to remove the duplicates while retaining only one instance of each comment.



In [None]:
# df.drop_duplicates() is used to remove duplicate rows from the DataFrame.
df = df.drop_duplicates(subset = 'Comment')

In [None]:
# checking for duplicates after removal
duplicate_df = df[df['Comment'].duplicated(keep = False)]



duplicate_df

Unnamed: 0,Comment,Emotion


### Text Cleaning, Tokenization, and Stopword Removal


In [None]:
# for all theese steps we are definig a function called clean_text

def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    cleaned_tokens = [word for word in tokens if word not in stop_words]

    # Rejoin tokens to form the cleaned text
    return ' '.join(cleaned_tokens)

## Preprocessing Techniques and Their Impact on Model Performance

### Lowercasing the Text
Technique: Converts all characters in the text to lowercase.

Impact: This ensures uniformity, as machine learning models treat words like "This" and "this" as different tokens. Lowercasing helps avoid this issue, reducing redundancy and improving the model’s understanding of text.


## Removing Punctuation and Special Characters
Technique: Uses a regular expression to remove any character that is not a word character (alphanumeric) or whitespace.

Impact: Punctuation and special characters generally don’t contribute meaningfully to text classification or NLP models. Removing them reduces noise in the data, improving the model’s performance by focusing on actual words.


## Tokenization
Technique: Splits the text into individual words or "tokens."

Impact: Tokenization breaks down text into manageable pieces (words), which is essential for most NLP tasks. This step allows the model to analyze words individually and helps in feature extraction, enhancing model accuracy.

## Removing Stopwords
Technique: Removes common words (like "the", "is", "and") that typically don't carry significant meaning.

Impact: Stopwords can dilute the importance of key terms in text data. Removing them ensures that the model focuses on the more informative words, improving its ability to learn meaningful patterns from the data and boosting performance, particularly in text classification models.

In [None]:
# testing the function
test_sentance = "i seriously hate one subject to death but now i feel reluctant to drop it"

print(f"The converted ouut of the given sentance {test_sentance} is:\n{clean_text(test_sentance)}")

The converted ouut of the given sentance i seriously hate one subject to death but now i feel reluctant to drop it is:
seriously hate one subject death feel reluctant drop


In [None]:
# Applying text preprocessing to the 'Comment' column by using the clean_text function
# and storing the result in a new column 'Cleaned_Comment'

df['Cleaned_Comment'] = df['Comment'].apply(clean_text)

In [None]:
df.head()

Unnamed: 0,Comment,Emotion,Cleaned_Comment
0,i seriously hate one subject to death but now ...,fear,seriously hate one subject death feel reluctan...
1,im so full of life i feel appalled,anger,im full life feel appalled
2,i sit here to write i start to dig out my feel...,fear,sit write start dig feelings think afraid acce...
3,ive been really angry with r and i feel like a...,joy,ive really angry r feel like idiot trusting fi...
4,i feel suspicious if there is no one outside l...,fear,feel suspicious one outside like rapture happe...


## Feature Extraction
We are utilizing TfidfVectorizer for feature extraction because it effectively balances the frequency of terms with their uniqueness across documents. This approach ensures that emotionally significant words are emphasized, while common, non-informative words are downweighted. By highlighting key terms that contribute to emotional content, TfidfVectorizer enhances the model's ability to accurately classify emotions. Additionally, it helps reduce noise and improve generalization, leading to better overall performance in emotion classification tasks.










In [None]:
# Initialize the TfidfVectorizer with stop words removed
vectorizer = TfidfVectorizer(stop_words='english')
# Apply the vectorizer to the 'Cleaned_Comment' column
X = vectorizer.fit_transform(df['Cleaned_Comment'])
# The target variable 'y' is extracted from the 'Emotion' column
y = df['Emotion']
# Print the resulting TF-IDF feature matrix 'X'
print(X)

  (0, 6673)	0.40106682538342764
  (0, 3448)	0.34119360412192185
  (0, 7349)	0.460447890291778
  (0, 1843)	0.43356898889079565
  (0, 2800)	0.08542282256037335
  (0, 6223)	0.3326836039812941
  (0, 2279)	0.45250697155850844
  (1, 2800)	0.15613061561446895
  (1, 3738)	0.32251346517435714
  (1, 4380)	0.4935887528453009
  (1, 341)	0.7924509061851693
  (2, 6863)	0.32580046623999326
  (2, 8558)	0.2826411284956054
  (2, 7194)	0.2883284551357934
  (2, 2052)	0.42790106245525095
  (2, 2805)	0.2609799338683638
  (2, 7678)	0.224254398987721
  (2, 160)	0.2761450037056625
  (2, 36)	0.38067553965723294
  (2, 5702)	0.39217633356820664
  (2, 4573)	0.23315802337899746
  (3, 2800)	0.09946043018381333
  (3, 4036)	0.3075410274209818
  (3, 6090)	0.27522561396615997
  (3, 282)	0.35927895433157725
  :	:
  (5931, 3252)	0.21071024788354095
  (5931, 5199)	0.29669781328738076
  (5931, 5485)	0.246248307555487
  (5931, 7533)	0.266681054673041
  (5931, 6631)	0.24519122916279404
  (5931, 4483)	0.24028224545924445
  (59

## TF-IDF Transformation of Text Data into Numerical Features


1. Tokenization:
The text is split into individual words (tokens) for analysis.

2. Term Frequency (TF):
Measures how often a word appears in a document compared to the total number of words in that document. Common words have higher TF scores.

3. Document Frequency (DF):
Counts how many documents contain a specific term. This helps determine the term's significance.

4. Inverse Document Frequency (IDF):
Measures the importance of a term across the entire dataset. Rare terms that appear in fewer documents are given higher scores.

5. TF-IDF Score Calculation:
Combines TF and IDF to calculate the final score for each term in a document:



$$
\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)
$$
This score reflects how important a term is in a specific document relative to the whole dataset.

6. Sparse Matrix Representation:
The result is a sparse matrix where rows represent documents (e.g., comments), columns represent unique terms, and values are the TF-IDF scores. This structure efficiently captures the significance of words while reducing noise.

## Splitting the dataset into training and testing data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Development

### Naive Bayes

In [None]:
# Initialize the Multinomial Naive Bayes classifier
classifier_nb = MultinomialNB()
# Fit the Multinomial Naive Bayes classifier to the training data
classifier_nb.fit(X_train, y_train)

### Checking the working of the model using a custom comment

In [None]:
new_sentence = input("enter a comment:")

# Step 1: Clean the input sentence (assuming clean_text function is defined)
cleaned_sentence = clean_text(new_sentence)

# Step 2: Transform the sentence using the same TF-IDF vectorizer used for training
sentence_tfidf = vectorizer.transform([cleaned_sentence])

# Step 3: Use the trained model to predict the emotion
predicted_emotion_nb = classifier_nb.predict(sentence_tfidf)

enter a comment:i love cars


In [None]:
print(f"The emotion of the given input\n {new_sentence}  is\n{predicted_emotion_nb[0]}")

The emotion of the given input
 i love cars  is
joy


In [None]:
predictions_nb = classifier_nb.predict(X_test)

### Support Vector Machine

In [None]:
# Initialize the Support Vector Classifier (SVC)
classifier_svc = SVC()

# Train the SVM classifier on the training data
classifier_svc.fit(X_train, y_train)

### Checking the working of the model using a custom comment

In [None]:
new_sentence = input("enter a comment:")

# Step 1: Clean the input sentence (assuming clean_text function is defined)
cleaned_sentence = clean_text(new_sentence)

# Step 2: Transform the sentence using the same TF-IDF vectorizer used for training
sentence_tfidf = vectorizer.transform([cleaned_sentence])

# Step 3: Use the trained model to predict the emotion
predicted_emotion_svc = classifier_svc.predict(sentence_tfidf)

enter a comment:there is a snake in the room


In [None]:
print(f"The emotion of the given input\n {new_sentence}  is\n{predicted_emotion_svc[0]}")

The emotion of the given input
 there is a snake in the room  is
fear


In [None]:
predictions_svc = classifier_svc.predict(X_test)

##  Model Comparison

We will compare the performance of the Naive Bayes model and the Support Vector Machine (SVM) by evaluating their accuracy and F1 score.









In [None]:
accuracy_nb = accuracy_score(y_test,predictions_nb)
f1_nb = f1_score(y_test,predictions_nb, average='weighted')

accuracy_svm = accuracy_score(y_test,predictions_svc)
f1_svm = f1_score(y_test,predictions_svc, average='weighted')

In [None]:
print(f'Naive Bayes - Accuracy: {accuracy_nb:.2f}, F1 Score: {f1_nb:.2f}')
print(f'Support Vector Machine - Accuracy: {accuracy_svm:.2f}, F1 Score: {f1_svm:.2f}')

Naive Bayes - Accuracy: 0.90, F1 Score: 0.90
Support Vector Machine - Accuracy: 0.92, F1 Score: 0.92


## Conclusion
Both models perform well, with the SVM showing a marginal improvement in both accuracy and F1 score compared to the Naive Bayes model.

SVM is often effective for emotion classification tasks, especially when the decision boundary is complex. It can provide high accuracy, especially with well-chosen hyperparameters and kernel functions