In [1]:
import pandas as pd

# Importing dataset
X = pd.read_csv("C:/Users/USER/OneDrive/Desktop/Datas/nlp_dataset.csv")
X

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear
...,...,...
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear


In [2]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5937 entries, 0 to 5936
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Comment  5937 non-null   object
 1   Emotion  5937 non-null   object
dtypes: object(2)
memory usage: 92.9+ KB


In [3]:
X.duplicated().sum()

0

In [4]:
X['Emotion'].unique()

array(['fear', 'anger', 'joy'], dtype=object)

### Text Preprocessing
Text preprocessing is crucial to improve model performance. It generally includes:

Tokenization: Breaking down the text into individual words or tokens.



Removing punctuation and special characters: Punctuation and special symbols may not contribute to the emotion classification task.

Lowercasing: Converting all text to lowercase to make the model case-insensitive.

Stopword removal: Stopwords (common words like "the", "and") don't provide meaningful information for classification and should be removed.

Stemming/Lemmatization (optional): Reducing words to their base form.

# Preprocessing

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string 

# Downloading necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Define a function to preprocess the text
def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    
    # Removing punctuation
    text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenizing
    tokens = word_tokenize(text_no_punct)
    
    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    return " ".join(tokens)



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
feature = X['Comment'].apply(preprocess_text)
feature

0       seriously hate one subject death feel reluctan...
1                              im full life feel appalled
2       sit write start dig feelings think afraid acce...
3       ive really angry r feel like idiot trusting fi...
4       feel suspicious one outside like rapture happe...
                              ...                        
5932                                begun feel distressed
5933    left feeling annoyed angry thinking center stu...
5934    ever get married everything ready offer got to...
5935    feel reluctant applying want able find company...
5936           wanted apologize feel like heartless bitch
Name: Comment, Length: 5937, dtype: object

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Transform the cleaned text into TF-IDF features
x = tfidf_vectorizer.fit_transform(feature)
y = X['Emotion']  # Assuming 'emotion_label' contains the target labels


TF-IDF Vectorization
Feature extraction converts text data into a numerical format that can be fed into machine learning models. We'll use TF-IDF Vectorizer because it weighs words based on their frequency in a document relative to their frequency across all documents, making it useful for emotion classification.

Explanation of TF-IDF Vectorizer
Term Frequency (TF): Measures how frequently a word appears in a document.
Inverse Document Frequency (IDF): Reduces the weight of common words by penalizing words that appear frequently across multiple documents.
This technique ensures that common words like "happy" or "sad" (in an emotion dataset) are given a balanced weight based on how often they appear in other documents.

In [8]:
from sklearn.model_selection import train_test_split

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


# Training Naive Bayes Model

In [9]:
from sklearn.naive_bayes import MultinomialNB

# Initialize and train the Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)


# Training Support Vector Machine (SVM) Model

In [10]:
from sklearn.svm import SVC

# Initialize and train the SVM model
svm_model = SVC()
svm_model.fit(X_train, y_train)


# Model Comparison

In [11]:
from sklearn.metrics import accuracy_score, f1_score

# Predicting on the test data
y_pred_nb = nb_model.predict(X_test)
y_pred_svm = svm_model.predict(X_test)

# Naive Bayes Evaluation
nb_accuracy = accuracy_score(y_test, y_pred_nb)
nb_f1 = f1_score(y_test, y_pred_nb, average='weighted')

# SVM Evaluation
svm_accuracy = accuracy_score(y_test, y_pred_svm)
svm_f1 = f1_score(y_test, y_pred_svm, average='weighted')

print(f"Naive Bayes Accuracy: {nb_accuracy}, F1-Score: {nb_f1}")
print(f"SVM Accuracy: {svm_accuracy}, F1-Score: {svm_f1}")


Naive Bayes Accuracy: 0.9116161616161617, F1-Score: 0.9115095144506911
SVM Accuracy: 0.9351851851851852, F1-Score: 0.9351483654514323


Naive Bayes: Works well for text classification due to its assumption of conditional independence between words, which can be effective for simpler tasks like emotion classification.

SVM: Generally performs better for higher-dimensional data, especially with a linear kernel, as it can better separate emotions using hyperplanes in the feature space.

Both models should be compared based on their accuracy and F1-score, and the better model will depend on the dataset and its complexity. If the dataset is imbalanced, F1-score might provide a more reliable metric than accuracy