In [18]:
import pandas as pd
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score

# Loading The Dataset

In [3]:
df = pd.read_csv("nlp_dataset.csv")

In [3]:
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [4]:
df.tail()

Unnamed: 0,Comment,Emotion
5932,i begun to feel distressed for you,fear
5933,i left feeling annoyed and angry thinking that...,anger
5934,i were to ever get married i d have everything...,joy
5935,i feel reluctant in applying there because i w...,fear
5936,i just wanted to apologize to you because i fe...,anger


# preprocessing steps

## 1) Analyzing Non-String Elements in a DataFrame

The code snippet leverages pandas to identify non-string elements within a DataFrame. It begins by applying the applymap() method to the DataFrame df, using a lambda function that checks if each element is not a string.

In [5]:
non_string_mask = df.applymap(lambda x: not isinstance(x, str))
non_string_mask.value_counts

<bound method DataFrame.value_counts of       Comment  Emotion
0       False    False
1       False    False
2       False    False
3       False    False
4       False    False
...       ...      ...
5932    False    False
5933    False    False
5934    False    False
5935    False    False
5936    False    False

[5937 rows x 2 columns]>

## 2) Converting Text to Lowercase in DataFrame Columns

The code snippet demonstrates how to convert the text in specific columns of a pandas DataFrame to lowercase. The first line, df['Comment'].str.lower(), targets the Comment column, applying the str.lower() method to each string element within that column.

In [6]:
df['Comment'].str.lower()
df['Emotion'].str.lower()

0        fear
1       anger
2        fear
3         joy
4        fear
        ...  
5932     fear
5933    anger
5934      joy
5935     fear
5936    anger
Name: Emotion, Length: 5937, dtype: object

## 3) Removing Punctuation from Text

The code defines a function, remove_punctuation, designed to eliminate punctuation from a given text string. Within the function, the variable punctuation is assigned the value of string.punctuation, which contains all standard punctuation characters

In [7]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

This is the set of punctuations


In [8]:
def remove_punctuation(text):
    punctuation = string.punctuation
    return text.translate(str.maketrans('','',punctuation))

In [9]:
df[['Comment','Emotion']] = df[['Comment','Emotion']].applymap(lambda x:remove_punctuation(x))

## 4) Removing Stop Words from Text

The code snippet defines a function, remove_stopwors, that is designed to remove stop words from a given text string.

### Listing English Stop Words

In [10]:
", ".join(stopwords.words('english'))

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [11]:
stop_word = stopwords.words('english')
def remove_stopwors(text):
    return " ".join([word for word in text.split() if word not in stop_word])

In [12]:
df[['Comment','Emotion']] = df[['Comment','Emotion']].applymap(lambda x:remove_stopwors(x))

## 5) Stemming Words in a Text

The code snippet defines a function, stem_words, which is designed to perform stemming on each word in a given text string using the Porter stemming algorithm.

In [13]:
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [14]:
df['Comment'] = df['Comment'].apply(lambda x: stem_words(x))

## 6) Tokenizing Text

The code snippet defines a function, tocken, which is designed to tokenize a given text string into individual words or tokens using the word_tokenize function from the Natural Language Toolkit (nltk).

In [16]:
def tocken(text):
    return word_tokenize(text)

In [17]:
df[['Comment','Emotion']] = df[['Comment','Emotion']].applymap(tocken)

## 7) Splitting Data into Training and Testing Sets

The code snippet demonstrates how to split a DataFrame into training and testing sets for machine learning tasks using the train_test_split function from the sklearn.model_selection module. Here's a detailed breakdown:

In [4]:
x_train_raw,x_test_raw,y_train_raw,y_test_raw = train_test_split(df['Comment'],df['Emotion'],test_size=0.2,random_state = 42)

# * Transforming Text Data Using TF-IDF Vectorization

TF-IDF vectorization is crucial for natural language processing tasks because it transforms textual data into numerical format, making it suitable for input into machine learning algorithms. It helps to weigh the importance of words, giving more significance to those that are unique to specific documents while reducing the weight of commonly used words across the corpus. This representation enhances the model's ability to differentiate between texts based on their content.



In [10]:
tfid = TfidfVectorizer()
x_train = tfid.fit_transform(x_train_raw)
x_test = tfid.transform(x_test_raw)

# * Encoding Target Labels Using Label Encoding

Label encoding is an essential preprocessing step when working with categorical target variables in machine learning. By converting categories into numerical values, it enables the use of algorithms that require numerical input. This approach is particularly useful for classification tasks where the target variable consists of distinct categories (e.g., emotions, classes).

In [14]:
lb = LabelEncoder()
y_train = lb.fit_transform(y_train_raw)
y_test = lb.transform(y_test_raw)

# * Training a Multinomial Naive Bayes Model

Multinomial Naive Bayes is a simple yet effective algorithm for text classification tasks, such as sentiment analysis or emotion detection. It assumes that the features are conditionally independent given the class label and calculates the probabilities of each class based on the frequencies of the features in the training data.

In [16]:
nb_model = MultinomialNB()
nb_model.fit(x_train,y_train)


# * Training a Support Vector Machine Model

SVM is a robust classification technique that can handle both linear and nonlinear classification problems. It is particularly effective in high-dimensional spaces and is well-suited for text classification tasks, such as sentiment analysis and emotion detection, especially when combined with vectorization techniques like TF-IDF.

In [17]:
svm_model = SVC()
svm_model.fit(x_train,y_train)

## Making Predictions with the Naive Bayes Model

Making predictions with a trained model is a crucial step in the machine learning workflow. The predictions can then be compared with the true labels (y_test) to evaluate the model's performance. Common evaluation metrics include accuracy, precision, recall, and F1-score, which help assess how well the model performs in classifying the target labels.

In [19]:
pred_nb = nb_model.predict(x_test)

## Making Predictions with the Support Vector Machine Model

Making predictions with the SVM model is an important part of the machine learning process. The predictions can be compared against the true labels (y_test) to evaluate the model's performance. Common evaluation metrics to consider include accuracy, precision, recall, and F1-score, which provide insights into the effectiveness of the model in classifying the target labels.

In [20]:
pred_svm = svm_model.predict(x_test)

## Generating Classification Reports for Model Evaluation

The classification report provides a detailed breakdown of the model's performance

In [23]:
report_nb = classification_report(y_test, pred_nb)
report_svm = classification_report(y_test, pred_svm)


## Calculating Accuracy for the Naive Bayes Model

Accuracy is a straightforward metric that gives an overall indication of the model's performance. However, it may not always be the best metric to rely on, especially in cases where the class distribution is imbalanced. It’s often useful to complement accuracy with other evaluation metrics like precision, recall, and F1-score for a more comprehensive understanding of the model's performance.

In [35]:
accuracy_nb = accuracy_score(y_test,pred_nb)
accuracy_nb

0.898989898989899

## Calculating Accuracy for the Support Vector Machine Model

Accuracy gives a straightforward indication of the model's performance. However, like with the Naive Bayes model, it's important to consider other metrics (like precision, recall, and F1-score) to obtain a more complete view of the model's effectiveness, especially in cases of class imbalance.

In [34]:
accurancy_svm = accuracy_score(y_test,pred_svm)
accurancy_svm

0.9141414141414141

In [29]:
report_nb

'              precision    recall  f1-score   support\n\n           0       0.87      0.94      0.90       392\n           1       0.92      0.88      0.90       416\n           2       0.91      0.88      0.89       380\n\n    accuracy                           0.90      1188\n   macro avg       0.90      0.90      0.90      1188\nweighted avg       0.90      0.90      0.90      1188\n'

In [36]:
report_svm

'              precision    recall  f1-score   support\n\n           0       0.90      0.92      0.91       392\n           1       0.96      0.87      0.91       416\n           2       0.89      0.95      0.92       380\n\n    accuracy                           0.91      1188\n   macro avg       0.92      0.92      0.91      1188\nweighted avg       0.92      0.91      0.91      1188\n'

#  Model Comparison

In [37]:
models = [nb_model,svm_model]

In [42]:
def best(models):
    best_model = None
    best_f1 = 0

    for model in models:
        model.fit(x_train, y_train)
        pre1 = model.predict(x_test)
        f1 = f1_score(y_test,pre1,average="weighted")
    
        if f1>best_f1:
            best_f1 = f1
            best_model = model
        
    print("\nbest model")
    print(f"Model: {best_model} with f1_score: {best_f1}")
    
best(models)


best model
Model: SVC() with f1_score: 0.9140861076081777


## SVM Model Evaluation Report: Justification for Selection
The classification report for the Support Vector Machine (SVM) model provides compelling evidence that it is an optimal choice for emotion classification tasks. Here’s a breakdown of the key metrics that support the conclusion that the SVM model outperforms alternatives:

## High Precision:

The SVM model achieves precision scores of 0.90 for class 0, 0.96 for class 1, and 0.89 for class 2. These scores indicate that the model is highly effective at correctly identifying positive instances for each class, suggesting low rates of false positives. This is particularly crucial in emotion classification, where accurate identification of emotions is essential.
## Strong Recall:

The recall values of 0.92 for class 0, 0.87 for class 1, and 0.95 for class 2 demonstrate the model’s capability to capture a significant proportion of actual positive instances. Although class 1 shows a slightly lower recall, the overall performance remains strong, indicating that the model is proficient at identifying the majority of true emotional expressions.
## Balanced F1-Scores:

The F1-scores for the three classes are all above 0.91, reflecting a commendable balance between precision and recall. This balance is essential in emotion classification, where both the accuracy of identifying emotions and the minimization of misclassifications are critical for reliable results.
## Overall Accuracy:

With an overall accuracy of 0.91, the SVM model correctly classifies 91% of the instances in the test set. This high accuracy underscores the model’s effectiveness and reliability in handling the complexities of emotion classification.