# Intro

Sentiment analysis is a technique through which you can analyze a piece of text to determine the sentiment behind it. In this notebook, we're going to train a Naïve Bayes Classifier for the task of sentiment analysis on hugging face emotion dataset.

**Please pay attention to these notes:**

<br/>

- Write your code in the cells denoted by:
```
######## Your Code Here ########
```
- You can add more cells if necessary
- Finding any sort of copying will zero down your grade.
- When your solution is ready to submit, don't forget to set the name of this notebook like  "Name_StudentID.ipynb".
- If you have any questions about this assignment, feel free to drop us a line. You can also ask your questions on the telegram group.
- You must run this notebook on Google Colab platform.

<br/>



# Libraries

In [19]:
# importing the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
import string
from nltk.stem import WordNetLemmatizer
import collections
from collections import Counter
from sklearn.model_selection import train_test_split as tts
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score , confusion_matrix , precision_score , recall_score , f1_score

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Load data

In [2]:
!pip install datasets

from datasets import load_dataset
emotion_data = load_dataset("emotion")

"""
    emotion_data is a dictionary contains train, val, and test data.
    for your convenience you can convert each of them to pandas dataframe.
"""

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

'\n    emotion_data is a dictionary contains train, val, and test data.\n    for your convenience you can convert each of them to pandas dataframe.\n'

# Preprocess
The first step of NLP is text preprocessing. Data cleaning is a very crucial step in any machine learning model, but more so for NLP. Without the cleaning process, the dataset is often a cluster of words that the computer doesn’t understand. Raw data over a properly or improperly formed sentence is not always desirable as it contains lot of unwanted components like null/html/links/url/emoji/stopwords etc. So in this step, this unwanted components are removed for better performance and accuracy.

In [58]:
from datasets import concatenate_datasets
train_data = emotion_data['train'].filter(lambda x: x['label'] in [0, 1])
val_data = emotion_data['validation'].filter(lambda x: x['label'] in [0, 1])
merged_data = concatenate_datasets([train_data, val_data])

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
punctuations = string.punctuation

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    text = re.sub(r"<.*?>", "", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
    text = " ".join(word for word in text.split() if word not in stop_words)
    text = " ".join(lemmatizer.lemmatize(word) for word in text.split())

    return text

# Training
Use Naive Beyes algorithm to train a Language Model

In [65]:
texts = []
for text in merged_data['text']:
  texts.append(preprocess_text(text))
labels = merged_data['label']

X_train, X_test, y_train, y_test = tts(texts, labels, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vectorized, y_train)

y_train_pred = nb_classifier.predict(X_train_vectorized)
print("Train Accuracy:", accuracy_score(y_train,y_train_pred))

Train Accuracy: 0.9864819944598338


# Test
Now you need to run inference on your test set

In [66]:
y_pred = nb_classifier.predict(X_test_vectorized)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Test Accuracy: 0.9539211342490032


# Evaluation
After training is finished, we need some metrics to evaluate the trained model on the test set. Here, you need to write code for calculating the metrics bellow <span style="background-color: yellow;">without the sklearn libraries</span> and compare the results with sklearn results!

# Confustion matrix

In [68]:
classes = sorted(set(y_test))
conf_matrix = np.zeros((len(classes), len(classes)), dtype=int)

for true, pred in zip(y_test, y_pred):
    conf_matrix[true][pred] += 1
print("Manuel Confusion Matrix:")
print(conf_matrix)
print("SKlearn Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Manuel Confusion Matrix:
[[ 964   69]
 [  35 1189]]
SKlearn Confusion Matrix:
[[ 964   69]
 [  35 1189]]


Precision

In [69]:
TP_0 = conf_matrix[0, 0]
FP_0 = conf_matrix[1, 0]
precision_0 = TP_0 / (TP_0 + FP_0) if (TP_0 + FP_0) > 0 else 0
print(f"Manuel Precision (Class sadness): {precision_0}")
print(f"SKlearn Precision (Class sadness): {precision_score(y_test, y_pred,labels=[0], average='macro')}")

TP_1 = conf_matrix[1, 1]
FP_1 = conf_matrix[0, 1]
precision_1 = TP_1 / (TP_1 + FP_1) if (TP_1 + FP_1) > 0 else 0
print(f"Manuel Precision (Class joy): {precision_1}")
print(f"SKlearn Precision (Class joy): {precision_score(y_test, y_pred,labels=[1], average='macro')}")

Manuel Precision (Class sadness): 0.964964964964965
SKlearn Precision (Class sadness): 0.964964964964965
Manuel Precision (Class joy): 0.9451510333863276
SKlearn Precision (Class joy): 0.9451510333863276


Recall

In [70]:
FN_0 = conf_matrix[0, 1]
recall_0 = TP_0 / (TP_0 + FN_0) if (TP_0 + FN_0) > 0 else 0
print(f"Manuel Recall (Class sadness): {recall_0}")
print(f"SKlearn Recall (Class sadness): {recall_score(y_test, y_pred,labels=[0], average='macro')}")

FN_1 = conf_matrix[1, 0]
recall_1 = TP_1 / (TP_1 + FN_1) if (TP_1 + FN_1) > 0 else 0
print(f"Manuel Recall (Class joy): {recall_1}")
print(f"SKlearn Recall (Class joy): {recall_score(y_test, y_pred,labels=[1], average='macro')}")


Manuel Recall (Class sadness): 0.9332042594385286
SKlearn Recall (Class sadness): 0.9332042594385286
Manuel Recall (Class joy): 0.9714052287581699
SKlearn Recall (Class joy): 0.9714052287581699


F-measure

In [71]:
f1_0 = 2 * (precision_0 * recall_0) / (precision_0 + recall_0) if (precision_0 + recall_0) > 0 else 0
print(f"Manuel F1-Measure (Class sadness): {f1_0}")
print(f"SKlearn F1-Measure (Class sadness): {f1_score(y_test, y_pred,labels=[0], average='macro')}")

f1_1 = 2 * (precision_1 * recall_1) / (precision_1 + recall_1) if (precision_1 + recall_1) > 0 else 0
print(f"Manuel F1-Measure (Class joy): {f1_1}")
print(f"SKlearn F1-Measure (Class joy): {f1_score(y_test, y_pred,labels=[1], average='macro')}")


Manuel F1-Measure (Class sadness): 0.9488188976377953
SKlearn F1-Measure (Class sadness): 0.9488188976377953
Manuel F1-Measure (Class joy): 0.9580983078162771
SKlearn F1-Measure (Class joy): 0.9580983078162773
