# Intro
Sentiment analysis is a technique through which you can analyze a piece of text to determine the sentiment behind it. In this notebook, we're going to train a GRADIENT DESCENT Classifier for the task of sentiment analysis on hugging face emotion dataset.

**Please pay attention to these notes:**

<br/>

- Write your code in the cells denoted by:
```
######## Your Code Here ########
```
- You can add more cells if necessary
- Finding any sort of copying will zero down your grade.
- When your solution is ready to submit, don't forget to set the name of this notebook like  "Name_StudentID.ipynb".
- If you have any questions about this assignment, feel free to drop us a line. You can also ask your questions on the telegram group.
- You must run this notebook on Google Colab platform.

<br/>



# Libraries

In [1]:
# importing the libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import numpy as np
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('punkt_tab')
import string
from nltk.stem import WordNetLemmatizer
import collections
from collections import Counter
from sklearn.model_selection import train_test_split as tts

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# Load data

In [2]:
!pip install datasets

from datasets import load_dataset
emotion_data = load_dataset("emotion")

"""
    emotion_data is a dictionary contains train, val, and test data.
    for your convenience you can convert each of them to pandas dataframe.
"""

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

'\n    emotion_data is a dictionary contains train, val, and test data.\n    for your convenience you can convert each of them to pandas dataframe.\n'

In [12]:
train_data = emotion_data["train"].to_pandas()
validation_data = emotion_data["validation"].to_pandas()
test_data = emotion_data["test"].to_pandas()
print(type(train_data))
combined_data = pd.concat([train_data, validation_data])

filtered_data = combined_data[combined_data["label"].isin([0, 1])]
test_data = test_data[test_data["label"].isin([0, 1])]

<class 'pandas.core.frame.DataFrame'>


# Preprocess
The first step of NLP is text preprocessing. Data cleaning is a very crucial step in any machine learning model, but more so for NLP. Without the cleaning process, the dataset is often a cluster of words that the computer doesn’t understand. Raw data over a properly or improperly formed sentence is not always desirable as it contains lot of unwanted components like null/html/links/url/emoji/stopwords etc. So in this step, this unwanted components are removed for better performance and accuracy.

In [4]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans("", "", string.punctuation))
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return " ".join(tokens)

filtered_data["text"] = filtered_data["text"].apply(preprocess_text)
test_data["text"] = test_data["text"].apply(preprocess_text)
vocab = list(set(word for sentence in filtered_data["text"] for word in sentence.split()))
word_to_idx = {word: idx for idx, word in enumerate(vocab)}

def text_to_vector(sentence):
    vector = np.zeros(len(vocab))
    for word in sentence.split():
        if word in word_to_idx:
            vector[word_to_idx[word]] += 1
    return vector

X = np.array([text_to_vector(sentence) for sentence in filtered_data["text"]])
y = filtered_data["label"].values

X_test = np.array([text_to_vector(sentence) for sentence in test_data["text"]])
y_test = test_data["label"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data["text"] = filtered_data["text"].apply(preprocess_text)


# Training
Use GRADIENT DESCENT algorithm to train a Language Model

In [13]:
# The Sigmoid Function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Loss Function (Binary Cross-Entropy)
def compute_loss(y_true, y_pred):
    return - (y_true * np.log(y_pred + 1e-15) + (1 - y_true) * np.log(1 - y_pred + 1e-15))

# Stochastic Gradient Descent Algorithm for Logistic Regression
def train_logistic_regression_sgd(X, y, learning_rate=0.01, bias=0, epochs=100):
    n_samples, n_features = X.shape
    np.random.seed(0)
    theta = np.random.randn(n_features)  # Random initialize parameters

    for epoch in range(epochs):
        for i in range(n_samples):
            # Select one sample at a time
            x_i = X[i]
            y_i = y[i]

            # Compute prediction (sigmoid)
            y_pred = sigmoid(np.dot(theta, x_i)+bias)

            # Compute loss
            loss = compute_loss(y_i, y_pred)

            # Compute gradient
            gradient_w = (y_pred - y_i) * x_i
            gradient_b = (y_pred - y_i)

            # Update parameters
            theta -= learning_rate * gradient_w
            bias -= learning_rate * gradient_b

        # Print loss at each epoch for monitoring
        if epoch % 10 == 0:
            y_preds_epoch = sigmoid(np.dot(X, theta))
            epoch_loss = np.mean(compute_loss(y, y_preds_epoch))
            print(f"Epoch {epoch}, Loss: {epoch_loss:.3}")

    return theta , bias

learning_rate = 0.01
epochs = 1000
bias = 0
theta , bias = train_logistic_regression_sgd(X, y, learning_rate,bias, epochs)

Epoch 0, Loss: 1.01
Epoch 10, Loss: 0.288
Epoch 20, Loss: 0.164
Epoch 30, Loss: 0.115
Epoch 40, Loss: 0.0898
Epoch 50, Loss: 0.0741
Epoch 60, Loss: 0.0635
Epoch 70, Loss: 0.0557
Epoch 80, Loss: 0.0498
Epoch 90, Loss: 0.0451
Epoch 100, Loss: 0.0413
Epoch 110, Loss: 0.0382
Epoch 120, Loss: 0.0355
Epoch 130, Loss: 0.0332
Epoch 140, Loss: 0.0312
Epoch 150, Loss: 0.0294
Epoch 160, Loss: 0.0279
Epoch 170, Loss: 0.0265
Epoch 180, Loss: 0.0253
Epoch 190, Loss: 0.0241
Epoch 200, Loss: 0.0231
Epoch 210, Loss: 0.0222
Epoch 220, Loss: 0.0213
Epoch 230, Loss: 0.0206
Epoch 240, Loss: 0.0198
Epoch 250, Loss: 0.0192
Epoch 260, Loss: 0.0185
Epoch 270, Loss: 0.018
Epoch 280, Loss: 0.0174
Epoch 290, Loss: 0.0169
Epoch 300, Loss: 0.0164
Epoch 310, Loss: 0.016
Epoch 320, Loss: 0.0155
Epoch 330, Loss: 0.0151
Epoch 340, Loss: 0.0148
Epoch 350, Loss: 0.0144
Epoch 360, Loss: 0.0141
Epoch 370, Loss: 0.0137
Epoch 380, Loss: 0.0134
Epoch 390, Loss: 0.0131
Epoch 400, Loss: 0.0128
Epoch 410, Loss: 0.0126
Epoch 420,

# Test
Now you need to run inference on your test set

In [14]:
y_preds = [sigmoid(np.dot(theta, x) + bias) >= 0.5 for x in X_test]
accuracy = np.mean(y_preds == y_test)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9780564263322884


# Evaluation
After training is finished, we need some metrics to evaluate the trained model on the test set. Here, you need to write code for utilizing the metrics bellow without the sklearn libraries!

Calculated Manuely

In [15]:
y_preds = np.array([1 if sigmoid(np.dot(theta, x) + bias) >= 0.5 else 0 for x in X_test])

# Initialize counts
tp = 0  # True Positive
tn = 0  # True Negative
fp = 0  # False Positive
fn = 0  # False Negative

for true, pred in zip(y_test, y_preds):
    if true == 1 and pred == 1:
        tp += 1
    elif true == 0 and pred == 0:
        tn += 1
    elif true == 0 and pred == 1:
        fp += 1
    elif true == 1 and pred == 0:
        fn += 1
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

# Display the results
print("Confusion Matrix:")
print(np.array([[tn, fp], [fn, tp]]))
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F-Measure: {f1:.2f}")

Confusion Matrix:
[[561  20]
 [  8 687]]
Precision: 0.97
Recall: 0.99
F-Measure: 0.98


Use sklearn.metrics

In [16]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix

precision = precision_score(y_test, y_preds)
recall = recall_score(y_test, y_preds)
f1_score = f1_score(y_test, y_preds)
confusion_matrix = confusion_matrix(y_test, y_preds)

print("Confusion Matrix:")
print(confusion_matrix)
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1_score:.2f}")

Confusion Matrix:
[[561  20]
 [  8 687]]
Precision: 0.97
Recall: 0.99
F1 Score: 0.98
