<a href="https://colab.research.google.com/github/SandySingh72/DATA_Analytics/blob/main/Project_Sentiments_Classification_For_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project: Sentiment Classification for Coffee Maker Reviews**

**Problem Statement:**

1. We are given a dataset "coffee_maker.csv" with two columns: "review" and "rating".
2. The task is to classify the reviews as Negative, Neutral, or Positive sentiments based on the following
mapping:

         • Ratings 1 or 2 are labeled as Negative
         • Rating 3 is labeled as Neutral
         • Ratings 4 or 5 are labeled as Positive

# **Stepwise Approach:**

**Step - 1: Loading of Libraries**

In [1]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

**Step - 1: Data Loading and Preprocessing:**

1. Loaded the dataset and removed rows with missing values in "review" or "rating".
2. Converted "rating" column to integers.
3. Mapped ratings to sentiment labels ("Negative", "Neutral", "Positive").

In [3]:
df = pd.read_csv('/content/coffee_maker.csv') #Load datafile (coffee_maker.csv)
df.dropna(subset=['review', 'rating'], inplace=True) #Drop rows under the review and rating column whic
df['rating'] = pd.to_numeric(df['rating'], errors='coerce').dropna().astype(int) #Convert the rating to integer form
#Relate the value of rating to a grouped value defined as sentiment
def map_sentiment(rating):
    if rating in [1, 2]:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'
    elif rating in [4, 5]:
        return 'Positive'
df['sentiment'] = df['rating'].apply(map_sentiment)
display(df.head())

Unnamed: 0,review_date,handle,rating,helpfulness_rating,review,sentiment
0,"April 14, 2018",The Dolphin,2,513,Delightful coffee maker if you’re only looking...,Negative
1,"February 7, 2019",Karen Kaffenberger,1,122,UPDATE: Bought this 10-21-18 and I finally ret...,Negative
2,"December 23, 2017",C1C3C11,4,185,The big reason I ordered this was because I wa...,Positive
3,"November 26, 2016",Paul Roberts,5,224,I've owned several of their older brewstation ...,Positive
4,"November 28, 2017",JennyD,3,116,I agonized over which coffee maker to purchase...,Neutral


**Step - 2**

Approach 1 – Zero-shot Classification with Pretrained Transformer:

1. Used the Hugging Face model "facebook/bart-large-mnli".
2. Applied zero-shot classi�cation on each review with target labels.

In [4]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") #Load the classsification model
labels = ["Negative", "Neutral", "Positive"] #Provide values of sentiments as labels
preds = [classifier(str(text), labels)['labels'][0] for text in tqdm(df['review'])] #Apply classification
accuracy = accuracy_score(df['sentiment'], preds) #Calculate the accuracy
f1 = f1_score(df['sentiment'], preds, average='weighted')
print(f"Zero-shot Model Accuracy: {accuracy:.4f}")
print(f"Zero-shot Model F1-score: {f1:.4f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0
  0%|          | 10/4999 [00:03<26:40,  3.12it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
100%|██████████| 4999/4999 [08:24<00:00,  9.92it/s]

Zero-shot Model Accuracy: 0.8326
Zero-shot Model F1-score: 0.8016





**Train-Test Split:**

Used sklearn's train_test_split to divide the data into 70% training and 30% testing.

In [5]:
X = df['review'] #define column for recview
y = df['sentiment'] #define column for sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Approach 2 – Sentence Embedding + Logistic Regression:**

1. Used "all-MiniLM-L6-v2" from SentenceTransformers to embed reviews.
2. Trained Logistic Regression classifier on embedded vectors.

In [6]:
model = SentenceTransformer('all-MiniLM-L6-v2') #Load pretrained transformer model
#Generate embeddings
X_train_emb = model.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb = model.encode(X_test.tolist(), show_progress_bar=True)
#Train classifier usiing logistic regression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_emb, y_train)
#Predict and evaluate based on the split dataset and trained model
y_pred = clf.predict(X_test_emb)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Custom Classifier Accuracy: {acc:.4f}")
print(f"Custom Classifier F1-score: {f1:.4f}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/110 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


Batches:   0%|          | 0/47 [00:00<?, ?it/s]

Custom Classifier Accuracy: 0.7940
Custom Classifier F1-score: 0.7539


**Approach 3 – MPNet Embedding + Logistic Regression:**

 1. Used "all-mpnet-base-v2" for sentence embedding.
 2. Trained Logistic Regression on MPNet embeddings.**

In [7]:
from sentence_transformers import SentenceTransformer
#Load more powerful model
mpnet_model = SentenceTransformer('all-mpnet-base-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
#Generate embeddings with mpnet
X_train_emb_mpnet = mpnet_model.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb_mpnet = mpnet_model.encode(X_test.tolist(), show_progress_bar=True)

Batches:   0%|          | 0/110 [00:00<?, ?it/s]

Batches:   0%|          | 0/47 [00:00<?, ?it/s]

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

In [10]:
#Train classifier usinf logisitc regression
clf_mpnet = LogisticRegression(max_iter=1000)
clf_mpnet.fit(X_train_emb_mpnet, y_train)
#Predict
y_pred_mpnet = clf_mpnet.predict(X_test_emb_mpnet)
acc_mpnet = accuracy_score(y_test, y_pred_mpnet)
f1_mpnet = f1_score(y_test, y_pred_mpnet, average='weighted')
print(f"MPNet Classifier Accuracy: {acc_mpnet:.4f}")
print(f"MPNet Classifier F1-score: {f1_mpnet:.4f}")

MPNet Classifier Accuracy: 0.8200
MPNet Classifier F1-score: 0.7795


**Approach 4 – MPNet Embedding + XGBoost:**

Used MPNet embeddings as input to XGBoost classifier.

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

In [12]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score
clf_xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
clf_xgb.fit(X_train_emb_mpnet, y_train_enc)
y_pred_xgb = clf_xgb.predict(X_test_emb_mpnet)
acc_xgb = accuracy_score(y_test_enc, y_pred_xgb)
f1_xgb = f1_score(y_test_enc, y_pred_xgb, average='weighted')
print(f"MPNet + XGBoost Accuracy: {acc_xgb:.4f}")
print(f"MPNet + XGBoost F1-score: {f1_xgb:.4f}")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


MPNet + XGBoost Accuracy: 0.8133
MPNet + XGBoost F1-score: 0.7730


**Approach 5 – MPNet Embedding + SVM:**

Trained a Support Vector Machine classifier using MPNet embeddings.

In [13]:
from sklearn.svm import SVC
clf_svm = SVC(kernel='linear')
clf_svm.fit(X_train_emb_mpnet, y_train)
y_pred_svm = clf_svm.predict(X_test_emb_mpnet)
y_pred_svm = clf_svm.predict(X_test_emb_mpnet)
acc_svm = accuracy_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')
print(f"MPNet + SVM Accuracy: {acc_svm:.4f}")
print(f"MPNet + SVM F1-score: {f1_svm:.4f}")

MPNet + SVM Accuracy: 0.8207
MPNet + SVM F1-score: 0.7788


# **Comparison of Results:**

**Method.                            Accuracy.   F1 -Score**

Zero-shot (BART-large-MNLI)             0.8326        0.8016

MiniLM Embedding + Logistic Regression  0.7940        0.7539

MPNet Embedding + Logistic Regression   0.8200        0.7795

MPNet Embedding + XGBoost               0.8133        0.7730

MPNet Embedding + SVM                   0.8207        0.7788

# **Conclusion:**

The zero-shot classification using "facebook/bart-large-mnli" gave the highest accuracy and F1 score,
making it the most suitable option among those tested. Among custom-trained models, MPNet
embeddings combined with Logistic Regression or XGBoost also performed competitively and could be
further fine-tuned for improvements.