<a href="https://colab.research.google.com/github/SandySingh72/DATA_Analytics/blob/main/Project_Sentiments_Classification_For_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Classification for Coffee Maker Reviews**

**Problem Statement:**

1. We are given a dataset "coffee_maker.csv" with two columns: "review" and "rating".
2. The task is to classify the reviews as Negative, Neutral, or Positive sentiments based on the following
mapping:

         • Ratings 1 or 2 are labeled as Negative
         • Rating 3 is labeled as Neutral
         • Ratings 4 or 5 are labeled as Positive

# **Stepwise Approach:**

**Step - 1: Loading of Libraries**

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from tqdm import tqdm

**Step - 1: Data Loading and Preprocessing:**

1. Loaded the dataset and removed rows with missing values in "review" or "rating".
2. Converted "rating" column to integers.
3. Mapped ratings to sentiment labels ("Negative", "Neutral", "Positive").

In [None]:
df = pd.read_csv('/content/coffee_maker.csv') #Load datafile (coffee_maker.csv)
df.dropna(subset=['review', 'rating'], inplace=True) #Drop rows under the review and rating column whic
df['rating'] = pd.to_numeric(df['rating'], errors='coerce').dropna().astype(int) #Convert the rating to integer form
#Relate the value of rating to a grouped value defined as sentiment
def map_sentiment(rating):
    if rating in [1, 2]:
        return 'Negative'
    elif rating == 3:
        return 'Neutral'
    elif rating in [4, 5]:
        return 'Positive'
df['sentiment'] = df['rating'].apply(map_sentiment)
display(df.head())

**Step - 2**

Approach 1 – Zero-shot Classification with Pretrained Transformer:

1. Used the Hugging Face model "facebook/bart-large-mnli".
2. Applied zero-shot classi�cation on each review with target labels.

In [None]:
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli") #Load the classsification model
labels = ["Negative", "Neutral", "Positive"] #Provide values of sentiments as labels
preds = [classifier(str(text), labels)['labels'][0] for text in tqdm(df['review'])] #Apply classification
accuracy = accuracy_score(df['sentiment'], preds) #Calculate the accuracy
f1 = f1_score(df['sentiment'], preds, average='weighted')
print(f"Zero-shot Model Accuracy: {accuracy:.4f}")
print(f"Zero-shot Model F1-score: {f1:.4f}")

**Train-Test Split:**

Used sklearn's train_test_split to divide the data into 70% training and 30% testing.

In [None]:
X = df['review'] #define column for recview
y = df['sentiment'] #define column for sentiment
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

**Approach 2 – Sentence Embedding + Logistic Regression:**

1. Used "all-MiniLM-L6-v2" from SentenceTransformers to embed reviews.
2. Trained Logistic Regression classifier on embedded vectors.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2') #Load pretrained transformer model
#Generate embeddings
X_train_emb = model.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb = model.encode(X_test.tolist(), show_progress_bar=True)
#Train classifier usiing logistic regression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_emb, y_train)
#Predict and evaluate based on the split dataset and trained model
y_pred = clf.predict(X_test_emb)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Custom Classifier Accuracy: {acc:.4f}")
print(f"Custom Classifier F1-score: {f1:.4f}")

**Approach 3 – MPNet Embedding + Logistic Regression:**

 1. Used "all-mpnet-base-v2" for sentence embedding.
 2. Trained Logistic Regression on MPNet embeddings.**

In [None]:
from sentence_transformers import SentenceTransformer
#Load more powerful model
mpnet_model = SentenceTransformer('all-mpnet-base-v2')

In [None]:
#Generate embeddings with mpnet
X_train_emb_mpnet = mpnet_model.encode(X_train.tolist(), show_progress_bar=True)
X_test_emb_mpnet = mpnet_model.encode(X_test.tolist(), show_progress_bar=True)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

In [None]:
#Train classifier usinf logisitc regression
clf_mpnet = LogisticRegression(max_iter=1000)
clf_mpnet.fit(X_train_emb_mpnet, y_train)
#Predict
y_pred_mpnet = clf_mpnet.predict(X_test_emb_mpnet)
acc_mpnet = accuracy_score(y_test, y_pred_mpnet)
f1_mpnet = f1_score(y_test, y_pred_mpnet, average='weighted')
print(f"MPNet Classifier Accuracy: {acc_mpnet:.4f}")
print(f"MPNet Classifier F1-score: {f1_mpnet:.4f}")

**Approach 4 – MPNet Embedding + XGBoost:**

Used MPNet embeddings as input to XGBoost classifier.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, f1_score
clf_xgb = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
clf_xgb.fit(X_train_emb_mpnet, y_train_enc)
y_pred_xgb = clf_xgb.predict(X_test_emb_mpnet)
acc_xgb = accuracy_score(y_test_enc, y_pred_xgb)
f1_xgb = f1_score(y_test_enc, y_pred_xgb, average='weighted')
print(f"MPNet + XGBoost Accuracy: {acc_xgb:.4f}")
print(f"MPNet + XGBoost F1-score: {f1_xgb:.4f}")

**Approach 5 – MPNet Embedding + SVM:**

Trained a Support Vector Machine classifier using MPNet embeddings.

In [None]:
from sklearn.svm import SVC
clf_svm = SVC(kernel='linear')
clf_svm.fit(X_train_emb_mpnet, y_train)
y_pred_svm = clf_svm.predict(X_test_emb_mpnet)
y_pred_svm = clf_svm.predict(X_test_emb_mpnet)
acc_svm = accuracy_score(y_test, y_pred_svm)
f1_svm = f1_score(y_test, y_pred_svm, average='weighted')
print(f"MPNet + SVM Accuracy: {acc_svm:.4f}")
print(f"MPNet + SVM F1-score: {f1_svm:.4f}")

# **Comparison of Results:**

**Method.                            Accuracy.   F1 -Score**

Zero-shot (BART-large-MNLI)             0.8326        0.8016

MiniLM Embedding + Logistic Regression  0.7940        0.7539

MPNet Embedding + Logistic Regression   0.8200        0.7795

MPNet Embedding + XGBoost               0.8133        0.7730

MPNet Embedding + SVM                   0.8207        0.7788

# **Conclusion:**

The zero-shot classification using "facebook/bart-large-mnli" gave the highest accuracy and F1 score,
making it the most suitable option among those tested. Among custom-trained models, MPNet
embeddings combined with Logistic Regression or XGBoost also performed competitively and could be
further fine-tuned for improvements.