
#Sentiment Classification for E-commerce Customer Feedback

## Business Context
In todayâ€™s fast-paced e-commerce landscape, **customer reviews significantly influence product perception and buying decisions**. Businesses must actively monitor customer sentiment to extract insights and maintain a competitive edge. Ignoring negative feedback can lead to serious consequences such as:

- **Customer Churn**: Unresolved complaints drive loyal customers away, reducing retention and future revenue.  
- **Reputation Damage**: Persistent negative sentiment erodes brand trust and deters new buyers.  
- **Financial Loss**: Declining sales and shifting preferences toward competitors directly impact profitability.  

With a **200% increase in customer base over three years** and a recent **25% spike in feedback volume**, manual review of feedback is unsustainable.  

ðŸ‘‰ The company aims to **implement an AI-driven solution** to automatically classify customer sentiment (**positive, negative, neutral**) from product reviews, surveys, and social media.  

---

## Objective
As a Data Scientist, your task is to:
1. **Analyze** the provided customer reviews dataset.  
2. **Preprocess** and clean the data.  
3. **Build predictive models** for sentiment classification:  
   - Traditional ML baselines (TF-IDF + Logistic Regression / SVM).  
   - Transformer-based fine-tuned model (DistilBERT).  
4. **Evaluate** models on accuracy and F1 scores.  
5. Provide insights and recommendations for deployment in real-world monitoring systems.  


## 1) Setup

In [None]:

# Install dependencies (uncomment if running in a clean environment)
# %pip install pandas numpy scikit-learn matplotlib seaborn
# %pip install transformers datasets evaluate accelerate torch
# %pip install fastapi uvicorn emoji textblob wordcloud


## 2) Imports

In [None]:

import os, re, random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

import torch
from datasets import Dataset
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding,
    TrainingArguments, Trainer
)
import evaluate


## 3) Load Data

In [None]:

# Set your dataset path here (CSV with columns: Review, Sentiment)
DATA_PATH = "customer_feedback.csv"
df = pd.read_csv(DATA_PATH)

# Normalize column names
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

assert "sentiment" in df.columns, "Dataset must include 'sentiment' column"
assert any("review" in c for c in df.columns), "Dataset must include a review text column"

text_col = [c for c in df.columns if "review" in c][0]
df = df[[text_col, "sentiment"]].rename(columns={text_col: "text"})

df.head()


## 4) Exploratory Data Analysis

In [None]:

print("Dataset size:", len(df))
print("\nClass distribution:")
print(df["sentiment"].value_counts())

df["text_length"] = df["text"].str.split().str.len()
df["text_length"].hist(bins=40)
plt.title("Review Length Distribution")
plt.xlabel("Words per review")
plt.ylabel("Count")
plt.show()


## 5) Preprocessing

In [None]:

import emoji

def clean_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"http\S+|www\.\S+", " ", s)
    s = re.sub(r"@\w+|#\w+", " ", s)
    s = emoji.replace_emoji(s, replace=" ")
    s = re.sub(r"[^a-z0-9\s\.\,\!\?]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_clean"] = df["text"].map(clean_text)

label2id = {"negative":0, "neutral":1, "positive":2}
id2label = {v:k for k,v in label2id.items()}
df["y"] = df["sentiment"].str.lower().map(label2id)
df.head()


## 6) Train / Validation / Test Split

In [None]:

train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["y"])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df["y"])
print("Train:", train_df.shape, "Val:", val_df.shape, "Test:", test_df.shape)


## 7) Baseline Models (TF-IDF + LR / SVM)

In [None]:

X_train, y_train = train_df["text_clean"], train_df["y"]
X_val, y_val = val_df["text_clean"], val_df["y"]
X_test, y_test = test_df["text_clean"], test_df["y"]

# Logistic Regression
pipe_lr = Pipeline([("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)), 
                    ("clf", LogisticRegression(max_iter=200))])
pipe_lr.fit(X_train, y_train)
val_pred_lr = pipe_lr.predict(X_val)
print("Validation Report (LR):\n", classification_report(y_val, val_pred_lr))

# SVM
pipe_svm = Pipeline([("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2)), 
                     ("clf", LinearSVC())])
pipe_svm.fit(X_train, y_train)
val_pred_svm = pipe_svm.predict(X_val)
print("Validation Report (SVM):\n", classification_report(y_val, val_pred_svm))


## 8) Transformer Model (DistilBERT Fine-tuning)

In [None]:

MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_fn(batch):
    return tokenizer(batch["text"], truncation=True)

train_ds = Dataset.from_pandas(train_df[["text", "y"]].rename(columns={"y":"labels"}))
val_ds = Dataset.from_pandas(val_df[["text", "y"]].rename(columns={"y":"labels"}))
test_ds = Dataset.from_pandas(test_df[["text", "y"]].rename(columns={"y":"labels"}))

train_ds = train_ds.map(tokenize_fn, batched=True)
val_ds = val_ds.map(tokenize_fn, batched=True)
test_ds = test_ds.map(tokenize_fn, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3, id2label=id2label, label2id=label2id
)

metric_f1 = evaluate.load("f1")
metric_acc = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    return {
        "accuracy": metric_acc.compute(predictions=preds, references=labels)["accuracy"],
        "f1_macro": metric_f1.compute(predictions=preds, references=labels, average="macro")["f1"],
    }

training_args = TrainingArguments(
    output_dir="outputs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    logging_steps=50,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train();


## 9) Evaluation & Error Analysis

In [None]:

# Evaluate transformer on test set
test_metrics = trainer.evaluate(test_ds)
print("Transformer Test Metrics:", test_metrics)

# Misclassified examples
preds = trainer.predict(test_ds).predictions.argmax(axis=-1)
mis_idx = np.where(preds != test_df["y"].values)[0].tolist()
for i in mis_idx[:5]:
    print("---")
    print("Text:", test_df.iloc[i]["text"][:300])
    print("True:", id2label[test_df.iloc[i]["y"]], "Pred:", id2label[preds[i]])


## 10) Inference Utility

In [None]:

def predict_sentiment(texts, model, tokenizer):
    enc = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
    with torch.no_grad():
        logits = model(**enc).logits
    preds = logits.argmax(dim=-1).tolist()
    return [id2label[p] for p in preds]

# Example
# predict_sentiment(["Great product, loved the battery life!", "Worst experience ever!"], model, tokenizer)
