# Sentiment-Based Product Recommendation System

Starter Jupyter notebook for the Ebuss case study. This notebook walks through:

1. Data loading & EDA
2. Data cleaning & text preprocessing
3. Feature extraction (TF-IDF)
4. Training multiple classifiers (Logistic Regression, Random Forest, XGBoost, Naive Bayes)
5. Evaluating and selecting best sentiment model
6. Building User- and Item-based recommenders
7. Re-ranking recommended items using predicted sentiment
8. Pickling artifacts and deployment notes

Save this notebook and run cells sequentially. Replace file paths as needed.

In [7]:
## 1) Imports and helper functions

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB

# XGBoost may not be installed in some environments. If missing, install via pip install xgboost
try:
    from xgboost import XGBClassifier
except Exception as e:
    XGBClassifier = None

# Recommender utilities
from sklearn.metrics.pairwise import cosine_similarity
import pickle

# Text preprocessing
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

print('Imports ready')

Imports ready


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
## 2) Load dataset

# Update the path to your CSV file if different
DATA_PATH = 'sample30.csv'

if not os.path.exists(DATA_PATH):
    raise FileNotFoundError(f"Dataset not found at {DATA_PATH}. Upload 'sample30.csv' to /mnt/data or change the path.")

df = pd.read_csv(DATA_PATH)
print('Shape:', df.shape)
df.head()

Shape: (30000, 15)


Unnamed: 0,id,brand,categories,manufacturer,name,reviews_date,reviews_didPurchase,reviews_doRecommend,reviews_rating,reviews_text,reviews_title,reviews_userCity,reviews_userProvince,reviews_username,user_sentiment
0,AV13O1A8GV-KLJ3akUyj,Universal Music,"Movies, Music & Books,Music,R&b,Movies & TV,Mo...",Universal Music Group / Cash Money,Pink Friday: Roman Reloaded Re-Up (w/dvd),2012-11-30T06:21:45.000Z,,,5,i love this album. it's very good. more to the...,Just Awesome,Los Angeles,,joshua,Positive
1,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor. This review was collected as part...,Good,,,dorothy w,Positive
2,AV14LG0R-jtxr-f38QfS,Lundberg,"Food,Packaged Foods,Snacks,Crackers,Snacks, Co...",Lundberg,Lundberg Organic Cinnamon Toast Rice Cakes,2017-07-09T00:00:00.000Z,True,,5,Good flavor.,Good,,,dorothy w,Positive
3,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-01-06T00:00:00.000Z,False,False,1,I read through the reviews on here before look...,Disappointed,,,rebecca,Negative
4,AV16khLE-jtxr-f38VFn,K-Y,"Personal Care,Medicine Cabinet,Lubricant/Sperm...",K-Y,K-Y Love Sensuality Pleasure Gel,2016-12-21T00:00:00.000Z,False,False,1,My husband bought this gel for us. The gel cau...,Irritation,,,walker557,Negative


In [9]:
## 3) Quick EDA

# Basic info
print(df.info())

# Show missing values
print('\nMissing values per column:\n', df.isnull().sum())

# Distribution of ratings (if rating column present)
if 'overall' in df.columns:
    display(df['overall'].value_counts().sort_index())

# Number of unique users and products
if 'reviews_username' in df.columns:
    print('Unique users:', df['reviews_username'].nunique())
if 'name' in df.columns:
    print('Unique products:', df['name'].nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    30000 non-null  object
 1   brand                 30000 non-null  object
 2   categories            30000 non-null  object
 3   manufacturer          29859 non-null  object
 4   name                  30000 non-null  object
 5   reviews_date          29954 non-null  object
 6   reviews_didPurchase   15932 non-null  object
 7   reviews_doRecommend   27430 non-null  object
 8   reviews_rating        30000 non-null  int64 
 9   reviews_text          30000 non-null  object
 10  reviews_title         29810 non-null  object
 11  reviews_userCity      1929 non-null   object
 12  reviews_userProvince  170 non-null    object
 13  reviews_username      29937 non-null  object
 14  user_sentiment        29999 non-null  object
dtypes: int64(1), object(14)
memory usage

In [10]:
## 4) Text cleaning & preprocessing

# We'll create a single text column by combining review title and review text if available
text_cols = []
if 'reviews_text' in df.columns:
    text_cols.append('reviews_text')
if 'reviews_title' in df.columns:
    text_cols.append('reviews_title')

if not text_cols:
    # fallback: try common column names
    for c in ['reviewText','review_title','text']:
        if c in df.columns:
            text_cols.append(c)

if len(text_cols)==0:
    raise ValueError('No text columns found. Update text_cols list to include your review columns.')

# create 'text' column
df['text'] = df[text_cols].astype(str).apply(lambda x: ' '.join([t for t in x if pd.notnull(t)]), axis=1)

# simple cleaning function
def clean_text(s):
    s = str(s).lower()
    s = re.sub(r"http\S+", "", s)
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    tokens = [w for w in s.split() if w not in STOPWORDS and len(w)>1]
    return ' '.join(tokens)

# apply cleaning
print('Cleaning text...')
df['clean_text'] = df['text'].fillna('').apply(clean_text)

# sample
print(df['clean_text'].iloc[:3])

Cleaning text...
0    love album good hip hop side current pop sound...
1     good flavor review collected part promotion good
2                                     good flavor good
Name: clean_text, dtype: object


In [12]:
## Replace your existing "Create sentiment labels" cell with this block

# Use existing sentiment labels if present, otherwise derive from ratings
if 'user_sentiment' in df.columns:
    # Normalize values to 'positive'/'negative' strings (in case of different casing)
    df['sentiment_label'] = df['user_sentiment'].astype(str).str.strip().str.lower().map(
        lambda x: 'positive' if 'pos' in x else ('negative' if 'neg' in x else x)
    )
    # Keep only rows with positive or negative labels
    df_sent = df[df['sentiment_label'].isin(['positive','negative'])].copy()
    print("Used 'user_sentiment' column. Counts:")
    print(df_sent['sentiment_label'].value_counts())

elif 'reviews_rating' in df.columns:
    # Map ratings to sentiment (binary). Drop neutral (rating == 3).
    df['sentiment_label'] = df['reviews_rating'].apply(
        lambda x: 'positive' if float(x) >= 4 else ('negative' if float(x) <= 2 else 'neutral')
    )
    print("Created sentiment_label from 'reviews_rating'. Value counts (including neutral):")
    print(df['sentiment_label'].value_counts())
    # Drop neutral reviews for binary classification
    df_sent = df[df['sentiment_label'] != 'neutral'].copy()
    print('After removing neutral, shape:', df_sent.shape)

else:
    raise ValueError('No rating or sentiment column found. Provide "user_sentiment" or "reviews_rating".')

# Ensure we have the cleaned text column created earlier. If not, create it from available text columns
if 'clean_text' not in df_sent.columns:
    text_cols = []
    for c in ['reviews_text','reviews_title','reviewText','text']:
        if c in df_sent.columns:
            text_cols.append(c)
    if len(text_cols)==0:
        raise ValueError("No text column found. Expected one of ['reviews_text','reviews_title','reviewText','text'].")
    df_sent['text'] = df_sent[text_cols].astype(str).apply(lambda x: ' '.join([t for t in x if pd.notnull(t)]), axis=1)
    # apply your cleaning function (ensure clean_text function exists in the notebook)
    df_sent['clean_text'] = df_sent['text'].fillna('').apply(clean_text)

# Final preview
df_sent = df_sent.reset_index(drop=True)
display(df_sent[['reviews_username','name','reviews_rating','sentiment_label','clean_text']].head())


Used 'user_sentiment' column. Counts:
sentiment_label
positive    26632
negative     3367
Name: count, dtype: int64


Unnamed: 0,reviews_username,name,reviews_rating,sentiment_label,clean_text
0,joshua,Pink Friday: Roman Reloaded Re-Up (w/dvd),5,positive,love album good hip hop side current pop sound...
1,dorothy w,Lundberg Organic Cinnamon Toast Rice Cakes,5,positive,good flavor review collected part promotion good
2,dorothy w,Lundberg Organic Cinnamon Toast Rice Cakes,5,positive,good flavor good
3,rebecca,K-Y Love Sensuality Pleasure Gel,1,negative,read reviews looking buying one couples lubric...
4,walker557,K-Y Love Sensuality Pleasure Gel,1,negative,husband bought gel us gel caused irritation fe...


In [14]:
## 6) Train-test split and TF-IDF vectorizer

X = df_sent['clean_text'].values
y = df_sent['sentiment_label'].map({'negative':0,'positive':1}).values

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
print('Train/test sizes:', X_train.shape, X_test.shape)

# TF-IDF
tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
print('TF-IDF shapes:', X_train_tfidf.shape, X_test_tfidf.shape)

# Save vectorizer for later
pickle.dump(tfidf, open('tfidf_vectorizer.pkl','wb'))
print('Saved tfidf_vectorizer.pkl')

Train/test sizes: (23999,) (6000,)
TF-IDF shapes: (23999, 20000) (6000, 20000)
Saved tfidf_vectorizer.pkl


In [16]:
## 7) Train multiple classifiers and compare

results = {}

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_tfidf, y_train)
yp = lr.predict(X_test_tfidf)
results['LogisticRegression'] = {'accuracy': accuracy_score(y_test, yp), 'f1': f1_score(y_test, yp)}
print('Logistic:', results['LogisticRegression'])

# Random Forest
rf = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
rf.fit(X_train_tfidf, y_train)
yp = rf.predict(X_test_tfidf)
results['RandomForest'] = {'accuracy': accuracy_score(y_test, yp), 'f1': f1_score(y_test, yp)}
print('RandomForest:', results['RandomForest'])

# Naive Bayes
nb = MultinomialNB()
nb.fit(X_train_tfidf, y_train)
yp = nb.predict(X_test_tfidf)
results['NaiveBayes'] = {'accuracy': accuracy_score(y_test, yp), 'f1': f1_score(y_test, yp)}
print('NaiveBayes:', results['NaiveBayes'])

# XGBoost (optional)
if XGBClassifier is not None:
    xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
    xgb.fit(X_train_tfidf, y_train)
    yp = xgb.predict(X_test_tfidf)
    results['XGBoost'] = {'accuracy': accuracy_score(y_test, yp), 'f1': f1_score(y_test, yp)}
    print('XGBoost:', results['XGBoost'])
else:
    print('XGBoost not available in this environment')

# summary
results

# Save best model (choose by f1)
best_model_name = max(results, key=lambda k: results[k]['f1'])
print('Best model:', best_model_name)
model_map = {'LogisticRegression': lr, 'RandomForest': rf, 'NaiveBayes': nb}
if XGBClassifier is not None:
    model_map['XGBoost'] = xgb
best_model = model_map[best_model_name]
pickle.dump(best_model, open('sentiment_model.pkl','wb'))
print('Saved sentiment_model.pkl')

Logistic: {'accuracy': 0.9058333333333334, 'f1': 0.9494316656224828}
RandomForest: {'accuracy': 0.913, 'f1': 0.9531502423263328}
NaiveBayes: {'accuracy': 0.8913333333333333, 'f1': 0.9421164772727273}


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


XGBoost: {'accuracy': 0.9293333333333333, 'f1': 0.9612502284774265}
Best model: XGBoost
Saved sentiment_model.pkl


In [18]:
## 8) Build simple User-based and Item-based recommenders (ratings pivot + cosine similarity)

# We need a user-item rating matrix. We'll use 'reviews_username', 'name' (product), and 'reviews_rating'.
if not all(c in df.columns for c in ['reviews_username','name','reviews_rating']):
    raise ValueError('Required columns for recommender not found: reviews_username, name, reviews_rating')

ratings = df[['reviews_username','name','reviews_rating']].dropna()
ratings = ratings.rename(columns={
    'reviews_username':'user',
    'name':'item',
    'reviews_rating':'rating'
})
print('Ratings shape:', ratings.shape)

# pivot (users × items)
user_item = ratings.pivot_table(index='user', columns='item', values='rating')
print('User-Item matrix shape:', user_item.shape)

# Fill missing with 0 for similarity-based CF (implicit approach)
user_item_filled = user_item.fillna(0)

# Item-based similarity
item_sim = cosine_similarity(user_item_filled.T)
item_sim_df = pd.DataFrame(item_sim, index=user_item_filled.columns, columns=user_item_filled.columns)

# User-based similarity
user_sim = cosine_similarity(user_item_filled)
user_sim_df = pd.DataFrame(user_sim, index=user_item_filled.index, columns=user_item_filled.index)

print('Constructed similarity matrices')


Ratings shape: (29937, 3)
User-Item matrix shape: (24914, 271)
Constructed similarity matrices


In [19]:
## 9) Recommendation functions

def recommend_items_item_based(user_id, top_k=20):
    # find items the user has rated
    if user_id not in user_item.index:
        raise ValueError('User not found')
    user_ratings = user_item.loc[user_id].dropna()
    scores = {}
    for item, r in user_ratings.items():
        # similar items
        sims = item_sim_df[item]
        for other_item, sim in sims.items():
            if other_item in user_ratings.index:
                continue
            scores.setdefault(other_item, 0)
            scores[other_item] += sim * r
    # sort
    recs = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [x[0] for x in recs]


def recommend_items_user_based(user_id, top_k=20):
    if user_id not in user_item.index:
        raise ValueError('User not found')
    sims = user_sim_df[user_id]
    sim_scores = sims.drop(index=user_id).sort_values(ascending=False)
    neighbor_users = sim_scores.head(10).index
    # aggregate neighbor ratings
    candidates = {}
    for nu in neighbor_users:
        for item, rating in user_item.loc[nu].dropna().items():
            if item in user_item.loc[user_id].dropna().index:
                continue
            candidates.setdefault(item, []).append(rating)
    # average rating
    avg_scores = {k: np.mean(v) for k,v in candidates.items()}
    recs = sorted(avg_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    return [x[0] for x in recs]

# Example: choose a user from the data
sample_user = user_item.index[0]
print('Sample user:', sample_user)
print('Item-based sample recs:', recommend_items_item_based(sample_user, top_k=5))
print('User-based sample recs:', recommend_items_user_based(sample_user, top_k=5))

Sample user: 00dog3
Item-based sample recs: ['Power Crunch Protein Energy Bar Peanut Butter Creme Original', 'Delta Single Handle Shower Faucet', 'Mike Dave Need Wedding Dates (dvd + Digital)', 'Caress Moisturizing Body Bar Natural Silk, 4.75oz', 'Pendaflex174 Divide It Up File Folder, Multi Section, Letter, Assorted, 12/pack']
User-based sample recs: []


In [21]:
## 10) Rerank top-20 using sentiment model to pick top-5

# Load vectorizer and sentiment model
vectorizer = pickle.load(open('tfidf_vectorizer.pkl','rb'))
model = pickle.load(open('sentiment_model.pkl','rb'))

# Function to compute percentage positive reviews for a product
def product_positive_pct(product_name):
    reviews = df[df['name']==product_name]
    if reviews.empty:
        return 0.0
    texts = reviews['clean_text'].fillna('')
    Xv = vectorizer.transform(texts)
    # if classifier has predict_proba
    if hasattr(model, 'predict_proba'):
        proba = model.predict_proba(Xv)[:,1]
        return proba.mean()
    else:
        preds = model.predict(Xv)
        return np.mean(preds)


def recommend_top5_for_user(user_id, base='item', top_k=20):
    if base=='item':
        candidates = recommend_items_item_based(user_id, top_k=top_k)
    else:
        candidates = recommend_items_user_based(user_id, top_k=top_k)
    scored = [(p, product_positive_pct(p)) for p in candidates]
    scored_sorted = sorted(scored, key=lambda x: x[1], reverse=True)
    return scored_sorted[:5]

print('Top-5 re-ranked for sample user:', recommend_top5_for_user(sample_user))

Top-5 re-ranked for sample user: [('Pantene Pro-V Expert Collection Age Defy Conditioner', np.float32(0.95623916)), ('My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Digital)', np.float32(0.94860965)), ('Delta Single Handle Shower Faucet', np.float32(0.9363709)), ('Planes: Fire Rescue (2 Discs) (includes Digital Copy) (blu-Ray/dvd)', np.float32(0.9311966)), ('Clorox Disinfecting Bathroom Cleaner', np.float32(0.922053))]


In [22]:
## 11) Save artifacts and deployment notes

# Already saved: tfidf_vectorizer.pkl and sentiment_model.pkl
# Save user-item pivot and similarity matrices for serving
pickle.dump(user_item, open('user_item_pivot.pkl','wb'))
pickle.dump(item_sim_df, open('item_sim.pkl','wb'))
pickle.dump(user_sim_df, open('user_sim.pkl','wb'))
print('Saved pivot and similarity matrices')

# Deployment notes (Flask): create three files for deployment
notes = '''
Deployment files to create:

1) model.py  - loads pickled artifacts (tfidf_vectorizer.pkl, sentiment_model.pkl, user_item_pivot.pkl, item_sim.pkl) and exposes functions recommend(username)
2) app.py    - Flask app that uses model.py to serve recommendations (POST form with username -> returns top 5 products)
3) templates/index.html - simple form to collect username and show results

High level app.py example:

from flask import Flask, render_template, request
import model
app = Flask(__name__)

@app.route('/', methods=['GET','POST'])
def index():
    if request.method=='POST':
        user = request.form.get('username')
        recs = model.recommend_for_user(user)
        return render_template('index.html', recommendations=recs)
    return render_template('index.html')

if __name__=='__main__':
    app.run(debug=True)

'''
print(notes)

Saved pivot and similarity matrices

Deployment files to create:

1) model.py  - loads pickled artifacts (tfidf_vectorizer.pkl, sentiment_model.pkl, user_item_pivot.pkl, item_sim.pkl) and exposes functions recommend(username)
2) app.py    - Flask app that uses model.py to serve recommendations (POST form with username -> returns top 5 products)
3) templates/index.html - simple form to collect username and show results

High level app.py example:

from flask import Flask, render_template, request
import model
app = Flask(__name__)

@app.route('/', methods=['GET','POST'])
def index():
    if request.method=='POST':
        user = request.form.get('username')
        recs = model.recommend_for_user(user)
        return render_template('index.html', recommendations=recs)
    return render_template('index.html')

if __name__=='__main__':
    app.run(debug=True)




In [27]:
# generate_user_neighbors.py
import pickle, os
import numpy as np
from sklearn.neighbors import NearestNeighbors

# load pivot (users x items) and ensure it's numeric, fillna(0)
user_item = pickle.load(open("user_item_pivot.pkl","rb"))
X = user_item.fillna(0).astype(np.float32).values  # shape: (n_users, n_items)
users = list(user_item.index)

n_neighbors = 30  # tune
print("Fitting NearestNeighbors (this may take time)...")
nbrs = NearestNeighbors(n_neighbors=n_neighbors+1, metric="cosine", algorithm="brute", n_jobs=-1)
nbrs.fit(X)
distances, indices = nbrs.kneighbors(X)

# build top-k dictionary excluding self (index 0)
user_neighbors = {}
for i, user in enumerate(users):
    neigh_idxs = indices[i][1:]   # skip self
    neigh_dists = distances[i][1:]
    # convert dist -> similarity
    neigh_sims = 1.0 - neigh_dists
    user_neighbors[user] = [(users[j], float(neigh_sims[k])) for k,j in enumerate(neigh_idxs)]

pickle.dump(user_neighbors, open("user_neighbors.pkl","wb"))
print("saved user_neighbors.pkl")



Fitting NearestNeighbors (this may take time)...
saved user_neighbors.pkl


## Final notes

This notebook is a comprehensive starting point. Next steps

- Hyperparameter tuning (GridSearchCV) for each classifier
- Handle class imbalance (SMOTE, class weights)
- Improve text preprocessing (lemmatization, spelling correction)
- Use pretrained sentence embeddings (sentence-transformers) instead of TF-IDF for semantic quality
- Use FAISS or Milvus for scalable nearest neighbors on embeddings
- Improve recommender by blending collaborative and content-based signals
- Build a proper Flask frontend and test locally before deploying to Heroku or similar