# Hybrid Product Recommendation System (SVD + KNNWithMeans)

Dataset: Women’s E-Commerce Clothing Reviews

Goal: Recommend products to users based on ratings and similarity, combining SVD and KNNWithMeans for improved accuracy.

# Import Libraries

In [41]:
%pip install scikit-surprise

import pandas as pd
from surprise import Dataset, Reader, KNNWithMeans, accuracy,SVD
from surprise.model_selection import train_test_split, GridSearchCV
import math

Note: you may need to restart the kernel to use updated packages.


# Loading the data

In [19]:
df=pd.read_csv('C:\\Users\\ICTServices\\Desktop\\Womens Clothing E-Commerce Reviews.csv')

# Data Exploration

In [20]:
df.head()   

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [21]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23486.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.535936
std,6779.968547,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


In [38]:
df.isnull().sum()

product_id        0
Rating            0
Title          3810
Review Text     845
user_id           0
dtype: int64

In [39]:
df.duplicated().sum()

0

The dataset contained missing values and their where no duplicate values

# Data Cleaning

In [None]:

df = df[['Clothing ID', 'Rating', 'Title', 'Review Text']]
df = df.rename(columns={'Clothing ID': 'product_id'})
df.dropna(subset=['product_id', 'Rating'], inplace=True)
df['user_id'] = range(1, len(df) + 1)

Dropped missing values since the surprise library can handle them.This dataset does not have user IDs.
To simulate a recommender environment,  I create artificial user IDs.

# Data preparation

In [None]:

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['user_id', 'product_id', 'Rating']], reader)

Surprise library requires a 'Reader' to define rating scale (1 to 5)
 and a dataset in the format [user_id, item_id, rating].

# Train test spilt

In [40]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

 We’ll split the data into 80% training and 20% testing.
This allows us to evaluate the model on unseen data.

# Training of individual models

In [31]:
# Step 5: Train individual models
svd = SVD(n_factors=100, lr_all=0.005, reg_all=0.02, n_epochs=30)
svd.fit(trainset)

svd = SVD(n_factors=100, lr_all=0.005, reg_all=0.02, n_epochs=30)
svd.fit(trainset)

sim_options = {"name": "pearson_baseline", "user_based": False}
knn = KNNWithMeans(k=30, sim_options=sim_options, verbose=False)
knn.fit(trainset)

<surprise.prediction_algorithms.knns.KNNWithMeans at 0x1a097ccf080>

 SVD (Matrix Factorization)
  Learns hidden features that represent users and products.
  Captures complex relationships and preferences.

   KNNWithMeans (Collaborative Filtering)
 Uses similarity between products (item-based approach).
 Adjusts for user bias using mean-centering.

 

 # Combination of predictions

In [None]:

alpha = 0.6  # weight for SVD (0.6 means 60% SVD, 40% KNN)
hybrid_predictions = []

for uid, iid, true_r in testset:
    svd_pred = svd.predict(uid, iid).est
    knn_pred = knn.predict(uid, iid).est
    hybrid_est = alpha * svd_pred + (1 - alpha) * knn_pred
    hybrid_predictions.append((uid, iid, true_r, hybrid_est))


 Each model predicts ratings separately.
 We combine them with a weighted average:
    hybrid = α * SVD + (1 - α) * KNN
 where α determines which model to trust more.

# Evaluation Hybrid Model

In [34]:
# Step 7: Evaluate RMSE
import math

mse = sum([(true_r - est) ** 2 for (_, _, true_r, est) in hybrid_predictions]) / len(hybrid_predictions)
rmse = math.sqrt(mse)
print(f"\n Hybrid Model RMSE: {rmse:.4f}")


 Hybrid Model RMSE: 1.1016


 We calculate the RMSE (Root Mean Square Error)
 to measure how close the predicted ratings are to the actual ratings.
 The RMSE is 1.1016 which is okay

 # Generate  Recommenadations

In [35]:
# Step 8: Recommendation function
def recommend_products(user_id, num_recommendations=5):
    all_products = df['product_id'].unique()
    rated_products = df[df['user_id'] == user_id]['product_id'].unique()
    products_to_predict = [pid for pid in all_products if pid not in rated_products]

    predictions = []
    for pid in products_to_predict:
        svd_pred = svd.predict(user_id, pid).est
        knn_pred = knn.predict(user_id, pid).est
        hybrid_est = alpha * svd_pred + (1 - alpha) * knn_pred
        predictions.append((pid, hybrid_est))

    predictions.sort(key=lambda x: x[1], reverse=True)

    print(f"\nTop {num_recommendations} Recommended Products for User {user_id}:\n")
    for pid, est in predictions[:num_recommendations]:
        print(f"Product ID: {pid}, Predicted Rating: {est:.2f}")

The  function is used to recommend top-N products for a given user
 It predicts ratings for all unrated products and ranks them

In [36]:
# Step 9: Example — Recommend for User 100
recommend_products(user_id=100, num_recommendations=5)


Top 5 Recommended Products for User 100:

Product ID: 123, Predicted Rating: 4.74
Product ID: 906, Predicted Rating: 4.73
Product ID: 961, Predicted Rating: 4.70
Product ID: 1125, Predicted Rating: 4.70
Product ID: 378, Predicted Rating: 4.70


 Example: Get top 5 product recommendations for user 100

Challenges

Artificial user IDs
→ Since the dataset doesn’t have real users, each product is tied to a unique “user”, so there’s no overlapping ratings — which makes collaborative filtering less effective.

Sparse matrix
→ If each user rated only one item, both SVD and KNN struggle to learn similarities.

No feature diversity
→ Using only product_id and rating limits what the model can learn — it’s like trying to recommend without context.

Model parameters
→ Default hyperparameters may not fit your dataset well.