#Building an E-commerce Product Recommender System: A Step-by-Step Guide

This project involves creating a recommender system for an e-commerce platform. We will cover the entire pipeline—from data preprocessing and model development to deployment and monitoring—using tools like Docker, AWS Free Tier, and public datasets. This guide assumes you have a basic understanding of Python, machine learning, and AWS services.

Data from: https://amazon-reviews-2023.github.io/



In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel


In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [3]:
from datasets import load_dataset

# Load the dataset
reviews_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", name="raw_review_All_Beauty", trust_remote_code=True)

raw_meta_dataset = load_dataset("McAuley-Lab/Amazon-Reviews-2023", name="raw_meta_All_Beauty", trust_remote_code=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/19.7k [00:00<?, ?B/s]

Amazon-Reviews-2023.py:   0%|          | 0.00/39.6k [00:00<?, ?B/s]

All_Beauty.jsonl:   0%|          | 0.00/327M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

meta_All_Beauty.jsonl:   0%|          | 0.00/213M [00:00<?, ?B/s]

Generating full split: 0 examples [00:00, ? examples/s]

In [4]:
reviews_dataset

DatasetDict({
    full: Dataset({
        features: ['rating', 'title', 'text', 'images', 'asin', 'parent_asin', 'user_id', 'timestamp', 'helpful_vote', 'verified_purchase'],
        num_rows: 701528
    })
})

In [5]:

# Convert to a pandas DataFrame
df = reviews_dataset["full"].to_pandas()

## Exploratory Analysis of the Data

In [6]:
df.head()

Unnamed: 0,rating,title,text,images,asin,parent_asin,user_id,timestamp,helpful_vote,verified_purchase
0,5.0,Such a lovely scent but not overpowering.,This spray is really nice. It smells really go...,[],B00YQ6X8EO,B00YQ6X8EO,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588687728923,0,True
1,4.0,Works great but smells a little weird.,"This product does what I need it to do, I just...",[],B081TJ8YS3,B081TJ8YS3,AGKHLEW2SOWHNMFQIJGBECAF7INQ,1588615855070,1,True
2,5.0,Yes!,"Smells good, feels great!",[],B07PNNCSP9,B097R46CSY,AE74DYR3QUGVPZJ3P7RFWBGIX7XQ,1589665266052,2,True
3,1.0,Synthetic feeling,Felt synthetic,[],B09JS339BZ,B09JS339BZ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1643393630220,0,True
4,5.0,A+,Love it,[],B08BZ63GMJ,B08BZ63GMJ,AFQLNQNQYFWQZPJQZS6V3NZU4QBQ,1609322563534,0,True


**asin, str**: ID of the product

**parent_asin, str**: Parent ID of the product. Note: Products with different colors, styles, sizes usually belong to the same parent ID. The “asin” in previous Amazon datasets is actually parent ID. Please use parent ID to find product meta.

In [7]:
print(print(df.isnull().sum()))
print(df.describe())
print()
print(df.info())

rating               0
title                0
text                 0
images               0
asin                 0
parent_asin          0
user_id              0
timestamp            0
helpful_vote         0
verified_purchase    0
dtype: int64
None
              rating     timestamp   helpful_vote
count  701528.000000  7.015280e+05  701528.000000
mean        3.960245  1.554781e+12       0.923588
std         1.494452  8.005792e+10       5.471391
min         1.000000  9.730527e+11       0.000000
25%         3.000000  1.501616e+12       0.000000
50%         5.000000  1.571595e+12       0.000000
75%         5.000000  1.614647e+12       1.000000
max         5.000000  1.694220e+12     646.000000

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 701528 entries, 0 to 701527
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   rating             701528 non-null  float64
 1   title              701528 non-null  ob

In [8]:
df.dropna(subset=['asin', 'title', 'text'], inplace=True)


In [None]:
print(len(df)-len(df.drop_duplicates(subset=['asin','title','text','user_id'])))

#Collaborative Filtering:

In [None]:
df.drop_duplicates(subset=['asin','title','text','user_id'], inplace=True)

In [None]:
#if we create a pivot table, it will consume too much memory: user_item_matrix = df.pivot_table(index='user_id', columns='asin', values='rating')

In [None]:
# Create mappings from IDs to indices
user_map = {u: i for i, u in enumerate(df['user_id'].unique())}
item_map = {i: j for j, i in enumerate(df['asin'].unique())}

# Map the user_id and product_id to indices
df['user_idx'] = df['user_id'].map(user_map)
df['item_idx'] = df['asin'].map(item_map)

# Build a sparse matrix
from scipy.sparse import csr_matrix

# we create a sparse form of user_item_matrix:
# sparse_matrix = csr_matrix((data, (row, col)), shape=(a, b)
sparse_matrix = csr_matrix(
    (df['rating'], (df['user_idx'], df['item_idx'])),
    shape=(len(user_map), len(item_map))
)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

item_similarity = cosine_similarity(sparse_matrix.T)

column_names = df['asin'].unique()
item_similarity_df = pd.DataFrame(item_similarity, index=column_names, columns=column_names)

Note that we get an Item-Item Similarity Matrix.

Since user_item_matrix.T transposes the user-item matrix, turning the rows into items and the columns into users.
After transposition, each row in the matrix represents an item, and each column represents a user's ratings for that item.

**cosine_similarity computes the pairwise similarity between the rows of the input matrix.**
Since the rows represent items, the result is a similarity matrix where:
Rows and columns both correspond to items.
Each value item_similarity[i, j] represents the cosine similarity between item i and item j.

In [None]:
def recommend_products(product_id, num_recommendations):
    sim_scores = item_similarity_df[product_id].sort_values(ascending=False)[1:num_recommendations+1]
    return sim_scores.index.tolist()

# Example usage:
recommend_products('B00YQ6X8EO', 5)

#Content-Based Filtering with NLP

We will use product descriptions to find similar products.


In [None]:
df_all = raw_meta_dataset["full"].to_pandas()
df_all.head()

## Text preprocessing and vectorizing using TFIDF:

Before applying TfidfVectorizer, ensure your column contains only meaningful text. Use pandas to filter out non-text entries.



In [None]:
# Using title of the products:
df_all['title'] = df_all['title'].fillna('')

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_all['title'])

## Compute Cosine Similarity:

5. Reduce Dimensionality
If the TF-IDF matrix has an extremely high number of features, reduce its dimensionality using Truncated SVD before computing similarity:

In [None]:

# from sklearn.decomposition import TruncatedSVD

# # Reduce dimensionality of TF-IDF matrix
# svd = TruncatedSVD(n_components=100)  # Adjust components as needed
# reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)


The tfidf_matrix generated by TfidfVectorizer is already a sparse matrix (usually in CSR format). However, cosine_similarity converts it into a dense matrix internally, which is where memory issues arise.

Solution:
Use linear_kernel from sklearn instead of cosine_similarity. It works directly with sparse matrices and avoids converting them to dense format.

In [None]:
# from sklearn.metrics.pairwise import linear_kernel

# # Compute cosine similarity in the reduced feature space
# cosine_sim = linear_kernel(reduced_tfidf_matrix, reduced_tfidf_matrix)


Ok, using both of these methodes again resulted in memory error. So We had to add divide the dataset into smaller batches (chunking as in Approach 2) and compute similarity for each chunk.

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import linear_kernel
import numpy as np

# Step 1: Reduce dimensionality
svd = TruncatedSVD(n_components=100)
reduced_tfidf_matrix = svd.fit_transform(tfidf_matrix)

# Step 2: Compute similarity in chunks (if needed)
num_rows = reduced_tfidf_matrix.shape[0]
chunk_size = 1000  # Adjust chunk size
cosine_sim_chunks = []

for start in range(0, num_rows, chunk_size):
    end = min(start + chunk_size, num_rows)
    cosine_sim_chunk = linear_kernel(reduced_tfidf_matrix[start:end], reduced_tfidf_matrix)
    cosine_sim_chunks.append(cosine_sim_chunk)

# Combine chunks into a single similarity matrix
cosine_sim = np.vstack(cosine_sim_chunks)




The above approach again resulted in memory crash. Trying alternatives:

##1. Approximate Nearest Neighbors (ANN)
Instead of calculating the full cosine similarity matrix, use an approximate method to find similar items efficiently. Libraries like FAISS (Facebook AI Similarity Search) or ScaNN are optimized for high-dimensional data and large datasets.

### Example with FAISS:

In [None]:
# !pip install faiss-cpu
# !pip install faiss-gpu

In [None]:
# import faiss
# import numpy as np

# # Convert sparse matrix to dense for FAISS (if necessary)
# dense_matrix = tfidf_matrix.toarray()

# # Build the FAISS index
# index = faiss.IndexFlatL2(dense_matrix.shape[1])  # L2 norm for cosine similarity
# index.add(dense_matrix)

# # Query for nearest neighbors
# k = 10  # Number of nearest neighbors
# distances, indices = index.search(dense_matrix, k)


The above crashed as well. Trying another method:

2. Sparse Matrix Operations
Leverage sparse matrix-specific operations to avoid dense representations in memory. Libraries like scipy are efficient for sparse data.

Example:

In [None]:
# from sklearn.metrics.pairwise import cosine_similarity

# # Compute similarity row by row to avoid large memory overhead
# row_sims = []
# for i in range(tfidf_matrix.shape[0]):
#     sim = cosine_similarity(tfidf_matrix[i], tfidf_matrix).toarray()
#     row_sims.append(sim)

# # Convert results back to sparse if needed
# row_sims = np.vstack(row_sims)
