# UWDSC Recommender Systems Workshop
Notebook by Tanvir Deol, uWaterloo Data Science Club


# Outline
We will be show 3 different implementations of a recommender system.  
Starting from basic and increasing in complexity

- Content-Based Filtering (KNN method)
- Collaborative Filtering (Matrix Factorization method)
- Neural Collaborative Filtering

## Datasets
Here are the links to the datasets we use here:  
[Amazon Product Metadata Dataset](https://www.kaggle.com/datasets/asaniczka/amazon-products-dataset-2023-1-4m-products/data?select=amazon_products.csv) --> Content-Based Filtering.  
[Amazon Product Reviews Dataset](https://www.kaggle.com/datasets/saurav9786/amazon-product-reviews) --> Neural Collaborative Filtering/Collaborative Filtering




First we connect to google drive to access our datasets

In [1]:
from google.colab import drive

drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


----

# Content-Based Filtering (K Nearest Neighbors)

In this section we will be using the Amazon Product Metadata Dataset.  
This is because in content-based filtering we need to know the embeddings of the users and items before hand. Which is different from the other methods.

In our dataset, we are given only the titles of Amazon product listings.  
We use a text embedding algorithm to turn each product title to a 384 dimensional vector, so that we can represent them on a $N$ dimensional space.

<img src="https://python-charts.com/en/correlation/3d-scatter-plot-matplotlib_files/figure-html/3d-scatter-plot-markers-color-group-matplotlib.png" alt="3D Scatter Plot" width="150" height="150">

After this, we perform the K Nearest Neighbors Algorithm with the Euclidean Distance metric to find the K most similar items given our users preferences.





In [None]:
!pip3 install pandas scikit-learn sentence-transformers

In [None]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sentence_transformers import SentenceTransformer

  from tqdm.autonotebook import tqdm, trange


In [None]:
# Function to get embeddings for a list of texts
def get_embeddings(texts, model):
    embeddings = model.encode(texts, convert_to_tensor=True)
    return embeddings

# Load a pre-trained model from sentence-transformers
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Load products
data = pd.read_csv("/content/gdrive/MyDrive/amazon_products.csv")
data = data.head(10000)
data = data[["asin","title"]]




In [None]:
data.head(10)

Unnamed: 0,asin,title
0,B014TMV5YE,"Sion Softside Expandable Roller Luggage, Black..."
1,B07GDLCQXV,Luggage Sets Expandable PC+ABS Durable Suitcas...
2,B07XSCCZYG,Platinum Elite Softside Expandable Checked Lug...
3,B08MVFKGJM,Freeform Hardside Expandable with Double Spinn...
4,B01DJLKZBA,Winfield 2 Hardside Expandable Luggage with Sp...
5,B07XSCD2R4,Maxlite 5 Softside Expandable Luggage with 4 S...
6,B07MXF4G8K,"Hard Shell Carry on Luggage Airline Approved, ..."
7,B07H515VCZ,"Maxporter II 30"" Hardside Spinner Trunk Luggag..."
8,B08BXBCNMQ,Omni 2 Hardside Expandable Luggage with Spinne...
9,B0B9K44XTS,Luggage Sets Expandable Lightweight Suitcases ...


In [None]:
# Get embeddings for product names
embeddings = get_embeddings(data["title"].tolist(), model)

# Initialize and fit the NearestNeighbors model
knn = NearestNeighbors(n_neighbors=5, metric='minkowski')
knn.fit(embeddings)

Here our user enters a query for the word "Basketball" so we embed it into a 384 dimension vector and find the closest items.

In [None]:
# Function to find the most similar product names
def find_similar_products(query, k=5):
    query_embedding = get_embeddings([query], model)
    distances, indices = knn.kneighbors(query_embedding, n_neighbors=k)
    similar_products = [data["title"][i] for i in indices[0]]
    return similar_products

# Example usage
query_product = "Basketball" # <-- User Query
similar_products = find_similar_products(query_product, k=5)
print(f"Products similar to '{query_product}':")
for product in similar_products:
    print(product)

Products similar to 'Basketball':
Reversible Men's Mesh Athletic Basketball Jersey Single for Team Scrimmage
2 Pack Men’s Compression Pants One Leg 3/4 Capri Tights Leggings Athletic Base Layer for Gym Running Basketball
Men's Sportswear Club Short Basketball Graphic
Men's Elite Basketball Shorts
Basketball Pants with Knee Pads, Black Knee Pads Compression Pants, 3/4 Capri Leggings


----

# Collaborative Filtering (Matrix Factorization)

In this section we use the Amazon Product Reviews Dataset. And we use the `scikit-suprise` library for our Matrix Factorization implementation. The dataset gives us information on how users rate items on a scale from 1 to 5.


If you recall from the presentation, the data we have here resembles a feedback matrix $A$. Which we want to factorize into a user matrix $U$ and an item matrix $V$, so that $UV^T$ is a good approximation of $A$.

<img src="https://media.licdn.com/dms/image/C4E12AQGnmr-VQ1zM7g/article-inline_image-shrink_1500_2232/0/1625459218650?e=1721865600&v=beta&t=KBRq8pzFfeznWXZbhzRgn19SthXjI7KB_OoxPgnF6e8" alt="3D Scatter Plot" width="550" height="300">

Once we have these approximation matrices $U$ and $V$, we can do inference by passing in a user ID, and receiving a list of the recommended products for that user.

On top of that we evaluate our implementation using a RMSE (Root Mean Squared Error) score.

In [None]:
!pip install scikit-surprise



In [None]:
import pandas as pd
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

In [2]:
file_path = '/content/gdrive/MyDrive/ratings_Electronics.csv'

In [None]:
# Load the data
data = pd.read_csv(file_path, header=None)
data.columns = ["user_id", "product_id", "rating", "timestamp"]
data = data.head(100000)

# Define a Reader object with the appropriate rating scale
reader = Reader(rating_scale=(1, 5))

# Load the dataset from the pandas dataframe
dataset = Dataset.load_from_df(data[['user_id', 'product_id', 'rating']], reader)

# Split the dataset into training and testing sets
trainset, testset = train_test_split(dataset, test_size=0.25)

In [None]:
data.head(10)

Unnamed: 0,user_id,product_id,rating,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,1365811200
1,A2CX7LUOHB2NDG,321732944,5.0,1341100800
2,A2NWSAGRHCP8N5,439886341,1.0,1367193600
3,A2WNBOD3WNDNKT,439886341,3.0,1374451200
4,A1GI0U4ZRJA8WN,439886341,1.0,1334707200
5,A1QGNMC6O1VW39,511189877,5.0,1397433600
6,A3J3BRHTDRFJ2G,511189877,2.0,1397433600
7,A2TY0BTJOTENPG,511189877,5.0,1395878400
8,A34ATBPOK6HCHY,511189877,5.0,1395532800
9,A89DO69P0XZ27,511189877,5.0,1395446400


In [None]:
# Use the SVD algorithm for matrix factorization
algo = SVD()

# Train the algorithm on the training set
algo.fit(trainset)

# Test the algorithm on the testing set
predictions = algo.test(testset)

# Compute and print the RMSE
accuracy.rmse(predictions)

RMSE: 1.2781


1.2780584542183315

In [None]:
# Function to recommend products for a user
def recommend_products(algo, user_id, data, num_recommendations=5):
    # Get a list of all product IDs
    product_ids = data['product_id'].unique()

    # Get the list of products the user has already rated
    user_rated_products = data[data['user_id'] == user_id]['product_id']

    # Generate predictions for products the user has not rated
    predictions = [algo.predict(user_id, pid) for pid in product_ids if pid not in user_rated_products.values]

    # Sort the predictions by estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)

    # Return the top-n recommended products
    top_predictions = predictions[:num_recommendations]
    top_product_ids = [pred.iid for pred in top_predictions]
    top_ratings = [pred.est for pred in top_predictions]

    return pd.DataFrame({'product_id': top_product_ids, 'predicted_rating': top_ratings})

# Example usage
user_id = 'A3SGXH7AUHU8GW'
recommendations = recommend_products(algo, user_id, data, num_recommendations=5)
print(recommendations)


   product_id  predicted_rating
0  B000053HC5          4.903466
1  B00004U89X          4.812169
2  B000053HH5          4.810761
3  B00000JBHE          4.810125
4  B00004TENT          4.795880


----

# Neural Collaborative Filtering

In this section we again use the same Amazon Product Reviews Dataset.

We use the `cornac` library which is made for multi-modal recommender systems.

Once our data is fitted to the NCF model, we evaluate our model using the RMSE (Root Mean Squared Error) and Precision @ K metric.

Lastly, we run an example inference where we pass in a specific user ID and receive the recommended products for that user.


In [None]:
!pip3 install cornac

In [4]:
import pandas as pd
import numpy as np
import cornac
from cornac.data import Dataset
from cornac.models import NeuMF
from cornac.eval_methods import RatioSplit
from cornac.metrics import RMSE, MAE, Recall, NDCG, Precision
from cornac.data import Reader

In [5]:
data = Reader().read(file_path, sep=',')
data = data[0:100000]

# Create a Cornac dataset
dataset = Dataset.from_uir(data, seed=0)

# Define the NCF model
ncf = NeuMF()

In [6]:
# Train and evaluate the NCF model
ncf.fit(dataset)

  0%|          | 0/20 [00:00<?, ?it/s]

<cornac.models.ncf.recom_neumf.NeuMF at 0x7dc975621ba0>

In [7]:
ratio_split = RatioSplit(data=data, test_size=0.25, rating_threshold=1.0, seed=0)

cornac.Experiment(eval_method=ratio_split,
                  models=[ncf],
                  metrics=[RMSE(), Precision(k=10)]).run()

  0%|          | 0/20 [00:00<?, ?it/s]


TEST:
...
      |   RMSE | Precision@10 | Train (s) | Test (s)
----- + ------ + ------------ + --------- + --------
NeuMF | 3.1901 |       0.0050 |  213.4477 |  11.6987



In [None]:
# Example usage
user_id = 'AKM1MP6P0OYPR'
recommended_items = ncf.recommend(user_id, k=10)

# Print the recommended items
print(recommended_items)

['B00005A0R9', 'B00004YZQ9', 'B00004YK37', 'B00000J3IO', 'B00004Y289', '957321296X', 'B00005A8SO', '9983765012', 'B00004SD92', 'B00004UF7U']
