## Collaborative Filtering

### Loading Datasets

The reviews and items are from the output of the `preprocessing.ipynb` file

In [1]:
import pandas as pd

reviews = pd.read_csv('../datasets/slimmed/reviews.csv')
items = pd.read_csv('../datasets/slimmed/items.csv')

Helper function to get title of item from its id (parent_asin)

In [84]:
def get_item_name_from_id(parent_asin):
	return items[items['parent_asin'] == parent_asin]['title'].unique()[0]

### Creating Sparse Matrix

The size of the user-item matrix would be too large to fit in memory and would contain many zero values anyway

In [2]:
num_user_ids, num_item_ids = reviews['user_id'].nunique(), items['parent_asin'].nunique()
format(num_user_ids, ','), format(num_item_ids, ','), format(num_user_ids * num_item_ids, ',')

('2,282,093', '121,820', '278,004,569,260')

The sparse matrix and mappings from `user_item_matrix` are imported into `uim`

In [3]:
import nbformat

# Load the notebook
with open('user_item_matrix.ipynb', 'r', encoding='utf-8') as f:
	nb = nbformat.read(f, as_version=4)

# Execute all code cells and store data in the uim dict
uim = {}
for cell in nb.cells:
	if cell.cell_type == 'code':
		exec(cell.source, uim)

### ALS Model (Alternating Least Squares)

`implicit` library already uses multithreading so `BLAS` threads should be set to 1 to avoid overhead

In [4]:
import threadpoolctl 
threadpoolctl.threadpool_limits(1, 'blas')

<threadpoolctl.threadpool_limits at 0x1c37f3b5a60>

#### Transforming CSR Ratings To Confidence

A core issue here is that implicit's ALS model works with implicit feedback and not explicit ones such as rating 

The following strategy is proposed for handling this<br><br>
For every item with mean user rating $\mu_u$ and standard deviation $\sigma_u$,<br>
o If an item rating is less than $\mu_u$, then it is set to 0 (no confidence)<br>
o Otherwise, it is scaled to a value in the range [1, 5] depending on $\mu_u$ and $\sigma_u$ (or scaled from [mean, 5] to [1, 5])

In [9]:
import numpy as np

# Extract components
data = uim['sparse_matrix_csr'].data
indices = uim['sparse_matrix_csr'].indices
indptr = uim['sparse_matrix_csr'].indptr

# Compute per-user mean ratings
n_users = uim['sparse_matrix_csr'].shape[0]

user_means = np.zeros(n_users)
max_user_ratings = np.zeros(n_users)

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	user_ratings = data[start:end]

	if len(user_ratings) > 0:
		user_means[user] = np.mean(user_ratings)
		max_user_ratings[user] = np.max(user_ratings)
	else:
		user_means[user] = 0.0
		max_user_ratings[user] = 1

`ALPHA` is scaling factor that determines how strongly the higher ratings are trusted over low ones.

In [52]:
ALPHA = 80

In [53]:
new_data = data.copy()

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	for i in range(start, end):
		rating = data[i]
		mean = user_means[user]
		max_rating = max_user_ratings[user]

		if rating < mean:
			new_data[i] = 0  # no confidence
		else:
			# # If user only gave ratings of 5, then it can be considered as the "neutral" rating
			# if mean == 5:
			#     conf = 3
			# # Linear map from [mean, 5] to [1, 5]
			# else:
			#     conf = (rating - mean) / (5 - mean) * 4 + 1

			s = 0
			if max_rating == mean:
				s = 1.0
			else:
				s = (rating - mean) / (max_rating - mean)

			new_data[i] = 1 + ALPHA * s

In [54]:
from scipy.sparse import csr_matrix
confidence_csr = csr_matrix((new_data, indices, indptr), shape=uim['sparse_matrix_csr'].shape)

In [55]:
confidence_csr.eliminate_zeros()

The ALS model is trained

In [56]:
from implicit.als import AlternatingLeastSquares

# Train ALS model
als_model = AlternatingLeastSquares(factors=50, iterations=15, regularization=0.1, random_state=42)
als_model.fit(confidence_csr)

  0%|          | 0/15 [00:00<?, ?it/s]

#### Saving ALS Model

In [106]:
import pickle
import gzip

# Save to a pickle file
with gzip.open('../data_structures/als_model.pkl', 'wb', compresslevel=5) as f:
	pickle.dump(als_model, f)

#### Loading ALS Model

In [121]:
import pickle
import gzip

from typing import cast
from implicit.cpu.als import AlternatingLeastSquares

# Load the compressed file
with gzip.open('../data_structures/als_model.pkl', 'rb') as f:
	als_model = cast(AlternatingLeastSquares, pickle.load(f))

#### Predicting User Ratings

A test run where the top 5 items are recommended for user with id from the map

In [108]:
user_id = 2  # Target user
num_recommendations = 15  # How many items to recommend

# Get top N recommended items and their scores
recommended_items = als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], N=num_recommendations
)

recommendations, scores = recommended_items
recommendations_scores = zip(recommendations, scores)

print(f'Top {num_recommendations} recommended items for User {uim['reverse_user_map'][user_id]}:')
for item_id, score in recommendations_scores:
	print(f'Item {uim['reverse_item_map'][item_id]} - Score: {score:.4f}')

Top 15 recommended items for User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
Item B009GE437W - Score: 0.3135
Item B087SHFL9B - Score: 0.3040
Item B07ZJ6RY1W - Score: 0.2774
Item B00DDILSBG - Score: 0.2502
Item B07L3D7C21 - Score: 0.2393
Item B0015AARJI - Score: 0.2247
Item B08JHZHWZ3 - Score: 0.2195
Item B0C3BNJFBV - Score: 0.2191
Item B0BMGHMP23 - Score: 0.2142
Item B016XBGWAQ - Score: 0.2042
Item B08F4D36D9 - Score: 0.2031
Item B07213YKCX - Score: 0.2007
Item B07R6NYNBJ - Score: 0.1996
Item B06Y2LGTW3 - Score: 0.1964
Item B01EJ9DMQQ - Score: 0.1947


In [109]:
already_rated_user_items = reviews[reviews['user_id'] == uim['reverse_user_map'][user_id]][['title', 'parent_asin', 'text', 'rating']]
already_rated_user_items[['parent_asin', 'rating']]

Unnamed: 0,parent_asin,rating
3,B0BCHWZX95,5
4,B00HUWA45W,5


In [110]:
items[items['parent_asin'].isin(already_rated_user_items['parent_asin'])][['title']]

Unnamed: 0,title
4085,PowerA Enhanced Wireless Controller for Ninten...
18048,KontrolFreek FPS Freek CQC Signature - Xbox One


In [111]:
als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)

(array([3, 4]), array([0.686079  , 0.00524818], dtype=float32))

Those confidence scores in the items are now converted back to user ratings

In [112]:
user_id, uim['reverse_user_map'][user_id], user_means[user_id], max_user_ratings[user_id]

(2, 'AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q', np.float64(5.0), np.float64(5.0))

A helper function to convert confidence scores to predicted ratings

In [113]:
def confidence_to_predicted_rating(user_id, confidences):
    mean = user_means[user_id]
    max_rating = max_user_ratings[user_id]

    s = (confidences - 1) / ALPHA
    return mean + s * (max_rating - mean)

The model correctly predicted the user's ratings on items they'd seen before

In [114]:
confidence_to_predicted_rating(user_id, als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)[1])

array([5., 5.])

The predicted ratings that the user would give to the recommended items

In [115]:
list(zip(recommendations, confidence_to_predicted_rating(user_id, recommendations)))

[(np.int32(769), np.float64(5.0)),
 (np.int32(1282), np.float64(5.0)),
 (np.int32(187), np.float64(5.0)),
 (np.int32(1013), np.float64(5.0)),
 (np.int32(3148), np.float64(5.0)),
 (np.int32(4351), np.float64(5.0)),
 (np.int32(927), np.float64(5.0)),
 (np.int32(1760), np.float64(5.0)),
 (np.int32(3999), np.float64(5.0)),
 (np.int32(2017), np.float64(5.0)),
 (np.int32(139), np.float64(5.0)),
 (np.int32(2740), np.float64(5.0)),
 (np.int32(2712), np.float64(5.0)),
 (np.int32(1521), np.float64(5.0)),
 (np.int32(153), np.float64(5.0))]

The names of the recommended items

In [116]:
list(map(lambda i: get_item_name_from_id(uim['reverse_item_map'][i]), recommendations))

['Remote Plus, Mario - Nintendo Wii',
 'Super Mario Odyssey - Nintendo Switch',
 '$45 Nintendo eShop Gift Card [Digital Code]',
 'Final Fantasy XV Deluxe Edition - PlayStation 4',
 'Nintendo Switch Online 12-Month Individual Membership [Digital Code]',
 'PlayStation 3 Dualshock 3 Wireless Controller (Black)',
 'Super Mario 3D All-Stars - Nintendo Switch, 175 pieces',
 'Logitech G815 LIGHTSYNC RGB Mechanical Gaming Keyboard with Low Profile GL Tactile switch, 5 programmable G-keys,USB Passthrough, dedicated media control - Linear, Black',
 'Logitech G502 Lightspeed Wireless Gaming Mouse with Hero 25K Sensor, PowerPlay Compatible, Tunable Weights and Lightsync RGB - Black',
 'Steam Link',
 'Pokémon Sword + Pokémon Sword Expansion Pass - Nintendo Switch',
 'Pokémon Ultra Sun and Ultra Moon Steelbook Dual Pack - Nintendo 3DS',
 'Logitech G635 DTS, X 7.1 Surround Sound LIGHTSYNC RGB PC Gaming Headset',
 'Power Supply Brick for Xbox One, Xbox Power Supply Brick Cord AC Adapter Power Supply C

### Finding Similar Users

In [147]:
num_similar = 10  # How many similar items to find
top_similar_users = als_model.similar_users(user_id, N=num_similar+1)

similar_users, scores = top_similar_users
similar_users_scores = list(zip(similar_users[1:], scores[1:]))

print(f'Top {num_similar} users similar to User {uim['reverse_user_map'][user_id]}:')
for sim_user_id, similarity in similar_users_scores[1:]:
	print(f'User {uim['reverse_user_map'][sim_user_id]} - Similarity Score: {similarity:.4f}')

Top 10 users similar to User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
User AGOIYLLZ563YGSOCOQKCBH2FOAIQ - Similarity Score: 0.9977
User AG7XTSDKYIHMEUQ7S7I66GOBZBZQ - Similarity Score: 0.9977
User AEVD66QO6BYDVTAN73RB7VXETLFQ - Similarity Score: 0.9976
User AEOBH5LQGQHMW6CMUNOQ2DULKQ3Q - Similarity Score: 0.9976
User AEBMZFW7E6HKMJDBM3YOFX6PCEPQ - Similarity Score: 0.9976
User AFDZ3XB6OJF2SRSJKJYIWZTZBC5A - Similarity Score: 0.9976
User AHDRCQN6PLDSKTCGGTMRK23K3BZA - Similarity Score: 0.9976
User AGBQL7LTUQ5GDCHH4CROBNN6NUYQ - Similarity Score: 0.9976
User AEFNJ5ZXL4ZT5HVBETVRLX75TN6A - Similarity Score: 0.9976


#### Finding Similar Items

In [117]:
item_id = 12  # Target item
num_similar = 10  # How many similar items to find

# Get top N similar items and their similarity scores (+1 is added to skip the item itself later on)
top_similar_items = als_model.similar_items(item_id, N=num_similar+1)

similar_items, scores = top_similar_items
similar_items_scores = list(zip(similar_items, scores))

print(f'Top {num_similar} items similar to Item {uim['reverse_item_map'][item_id]}:')
for sim_item_id, similarity in similar_items_scores[1:]:
	print(f'Item {uim['reverse_item_map'][sim_item_id]} - Similarity Score: {similarity:.4f}')

Top 10 items similar to Item B09JY72CNG:
Item B095J5JP9T - Similarity Score: 0.9978
Item B09LBFSL1F - Similarity Score: 0.9962
Item B07LBG68KV - Similarity Score: 0.8819
Item B07PB4BDKD - Similarity Score: 0.8503
Item B07LCRLXM7 - Similarity Score: 0.8308
Item B09K7YM2TN - Similarity Score: 0.8277
Item B07NTT87J9 - Similarity Score: 0.8267
Item B08XX4WPB7 - Similarity Score: 0.8158
Item B0B8XDYWG1 - Similarity Score: 0.8130
Item B07GBSHTQK - Similarity Score: 0.8121


Very good and relevant recommendations for the given item (first in the list)

In [118]:
list(map(get_item_name_from_id, map(lambda x: uim['reverse_item_map'][x], [item_id, *similar_items[1:]])))

['Razer Goliathus Extended Chroma Gaming Mouse Pad: Customizable Chroma RGB Lighting - Soft, Cloth Material - Balanced Control & Speed - Non-Slip Rubber Base - Mercury White',
 'Razer BlackWidow Mechanical Gaming Keyboard - Mercury Edition (Renewed)',
 'Corsair Virtuoso RGB Wireless XT High-Fidelity Gaming Headset with Bluetooth and Spatial Audio - Works with Mac, PC, PS5, PS4, Xbox Series X/S - Slate (Renewed)',
 'Amaping Retro Mechanical Keyboard Steampunk Style Pattern RGB Colorful LED Backlit USB Wired 87 Keys Gaming Keyboards for PUBG LOL Gamer Ergonomic Design (White)',
 'Razer Abyssus Essential: True 7,200 DPI Optical Sensor - 3 Hyperesponse Buttons - Powered by Razer Chroma - Ambidextrous Ergonomic Gaming Mouse (Renewed)',
 'Xbox One USB Hub Adapter,VSEER High Speed USB Hub Extension with 4 USB Ports for Xbox One Game Console Accessories(Third Party Product)-Black',
 '(Mouse + Grip Tape) Glorious Model O Wireless Gaming Mouse - RGB 69g Lightweight Wireless Gaming Mouse (Matte B

### Text Features

A more powerful recommendation system can be built using the other features in the `items` dataset

In [119]:
items[['title', 'parent_asin', 'features', 'description', 'details', 'categories']]

Unnamed: 0,title,parent_asin,features,description,details,categories
0,Phantasmagoria: A Puzzle of Flesh,B00069EVOG,['Windows 95'],[],"{'Best Sellers Rank': {'Video Games': 137612, ...","['Video Games', 'PC', 'Games']"
1,NBA 2K17 - Early Tip Off Edition - PlayStation 4,B00Z9TLVK0,['The #1 rated NBA video game simulation serie...,['Following the record-breaking launch of NBA ...,"{'Release date': 'September 16, 2016', 'Best S...","['Video Games', 'PlayStation 4', 'Games']"
2,Nintendo Selects: The Legend of Zelda Ocarina ...,B07SZJZV88,['Authentic Nintendo Selects: The Legend of Ze...,[],"{'Best Sellers Rank': {'Video Games': 51019, '...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
3,"Spongebob Squarepants, Vol. 1",B0001ZNU56,['Bubblestand: SpongeBob shows Patrick and Squ...,['Now you can watch the wild underwater antics...,"{'Release date': 'August 15, 2004', 'Best Sell...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
4,eXtremeRate Soft Touch Top Shell Front Housing...,B07H93H878,['Compatibility Models: Ultra fits for Xbox On...,[],"{'Best Sellers Rank': {'Video Games': 48130, '...","['Video Games', 'Xbox One', 'Accessories', 'Fa..."
...,...,...,...,...,...,...
121815,DANVILLE SKY,B014RXTSDK,[],['Disney Infinity Series 3 Power Disc Danville...,"{'Best Sellers Rank': {'Video Games': 105422, ...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121816,Ci-Yu-Online Charizard Black #1 Limited Editio...,B07JDT455V,[],[],{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121817,Story of Seasons: Pioneers Of Olive Town (Nint...,B09XQJS4CZ,['A wild world of discovery - tame the wildern...,"['Product Description', ""Inspired by Tales of ...","{'Release date': 'March 26, 2021', 'Best Selle...","['Video Games', 'Nintendo Switch', 'Games']"
121818,MotoGP 18 (PC DVD) UK IMPORT REGION FREE,B07DGPTGNV,['Brand new game engine - MotoGP18 has been re...,['Become the champion of the 2018 MotoGP Seaso...,{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Game Genre of the Month']"


Use TF-IDF or BERT Embeddings... (Embeddings would be better as descriptions may not contain similar words)

Good luck :)