## Collaborative Filtering

### Loading Datasets

The reviews and items are from the output of the `preprocessing.ipynb` file

In [1]:
import pandas as pd

reviews = pd.read_csv('../datasets/slimmed/reviews.csv')
items = pd.read_csv('../datasets/slimmed/items.csv')

Helper function to get title of item from its id (parent_asin)

In [2]:
def get_item_name_from_id(parent_asin):
	return items[items['parent_asin'] == parent_asin]['title'].unique()[0]

### Creating Sparse Matrix

The size of the user-item matrix would be too large to fit in memory and would contain many zero values anyway

In [3]:
num_user_ids, num_item_ids = reviews['user_id'].nunique(), items['parent_asin'].nunique()
format(num_user_ids, ','), format(num_item_ids, ','), format(num_user_ids * num_item_ids, ',')

('2,282,093', '121,820', '278,004,569,260')

The sparse matrix and mappings from `user_item_matrix` are imported into `uim`

In [4]:
import nbformat

# Load the notebook
with open('user_item_matrix.ipynb', 'r', encoding='utf-8') as f:
	nb = nbformat.read(f, as_version=4)

# Execute all code cells and store data in the uim dict
uim = {}
for cell in nb.cells:
	if cell.cell_type == 'code':
		exec(cell.source, uim)

99.998734


### ALS Model (Alternating Least Squares)

`implicit` library already uses multithreading so `BLAS` threads should be set to 1 to avoid overhead

In [5]:
import threadpoolctl 
threadpoolctl.threadpool_limits(1, 'blas')

<threadpoolctl.threadpool_limits at 0x298d7c2ec60>

#### Transforming CSR Ratings To Confidence

A core issue here is that implicit's ALS model works with implicit feedback and not explicit ones such as rating 

In [6]:
import numpy as np

# Extract components
data = uim['sparse_matrix_csr'].data
indices = uim['sparse_matrix_csr'].indices
indptr = uim['sparse_matrix_csr'].indptr

# Compute per-user mean ratings
n_users = uim['sparse_matrix_csr'].shape[0]

user_means = np.zeros(n_users)
max_user_ratings = np.zeros(n_users)

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	user_ratings = data[start:end]

	if len(user_ratings) > 0:
		user_means[user] = np.mean(user_ratings)
		max_user_ratings[user] = np.max(user_ratings)
	else:
		user_means[user] = 0.0
		max_user_ratings[user] = 1

`ALPHA` is scaling factor that determines how strongly the higher ratings are trusted over low ones.

In [7]:
ALPHA = 80

The following strategy is proposed for handling this<br><br>
For every item with mean user rating $\mu_u$<br>
o If an item rating is less than $\mu_u$, then it is set to 0 (considered as not seen)<br>
o Otherwise, it is scaled to a value in the range [1, 5] using min-max normalization of min=$\mu_u$ and max=max_user_rating

In [8]:
new_data = data.copy()

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	for i in range(start, end):
		rating = data[i]
		mean = user_means[user]
		max_rating = max_user_ratings[user]

		if rating < mean:
			new_data[i] = 0  # no confidence
		else:
			# # If user only gave ratings of 5, then it can be considered as the "neutral" rating
			# if mean == 5:
			#     conf = 3
			# # Linear map from [mean, 5] to [1, 5]
			# else:
			#     conf = (rating - mean) / (5 - mean) * 4 + 1

			s = 0
			if max_rating == mean:
				s = 1.0
			else:
				s = (rating - mean) / (max_rating - mean)

			new_data[i] = 1 + ALPHA * s

In [9]:
from scipy.sparse import csr_matrix
confidence_csr = csr_matrix((new_data, indices, indptr), shape=uim['sparse_matrix_csr'].shape)

In [10]:
confidence_csr.eliminate_zeros()

The ALS model is trained

In [11]:
from implicit.als import AlternatingLeastSquares

# Train ALS model
als_model = AlternatingLeastSquares(factors=200, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
als_model.fit(confidence_csr)

  0%|          | 0/15 [00:00<?, ?it/s]

#### Saving ALS Model

In [12]:
import pickle
import gzip

# Save to a pickle file
with gzip.open('../data_structures/als_model.pkl', 'wb', compresslevel=5) as f:
	pickle.dump(als_model, f)

#### Loading ALS Model

In [13]:
import pickle
import gzip

from typing import cast
from implicit.cpu.als import AlternatingLeastSquares

# Load the compressed file
with gzip.open('../data_structures/als_model.pkl', 'rb') as f:
	als_model = cast(AlternatingLeastSquares, pickle.load(f))

#### Predicting User Ratings

A test run where the top 5 items are recommended for user with id from the map

In [14]:
user_id = 2  # Target user
num_recommendations = 15  # How many items to recommend

# Get top N recommended items and their scores
recommended_items = als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], N=num_recommendations
)

recommendations, scores = recommended_items
recommendations_scores = zip(recommendations, scores)

print(f'Top {num_recommendations} recommended items for User {uim['reverse_user_map'][user_id]}:')
for item_id, score in recommendations_scores:
	print(f'Item {uim['reverse_item_map'][item_id]} - Score: {score:.4f}')

Top 15 recommended items for User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
Item B00KSVXSZU - Score: 0.3600
Item B00DBDPOZ4 - Score: 0.3291
Item B01GW8VG7O - Score: 0.2962
Item B00EQNP8F4 - Score: 0.2639
Item B00DB2BI8M - Score: 0.2566
Item B08LMKNSPL - Score: 0.2524
Item B01JS3F79I - Score: 0.2338
Item B00KSRV19E - Score: 0.2279
Item B07K3KHFSY - Score: 0.2209
Item B00MMN7W9A - Score: 0.2192
Item B00LMHX16K - Score: 0.2181
Item B07FSYMKPD - Score: 0.2171
Item B082W522QZ - Score: 0.2170
Item B00ZDNNRB8 - Score: 0.2161
Item B00ASKNT3W - Score: 0.2153


In [15]:
already_rated_user_items = reviews[reviews['user_id'] == uim['reverse_user_map'][user_id]][['title', 'parent_asin', 'text', 'rating']]
already_rated_user_items[['parent_asin', 'rating']]

Unnamed: 0,parent_asin,rating
3,B0BCHWZX95,5
4,B00HUWA45W,5


In [16]:
items[items['parent_asin'].isin(already_rated_user_items['parent_asin'])][['title']]

Unnamed: 0,title
4085,PowerA Enhanced Wireless Controller for Ninten...
18048,KontrolFreek FPS Freek CQC Signature - Xbox One


In [17]:
als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)

(array([4, 3]), array([0.07769904, 0.00035912], dtype=float32))

Those confidence scores in the items are now converted back to user ratings

In [18]:
user_id, uim['reverse_user_map'][user_id], user_means[user_id], max_user_ratings[user_id]

(2, 'AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q', np.float64(5.0), np.float64(5.0))

A helper function to convert confidence scores to predicted ratings

In [19]:
def confidence_to_predicted_rating(user_id, confidences):
    mean = user_means[user_id]
    max_rating = max_user_ratings[user_id]

    s = (confidences - 1) / ALPHA
    return mean + s * (max_rating - mean)

The model correctly predicted the user's ratings on items they'd seen before

In [20]:
confidence_to_predicted_rating(user_id, als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)[1])

array([5., 5.])

The predicted ratings that the user would give to the recommended items

In [21]:
list(zip(recommendations, confidence_to_predicted_rating(user_id, recommendations)))

[(np.int32(78352), np.float64(5.0)),
 (np.int32(104562), np.float64(5.0)),
 (np.int32(68874), np.float64(5.0)),
 (np.int32(40735), np.float64(5.0)),
 (np.int32(8938), np.float64(5.0)),
 (np.int32(968), np.float64(5.0)),
 (np.int32(91094), np.float64(5.0)),
 (np.int32(10450), np.float64(5.0)),
 (np.int32(28605), np.float64(5.0)),
 (np.int32(66300), np.float64(5.0)),
 (np.int32(44264), np.float64(5.0)),
 (np.int32(16727), np.float64(5.0)),
 (np.int32(22046), np.float64(5.0)),
 (np.int32(1645), np.float64(5.0)),
 (np.int32(56938), np.float64(5.0))]

The names of the recommended items

In [22]:
list(map(lambda i: get_item_name_from_id(uim['reverse_item_map'][i]), recommendations))

['Turtle Beach - Ear Force XO One Amplified Gaming Headset and Headset Audio Controller- Xbox One',
 'Xbox One Play and Charge Kit',
 'Xbox One X 1TB Limited Edition Console - Project Scorpio Edition [Discontinued]',
 'Microsoft Xbox LIVE 12 Month Gold Membership (Physical Card)',
 'Rocksmith 2014 Edition - PC/Mac (Cable Included)',
 'KontrolFreek Alpha for Xbox One and Xbox Series X Controller | Performance Thumbsticks | 2 Low-Rise Concave | Green (Blue)',
 'Turtle Beach - Stealth 520 Premium Fully Wireless Gaming Headset \xa0PS4 Pro PS4 & PS3 (Discontinued by Manufacturer)',
 'Forza Horizon 2 for Xbox One',
 'Nyko Charge Base - 2 Port Controller Charger with 2 USB Charge Adapters for PlayStation 3',
 'KontrolFreek CQCX Performance Thumbsticks for Xbox One 2 Mid-Rise Convex Thumb Grips Black',
 'KontrolFreek FPS Freek CQC for Xbox One Controller | Performance Thumbsticks | 2 Mid-Rise Concave | Black',
 'HyperX Cloud Pro Gaming Headset - Silver - with in-Line Audio Control for PS4, Xbo

### Finding Similar Users

In [23]:
num_similar = 10  # How many similar items to find
top_similar_users = als_model.similar_users(user_id, N=num_similar+1)

similar_users, scores = top_similar_users
similar_users_scores = list(zip(similar_users[1:], scores[1:]))

print(f'Top {num_similar} users similar to User {uim['reverse_user_map'][user_id]}:')
for sim_user_id, similarity in similar_users_scores[1:]:
	print(f'User {uim['reverse_user_map'][sim_user_id]} - Similarity Score: {similarity:.4f}')

Top 10 users similar to User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
User AHAFOJR2DG35UZR36IGJPUKYOP6Q - Similarity Score: 0.8024
User AEHYRREFC65LM6WI6PUSKEV7EGSA - Similarity Score: 0.8024
User AFOOTQZFQTKZCTCMF6J2QY4SMBVQ - Similarity Score: 0.8024
User AGPYL2VFOE3ECPEM7MLLMPWPPWIQ - Similarity Score: 0.8024
User AGOY3BLZLETYUBBYEQXOD4F74OOA - Similarity Score: 0.8024
User AFDLCLJLGZXA2XJ45DONGSVO4XXQ - Similarity Score: 0.8024
User AFNFYEEHTMG4DCD7O2Y5YHUAYJHQ - Similarity Score: 0.8024
User AE4JBASSCLATBR3EW5F2DZOI2H6A - Similarity Score: 0.8024
User AFTJ7MJFDEHXWGP6E7VDR624A4WQ - Similarity Score: 0.8024


#### Finding Similar Items

In [47]:
item_id = 1  # Target item
num_similar = 10  # How many similar items to find

# Get top N similar items and their similarity scores (+1 is added to skip the item itself later on)
top_similar_items = als_model.similar_items(item_id, N=num_similar+1)

similar_items, scores = top_similar_items
similar_items_scores = list(zip(similar_items, scores))

print(f'Top {num_similar} items similar to Item {uim['reverse_item_map'][item_id]}:')
for sim_item_id, similarity in similar_items_scores[1:]:
	print(f'Item {uim['reverse_item_map'][sim_item_id]} - Similarity Score: {similarity:.4f}')

Top 10 items similar to Item B00Z9TLVK0:
Item B00Z9TMH1W - Similarity Score: 0.5020
Item B00XWE54NE - Similarity Score: 0.4507
Item B00XWE54CU - Similarity Score: 0.4130
Item B00AY1CT3G - Similarity Score: 0.3753
Item B07GQ8HRFV - Similarity Score: 0.3648
Item B01MG6DORB - Similarity Score: 0.3630
Item B07BT5L9RS - Similarity Score: 0.3551
Item B00ZKZY04M - Similarity Score: 0.3394
Item B00XZDQ6O8 - Similarity Score: 0.3304
Item B00XZ30HSO - Similarity Score: 0.3297


Very good and relevant recommendations for the given item (first in the list)

In [48]:
items[items['parent_asin'] == uim['reverse_item_map'][item_id]]

Unnamed: 0,title,features,description,videos,details,images,parent_asin,categories,average_rating,rating_number,main_category,store,price
1,NBA 2K17 - Early Tip Off Edition - PlayStation 4,['The #1 rated NBA video game simulation serie...,['Following the record-breaking launch of NBA ...,[{'title': 'NBA 2K17 - Kobe: Haters vs Players...,"{'Release date': 'September 16, 2016', 'Best S...",[{'thumb': 'https://m.media-amazon.com/images/...,B00Z9TLVK0,"['Video Games', 'PlayStation 4', 'Games']",4.3,223,Video Games,2K,58.0


In [49]:
list(map(get_item_name_from_id, map(lambda x: uim['reverse_item_map'][x], [item_id, *similar_items[1:]])))

['NBA 2K17 - Early Tip Off Edition - PlayStation 4',
 'Madden NFL 17 - Deluxe Edition - Xbox One',
 'NBA 2K16 : Early Tip-off Edition - Xbox One',
 'NBA 2K16 : Early Tip-off Edition - PlayStation 4',
 'Madden NFL 25 - Playstation 3',
 'Creed: Rise to Glory - PlayStation VR',
 'NBA 2K17 Standard Edition - PlayStation 4',
 'NBA 2K17 - Legends Gold - Xbox One Digital Code',
 'NBA Live 16 - Xbox One Digital Code',
 'WWE 2K16 - Steam PC [Online Game Code]',
 'Turtle Beach Ear Force XO One Gaming Headset (Certified Refurbished)']

### Handling Guests

Guest (vectors) are not in the ALS matrix and so cannot use the `similar_users` & `recommend_items` above directly but this can be handled<br>

In [50]:
guest_vector = ['B07KRWJCQW', 'B07ZJ6RY1W', 'B07JGVX9D6', 'B075YBBQMM', 'B0BN942894', 'B077GG9D5D', 'B00ZQB28XK', 'B014R4KYMS', 'B07YBXFF5C']
mapped_guest_vector = uim['item_map'][uim['item_map']['parent_asin'].isin(guest_vector)].index.tolist()

mapped_guest_vector

[2107, 7305, 12910, 18299, 38594, 45030, 57258, 60129, 90348]

In [51]:
[get_item_name_from_id(parent_asin) for parent_asin in guest_vector]

['$40 Xbox Gift Card [Digital Code]',
 '$45 Nintendo eShop Gift Card [Digital Code]',
 'Microsoft Xbox One X 1 TB with Red Dead Redemption 2',
 'PS4 Controller Charger Dock Station, OIVO PS4 Controller Charging Dock Station with Upgraded 1.8-Hours Charging Chip, Charging Dock Station Replacement for Playstation 4 Dualshock 4 Controller Charger',
 'BENGOO Stereo Pro Gaming Headset for PS4, PC, Xbox One Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Wii Accessory Kits',
 'DualShock 4 Wireless Controller for PlayStation 4 - Jet Black',
 "No Man's Sky - PlayStation 4",
 "Uncharted 4: A Thief's End - PlayStation 4",
 'Doom - PC']

`similar_items` only needs item ids (similar_items also includes the given item so N+1 similar items must be generated)

In [52]:
personalized_items = als_model.similar_items(mapped_guest_vector, N=10+1)

recommend_items, scores = personalized_items
similar_items = list(zip(recommend_items, scores))

similar_items[0] # An example print of similar items for 'B07KRWJCQW'

(array([  2107, 102879,  70320,   9491,  35150,  94542,  32501,  57818,
        101932,  81290,  51247], dtype=int32),
 array([0.9999999 , 0.3593087 , 0.3019174 , 0.30171075, 0.27262   ,
        0.2725433 , 0.27099964, 0.26811698, 0.26728627, 0.26313052,
        0.26081574], dtype=float32))

In [53]:
for idx, parent_asin in enumerate(guest_vector):
    print(f'For {get_item_name_from_id(parent_asin)}')
    print(f'The similar items are {[get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[idx][0][1:10]]}')
    print('----------')

For $40 Xbox Gift Card [Digital Code]
The similar items are ['Xbox $5 Gift Card - Xbox 360 Digital Code', 'MLB The Show 16 - MVP Edition - PS4 [Digital Code]', '3-Pack 6FT PS4 Controller Charger Cable for Xbox One Controller,Dualshock 4,PS4 Charging Cord,Nylon Braided Micro USB Data Sync Cable for Xbox One S/X,Playstation 4,PS4 Slim/Pro,Charge and Play Wire', 'Sliq Gaming Pro-Hex Thumb Stick Grips for Xbox Series X|S & Xbox One Controllers', 'PowerA Complete Power Station', 'Grand Theft Auto V - Tiger Shark Cash Card - Xbox One Digital Code', 'Pokemon HeartGold Version - Limited Edition - Nintendo DS', 'RGB Gaming Mouse Pad with 4-Port USB Hub, LED Soft Extended Large Size Mousepad, 16 Color 3 Brightness Mouse Mat, Non-Slip Rubber Base for Desk Laptop Computer PC Games (31.5×11.8x0.16in)', 'YHT Wireless Joy Pad Controller for Switch, Replacement with Redesigned Ergonomic Hand Grip Comfortable Handheld Gamepad Remote(Grey)']
----------
For $45 Nintendo eShop Gift Card [Digital Code]
The

In [54]:
get_item_name_from_id(guest_vector[0]), [get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[0][0][1:10]]

('$40 Xbox Gift Card [Digital Code]',
 ['Xbox $5 Gift Card - Xbox 360 Digital Code',
  'MLB The Show 16 - MVP Edition - PS4 [Digital Code]',
  '3-Pack 6FT PS4 Controller Charger Cable for Xbox One Controller,Dualshock 4,PS4 Charging Cord,Nylon Braided Micro USB Data Sync Cable for Xbox One S/X,Playstation 4,PS4 Slim/Pro,Charge and Play Wire',
  'Sliq Gaming Pro-Hex Thumb Stick Grips for Xbox Series X|S & Xbox One Controllers',
  'PowerA Complete Power Station',
  'Grand Theft Auto V - Tiger Shark Cash Card - Xbox One Digital Code',
  'Pokemon HeartGold Version - Limited Edition - Nintendo DS',
  'RGB Gaming Mouse Pad with 4-Port USB Hub, LED Soft Extended Large Size Mousepad, 16 Color 3 Brightness Mouse Mat, Non-Slip Rubber Base for Desk Laptop Computer PC Games (31.5×11.8x0.16in)',
  'YHT Wireless Joy Pad Controller for Switch, Replacement with Redesigned Ergonomic Hand Grip Comfortable Handheld Gamepad Remote(Grey)'])

### Text Features

A more powerful recommendation system can be built using the other features in the `items` dataset

In [55]:
items[['title', 'parent_asin', 'features', 'description', 'details', 'categories']]

Unnamed: 0,title,parent_asin,features,description,details,categories
0,Phantasmagoria: A Puzzle of Flesh,B00069EVOG,['Windows 95'],[],"{'Best Sellers Rank': {'Video Games': 137612, ...","['Video Games', 'PC', 'Games']"
1,NBA 2K17 - Early Tip Off Edition - PlayStation 4,B00Z9TLVK0,['The #1 rated NBA video game simulation serie...,['Following the record-breaking launch of NBA ...,"{'Release date': 'September 16, 2016', 'Best S...","['Video Games', 'PlayStation 4', 'Games']"
2,Nintendo Selects: The Legend of Zelda Ocarina ...,B07SZJZV88,['Authentic Nintendo Selects: The Legend of Ze...,[],"{'Best Sellers Rank': {'Video Games': 51019, '...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
3,"Spongebob Squarepants, Vol. 1",B0001ZNU56,['Bubblestand: SpongeBob shows Patrick and Squ...,['Now you can watch the wild underwater antics...,"{'Release date': 'August 15, 2004', 'Best Sell...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
4,eXtremeRate Soft Touch Top Shell Front Housing...,B07H93H878,['Compatibility Models: Ultra fits for Xbox On...,[],"{'Best Sellers Rank': {'Video Games': 48130, '...","['Video Games', 'Xbox One', 'Accessories', 'Fa..."
...,...,...,...,...,...,...
121815,DANVILLE SKY,B014RXTSDK,[],['Disney Infinity Series 3 Power Disc Danville...,"{'Best Sellers Rank': {'Video Games': 105422, ...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121816,Ci-Yu-Online Charizard Black #1 Limited Editio...,B07JDT455V,[],[],{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121817,Story of Seasons: Pioneers Of Olive Town (Nint...,B09XQJS4CZ,['A wild world of discovery - tame the wildern...,"['Product Description', ""Inspired by Tales of ...","{'Release date': 'March 26, 2021', 'Best Selle...","['Video Games', 'Nintendo Switch', 'Games']"
121818,MotoGP 18 (PC DVD) UK IMPORT REGION FREE,B07DGPTGNV,['Brand new game engine - MotoGP18 has been re...,['Become the champion of the 2018 MotoGP Seaso...,{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Game Genre of the Month']"


Use TF-IDF or BERT Embeddings... (Embeddings would be better as descriptions may not contain similar words)

Good luck :)

### Saving Model (OOP)

The model is to be used in the backend but this is not possible without all its dependencies being saved as well.