## Collaborative Filtering

### Loading Datasets

The reviews and items are from the output of the `preprocessing.ipynb` file

In [2]:
import pandas as pd

reviews = pd.read_csv('../datasets/slimmed/reviews.csv')
items = pd.read_csv('../datasets/slimmed/items.csv')

Helper function to get title of item from its id (parent_asin)

In [3]:
def get_item_name_from_id(parent_asin):
	return items[items['parent_asin'] == parent_asin]['title'].unique()[0]

### Creating Sparse Matrix

The size of the user-item matrix would be too large to fit in memory and would contain many zero values anyway

In [4]:
num_user_ids, num_item_ids = reviews['user_id'].nunique(), items['parent_asin'].nunique()
format(num_user_ids, ','), format(num_item_ids, ','), format(num_user_ids * num_item_ids, ',')

('2,282,093', '121,820', '278,004,569,260')

The sparse matrix and mappings from `user_item_matrix` are imported into `uim`

In [5]:
import nbformat

# Load the notebook
with open('user_item_matrix.ipynb', 'r', encoding='utf-8') as f:
	nb = nbformat.read(f, as_version=4)

# Execute all code cells and store data in the uim dict
uim = {}
for cell in nb.cells:
	if cell.cell_type == 'code':
		exec(cell.source, uim)

### ALS Model (Alternating Least Squares)

`implicit` library already uses multithreading so `BLAS` threads should be set to 1 to avoid overhead

In [6]:
import threadpoolctl 
threadpoolctl.threadpool_limits(1, 'blas')

<threadpoolctl.threadpool_limits at 0x2e3a2af9100>

#### Transforming CSR Ratings To Confidence

A core issue here is that implicit's ALS model works with implicit feedback and not explicit ones such as rating 

In [7]:
import numpy as np

# Extract components
data = uim['sparse_matrix_csr'].data
indices = uim['sparse_matrix_csr'].indices
indptr = uim['sparse_matrix_csr'].indptr

# Compute per-user mean ratings
n_users = uim['sparse_matrix_csr'].shape[0]

user_means = np.zeros(n_users)
max_user_ratings = np.zeros(n_users)

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	user_ratings = data[start:end]

	if len(user_ratings) > 0:
		user_means[user] = np.mean(user_ratings)
		max_user_ratings[user] = np.max(user_ratings)
	else:
		user_means[user] = 0.0
		max_user_ratings[user] = 1

`ALPHA` is scaling factor that determines how strongly the higher ratings are trusted over low ones.

In [8]:
ALPHA = 80

The following strategy is proposed for handling this<br><br>
For every item with mean user rating $\mu_u$<br>
o If an item rating is less than $\mu_u$, then it is set to 0 (considered as not seen)<br>
o Otherwise, it is scaled to a value in the range [1, 5] using min-max normalization of min=$\mu_u$ and max=max_user_rating

In [9]:
new_data = data.copy()

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	for i in range(start, end):
		rating = data[i]
		mean = user_means[user]
		max_rating = max_user_ratings[user]

		if rating < mean:
			new_data[i] = 0  # no confidence
		else:
			# # If user only gave ratings of 5, then it can be considered as the "neutral" rating
			# if mean == 5:
			#     conf = 3
			# # Linear map from [mean, 5] to [1, 5]
			# else:
			#     conf = (rating - mean) / (5 - mean) * 4 + 1

			s = 0
			if max_rating == mean:
				s = 1.0
			else:
				s = (rating - mean) / (max_rating - mean)

			new_data[i] = 1 + ALPHA * s

In [10]:
from scipy.sparse import csr_matrix
confidence_csr = csr_matrix((new_data, indices, indptr), shape=uim['sparse_matrix_csr'].shape)

In [11]:
confidence_csr.eliminate_zeros()

The ALS model is trained

In [12]:
from implicit.als import AlternatingLeastSquares

# Train ALS model
als_model = AlternatingLeastSquares(factors=200, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
als_model.fit(confidence_csr)

  0%|          | 0/15 [00:00<?, ?it/s]

#### Saving ALS Model

In [13]:
import pickle
import gzip

# Save to a pickle file
with gzip.open('../data_structures/als_model.pkl', 'wb', compresslevel=5) as f:
	pickle.dump(als_model, f)

#### Loading ALS Model

In [14]:
import pickle
import gzip

from typing import cast
from implicit.cpu.als import AlternatingLeastSquares

# Load the compressed file
with gzip.open('../data_structures/als_model.pkl', 'rb') as f:
	als_model = cast(AlternatingLeastSquares, pickle.load(f))

#### Predicting User Ratings

A test run where the top 5 items are recommended for user with id from the map

In [15]:
user_id = 2  # Target user
num_recommendations = 15  # How many items to recommend

# Get top N recommended items and their scores
recommended_items = als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], N=num_recommendations
)

recommendations, scores = recommended_items
recommendations_scores = zip(recommendations, scores)

print(f'Top {num_recommendations} recommended items for User {uim['reverse_user_map'][user_id]}:')
for item_id, score in recommendations_scores:
	print(f'Item {uim['reverse_item_map'][item_id]} - Score: {score:.4f}')

Top 15 recommended items for User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
Item B017QU5G1O - Score: 0.3404
Item B01N4JYY1H - Score: 0.2889
Item B0CB7JB1V6 - Score: 0.2749
Item B07KRWJCQW - Score: 0.2592
Item B00CMQTUSS - Score: 0.2528
Item B00HUW901Q - Score: 0.2527
Item B01MRN26ES - Score: 0.2478
Item B08LMKNSPL - Score: 0.2464
Item B07ZJ6RY1W - Score: 0.2447
Item B08GPBG66Y - Score: 0.2441
Item B017V6YVDC - Score: 0.2425
Item B000N5Z2L4 - Score: 0.2401
Item B01N6QKT7H - Score: 0.2318
Item B087NNPYP3 - Score: 0.2312
Item B08Q1MFHT3 - Score: 0.2302


In [16]:
already_rated_user_items = reviews[reviews['user_id'] == uim['reverse_user_map'][user_id]][['title', 'parent_asin', 'text', 'rating']]
already_rated_user_items[['parent_asin', 'rating']]

Unnamed: 0,parent_asin,rating
3,B0BCHWZX95,5
4,B00HUWA45W,5


In [17]:
items[items['parent_asin'].isin(already_rated_user_items['parent_asin'])][['title']]

Unnamed: 0,title
4085,PowerA Enhanced Wireless Controller for Ninten...
18048,KontrolFreek FPS Freek CQC Signature - Xbox One


In [18]:
als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)

(array([3, 4]), array([0.866863  , 0.22684452], dtype=float32))

Those confidence scores in the items are now converted back to user ratings

In [19]:
user_id, uim['reverse_user_map'][user_id], user_means[user_id], max_user_ratings[user_id]

(2, 'AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q', np.float64(5.0), np.float64(5.0))

A helper function to convert confidence scores to predicted ratings

In [20]:
def confidence_to_predicted_rating(user_id, confidences):
    mean = user_means[user_id]
    max_rating = max_user_ratings[user_id]

    s = (confidences - 1) / ALPHA
    return mean + s * (max_rating - mean)

The model correctly predicted the user's ratings on items they'd seen before

In [21]:
confidence_to_predicted_rating(user_id, als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)[1])

array([5., 5.])

The predicted ratings that the user would give to the recommended items

In [22]:
list(zip(recommendations, confidence_to_predicted_rating(user_id, recommendations)))

[(np.int32(587), np.float64(5.0)),
 (np.int32(936), np.float64(5.0)),
 (np.int32(335), np.float64(5.0)),
 (np.int32(634), np.float64(5.0)),
 (np.int32(5761), np.float64(5.0)),
 (np.int32(10963), np.float64(5.0)),
 (np.int32(805), np.float64(5.0)),
 (np.int32(8802), np.float64(5.0)),
 (np.int32(187), np.float64(5.0)),
 (np.int32(1706), np.float64(5.0)),
 (np.int32(9), np.float64(5.0)),
 (np.int32(738), np.float64(5.0)),
 (np.int32(284), np.float64(5.0)),
 (np.int32(822), np.float64(5.0)),
 (np.int32(1951), np.float64(5.0))]

The names of the recommended items

In [23]:
list(map(lambda i: get_item_name_from_id(uim['reverse_item_map'][i]), recommendations))

['PDP Gaming Energizer Dual Controller Charging System, Two Rechargeable Battery Packs: Black - Xbox One',
 'Xbox Wireless Controller – Red',
 'ASTARRY Wireless Pro Controller Compatible with Switch Lite/Switch OLED (Red)',
 '$40 Xbox Gift Card [Digital Code]',
 'Xbox One Wireless Controller (Without 3.5 millimeter headset jack)',
 'KontrolFreek FPS Freek Phantom for Xbox One Controller | Performance Thumbsticks | 2 High-Rise Concave | White',
 'Rii RK100+ Multiple Color Rainbow LED Backlit Large Size USB Wired Mechanical Feeling Multimedia PC Gaming Keyboard,Office Keyboard for Working or Primer Gaming,Office Device',
 'KontrolFreek Alpha for Xbox One and Xbox Series X Controller | Performance Thumbsticks | 2 Low-Rise Concave | Green (Blue)',
 '$45 Nintendo eShop Gift Card [Digital Code]',
 'havit RGB Gaming Mouse Pad Soft Non-Slip Rubber Base Mouse Mat for Laptop Computer PC Games (13.8 X 9.8 X 0.16 inches, Black)',
 'PlayStation Plus: 1 Month Membership [Digital Code]',
 'Xbox Live 

### Finding Similar Users

In [24]:
num_similar = 10  # How many similar items to find
top_similar_users = als_model.similar_users(user_id, N=num_similar+1)

similar_users, scores = top_similar_users
similar_users_scores = list(zip(similar_users[1:], scores[1:]))

print(f'Top {num_similar} users similar to User {uim['reverse_user_map'][user_id]}:')
for sim_user_id, similarity in similar_users_scores[1:]:
	print(f'User {uim['reverse_user_map'][sim_user_id]} - Similarity Score: {similarity:.4f}')

Top 10 users similar to User AFTC6ZR5IKNRDG5JCPVNVMU3XV2Q:
User AFDLCLJLGZXA2XJ45DONGSVO4XXQ - Similarity Score: 0.8034
User AGTDQUFRFYACW5KROJR6ORNSLJIA - Similarity Score: 0.8034
User AFNFYEEHTMG4DCD7O2Y5YHUAYJHQ - Similarity Score: 0.8034
User AE4JBASSCLATBR3EW5F2DZOI2H6A - Similarity Score: 0.8034
User AEKBLLHCHN2SUABPQA37IQVPPAHA - Similarity Score: 0.8034
User AGPYL2VFOE3ECPEM7MLLMPWPPWIQ - Similarity Score: 0.8034
User AGOY3BLZLETYUBBYEQXOD4F74OOA - Similarity Score: 0.8034
User AHAFOJR2DG35UZR36IGJPUKYOP6Q - Similarity Score: 0.8034
User AEFUC4ESBD2SEXMZV32XWVEGN57A - Similarity Score: 0.8034


#### Finding Similar Items

In [25]:
item_id = 12  # Target item
num_similar = 10  # How many similar items to find

# Get top N similar items and their similarity scores (+1 is added to skip the item itself later on)
top_similar_items = als_model.similar_items(item_id, N=num_similar+1)

similar_items, scores = top_similar_items
similar_items_scores = list(zip(similar_items, scores))

print(f'Top {num_similar} items similar to Item {uim['reverse_item_map'][item_id]}:')
for sim_item_id, similarity in similar_items_scores[1:]:
	print(f'Item {uim['reverse_item_map'][sim_item_id]} - Similarity Score: {similarity:.4f}')

Top 10 items similar to Item B09JY72CNG:
Item B07RVMNWDM - Similarity Score: 0.6381
Item B07DHNX18W - Similarity Score: 0.5705
Item B07P6T66XS - Similarity Score: 0.5275
Item B087ZRQ44S - Similarity Score: 0.4865
Item B07NTT87J9 - Similarity Score: 0.4800
Item B08HC6N4KY - Similarity Score: 0.4533
Item B07DHNZ676 - Similarity Score: 0.4393
Item B093XLM8YX - Similarity Score: 0.4284
Item B07FFCN3DJ - Similarity Score: 0.4235
Item B08LYVLM84 - Similarity Score: 0.4176


Very good and relevant recommendations for the given item (first in the list)

In [37]:
items[items['parent_asin'] == uim['reverse_item_map'][item_id]]

Unnamed: 0,title,features,description,videos,details,images,parent_asin,categories,average_rating,rating_number,main_category,store,price
410,Razer Goliathus Extended Chroma Gaming Mouse P...,['Ultimate Personalization & Gaming Immersion ...,['The Razer Goliathus extended Chroma soft gam...,[{'title': 'My 3 Biggest Thoughts On This Gami...,{'Product Dimensions': '11.58 x 1.23 x 0.12 in...,[{'thumb': 'https://m.media-amazon.com/images/...,B09JY72CNG,"['Video Games', 'PC', 'Accessories', 'Gaming M...",4.8,11426,Computers,Razer,59.99


In [26]:
list(map(get_item_name_from_id, map(lambda x: uim['reverse_item_map'][x], [item_id, *similar_items[1:]])))

['Razer Goliathus Extended Chroma Gaming Mouse Pad: Customizable Chroma RGB Lighting - Soft, Cloth Material - Balanced Control & Speed - Non-Slip Rubber Base - Mercury White',
 'Razer Base Station Chroma Headphone/Headset Stand w/USB Hub: Chroma RGB Lighting - 3X USB 3.0 Ports - Non-Slip Rubber Base - Designed for Gaming Headsets - Mercury White',
 'Razer Huntsman Elite Gaming Keyboard: Fast Keyboard Switches - Clicky Optical Switches - Chroma RGB Lighting - Magnetic Plush Wrist Rest - Dedicated Media Keys & Dial - Classic Black',
 'Razer Lancehead Wireless Gaming Mouse: 16K DPI Optical Sensor - Chroma RGB Lighting - 9 Programmable Buttons - Mechanical Switches - 50Hr Battery - Gunmetal',
 'Razer Firefly Hard V2 RGB Gaming Mouse Pad: Customizable Chroma Lighting, Built-in Cable Management, Balanced Control & Speed, Non-Slip Rubber Base',
 'Razer Basilisk Gaming Mouse: 16,000 DPI Optical Sensor - Chroma RGB Lighting - 8 Programmable Buttons - Mechanical Switches - Customizable Scroll Re

### Handling Guests

Guest (vectors) are not in the ALS matrix and so cannot use the `similar_users` & `recommend_items` above directly but this can be handled<br>

In [27]:
guest_vector = ['B07KRWJCQW', 'B07ZJ6RY1W', 'B07JGVX9D6', 'B075YBBQMM', 'B0BN942894', 'B077GG9D5D', 'B00ZQB28XK', 'B014R4KYMS', 'B07YBXFF5C']
mapped_guest_vector = [uim['item_map'][item_id] for item_id in guest_vector]

mapped_guest_vector

[634, 187, 83187, 100, 601, 225, 1020, 323, 573]

In [28]:
[get_item_name_from_id(parent_asin) for parent_asin in guest_vector]

['$40 Xbox Gift Card [Digital Code]',
 '$45 Nintendo eShop Gift Card [Digital Code]',
 'Microsoft Xbox One X 1 TB with Red Dead Redemption 2',
 'PS4 Controller Charger Dock Station, OIVO PS4 Controller Charging Dock Station with Upgraded 1.8-Hours Charging Chip, Charging Dock Station Replacement for Playstation 4 Dualshock 4 Controller Charger',
 'BENGOO Stereo Pro Gaming Headset for PS4, PC, Xbox One Controller, Noise Cancelling Over Ear Headphones with Mic, LED Light, Bass Surround, Soft Memory Earmuffs for Laptop Mac Wii Accessory Kits',
 'DualShock 4 Wireless Controller for PlayStation 4 - Jet Black',
 "No Man's Sky - PlayStation 4",
 "Uncharted 4: A Thief's End - PlayStation 4",
 'Doom - PC']

`similar_items` only needs item ids (similar_items also includes the given item so N+1 similar items must be generated)

In [29]:
personalized_items = als_model.similar_items(mapped_guest_vector, N=10+1)

recommend_items, scores = personalized_items
similar_items = list(zip(recommend_items, scores))

similar_items[0] # An example print of similar items for 'B07KRWJCQW'

(array([  634, 18137, 10201, 10203,  6917, 32427, 14639, 27012,  5448,
          842, 31988], dtype=int32),
 array([0.9999999 , 0.36045963, 0.29659322, 0.2835863 , 0.28153694,
        0.28103322, 0.28006414, 0.2779255 , 0.27625108, 0.27396494,
        0.2709237 ], dtype=float32))

In [30]:
for idx, parent_asin in enumerate(guest_vector):
    print(f'For {get_item_name_from_id(parent_asin)}')
    print(f'The similar items are {[get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[idx][0][1:10]]}')
    print('----------')

For $40 Xbox Gift Card [Digital Code]
The similar items are ['Xbox $5 Gift Card - Xbox 360 Digital Code', 'PowerA Complete Power Station', "WB Games Rocket League: Collector's Edition - Xbox One", 'UL SADES SA807 Multi-Platform Gaming Headsets Headphones For New Xbox one PS4 PC Laptop Mac iPad iPod (Black&Blue)', 'Sour Patch Kids Limited Edition Just Blue', 'Wireless Controller for Nintendo Switch Pro - Sinbadteck Remote Controller for Nintendo Switch Switch pro Console w 6 Axis Gyro Dual Vibration', 'MLB The Show 16 - MVP Edition - PS4 [Digital Code]', 'Xbox LIVE 4000 Microsoft Points - Xbox 360 Digital Code', '$20 PlayStation Store Gift Card [Digital Code]']
----------
For $45 Nintendo eShop Gift Card [Digital Code]
The similar items are ['Nintendo Switch Online 12-Month Individual Membership [Digital Code]', 'Rocket League: Rocket League Crate Unlock Key X5 - PS4 [Digital Code]', 'Apex Legends - 1,000 Coins Virtual Currency - PS4 [Digital Code]', 'NCSoft NCoin 1600 [Online Game Code

In [31]:
get_item_name_from_id(guest_vector[0]), [get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[0][0][1:10]]

('$40 Xbox Gift Card [Digital Code]',
 ['Xbox $5 Gift Card - Xbox 360 Digital Code',
  'PowerA Complete Power Station',
  "WB Games Rocket League: Collector's Edition - Xbox One",
  'UL SADES SA807 Multi-Platform Gaming Headsets Headphones For New Xbox one PS4 PC Laptop Mac iPad iPod (Black&Blue)',
  'Sour Patch Kids Limited Edition Just Blue',
  'Wireless Controller for Nintendo Switch Pro - Sinbadteck Remote Controller for Nintendo Switch Switch pro Console w 6 Axis Gyro Dual Vibration',
  'MLB The Show 16 - MVP Edition - PS4 [Digital Code]',
  'Xbox LIVE 4000 Microsoft Points - Xbox 360 Digital Code',
  '$20 PlayStation Store Gift Card [Digital Code]'])

### Text Features

A more powerful recommendation system can be built using the other features in the `items` dataset

In [32]:
items[['title', 'parent_asin', 'features', 'description', 'details', 'categories']]

Unnamed: 0,title,parent_asin,features,description,details,categories
0,Phantasmagoria: A Puzzle of Flesh,B00069EVOG,['Windows 95'],[],"{'Best Sellers Rank': {'Video Games': 137612, ...","['Video Games', 'PC', 'Games']"
1,NBA 2K17 - Early Tip Off Edition - PlayStation 4,B00Z9TLVK0,['The #1 rated NBA video game simulation serie...,['Following the record-breaking launch of NBA ...,"{'Release date': 'September 16, 2016', 'Best S...","['Video Games', 'PlayStation 4', 'Games']"
2,Nintendo Selects: The Legend of Zelda Ocarina ...,B07SZJZV88,['Authentic Nintendo Selects: The Legend of Ze...,[],"{'Best Sellers Rank': {'Video Games': 51019, '...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
3,"Spongebob Squarepants, Vol. 1",B0001ZNU56,['Bubblestand: SpongeBob shows Patrick and Squ...,['Now you can watch the wild underwater antics...,"{'Release date': 'August 15, 2004', 'Best Sell...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
4,eXtremeRate Soft Touch Top Shell Front Housing...,B07H93H878,['Compatibility Models: Ultra fits for Xbox On...,[],"{'Best Sellers Rank': {'Video Games': 48130, '...","['Video Games', 'Xbox One', 'Accessories', 'Fa..."
...,...,...,...,...,...,...
121815,DANVILLE SKY,B014RXTSDK,[],['Disney Infinity Series 3 Power Disc Danville...,"{'Best Sellers Rank': {'Video Games': 105422, ...","['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121816,Ci-Yu-Online Charizard Black #1 Limited Editio...,B07JDT455V,[],[],{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Legacy Systems', 'Nintendo Sy..."
121817,Story of Seasons: Pioneers Of Olive Town (Nint...,B09XQJS4CZ,['A wild world of discovery - tame the wildern...,"['Product Description', ""Inspired by Tales of ...","{'Release date': 'March 26, 2021', 'Best Selle...","['Video Games', 'Nintendo Switch', 'Games']"
121818,MotoGP 18 (PC DVD) UK IMPORT REGION FREE,B07DGPTGNV,['Brand new game engine - MotoGP18 has been re...,['Become the champion of the 2018 MotoGP Seaso...,{'Pricing': 'The strikethrough price is the Li...,"['Video Games', 'Game Genre of the Month']"


Use TF-IDF or BERT Embeddings... (Embeddings would be better as descriptions may not contain similar words)

Good luck :)