## Collaborative Filtering

### Loading Datasets

The reviews and items are from the output of the `preprocessing.ipynb` file

In [None]:
import pandas as pd

reviews = pd.read_csv('../datasets/slimmed/reviews.csv')
items = pd.read_csv('../datasets/slimmed/items.csv')

Helper function to get title of item from its id (parent_asin)

In [None]:
def get_item_name_from_id(parent_asin):
	return items[items['parent_asin'] == parent_asin]['title'].unique()[0]

### Creating Sparse Matrix

The size of the user-item matrix would be too large to fit in memory and would contain many zero values anyway

In [None]:
num_user_ids, num_item_ids = reviews['user_id'].nunique(), items['parent_asin'].nunique()
format(num_user_ids, ','), format(num_item_ids, ','), format(num_user_ids * num_item_ids, ',')

The sparse matrix and mappings from `user_item_matrix` are imported into `uim`

In [None]:
import nbformat

# Load the notebook
with open('user_item_matrix.ipynb', 'r', encoding='utf-8') as f:
	nb = nbformat.read(f, as_version=4)

# Execute all code cells and store data in the uim dict
uim = {}
for cell in nb.cells:
	if cell.cell_type == 'code':
		exec(cell.source, uim)

### ALS Model (Alternating Least Squares)

`implicit` library already uses multithreading so `BLAS` threads should be set to 1 to avoid overhead

In [None]:
import threadpoolctl 
threadpoolctl.threadpool_limits(1, 'blas')

#### Transforming CSR Ratings To Confidence

A core issue here is that implicit's ALS model works with implicit feedback and not explicit ones such as rating 

In [None]:
import numpy as np

# Extract components
data = uim['sparse_matrix_csr'].data
indices = uim['sparse_matrix_csr'].indices
indptr = uim['sparse_matrix_csr'].indptr

# Compute per-user mean ratings
n_users = uim['sparse_matrix_csr'].shape[0]

user_means = np.zeros(n_users)
max_user_ratings = np.zeros(n_users)

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	user_ratings = data[start:end]

	if len(user_ratings) > 0:
		user_means[user] = np.mean(user_ratings)
		max_user_ratings[user] = np.max(user_ratings)
	else:
		user_means[user] = 0.0
		max_user_ratings[user] = 1

`ALPHA` is scaling factor that determines how strongly the higher ratings are trusted over low ones.

In [None]:
ALPHA = 80

The following strategy is proposed for handling this<br><br>
For every item with mean user rating $\mu_u$<br>
o If an item rating is less than $\mu_u$, then it is set to 0 (considered as not seen)<br>
o Otherwise, it is scaled to a value in the range [1, 5] using min-max normalization of min=$\mu_u$ and max=max_user_rating

In [None]:
new_data = data.copy()

for user in range(n_users):
	start, end = indptr[user], indptr[user + 1]
	for i in range(start, end):
		rating = data[i]
		mean = user_means[user]
		max_rating = max_user_ratings[user]

		if rating < mean:
			new_data[i] = 0  # no confidence
		else:
			# # If user only gave ratings of 5, then it can be considered as the "neutral" rating
			# if mean == 5:
			#     conf = 3
			# # Linear map from [mean, 5] to [1, 5]
			# else:
			#     conf = (rating - mean) / (5 - mean) * 4 + 1

			s = 0
			if max_rating == mean:
				s = 1.0
			else:
				s = (rating - mean) / (max_rating - mean)

			new_data[i] = 1 + ALPHA * s

In [None]:
from scipy.sparse import csr_matrix
confidence_csr = csr_matrix((new_data, indices, indptr), shape=uim['sparse_matrix_csr'].shape)

In [None]:
confidence_csr.eliminate_zeros()

#### Optimizing k Latent Features

It would be best to evaluate the model against the ratings of users with the most number of reviews

In [None]:
# Returns indices of top n users who've reviewed the most items
def getFrequentReviewersIdx(n):
    userReviewTotals = reviews.groupby("user_id").size().reset_index(name="total_reviews")
    mostFreqReviewers = userReviewTotals.sort_values(by="total_reviews", ascending=False)[:n]
    return mostFreqReviewers["user_id"].map(uim["user_map"]).values

In [None]:
# Gets the indices and ratings of all items a user has reviewed
def getRatings(user_idx):
    ratings = uim["sparse_matrix_csr"][user_idx, :].toarray().flatten()
    itemIndices = ratings.nonzero()[0]
    ratings = ratings[itemIndices]

    return list(zip(itemIndices, ratings))

In [None]:
def confidence_to_predicted_rating(user_id, confidences):
    mean = user_means[user_id]
    max_rating = max_user_ratings[user_id]

    s = (confidences - 1) / ALPHA
    return mean + s * (max_rating - mean)

Our evaluation metric for optimizing k will be RMSE

In [None]:
from sklearn.metrics import mean_squared_error

# Calculates RMSE of a model on ratings of the most active reviewers
def evalRMSE(als_model, topReviewers):
    user_factors = als_model.user_factors
    item_factors = als_model.item_factors

    # Cumulative arrays containing all users' ratings and predictions
    allRatings = []
    allPredictions = []
    
    for user_index in topReviewers:
        ratedItems, ratings = zip(*getRatings(user_index))
        ratings = list(ratings)
        # Implicit ALS model doesn't have .predict(), so we use dot prod @ between user_factors and item_factors to predict specific ratings (without bias)
        predictedRatingsConf = np.array([user_factors[user_index] @ item_factors[item_index] for item_index in ratedItems])
        # Above calculates confidence (implicit), so we need to convert to rating (explicit)
        predictedRatings = [confidence_to_predicted_rating(user_index, confidence) for confidence in predictedRatingsConf]
        
        allRatings.extend(ratings)
        allPredictions.extend(predictedRatings)


    mse = mean_squared_error(allRatings, allPredictions)
    rmse = np.sqrt(mse)

    print("RMSE")
    print(rmse)
    
    return rmse

In [None]:
from implicit.als import AlternatingLeastSquares

# Finds best k latent features given confidence matrix and number of reviewers we want to use to evaluate RMSE
def optimizeK(kVals, confidence_csr, num_reviewers):

    topReviewers = getFrequentReviewersIdx(num_reviewers)  
    bestK = None
    bestRMSE = float("inf")

    for k in kVals:
        
        als_model = AlternatingLeastSquares(factors=k, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
        als_model.fit(confidence_csr)

        rmse = evalRMSE(als_model, topReviewers)

        if rmse < bestRMSE:
            bestRMSE = rmse
            bestK = k
            

    return bestK


In [None]:
kVals = [5, 10, 15, 20, 25]
bestK = optimizeK(kVals, confidence_csr, 15)
print(f"The best k value is {bestK}")

The ALS model is trained

In [None]:
# Train ALS model
als_model = AlternatingLeastSquares(factors=bestK, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
als_model.fit(confidence_csr)

#### Saving ALS Model

In [None]:
import pickle
import gzip

# Save to a pickle file
with gzip.open('../data_structures/als_model.pkl', 'wb', compresslevel=5) as f:
	pickle.dump(als_model, f)

#### Loading ALS Model

In [None]:
import pickle
import gzip

from typing import cast
from implicit.cpu.als import AlternatingLeastSquares

# Load the compressed file
with gzip.open('../data_structures/als_model.pkl', 'rb') as f:
	als_model = cast(AlternatingLeastSquares, pickle.load(f))

#### Evaluating The Model

In [None]:
def precisionRecallK(model, test_users, k):
    precisions = []
    recalls = []

    for user_index in test_users:
        # Ground truth relevant items and indices
        relevantItems = (getRatings(user_index))
        relevantIndices = set([item for item, _ in relevantItems])

        # k recommended items and indices
        recommendedItems = model.recommend(user_index, uim['sparse_matrix_csr'][user_index], N=k, filter_already_liked_items=False)
        recommendations, scores = recommendedItems
        recommendations_scores = zip(recommendations, scores)
        recommendationIndices = set([item_id for item_id, score in recommendations_scores])

        # Relevant items in top k
        overlap = recommendationIndices & relevantIndices

        precision = len(overlap) / k
        recall = len(overlap) / len(relevantIndices)

        precisions.append(precision)
        recalls.append(recall)          

    return np.mean(precisions), np.mean(recalls)

In [None]:
import matplotlib.pyplot as plt

def plotPRK(model, test_users, ks):
    precisions = []
    recalls = []

    for k in ks:
        p, r = precisionRecallK(model, test_users, k)
        precisions.append(p)
        recalls.append(r)

    plt.figure(figsize=(10, 6))
    plt.plot(ks, precisions, label='Precision@k', marker='o')
    plt.plot(ks, recalls, label='Recall@k', marker='x')
    plt.xlabel('k')
    plt.ylabel('Value')
    plt.title('Precision and Recall vs k Recommendations')
    plt.legend()
    plt.grid(True)
    plt.show()


In [None]:
testUsers = getFrequentReviewersIdx(10)
plotPRK(als_model, testUsers, kVals)

In [None]:
def plotF1K(model, test_users, ks):
    f1s = []

    for k in ks:
        precision, recall = precisionRecallK(model, test_users, k)
        if precision + recall == 0:
            f1 = 0
        else:
            f1 = 2 * precision * recall / (precision + recall)
        f1s.append(f1)

    plt.figure(figsize=(10, 6))
    plt.plot(ks, f1s, label='F1 Score@k', marker='s')
    plt.xlabel('k')
    plt.ylabel('F1 Score')
    plt.title('F1 Score vs k Recommendations')
    plt.legend()
    plt.grid(True)
    plt.show()

In [None]:
plotF1K(als_model, testUsers, kVals)

In [None]:
def plotKRMSE(k_values, rmse_values):
    plt.figure(figsize=(8, 5))
    plt.plot(k_values, rmse_values, marker='o')
    plt.xlabel('k Latent Factors ')
    plt.ylabel('RMSE')
    plt.title('RMSE vs. k Latent Features')
    plt.grid(True)
    plt.show()

In [None]:
modelK5 = AlternatingLeastSquares(factors=10, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
modelK5.fit(confidence_csr)

modelK10 = AlternatingLeastSquares(factors=10, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
modelK10.fit(confidence_csr)

modelK15 = AlternatingLeastSquares(factors=15, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
modelK15.fit(confidence_csr)

modelK20 = AlternatingLeastSquares(factors=20, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
modelK20.fit(confidence_csr)

modelK25 = AlternatingLeastSquares(factors=25, iterations=15, regularization=0.1, random_state=42, calculate_training_loss=True)
modelK25.fit(confidence_csr)


models = [modelK5, modelK10, modelK15, modelK20, modelK25]

rmseVals = [evalRMSE(model, testUsers) for model in models]

plotKRMSE(kVals, rmseVals)

#### Predicting User Ratings

A test run where the top 5 items are recommended for user with id from the map

In [None]:
user_id = 2  # Target user
num_recommendations = 15  # How many items to recommend

# Get top N recommended items and their scores
recommended_items = als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], N=num_recommendations
)

recommendations, scores = recommended_items
recommendations_scores = zip(recommendations, scores)

print(f'Top {num_recommendations} recommended items for User {uim['reverse_user_map'][user_id]}:')
for item_id, score in recommendations_scores:
	print(f'Item {uim['reverse_item_map'][item_id]} - Score: {score:.4f}')

In [None]:
already_rated_user_items = reviews[reviews['user_id'] == uim['reverse_user_map'][user_id]][['title', 'parent_asin', 'text', 'rating']]
already_rated_user_items[['parent_asin', 'rating']]

In [None]:
items[items['parent_asin'].isin(already_rated_user_items['parent_asin'])][['title']]

In [None]:
als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)

Those confidence scores in the items are now converted back to user ratings

In [None]:
user_id, uim['reverse_user_map'][user_id], user_means[user_id], max_user_ratings[user_id]

A helper function to convert confidence scores to predicted ratings

In [None]:
def confidence_to_predicted_rating(user_id, confidences):
    mean = user_means[user_id]
    max_rating = max_user_ratings[user_id]

    s = (confidences - 1) / ALPHA
    return mean + s * (max_rating - mean)

The model correctly predicted the user's ratings on items they'd seen before

In [None]:
confidence_to_predicted_rating(user_id, als_model.recommend(
	user_id, uim['sparse_matrix_csr'][user_id], items=[3, 4], filter_already_liked_items=False
)[1])

The predicted ratings that the user would give to the recommended items

In [None]:
list(zip(recommendations, confidence_to_predicted_rating(user_id, recommendations)))

The names of the recommended items

In [None]:
list(map(lambda i: get_item_name_from_id(uim['reverse_item_map'][i]), recommendations))

### Finding Similar Users

In [None]:
num_similar = 10  # How many similar items to find
top_similar_users = als_model.similar_users(user_id, N=num_similar+1)

similar_users, scores = top_similar_users
similar_users_scores = list(zip(similar_users[1:], scores[1:]))

print(f'Top {num_similar} users similar to User {uim['reverse_user_map'][user_id]}:')
for sim_user_id, similarity in similar_users_scores[1:]:
	print(f'User {uim['reverse_user_map'][sim_user_id]} - Similarity Score: {similarity:.4f}')

#### Finding Similar Items

In [None]:
item_id = 1  # Target item
num_similar = 10  # How many similar items to find

# Get top N similar items and their similarity scores (+1 is added to skip the item itself later on)
top_similar_items = als_model.similar_items(item_id, N=num_similar+1)

similar_items, scores = top_similar_items
similar_items_scores = list(zip(similar_items, scores))

print(f'Top {num_similar} items similar to Item {uim['reverse_item_map'][item_id]}:')
for sim_item_id, similarity in similar_items_scores[1:]:
	print(f'Item {uim['reverse_item_map'][sim_item_id]} - Similarity Score: {similarity:.4f}')

Very good and relevant recommendations for the given item (first in the list)

In [None]:
items[items['parent_asin'] == uim['reverse_item_map'][item_id]]

In [None]:
list(map(get_item_name_from_id, map(lambda x: uim['reverse_item_map'][x], [item_id, *similar_items[1:]])))

### Handling Guests

Guest (vectors) are not in the ALS matrix and so cannot use the `similar_users` & `recommend_items` above directly but this can be handled<br>

In [None]:
guest_vector = ['B07KRWJCQW', 'B07ZJ6RY1W', 'B07JGVX9D6', 'B075YBBQMM', 'B0BN942894', 'B077GG9D5D', 'B00ZQB28XK', 'B014R4KYMS', 'B07YBXFF5C']
mapped_guest_vector = uim['item_map'][uim['item_map']['parent_asin'].isin(guest_vector)].index.tolist()

mapped_guest_vector

In [None]:
[get_item_name_from_id(parent_asin) for parent_asin in guest_vector]

`similar_items` only needs item ids (similar_items also includes the given item so N+1 similar items must be generated)

In [None]:
personalized_items = als_model.similar_items(mapped_guest_vector, N=10+1)

recommend_items, scores = personalized_items
similar_items = list(zip(recommend_items, scores))

similar_items[0] # An example print of similar items for 'B07KRWJCQW'

In [None]:
for idx, parent_asin in enumerate(guest_vector):
    print(f'For {get_item_name_from_id(parent_asin)}')
    print(f'The similar items are {[get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[idx][0][1:10]]}')
    print('----------')

In [None]:
get_item_name_from_id(guest_vector[0]), [get_item_name_from_id(uim['reverse_item_map'][parent_asin]) for parent_asin in similar_items[0][0][1:10]]

### Text Features

A more powerful recommendation system can be built using the other features in the `items` dataset

In [None]:
items[['title', 'parent_asin', 'features', 'description', 'details', 'categories']]

Use TF-IDF or BERT Embeddings... (Embeddings would be better as descriptions may not contain similar words)

Good luck :)

### Saving Model (OOP)

The model is to be used in the backend but this is not possible without all its dependencies being saved as well.