## Table of Contents

1. [Dataset](#dataset)

2. [User-item Matrix](#matrix)

    2.1 [Pandas Version](#pandas)

    2.2 [Surprise Lib Version](#surprise)

3. [Recommender Systems](#recsys)

    3.1 [Popularity](#popularity)

    3.2 [Modelling](#modelling)

    3.3 [Top Recommendations](#top-rec)

In [None]:
import numpy as np
import pandas as pd
from surprise import Dataset, Reader, NormalPredictor, KNNBasic, KNNWithZScore, KNNWithMeans, KNNWithZScore, SVD
from surprise.model_selection import train_test_split, cross_validate
from collections import defaultdict
from surprise import accuracy
import random

In [None]:
import pandas as pd

# Sample dataframes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = pd.DataFrame({'X': ['X0', 'X1'],
                    'Y': ['Y0', 'Y1']})

# Performing a Cartesian product (cross-join)
result = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)

print(result)

In [None]:
import matplotlib.pyplot as plt

## 

Jester dataset: https://eigentaste.berkeley.edu/dataset/

In [None]:
# Load data
from surprise import Dataset
from surprise import Reader

# Load the Jester5k data
data = Dataset.load_builtin('ml-100k') #ml-100k

Our dataset is currently encapsulated in an object named "dataset" from the Surprise library.

In [None]:
data

We will transform the dataset into a pandas DataFrame in order to explore and visualize the data.

In [None]:
# transform the surprise dataset into pandas dataframe
df = pd.DataFrame(data.raw_ratings, columns=['user_id', 'item_id', 'rating', 'comments']).drop(columns=['comments'])

What can you tell about this dataset?

1. How many ratings do we have?

2. How many users do we have?

3. How many items do we have?

4. What is the distribution of ratings?

In [None]:
# TODO

## 

The first step to create a recommender system, is to transform the dataset into a user-item matrix. To that end, we must first define the "user", the "item" and the "value". The value can be a rating (explicit feedback) or binary information (implicit feedback).

In this case, our user is the column "user", the item is the "item" and the value is the "rating".

### 

Let's create a user-item matrix based on our pandas dataframe.

In [None]:
# we will use the pivot function
df_matrix = df.pivot(index='user_id', columns='item_id', values='rating')

df_matrix

Is this dataset sparse?

To calculate the sparsity, we count the number of ratings (that is, the number of cells in the matrix that are filled) and divide by the total number of user-item pairs. To count the number of users, we can simply count the number of rows in the matrix, while to count the number of items, we can simply count the number of columns. 

In [None]:
print(f"{df_matrix.notnull().sum().sum() / (df_matrix.shape[0] * df_matrix.shape[1]):.2%}")

What is the:

1. distribution of total number of items per user?

2. distribution of total number of users per item?

3. distribution of mean ratings per user?

4. distribution of ratings?

(Show the histograms)

In [None]:
# TODO

### 

Now let's tranform dataset in a user-item matrix **using the surprise library**.

To that end, we can apply the method "build_full_trainset".

This "trainset" builds a dataset that can be used for training purposes. So be aware, that in this case we are building the "training set" with the full matrix!! (without train test split - keep doing the exercises to find the solution) 

This is the documentation for the trainset object https://surprise.readthedocs.io/en/stable/trainset.html

(Using the trainset object is useful for applying the surprise library methods)

In [None]:
# Build the trainset
trainset = data.build_full_trainset()

trainset

Explore the trainset object. 

Can you answer the same questions about the dataset using only the methods available in the trainset object? Do you have the same results?

In [None]:
# TODO

What are the two most popular items?

In [None]:
# TODO

## 

In [None]:
# HACK: the dataset is too big, so we will pick our pandas dataframe, create a sample and then covert to the surprise lib dataset
reader = Reader(rating_scale=(df.rating.min(), df.rating.max()))
# you can try other sizes
size = 10000
data_sml = Dataset.load_from_df(df[['user_id', 'item_id', 'rating']].sample(size), reader)

To properly evaluate the recommender systems, we will now split the original dataset into train and test. 

In [None]:
# use only the first half of the dataset as our dataset
# split intro train and test set
trainset, testset = train_test_split(data, test_size=0.2)
# if you want to use the small dataset, please change data to data_sml

### 

The popularity method is the simplest recommender system.

It finds the most popular items and then recommends them to new users.

In [None]:
# Popular Recommender -> maybe we should use the rating too
def popular_recommendations(trainset, top_n=10):
    item_counts = defaultdict(int)

    # Iterate through the trainset to count item ratings
    for _, item_id, _ in trainset.all_ratings():
        item_counts[item_id] += 1

    # Sort items by popularity (number of ratings)
    popular_items = sorted(item_counts.items(), key=lambda x: x[1], reverse=True)

    # Get the top N most popular items (e.g., top 10)
    top_n = popular_items[:top_n]
    return [trainset.to_raw_iid(i) for i, _ in top_n]

These are the most popular items

In [None]:
popular_recommendations(trainset, 5)

### 

The predictor fills the matrix with the ratings por unseen user-item pairs.

To evaluate how good the model is, we calculate the RMSE between the true value and the predicted. 

As we are using the test set to create the predictions only for user-item pairs found in the test set.

In [None]:
# Define evaluation function
def evaluate_algorithm(algo, trainset, testset):
    algo.fit(trainset)
    predictions = algo.test(testset)
    
    # Compute and return RMSE
    rmse = accuracy.rmse(predictions)
    return rmse

In [None]:
# Random Recommender
random_algo = NormalPredictor()
random_rmse = evaluate_algorithm(random_algo, trainset, testset)

In [None]:
# User-Based Collaborative Filtering
#ubcf_algo = KNNBasic(sim_options={'user_based': True})
#ubcf_rmse = evaluate_algorithm(ubcf_algo, trainset, testset)

In [None]:
# Item-Based Collaborative Filtering
ibcf_algo = KNNBasic(sim_options={'user_based': False})
ibcf_rmse = evaluate_algorithm(ibcf_algo, trainset, testset)

In [None]:
# Singular Value Decomposition (SVD)
svd_algo = SVD()
svd_rmse = evaluate_algorithm(svd_algo, trainset, testset)

In [None]:
print(f"Random RMSE: {random_rmse:.3f}")
#print(f"User-Based CF RMSE: {ubcf_rmse:.3f}")
print(f"Item-Based CF RMSE: {ibcf_rmse:.3f}")
print(f"SVD RMSE: {svd_rmse:.3f}")

### 

The following function is designed to generate personalized recommendations for a user using a recommender model (`algo`) and a `Trainset` object. It uses the recommender model to make a rating prediction for each item the user hasn't interacted with and sorts the items by their estimated scores in descending order. Then, selects the top `n` items with the highest estimated scores as recommendations for the user.

In [None]:
# Recommend top N items for a user using a recommender model
def recommend_top_n(algo, trainset, user_id, n=10):
    user_ratings = trainset.ur[user_id]
    items = [item_id for (item_id, _) in user_ratings]
    
    item_scores = {}
    # this is actually not the most correct way to do this, but it works
    for item_id in trainset.all_items():
        if item_id not in items:
            prediction = algo.predict(trainset.to_raw_uid(user_id), trainset.to_raw_iid(item_id), verbose=True)
            item_scores[item_id] = prediction.est
    
    top_items = sorted(item_scores, key=item_scores.get, reverse=True)[:n]

    #from raw_id to actual_id
    return [trainset.to_raw_iid(i) for i in top_items]

In [None]:
# Get recommendations for a specific user using the User-Based CF model
user_id = 3 # Change to the desired user ID
ubcf_top_items = recommend_top_n(ibcf_algo, trainset, user_id , n=5)
print("Top 5 User-Based CF Recommendations for User", trainset.to_raw_uid(user_id), ":", ubcf_top_items)

In [None]:
user_id = 10
n = 5
print("user_id", trainset.to_raw_uid(user_id))

print(recommend_top_n(svd_algo, trainset, user_id, n))

How to evaluate this ranking?

In [None]:
df_testset = pd.DataFrame(testset, columns=['user_id', 'item_id', 'rating'])

In [None]:
# we just want to recommend positive ratings
pos_rating = 5
df_testset_pos = df_testset[df_testset["rating"] >= pos_rating]
# which users exist in the training and testset
users = []
for u in df_testset_pos["user_id"].unique():
    try :
        trainset.to_inner_uid(u)
        users.append(u)
    except ValueError:
        continue

print("number of users in the testset that exist in the trainset:", len(users))

In [None]:
random_user = random.choice(users)
n = 5
print("user_id : ", random_user)
gt = df_testset[(df_testset['user_id']==random_user) & (df_testset['rating']>pos_rating)].item_id.to_list()
print("ground truth : ", gt)
recs =  recommend_top_n(svd_algo, trainset, trainset.to_inner_uid(random_user), n)
print("recommendations: ",recs)
print(f"hits: {len(set(gt).intersection(set(recs)))} / {n}")