# <u> Memory-based Collaborative Filtering</u>

This notebook explores **memory-based collaborative filtering** as a first baseline for building a movie recommendation system. The approach relies solely on the raw user–item rating matrix, without the need for training or any machine learning framework.

The general idea is to build recommendations based on similarities, which can be computed in two main ways:

- **User–User Collaborative Filtering**: recommends movies liked by users who have similar preferences and rating behavior to us.
- **Item–Item Collaborative Filtering**: recommends new movies that are similar to the ones we have already seen and liked.

<br>

To define similarities between users or items, we commonly use two metrics:

- **Cosine Similarity**: a fast and popular measure that computes the angle between two vectors (single users or items). The formula is:

<br>

$$
\text{sim}_{\text{cosine}}(x, y) = \frac{x \cdot y}{\|x\| \|y\|}
$$

<br>

- **Pearson Correlation**: more computationally demanding, but it measures the linear relationship between co-rated items, correcting for each user’s individual rating bias. The formula is:

<br>

$$
\text{sim}_{\text{pearson}}(x, y) = \frac{\sum_{i}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \sqrt{\sum(y_i - \bar{y})^2}}
$$

<br>

Due to its ability to correct for individual rating biases, Pearson correlation is particularly effective in user–user collaborative filtering, where users may differ significantly in how they rate items. However, in item–item collaborative filtering, this adjustment is not appropriate, as centering on item means does not address user-level biases. Instead, cosine similarity and its adjusted form,are used to account for user rating behavior when comparing items.


## <u>0. Setting:</u>

### <u>0.1 Import libraries and dataframe</u>

In [None]:
# Import necessary libraries
import pandas as pd, numpy as np, os, sys, seaborn as sns, matplotlib.pyplot as plt
import matplotlib.dates as mdates
from scipy.sparse import csr_matrix
from scipy.stats import pearsonr
from tqdm.notebook import tqdm
from sklearn.metrics import mean_squared_error
import pyarrow as pa
import pyarrow.parquet as pq

# Set the working directory
current_dir = os.getcwd()

project_root = os.path.abspath(os.path.join(current_dir, ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import module for data processing
from modules.data_analysis import *


In [7]:
# Import cleand dataframe
file_path = '../data/processed/ratings_processed.parquet'
ratings_processed = pd.read_parquet(file_path, engine="pyarrow")
ratings_processed.head(3)

ImportError: Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.

In [4]:
print(f"Number of unique userId: {ratings_processed['userId'].nunique()}")
print(f"Number of unique movieId: {ratings_processed['movieId'].nunique()}")
print(f"Number of reviews: ~ {len(ratings_processed)/1000000:.1f} M")

Number of unique userId: 138383
Number of unique movieId: 12531
Number of reviews: ~ 19.9 M


### <u>0.2 Model evaluation </U>
In order to evaluate model performance and enable fair comparison with future approaches, we adopt a Leave-5-out strategy: for each user, the last 5 ratings are held out for testing, while the rest are used for training. This mirrors real-world scenarios where unseen items are recommended based on past behavior.
We assess accuracy using Root Mean Squared Error (RMSE) between predicted and actual ratings on the held-out items. This evaluation setup provides a consistent and realistic benchmark across all models.

The choice of holding out 5 ratings per user ensures a good balance between training size and evaluation coverage. Since we only include users with at least 20 ratings, this results in a minimum 75% train and 25% test split per user. Based on the user activity distribution observed in the `eda.fe.ipynb` notebook, this threshold offers sufficient signal for training while ensuring each user contributes to model evaluation.

Additionally with approximately 138,000 unique users in the final dataset, this strategy yields around 690,000 test points for evaluation, out of nearly 20 million total ratings, providing robust and reliable validation across the user base.

In [5]:
# Sort in term of review date and user id and then pick 5 most recent review for each userId
sorted_df = ratings_processed.sort_values(by=['userId','timestamp'], ascending=[True,True])
test_df = sorted_df.groupby('userId').tail(5)

# Build train df by removing the test_df rows from it
train_df = ratings_processed.drop(test_df.index).reset_index(drop=True)
test_df = test_df.reset_index(drop=True)


In [6]:
# Assert correct length across users in the test set
assert test_df.groupby('userId').size().eq(5).all(), "Some users in test_df have ≠ 5 ratings"

In [None]:
# Manual Check of test_train split
user_id = 1

# All ratings for that user, sorted as in the previous setting
user_ratings = ratings_processed[ratings_processed['userId'] == user_id].sort_values('timestamp', ascending=True)

# Check that the 5 most recent ratings match test_df's entries for that user
expected_test_rows = user_ratings.tail(5).reset_index(drop=True)
actual_test_rows = test_df[test_df['userId'] == user_id].reset_index(drop=True)

# Assert that they match
assert expected_test_rows[['movieId', 'timestamp']].equals(actual_test_rows[['movieId', 'timestamp']])

## <u> 1. User-User Collaborative Filtering </u>

As previously explained, the main idea behind the **user–user collaborative filtering** approach is to recommend items to a target user by finding other users with similar tastes, then suggesting items they liked but the target user hasn't seen yet.

To compute similarities across users, we use Pearson **correlation** on the **user–item rating matrix**, which is structured as follows:
- Rows represent users (`userId`)
- Columns represent movies (`movieId`)
- Values are the ratings users assigned to each movie

This results in a very sparse matrix, as the dataset contains approximately 139,000 users and 12,000 movies, and most users rate only a small fraction of all available movies.

The prediction for how much a user $u$ will like an unseen item $i$ is computed using a **similarity-weighted average** of the ratings from the most similar users who have rated item $i$:

$$
\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{v \in N(u)} \text{sim}(u,v) \cdot (r_{v,i} - \bar{r}_v)}{\sum_{v \in N(u)} |\text{sim}(u,v)|}
$$

Where:
- $\hat{r}_{u,i}$ is the predicted rating for user $u$ on item $i$
- $\bar{r}_u$ is the average rating of user $u$
- $N(u)$ is the set of top-$K$ most similar users to $u$ who rated item $i$
- $\text{sim}(u,v)$ is the Pearson correlation between users $u$ and $v$
- $r_{v,i}$ is the rating that user $v$ gave to item $i$
- $\bar{r}_v$ is the average rating of user $v$

<br>

For the sake of efficiency and prediction stability, we set $K=20$, meaning that predictions are based on the 20 most similar users who have rated the item. This value offers a good trade-off: it is large enough to smooth out noise from individual ratings, while still focusing on the most relevant users.
Additionally, by setting the minimum number of reviews per movie to 25, we ensure that each item has enough rating data to support consistent top-K neighbor selection during prediction.

<br>

In [20]:
conda install -c conda-forge scikit-surprise

3 channel Terms of Service accepted
Note: you may need to restart the kernel to use updated packages.



CondaMemoryError: The conda process ran out of memory. Increase system memory and/or try again.




Retrieving notices: done
Channels:
 - conda-forge
 - defaults
Platform: win-64
Collecting package metadata (repodata.json): failed


## <u> 2. Item-Item Collaborative Filtering </u>

As an alternative to user–user collaborative filtering, which is often inefficient due to high user sparsity and unstable similarity estimates, **item–item collaborative filtering** shifts the perspective to the items themselves. Instead of finding similar users, we recommend items to a user based on the similarity between items they have already rated and the target item. This method is typically more computationally efficient and stable, since items (e.g., movies) receive many ratings and exhibit more consistent co-rating patterns across users. As in the previous approach, we use a sparse rating matrix to represent user-item interactions, but this time structured as an **item–user matrix**, where each row corresponds to a movie and each column to a user. While the matrix is still sparse overall, the item dimension tends to be denser and more stable than the user dimension, making similarity computation more reliable and less expensive.


<br>

**<u>Similiraty:</u>**

To compute item–item similarities, we use **adjusted cosine similarity**, which corrects for user rating biases by centering each rating around the corresponding user’s average. This prevents misleading similarity scores caused by users who consistently rate much higher or lower than others.

Let $R_{ui}$ denote the rating of user $u$ for item $i$, and let $\bar{R}_u$ denote the average rating by user $u$. The adjusted cosine similarity between items $i$ and $j$ is defined as:

<br>

$$
\text{sim}(i, j) = \frac{\sum_{u \in U_{ij}} (R_{ui} - \bar{R}_u)(R_{uj} - \bar{R}_u)}{\sqrt{\sum_{u \in U_{ij}} (R_{ui} - \bar{R}_u)^2} \cdot \sqrt{\sum_{u \in U_{ij}} (R_{uj} - \bar{R}_u)^2}}
$$

Where:
- $U_{ij}$ is the set of users who rated both items $i$ and $j$
- The user mean $\bar{R}_u$ removes user-specific bias

This differs from standard cosine similarity, which does not correct for user bias, and from Pearson correlation, which centers on item means, making it less suitable for item–item comparison.

<br>


<u> **Prediction:** </u>

The prediction formulation are based on the assumption that a user’s opinion on an item can be inferred from how they rated similar items, adjusted for each item's overall average reception across the user base. Let $\hat{R}_{ui}$ be the predicted rating for user $u$ on item $i$. The prediction is computed using the baseline average rating of item $i$ and a weighted sum of the user's deviations from the mean on similar items:

<br>

$$
\hat{R}_{ui} = \bar{R}_i + \frac{\sum_{j \in N(i;u)} \text{sim}(i, j) \cdot (R_{uj} - \bar{R}_j)}{\sum_{j \in N(i;u)} |\text{sim}(i, j)|}
$$

Where:
- $\bar{R}_i$ is the average rating of item $i$ across all users
- $N(i;u)$ is the set of the top $k$ most similar items to $i$ that user $u$ has rated
- $\bar{R}_j$ is the average rating of item $j$
- $R_{uj}$ is the rating of user $u$ for item $j$

<br>

To ensure stable and computationally efficient predictions, we limit the similarity neighborhood size $k$ to 10. Additionally, only positively similar items are considered to avoid misleading predictions when the user’s rated movies are unrelated to the target item. If fewer than $k$ similar items are available, the prediction still proceeds using the available subset, falling back more heavily on the global average rating $\bar{R}_i$ of the target item. This fallback provides a robust baseline while still allowing for personalization when enough neighbors are available.

<br>

<u>**Extension:**</u>

One important extension to this method involves incorporating content-based similarity alongside rating-based similarity. In particular, item metadata such as genre can be encoded using multi-hot vectors and used to compute genre-based cosine similarity. This additional information can improve recommendation quality, especially in sparse or cold-start scenarios where rating-based item similarity is limited or unavailable.


### <u>Preparation</u>

#### <u> 2.1.1 Build item-user rating matrix:</u> 

To save memory and avoid redundant computation, we reuse the sparse matrix constructed in section `1.1.1` for user–user collaborative filtering (`item_user_sparse`). As a reminder, it was built using csr_matrix, and the corresponding mappings between original and encoded userID and item are already available in the environment.