# <u> Model-based Collaborative Filtering</u>


As an alternative to memory-based collaborative filtering explored in `mem_collab_filtering.ipynb`, **model-based collaborative filtering** leverages matrix factorization to capture latent structures in the user–item interaction matrix. The main idea is to represent each user and each item in a **latent space** of dimension $p$, such that the predicted rating $\hat{R}_{ui}$ is given by the dot product of their latent vectors:

<br>

$$
\hat{R}_{ui} = \mathbf{p}_u^\top \mathbf{q}_i
$$

<br>

Where:  
- $\mathbf{p}_u \in \mathbb{R}^p$ is the latent vector of user $u$  
- $\mathbf{q}_i \in \mathbb{R}^p$ is the latent vector of item $i$  
- $p$ is the number of latent factors (a hyperparameter to tune)  

This decomposition reduces the sparsity of the original user–item matrix while capturing hidden relationships between users and items, providing a powerful approach for recommendation systems. In practice and focus of this study two two model-based approaches: **optimized Singular Value Decomposition (SVD)** and **Alternating Least Squares (ALS)** are commonly applied. The key difference between them lies in the optimization strategy used to learn the latent matrices: optimized SVD uses gradient-based updates on observed ratings, whereas ALS alternates between solving for user and item matrices, which affects both scalability and computational efficiency. While a baseline truncated SVD could theoretically be applied, it requires imputing missing ratings to reconstruct the matrix. In sparse datasets like ours, this leads to significant limitations: imputing mean values for unseen items smooths out individual user preferences, and the model loses sensitivity to personal tastes. For these reasons, we do not evaluate truncated SVD in this case study, and instead focus on approaches that learn latent factors directly from observed ratings.

## <u>0. Setting:</u>

### <u>0.1 Import libraries</u>

In [1]:
# Import necessary libraries
import pandas as pd, numpy as np, os, sys
import pyarrow as pa
import pyarrow.parquet as pq
from surprise import Dataset, Reader, KNNWithMeans, accuracy
import time

# Set the working directory
current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

### <u>0.1 Import pre-built datasets</u>

As a comparative study with the memory-based collaborative system, the same train–validation–test split is applied to the dataset. This ensures a fair comparison between different models without introducing bias in the evaluation set. Given the nature of the algorithm, which does not incorporate any content-based information about the movies, only the `userId`, `movieId`, and `rating` columns are used for training. Similarly to the previous approach, RMSE, MAE, and training/evaluation time are recorded for comparison on the test set, while hyperparameter tuning is performed using the validation set.

In [None]:
#Load dataframe over the columns of interest
train_df = pd.read_csv('../data/processed/train_df.csv')[['userId', 'movieId', 'rating']]
val_df = pd.read_csv('../data/processed/val_df.csv')[['userId', 'movieId', 'rating']]
test_df = pd.read_csv('../data/processed/test_df.csv')[['userId', 'movieId', 'rating']]

## <u>1. Optimized SVD:</u>