# <u> Hybrid Content-Collaborative Filtering Model</u>


Building on the collaborative filtering approaches explored in previous notebooks, **hybrid content-collaborative filtering** extends the Funk SVD model by incorporating content features as an additive component on top of the purely collaborative prediction explored in the model-based CB. The goal is not to replace collaborative signals, but to correct and enrich them, especially in cold or sparse regimes, while preserving the strong performance of matrix factorization in warm-start settings early observed.

**<u>Predictions:</u>**

The predicted rating for a user $u$ on an item $i$ is defined as:

<br>

$$
\hat{r}_{u,i} = \mu + b_u + b_i + \mathbf{p}_u^\top \mathbf{q}_i 
+ \mathbf{w}_u^\top \mathbf{x}_u + \mathbf{w}_i^\top \mathbf{x}_i
$$



where:
- The first part of the equation corresponds to the Funk SVD model, where $\mu$ denotes the global mean rating and $b_u$ and $b_i$ are user and item bias terms. The vectors $\mathbf{p}_u$ and $\mathbf{q}_i$ represent the latent user and item embeddings learned from interaction data.
- The additional terms introduce content-based corrections. The vectors $\mathbf{x}_u \in \mathbb{R}^{d_u}$ and $\mathbf{x}_i \in \mathbb{R}^{d_i}$ denote user and item content features, respectively, while $\mathbf{w}_u$ and $\mathbf{w}_i$ are learned linear weights mapping these features directly to rating adjustments.

**<u>Training and Iterative Updates:</u>**

During training, the model jointly learns the latent factors, bias terms, and content feature weights by minimizing a regularized squared error over observed ratings:


<br>

$$
\min_{\mathbf{P}, \mathbf{Q}, \mathbf{w}_u, \mathbf{w}_i, \mathbf{b}}
\sum_{(u,i) \in \mathcal{K}}
\left(r_{u,i} - \hat{r}_{u,i}\right)^2
+ \lambda \left(
\|\mathbf{P}\|^2 + \|\mathbf{Q}\|^2 + \|\mathbf{w}_u\|^2 + \|\mathbf{w}_i\|^2 + \|\mathbf{b}\|^2
\right)
$$

<br>

where $\mathcal{K}$ denotes the set of observed user–item interactions in the training data and $\lambda$ controls the strength of regularization.

The optimization follows the same iterative procedure as Funk SVD and is performed using the  Stochastic Gradient Descent. This ensures a consistent training and comparative setup, while allowing content-based signals to be incorporated with minimal additional complexity.

**<u>Hyperparameter Selection:</u>**

The main hyperparameters evaluated on the validation set are:

- Number of latent factors, controlling the capacity of the collaborative component;  
- Regularization parameter, applied uniformly across all learned parameters;  
- Learning rate, governing the optimization dynamics;  
- Number of epochs, determining the number of full passes over the training data.

**<u>Implementation:</u>**

Due to the absence of a pre-built library supporting this specific additive hybrid formulation, the model is implemented using a custom PyTorch architecture. This approach enables direct control over the loss function and optimization process, while extending the original Funk SVD implementation in a minimal and transparent manner. Existing libraries such as Surprise (SVD++) and LightFM do not align with these requirements, as the former does not support explicit content features and the latter focuses on implicit feedback with ranking-based objectives rather than explicit rating prediction with an MSE loss.


## <u>0. Setting:</u>

### <u>0.1 Import libraries</u>

In [1]:
# Import necessary libraries
import pandas as pd, numpy as np, os, sys
import time
import matplotlib.pyplot as plt


# Remove userwarnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


# Set the working directory
current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

# Import custom modules
from modules.hybrid_CFF import *

### <u>0.2 Import pre-built datasets</u>

For consistency with the previous experiments, we adopt the same **time-based train–validation–test split**. For each user, the **earliest 70% of ratings** are used for training, the **next 10% for validation**, and the **most recent 20% for testing**. This ensures a fair comparison across models while preventing information leakage into the evaluation set.

Given the hybrid nature of the hybrid algorithm, which jointly leverages **collaborative interactions and content-based features**, the movie feature vectors constructed in `05_content_F.ipynb` are used as item-information. In addition, user-specific feature representations are constructed and incorporated into the model. This allows the model to learn latent representations that combine interaction patterns with explicit content information in opposite to its baseline model evalauated in `04_model_CF.ipynb`.

Following the evaluation protocol used for collaborative and model-based approaches, RMSE is reported on the full test set as well as separately for **warm-start** and **cold-start** settings. The cold-start subset corresponds to movies with fewer than 10 interactions in the training set, where predictions rely more heavily on content features rather than collaborative signals. Additionally, Hyperparameter tuning is performed on the validation set, and the total training and evaluation time is reported to enable computational performance comparisons.


In [None]:
# Load dataframe over the columns of interest
train_df = pd.read_csv('../data/processed/train_df.csv')
val_df = pd.read_csv('../data/processed/val_df.csv')
test_df = pd.read_csv('../data/processed/test_df.csv')
warm_test_df = pd.read_csv('../data/processed/warm_test_df.csv')
cold_test_df = pd.read_csv('../data/processed/cold_test_df.csv')
movies_vector_std = pd.read_csv('../data/processed/movies_vector_std.csv')