# <u> Content-based Filtering</u>


As an alternative to collaborative filtering explored in previous notebooks, **content-based filtering** relies on the intrinsic features of items rather than user–item interactions across the entire population. The main idea is to represent each movie as a **feature vector** capturing its attributes, and to construct a **user profile** by aggregating the features of movies the user has previously liked. The predicted preference for a new movie is then determined by the **similarity** between the user profile and the movie feature vector.

<br>

Mathematically, the predicted score for a user $u$ on an unseen movie $m$ can be expressed as

<br>

$$
\hat{r}_{u,m} = \text{scale}\Big(\text{sim}(\text{user\_profile}_u, \text{movie\_vector}_m)\Big)
$$

<br>

where  

- $\text{user\_profile}_u \in \mathbb{R}^d$ is the vector representing user $u$’s preferences;  
- $\text{movie\_vector}_m \in \mathbb{R}^d$ is the vector describing movie $m$;  
- $\text{sim}(\cdot,\cdot)$ is a similarity measure, such as **cosine similarity and dot product**;  
- $\text{scale}(\cdot)$ is a linear transformation that maps the similarity score to the rating range (0.5-5) by matching the original rating scale.  

<br>

This approach allows the recommender system to suggest movies that are **most similar to what the user already enjoys**. The method is particularly useful for capturing individual tastes, and handling situations where user interactions and overlapping with other users are limited. In practice, the quality of the recommendations depends mostly on the richness and relevance of the item features used to construct the user profile. 


## <u>0. Setting:</u>

### <u>0.1 Import libraries</u>

In [None]:
# Import necessary libraries
import pandas as pd, numpy as np, os, sys
import time
import matplotlib.pyplot as plt

# Remove userwarnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)


# Set the working directory
current_dir = os.getcwd()
project_root = os.path.abspath(os.path.join(current_dir, ".."))
if project_root not in sys.path:
    sys.path.append(project_root)

### <u>0.2 Import pre-built datasets</u>

As a comparative study with the memory-based and model-based collaborative system, we use the same **train–validation–test split** based on time. For each user, the **earliest 70% of their ratings** are used for training, the **next 10% for validation**, and the **most recent 20% for testing**. This ensures a fair comparison between different models without introducing bias in the evaluation set. Given the nature of the algorithm, which does not incorporate any content-based information about the movies, only the `userId`, `movieId`, and `rating` columns are used for training. Following the memory-based evaluation, RMSE is recorded on the full test set as well as separately for **warm-start** and **cold-start** subsets. Hyperparameter tuning is performed on the validation set, and total training and evaluation time is reported for computation comparison.



In [None]:
#Load dataframe over the columns of interest
train_df = pd.read_csv('../data/processed/train_df.csv')
val_df = pd.read_csv('../data/processed/val_df.csv')
test_df = pd.read_csv('../data/processed/test_df.csv')
warm_test_df = pd.read_csv('../data/processed/warm_test_df.csv')
cold_test_df = pd.read_csv('../data/processed/cold_test_df.csv')

## <u>1. Similarity</u>



In this experiment, we use **cosine similarity** as the measure of similarity between the user profile and an unseen movie vector. The motivation for this choice lies in how cosine similarity emphasizes **alignment in feature space** rather than absolute magnitude. In this implementation, the user profile is constructed as a weighted aggregation of movie feature vectors, where the weights correspond to the ratings assigned by the user. Although this aggregation is normalized by the sum of the weights, the resulting vector may still exhibit differences in magnitude due to variations in how concentrated or diverse a user’s preferences are across feature dimensions. As a result, similarity measures that are sensitive to vector magnitude, such as the dot product, may assign systematically higher scores to users whose profiles have larger norms, regardless of how well the underlying feature patterns align.

Cosine similarity addresses this issue by normalizing both the user profile and the movie vector, effectively focusing on the **direction of the vectors rather than their scale**. This allows the model to compare how closely the feature composition of a movie matches the user’s preference profile, independently of the user’s overall activity level or rating intensity.

Mathematically, the cosine similarity between the user profile vector $\mathbf{u}$ and a movie feature vector $\mathbf{m}$ is defined as

<br>

$$
\text{sim}_{\text{cosine}}(\mathbf{u}, \mathbf{m}) = 
\frac{\mathbf{u} \cdot \mathbf{m}}{\|\mathbf{u}\| \, \|\mathbf{m}\|}
$$

where $\|\cdot\|$ denotes the **Euclidean norm** of a vector.

## <u>1. Movie Vector</u>

Based on the exploratory data analysis conducted in `01_eda.ipynb`, several features have been identified for inclusion in our movie vector representation. In this notebook, the following features will be implemented:

- **Movie genres**: The 19 distinct genres are encoded using **one-hot encoding**, providing a categorical representation of the movie’s type.
- **Bayesian-adjusted rating**: This measure is included to capture the overall quality and community appraisal of a movie. It is preferred over the simple average rating, as it accounts for differences in popularity, exposure, and review count, mitigating the influence of movies with very few ratings.
- **Decade of release**: The decade is derived from the release year to provide a temporal context. This feature allows movies to be compared in terms of their production period, capturing similarity across time.
- **Tag relevance scores**: User-provided tags are incorporated as an additional source of content. Given the large number of distinct tags (1128), filtering based on variance and subsequent **PCA** is applied to reduce dimensionality while retaining the most significant information for movie comparison.

This combination of categorical, numerical, and derived features forms the basis of our content-based movie representation, providing a rich and compact vector space for similarity computation.
