# Expirements - Content-Based Filtering

Recommends items similar to those a user liked, based on item/user attributes.

Some papers:
- [Survey on Collaborative Filtering, Content-based
Filtering and Hybrid Recommendation System](https://www.academia.edu/download/59762468/10.1.1.695.642820190617-91457-z4s1rf.pdf)

Table of content:
- [XGBRanker](#XGBRanker)
- [RandomForestRegressor](#RandomForestRegressor)
- [FFNN](#FFNN)

Here are the different usable features:

* **User Features**:
    - Past interactions history
    - User’s preferred tags, categories, or creators

* **Item Features**:
  - Video Duration
  - Video Watch Ratio
  - Captions / Tags / Categories
  - Reports / Likes / Comments / 

* **Temporal Features**: Not used

## Imports

In [1]:
from datetime import datetime
import math
import sys
import os
import random
from cycler import cycler
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split, KFold

sys.path.insert(1, '../..')

from src.models.content_xgb_ranker import ContentBasedFilteringXGRanker
from src.models.content_random_forest import ContentBasedFilteringRF
from src.models.content_ffnn import ContentBasedFilteringNN
from src.pipelines.content_pipeline import ContentPipeline
from src.metrics.bench import bench_model
from src.utils.ground_truth import build_ground_truth_top_10_percent

from src.utils.random import set_seed

set_seed(45)

plt.rcParams["figure.figsize"] = (20, 13)
colors = plt.get_cmap('tab10').colors
plt.rc('axes', prop_cycle=cycler('color', colors))

%matplotlib inline
%config InlineBackend.figure_format = "retina"

2025-05-17 09:25:37.763514: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-17 09:25:37.776064: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-17 09:25:37.815948: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747473937.875507  170820 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747473937.891173  170820 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747473937.929671  170820 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

## Preprocessing

Full implementation can be found at `src/pipelines/content_pipeline.py`

### Preprocessing on User

- **categorical**: is_live_streamer, is_user_full_active, is_live_streamer
- **numerical**: followers, fans, does video has its prefered category ?, does video has its prefered upload_type ?, video_duration_diff
 

### Preprocessing on Videos

- **categorical**: upload_type (18 dummies), first video category (39 dummies)
- **text**: tags (TF-IDF), captions (TF-IDF)
- **numerical**: engagement metrics (like, comments, shares, reports and other ratios), video_duration, is_add

### Final Aggregation

- **Stores representations**: `self.users`, `self.videos`
- **Column groups**  
  - `cat_cols`: `upload_type`  
  - `binary_cols`: flags (ad, matches, weekend, user activity, etc.)  
  - `num_cols`: engagement & popularity metrics
- **Feature engineering**
  - `category_match`, `upload_type_match`, `duration_diff`
- **Merging**: join interactions with pre-processed `users` & `videos`
- **Preprocessing steps**  
  1. Drop / fill missing (medians for numeric, modes for categorical)  
  2. One-hot encode categorical (`OneHotEncoder`)  
  3. Scale numeric (`StandardScaler`)  
  4. TF-IDF on `tags_caption_cover` (removed because of performance issues)
- **`fit_transform`**  
  - Cleans interactions, computes `engagement` + `is_weekend`  
  - Builds `videos` via `preprocess_videos`, `users` via `preprocess_users`  
  - Learns medians, modes, OHE, scaler; saves parquet `train_processed_data`
- **`transform`**  
  - Reuses learned encoders/scaler on new data  

In [2]:
pipeline = ContentPipeline(tfidf_max_features=32)

train_path = "../../save/data_content_preprocessed.parquet"
test_path  = "../../save/data_test_content_preprocessed.parquet"

if not os.path.exists(train_path) or not os.path.exists(test_path):
    interactions_test = pd.read_csv("../data_final_project/KuaiRec 2.0/data/small_matrix.csv")
    interactions_train = pd.read_csv("../data_final_project/KuaiRec 2.0/data/big_matrix.csv")
    video_features = pd.read_csv("../data_final_project/KuaiRec 2.0/data/kuairec_caption_category.csv", lineterminator='\n')
    video_categories = pd.read_csv("../data_final_project/KuaiRec 2.0/data/item_categories.csv")
    video_daily = pd.read_csv("../data_final_project/KuaiRec 2.0/data/item_daily_features.csv")
    user_features = pd.read_csv("../data_final_project/KuaiRec 2.0/data/user_features.csv")
    df_train = pipeline.fit_transform(
        interactions=interactions_train,
        video_features=video_features,
        video_daily=video_daily,
        video_categories=video_categories,
        user_features=user_features,
    )
    df_test = pipeline.transform(
        interactions=interactions_test,
        video_features=video_features,
        video_daily=video_daily,
        video_categories=video_categories,
        user_features=user_features,
    )
    df_train.to_parquet(train_path, compression="gzip")
    df_test.to_parquet(test_path, compression="gzip")
else:
    df_train = pd.read_parquet(train_path)
    df_test  = pd.read_parquet(test_path)
    # FIX UP because loading it all takes so much time ...
    for col in df_test.select_dtypes(include=['int', 'int64', 'int32']).columns:
        df_test[col] = df_test[col].astype(np.int16)
    for col in df_test.select_dtypes(include=['float', 'float64']).columns:
        df_test[col] = df_test[col].astype(np.float32)
    df_test.drop(columns=['user_active_degree'], inplace=True)

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10300969 entries, 0 to 10300968
Data columns (total 41 columns):
 #   Column                              Dtype  
---  ------                              -----  
 0   video_id                            int16  
 1   user_id                             int16  
 2   engagement                          float32
 3   like_play_ratio                     float32
 4   comment_play_ratio                  float32
 5   share_play_ratio                    float32
 6   like_cancel_ratio                   float32
 7   video_duration                      float32
 8   is_add                              int16  
 9   like_to_comment_ratio               float32
 10  follow_play_ratio                   float32
 11  follow_cancel_ratio                 float32
 12  is_lowactive_period                 int16  
 13  is_live_streamer                    int16  
 14  is_video_author                     int16  
 15  follower_fan_ratio                  float32
 16

## Random Forest

Our first model will use a a Random Forest that will try to minimze the gap between predicted engagement and actual engagement

* implementation: `src/models/content_random_forest.py`
* RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

### Model & Hyperparams

In [3]:
X_cols = [col for col in df_train.columns if col != "engagement"]
X_train = df_train[X_cols].to_numpy(copy=False)
y_train = df_train["engagement"]
groups_train = df_train["user_id"].to_numpy(copy=False)

In [6]:
mask = ~np.isnan(X_train).any(axis=1)
X_train_clean = X_train[mask]
y_train_clean = y_train[mask]
groups_train_clean = groups_train[mask]

In [7]:
os.makedirs("../../save/content_random_forest", exist_ok=True)

rf_param_list = [
    {"max_depth": 6, "n_estimators": 25},
    {"max_depth": 10, "n_estimators": 50},
    {"max_depth": 12, "n_estimators": 80},
]

for param in rf_param_list:
    print(f"Training Random Forest with params: {param}")
    model = ContentBasedFilteringRF()
    model.fit(
        X_train_clean, y_train_clean, groups_train_clean,
        params={
            'n_estimators': param["n_estimators"],
            'max_depth': param["max_depth"],
            'min_samples_split': 10,
            'min_samples_leaf': 5,
            'max_features': 'sqrt',
            'random_state': 42,
            'n_jobs': -1
        }
    )
    save_path = f"../../save/content_random_forest/rf_md{param['max_depth']}_ne{param['n_estimators']}"
    model.save(save_path)

2025-05-16 07:49:37,503 - [INFO] {src.models.base} - Starting training with parameters: {'n_estimators': 25, 'max_depth': 6, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'random_state': 42, 'n_jobs': -1}
2025-05-16 07:49:37,504 - [INFO] {src.models.base} - Using 3-fold cross-validation


Training Random Forest with params: {'max_depth': 6, 'n_estimators': 25}


Cross-validation:   0%|                                                                                                                                                     | 0/3 [00:00<?, ?it/s]2025-05-16 07:49:37,602 - [INFO] {src.models.base} - 
Fold 1/3
2025-05-16 07:50:43,004 - [INFO] {src.models.base} - Fold 1 NDCG: 0.461572
Cross-validation:  33%|███████████████████████████████████████████████                                                                                              | 1/3 [01:05<02:10, 65.41s/it]2025-05-16 07:50:43,014 - [INFO] {src.models.base} - 
Fold 2/3
2025-05-16 07:51:51,525 - [INFO] {src.models.base} - Fold 2 NDCG: 0.459078
Cross-validation:  67%|██████████████████████████████████████████████████████████████████████████████████████████████                                               | 2/3 [02:13<01:07, 67.24s/it]2025-05-16 07:51:51,527 - [INFO] {src.models.base} - 
Fold 3/3
2025-05-16 07:52:50,001 - [INFO] {src.models.base} - Fold 3 NDCG: 0.450828
Cros

fit took 192.5476 seconds
Training Random Forest with params: {'max_depth': 10, 'n_estimators': 50}


Cross-validation:   0%|                                                                                                                                                     | 0/3 [00:00<?, ?it/s]2025-05-16 07:52:50,207 - [INFO] {src.models.base} - 
Fold 1/3
2025-05-16 07:55:20,065 - [INFO] {src.models.base} - Fold 1 NDCG: 0.470651
Cross-validation:  33%|██████████████████████████████████████████████▋                                                                                             | 1/3 [02:29<04:59, 149.86s/it]2025-05-16 07:55:20,075 - [INFO] {src.models.base} - 
Fold 2/3
2025-05-16 07:57:48,607 - [INFO] {src.models.base} - Fold 2 NDCG: 0.469463
Cross-validation:  67%|█████████████████████████████████████████████████████████████████████████████████████████████▎                                              | 2/3 [04:58<02:29, 149.08s/it]2025-05-16 07:57:48,609 - [INFO] {src.models.base} - 
Fold 3/3
2025-05-16 08:00:19,594 - [INFO] {src.models.base} - Fold 3 NDCG: 0.461536
Cros

fit took 449.5576 seconds
Training Random Forest with params: {'max_depth': 12, 'n_estimators': 80}


Cross-validation:   0%|                                                                                                                                                     | 0/3 [00:00<?, ?it/s]2025-05-16 08:00:19,782 - [INFO] {src.models.base} - 
Fold 1/3
2025-05-16 08:04:53,698 - [INFO] {src.models.base} - Fold 1 NDCG: 0.472044
Cross-validation:  33%|██████████████████████████████████████████████▋                                                                                             | 1/3 [04:33<09:07, 273.92s/it]2025-05-16 08:04:53,700 - [INFO] {src.models.base} - 
Fold 2/3
2025-05-16 08:09:35,810 - [INFO] {src.models.base} - Fold 2 NDCG: 0.471470
Cross-validation:  67%|█████████████████████████████████████████████████████████████████████████████████████████████▎                                              | 2/3 [09:16<04:38, 278.74s/it]2025-05-16 08:09:35,812 - [INFO] {src.models.base} - 
Fold 3/3
2025-05-16 08:14:17,514 - [INFO] {src.models.base} - Fold 3 NDCG: 0.462552
Cros

fit took 837.8924 seconds


### Evaluation

In [4]:
df_train_pivot = df_train.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
df_test_pivot = df_test.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
train_ground_truth = build_ground_truth_top_10_percent(df_train_pivot)
test_ground_truth = build_ground_truth_top_10_percent(df_test_pivot)

In [5]:
best_rf_model = ContentBasedFilteringRF()
best_rf_model.load(f"../../save/content_random_forest/rf_md10_ne50_best.pkl")

2025-05-16 11:11:31,696 - [INFO] {src.models.base} - Model loaded from ../../save/content_random_forest/rf_md10_ne50_best.pkl


<src.models.content_random_forest.ContentBasedFilteringRF at 0x7ff0eed8ea50>

In [6]:
X_cols = [col for col in df_test.columns if col != "engagement"]
X_test = df_test[X_cols].to_numpy(copy=False)
y_test = df_test["engagement"]
mask = ~np.isnan(X_test).any(axis=1)
X_test_clean = X_test[mask]
y_test_clean = y_test[mask]
all_user_item_ids = X_test_clean[: , :2] # video_id, user_id

In [7]:
K = 100
N_USERS = 10

user_ids = [14, 19, 21, 23]

mask = np.isin(X_test_clean[:, 1], user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

recommendations = best_rf_model.recommend(
    X=X_subset,
    user_ids=user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)

for user in user_ids:
    recommended_items = [item_id for item_id, _ in recommendations.get(user, [])]

    test_hits = len(set(test_ground_truth.get(user, [])).intersection(recommended_items))
    print(f"user {user}: {test_hits} / {K} test recommendations in ground truth")

2025-05-16 11:11:32,476 - [INFO] {src.models.base} - Generating predictions for 12571 samples...
2025-05-16 11:11:32,476 - [INFO] {src.models.base} - Using best model for prediction


predict took 0.0141 seconds
recommend took 0.0215 seconds
user 14: 44 / 100 test recommendations in ground truth
user 19: 38 / 100 test recommendations in ground truth
user 21: 57 / 100 test recommendations in ground truth
user 23: 20 / 100 test recommendations in ground truth


In [8]:
test_sample_user_ids = random.sample(list(np.unique(X_test_clean[:, 1])), N_USERS)
test_sample_recommendations = {}
print(f"Selected users: {test_sample_user_ids}")

mask = np.isin(X_test_clean[:, 1], test_sample_user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

test_sample_recommendations_with_scores = best_rf_model.recommend(
    X=X_subset,
    user_ids=test_sample_user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)
test_sample_recommendations = {
    user: [item_id for item_id, _ in item_score_list]
    for user, item_score_list in test_sample_recommendations_with_scores.items()
}
test_sample_ground_truth = {user: test_ground_truth.get(user, []) for user in test_sample_user_ids}

2025-05-16 11:11:32,644 - [INFO] {src.models.base} - Generating predictions for 31485 samples...
2025-05-16 11:11:32,645 - [INFO] {src.models.base} - Using best model for prediction


Selected users: [2812.0, 4461.0, 5096.0, 2686.0, 946.0, 3221.0, 3625.0, 242.0, 847.0, 5066.0]
predict took 0.0373 seconds
recommend took 0.0513 seconds
{2812.0: [9178.0, 314.0, 154.0, 4040.0, 5464.0, 600.0, 7383.0, 1305.0, 8524.0, 8366.0, 5525.0, 9910.0, 6787.0, 9815.0, 4123.0, 6222.0, 2130.0, 10206.0, 2337.0, 7594.0, 211.0, 8958.0, 3344.0, 2590.0, 7375.0, 5995.0, 2343.0, 4932.0, 9907.0, 8298.0, 8732.0, 5365.0, 2263.0, 701.0, 3338.0, 619.0, 4367.0, 10062.0, 7135.0, 3723.0, 7049.0, 1445.0, 2894.0, 9199.0, 8186.0, 723.0, 4637.0, 3947.0, 4282.0, 10519.0, 9850.0, 6870.0, 9261.0, 3107.0, 3118.0, 5716.0, 10500.0, 6985.0, 9892.0, 3211.0, 7559.0, 4858.0, 4665.0, 2354.0, 9786.0, 8629.0, 4689.0, 7085.0, 768.0, 2396.0, 905.0, 7190.0, 5466.0, 8743.0, 4094.0, 9794.0, 624.0, 3904.0, 1372.0, 217.0, 9965.0, 1412.0, 2178.0, 3597.0, 7284.0, 471.0, 2383.0, 5666.0, 4312.0, 289.0, 4101.0, 8531.0, 690.0, 4025.0, 9812.0, 8464.0, 10037.0, 4663.0, 8819.0, 2312.0], 4461.0: [4040.0, 9178.0, 314.0, 154.0, 1305.0,

In [9]:
print(f"Testing: k={K}, users={N_USERS}\n")
bench_model(recommendations=test_sample_recommendations, ground_truth=test_sample_ground_truth, k=K)

Testing: k=100, users=10

NDCG@100 = 0.6269
MAP@100 = 0.5730
MAR@100 = 0.1746
F1@100 = 0.2676


## XGBRanker

XGBRanker will try to maximize the NDCG on user and item representation from carefuly selected features including (see EDAs for more information):
- Video features:
    - **categorical**: upload_type (18 dummies), first video category (39 dummies)
    - **text**: tags (TF-IDF), captions (TF-IDF)
    - **numerical**: engagement metrics (like, comments, shares, reports and other ratios), video_duration, is_add
- User features:
    - **categorical**: is_live_streamer, is_user_full_active, is_live_streamer
    - **numerical**: followers, fans, does video has its prefered category ?, does video has its prefered upload_type ?, video_duration_diff
 
* implementation: `src/models/content_xgbranker.py`
* XGBoostRanker: https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html

### Model & Hyperparams

Scaling y recommendation between 0 and 20

In [3]:
X_cols = [col for col in df_train.columns if col != "engagement"]
X_train = df_train[X_cols].to_numpy(copy=False)
y_train = df_train["engagement"]
y_scaled = (y_train / np.max(y_train)) * 10 if np.max(y_train) > 0 else y_train
groups_train = df_train["user_id"].to_numpy(copy=False)

q1 = np.percentile(y_train, 25)
q2 = np.percentile(y_train, 50)
q3 = np.percentile(y_train, 75)
max_val = np.max(y_train)

y_scaled = np.zeros_like(y_train)

mask1 = (y_train <= q1) & (y_train >= 0)
if q1 > 0:
    y_scaled[mask1] = np.round(5 * y_train[mask1] / q1)
mask2 = (y_train > q1) & (y_train <= q2)
if q2 > q1:
    y_scaled[mask2] = np.round(5 + 5 * (y_train[mask2] - q1) / (q2 - q1))
mask3 = (y_train > q2) & (y_train <= q3)
if q3 > q2:
    y_scaled[mask3] = np.round(10 + 5 * (y_train[mask3] - q2) / (q3 - q2))
mask4 = (y_train > q3)
if max_val > q3:
    y_scaled[mask4] = np.round(15 + 5 * (y_train[mask4] - q3) / (max_val - q3))
y_scaled = y_scaled.astype(int)
print(y_scaled)

[15 15  1 ...  8  5 11]


In [4]:
os.makedirs("../../save/content_xgb_ranker", exist_ok=True)

xgb_params_list = [
    #{"max_depth": 4, "n_estimators": 20, "tree_method": "hist"},
    #{"max_depth": 6, "n_estimators": 50, "tree_method": "hist"},
    #{"max_depth": 10, "n_estimators": 35, "tree_method": "hist"},
    #{"max_depth": 12, "n_estimators": 25, "tree_method": "hist"},
    #{"max_depth": 4, "n_estimators": 25, "tree_method": "approx"},
    {"max_depth": 8, "n_estimators": 75, "tree_method": "approx"},
]

for param in xgb_params_list:
    print(f"Training XGB Ranker with params: {param}")
    model = ContentBasedFilteringXGRanker()
    model.fit(
        X_train, y_scaled, groups_train,
        params={
            'objective': 'rank:ndcg',
            'eval_metric': 'ndcg@100',
            'learning_rate': 0.05,
            'ndcg_exp_gain': False,
            'max_depth': param["max_depth"],
            'min_child_weight': 50,
            'subsample': 0.8,
            'colsample_bytree': 0.7,
            'n_estimators': param["n_estimators"],
            'tree_method': param["tree_method"],
            'random_state': 42,
        }
    )

    save_path = f"../../save/content_xgb_ranker/xgb_md{param['max_depth']}_ne{param['n_estimators']}_tm{param['tree_method']}"
    model.save(save_path)

2025-05-15 23:10:34,064 - [INFO] {src.models.base} - Starting training with parameters: {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@100', 'learning_rate': 0.05, 'ndcg_exp_gain': False, 'max_depth': 8, 'min_child_weight': 50, 'subsample': 0.8, 'colsample_bytree': 0.7, 'n_estimators': 75, 'tree_method': 'approx', 'random_state': 42}
2025-05-15 23:10:34,065 - [INFO] {src.models.base} - Using 3-fold cross-validation


Training XGB Ranker with params: {'max_depth': 8, 'n_estimators': 75, 'tree_method': 'approx'}


Cross-validation:   0%|                                                 | 0/3 [00:00<?, ?it/s]2025-05-15 23:10:34,158 - [INFO] {src.models.base} - 
Fold 1/3
Parameters: { "n_estimators" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	train-ndcg@100:0.67943	validation-ndcg@100:0.68371
[10]	train-ndcg@100:0.68238	validation-ndcg@100:0.68617
[20]	train-ndcg@100:0.68482	validation-ndcg@100:0.68818
[30]	train-ndcg@100:0.68562	validation-ndcg@100:0.68857
[40]	train-ndcg@100:0.68620	validation-ndcg@100:0.68880
[50]	train-ndcg@100:0.68642	validation-ndcg@100:0.68909
[60]	train-ndcg@100:0.68682	validation-ndcg@100:0.68950
[70]	train-ndcg@100:0.68709	validation-ndcg@100:0.68959


Cross-validation:   0%|                                                 | 0/3 [05:08<?, ?it/s]


KeyboardInterrupt: 

### Evaluation

In [10]:
df_train_pivot = df_train.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
df_test_pivot = df_test.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
train_ground_truth = build_ground_truth_top_10_percent(df_train_pivot)
test_ground_truth = build_ground_truth_top_10_percent(df_test_pivot)

In [11]:
best_xgb_model = ContentBasedFilteringXGRanker()
best_rf_model.load(f"../../save/content_random_forest/rf_md10_ne50_best.pkl")

2025-05-16 11:12:04,423 - [INFO] {src.models.base} - Model loaded from ../../save/content_random_forest/rf_md10_ne50_best.pkl


<src.models.content_random_forest.ContentBasedFilteringRF at 0x7ff0eed8ea50>

In [12]:
X_cols = [col for col in df_test.columns if col != "engagement"]
X_test = df_test[X_cols].to_numpy(copy=False)
y_test = df_test["engagement"]
mask = ~np.isnan(X_test).any(axis=1)
X_test_clean = X_test[mask]
y_test_clean = y_test[mask]
all_user_item_ids = X_test_clean[: , :2] # video_id, user_id

In [13]:
K = 100
N_USERS = 10

user_ids = [14, 19, 21, 23]

mask = np.isin(X_test_clean[:, 1], user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

recommendations = best_rf_model.recommend(
    X=X_subset,
    user_ids=user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)

for user in user_ids:
    recommended_items = [item_id for item_id, _ in recommendations.get(user, [])]

    test_hits = len(set(test_ground_truth.get(user, [])).intersection(recommended_items))
    print(f"user {user}: {test_hits} / {K} test recommendations in ground truth")

2025-05-16 11:12:05,391 - [INFO] {src.models.base} - Generating predictions for 12571 samples...
2025-05-16 11:12:05,392 - [INFO] {src.models.base} - Using best model for prediction


predict took 0.0136 seconds
recommend took 0.0200 seconds
user 14: 44 / 100 test recommendations in ground truth
user 19: 38 / 100 test recommendations in ground truth
user 21: 57 / 100 test recommendations in ground truth
user 23: 20 / 100 test recommendations in ground truth


In [24]:
test_sample_user_ids = random.sample(list(np.unique(X_test_clean[:, 1])), N_USERS)
test_sample_recommendations = {}
print(f"Selected users: {test_sample_user_ids}")

mask = np.isin(X_test_clean[:, 1], test_sample_user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

test_sample_recommendations_with_scores = best_rf_model.recommend(
    X=X_subset,
    user_ids=test_sample_user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)
test_sample_recommendations = {
    user: [item_id for item_id, _ in item_score_list]
    for user, item_score_list in test_sample_recommendations_with_scores.items()
}
test_sample_ground_truth = {user: test_ground_truth.get(user, []) for user in test_sample_user_ids}

2025-05-16 11:12:31,830 - [INFO] {src.models.base} - Generating predictions for 31447 samples...
2025-05-16 11:12:31,831 - [INFO] {src.models.base} - Using best model for prediction


Selected users: [5265.0, 6409.0, 4528.0, 3101.0, 2220.0, 4439.0, 4140.0, 4037.0, 1590.0, 6039.0]
predict took 0.1712 seconds
recommend took 0.1928 seconds


In [25]:
print(f"Testing: k={K}, users={N_USERS}\n")
bench_model(recommendations=test_sample_recommendations, ground_truth=test_sample_ground_truth, k=K)

Testing: k=100, users=10

NDCG@100 = 0.6008
MAP@100 = 0.5520
MAR@100 = 0.1674
F1@100 = 0.2569


## FFNN
* implementation: `src/models/content_ffnn.py`

### Model & Hyperparams

Scaling y recommendation between 0 and 20

In [3]:
X_cols = [col for col in df_train.columns if col != "engagement"]
X_train = df_train[X_cols].to_numpy(copy=False)
y_train = df_train["engagement"]
groups_train = df_train["user_id"].to_numpy(copy=False)
mask = ~np.isnan(X_train).any(axis=1)
X_train_clean = X_train[mask]
y_train_clean = y_train[mask]
groups_train_clean = groups_train[mask]

In [None]:
os.makedirs("../../save/content_nn_ranker", exist_ok=True)

nn_params_list = [
    # Architecture variations - smaller to larger networks
    #{
    #    "hidden_layers": [64],
    #    "dropout_rate": 0.2,
    #    "learning_rate": 0.01,
    #    "l2_reg": 0.001,
    #    "batch_size": 256,
    #    "epochs": 10,
    #    "patience": 10,
    #    "activation": "relu",
    #    "random_state": 42
    #},
    {
        "hidden_layers": [64, 32],
        "dropout_rate": 0.3,
        "learning_rate": 0.01,
        "l2_reg": 0.0005,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    {
        "hidden_layers": [128, 64, 32],
        "dropout_rate": 0.3,
        "learning_rate": 0.01,
        "l2_reg": 0.0005,
        "batch_size": 512, 
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    # Learning rate variations
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.3,
        "learning_rate": 0.1,  # Higher learning rate
        "l2_reg": 0.0005,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.3,
        "learning_rate": 0.0001,
        "l2_reg": 0.0005,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    # Regularization variations
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.5,
        "learning_rate": 0.01,
        "l2_reg": 0.001,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.2,
        "learning_rate": 0.001,
        "l2_reg": 0.01,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    # Activation function variations
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.3,
        "learning_rate": 0.001,
        "l2_reg": 0.0005,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "elu",  # Different activation function
        "random_state": 42
    },
    # Batch size variations
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.3,
        "learning_rate": 0.001,
        "l2_reg": 0.0005,
        "batch_size": 128,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    },
    {
        "hidden_layers": [64, 32, 16],
        "dropout_rate": 0.3,
        "learning_rate": 0.001,
        "l2_reg": 0.0005,
        "batch_size": 512,
        "epochs": 5,
        "patience": 2,
        "activation": "relu",
        "random_state": 42
    }
]

def get_model_name(params):
    layers_str = "_".join(str(l) for l in params["hidden_layers"])
    return (f"nn_layers{layers_str}_dr{params['dropout_rate']}_"
            f"lr{params['learning_rate']}_l2{params['l2_reg']}_"
            f"bs{params['batch_size']}_act{params['activation']}")

for i, params in enumerate(nn_params_list):
    print(f"\n[{i+1}/{len(nn_params_list)}] Training Neural Network with params:")
    for key, value in params.items():
        print(f"  {key}: {value}")

    model = ContentBasedFilteringNN()
    model.fit(
        X_train_clean, y_train_clean, groups_train_clean,
        params=params
    )

    model_name = get_model_name(params)
    save_path = f"../../save/content_nn_ranker/{model_name}.keras"
    model.save(save_path)
    
    print(f"Model saved to {save_path}")

2025-05-17 09:26:22,814 - [INFO] {src.models.base} - Starting training with parameters: {'hidden_layers': [64, 32], 'dropout_rate': 0.3, 'learning_rate': 0.01, 'l2_reg': 0.0005, 'batch_size': 512, 'epochs': 5, 'patience': 2, 'activation': 'relu', 'random_state': 42}



[1/9] Training Neural Network with params:
  hidden_layers: [64, 32]
  dropout_rate: 0.3
  learning_rate: 0.01
  l2_reg: 0.0005
  batch_size: 512
  epochs: 5
  patience: 2
  activation: relu
  random_state: 42


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2025-05-17 09:26:22.830961: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
2025-05-17 09:26:22.831019: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:178] verbose logging is disabled. Rerun with verbose logging (usually --v=1 or --vmodule=cuda_diagnostics=1) to get more diagnostic output from this module
2025-05-17 09:26:22.831031: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:183] retrieving CUDA diagnostic information for host: nixos
2025-05-17 09:26:22.831039: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:190] hostname: nixos
2025-05-17 09:26:22.831266: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:197] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program. The library may be missing 

Epoch 1/5


2025-05-17 09:26:24.200944: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1570444320 exceeds 10% of free system memory.


[1m19171/19171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m175s[0m 9ms/step - loss: 5.2921 - learning_rate: 0.0100
Epoch 2/5
[1m19171/19171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m172s[0m 9ms/step - loss: 5.2719 - learning_rate: 0.0100
Epoch 3/5
[1m19171/19171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m181s[0m 9ms/step - loss: 5.2718 - learning_rate: 0.0100
Epoch 4/5
[1m19171/19171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m189s[0m 10ms/step - loss: 5.2715 - learning_rate: 0.0100
Epoch 5/5


### Evaluation

In [10]:
df_train_pivot = df_train.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
df_test_pivot = df_test.drop_duplicates(['user_id', 'video_id'], keep='first')\
            .pivot(index='user_id', columns='video_id', values='engagement')
train_ground_truth = build_ground_truth_top_10_percent(df_train_pivot)
test_ground_truth = build_ground_truth_top_10_percent(df_test_pivot)

In [11]:
best_nn_model = ContentBasedFilteringNN()
best_nn_model.load(f"../../save/content_nn_ranker/")

2025-05-16 11:12:04,423 - [INFO] {src.models.base} - Model loaded from ../../save/content_random_forest/rf_md10_ne50_best.pkl


<src.models.content_random_forest.ContentBasedFilteringRF at 0x7ff0eed8ea50>

In [12]:
X_cols = [col for col in df_test.columns if col != "engagement"]
X_test = df_test[X_cols].to_numpy(copy=False)
y_test = df_test["engagement"]
mask = ~np.isnan(X_test).any(axis=1)
X_test_clean = X_test[mask]
y_test_clean = y_test[mask]
all_user_item_ids = X_test_clean[: , :2] # video_id, user_id

In [13]:
K = 100
N_USERS = 10

user_ids = [14, 19, 21, 23]

mask = np.isin(X_test_clean[:, 1], user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

recommendations = best_rf_model.recommend(
    X=X_subset,
    user_ids=user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)

for user in user_ids:
    recommended_items = [item_id for item_id, _ in recommendations.get(user, [])]

    test_hits = len(set(test_ground_truth.get(user, [])).intersection(recommended_items))
    print(f"user {user}: {test_hits} / {K} test recommendations in ground truth")

2025-05-16 11:12:05,391 - [INFO] {src.models.base} - Generating predictions for 12571 samples...
2025-05-16 11:12:05,392 - [INFO] {src.models.base} - Using best model for prediction


predict took 0.0136 seconds
recommend took 0.0200 seconds
user 14: 44 / 100 test recommendations in ground truth
user 19: 38 / 100 test recommendations in ground truth
user 21: 57 / 100 test recommendations in ground truth
user 23: 20 / 100 test recommendations in ground truth


In [24]:
test_sample_user_ids = random.sample(list(np.unique(X_test_clean[:, 1])), N_USERS)
test_sample_recommendations = {}
print(f"Selected users: {test_sample_user_ids}")

mask = np.isin(X_test_clean[:, 1], test_sample_user_ids) # second column is user id
X_subset = X_test_clean[mask]
all_user_item_ids_subset = all_user_item_ids[mask]

test_sample_recommendations_with_scores = best_rf_model.recommend(
    X=X_subset,
    user_ids=test_sample_user_ids,
    all_user_item_ids=all_user_item_ids_subset,
    top_n=K
)
test_sample_recommendations = {
    user: [item_id for item_id, _ in item_score_list]
    for user, item_score_list in test_sample_recommendations_with_scores.items()
}
test_sample_ground_truth = {user: test_ground_truth.get(user, []) for user in test_sample_user_ids}

2025-05-16 11:12:31,830 - [INFO] {src.models.base} - Generating predictions for 31447 samples...
2025-05-16 11:12:31,831 - [INFO] {src.models.base} - Using best model for prediction


Selected users: [5265.0, 6409.0, 4528.0, 3101.0, 2220.0, 4439.0, 4140.0, 4037.0, 1590.0, 6039.0]
predict took 0.1712 seconds
recommend took 0.1928 seconds


In [25]:
print(f"Testing: k={K}, users={N_USERS}\n")
bench_model(recommendations=test_sample_recommendations, ground_truth=test_sample_ground_truth, k=K)

Testing: k=100, users=10

NDCG@100 = 0.6008
MAP@100 = 0.5520
MAR@100 = 0.1674
F1@100 = 0.2569
