# Expirements - Content-Based Filtering

Recommends items similar to those a user liked, based on item/user attributes.

Some papers:
- [Survey on Collaborative Filtering, Content-based
Filtering and Hybrid Recommendation System](https://www.academia.edu/download/59762468/10.1.1.695.642820190617-91457-z4s1rf.pdf)

Table of content:
- [NCF (xxx)]( #)

Here are the different usable features:

* **User Features**:
    - Past interactions history
    - User’s preferred tags, categories, or creators

* **Item Features**:
  - Video Duration
  - Video Watch Ratio
  - Captions / Tags / Categories
  - Reports / Likes / Comments / 

* **Temporal Features**: Not used

### Imports

In [1]:
from datetime import datetime
import random
import math
import logging
import time
from tqdm import tqdm
from collections import defaultdict
from functools import reduce
from typing import Dict, Any, Optional, List, Tuple

from cycler import cycler
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import xgboost as xgb
import tensorflow as tf
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold

#import jieba
import warnings

from random_utils import set_seed
from model_utils import normalize_ratings

set_seed(45)

plt.rcParams["figure.figsize"] = (20, 13)
colors = plt.get_cmap('tab10').colors
plt.rc('axes', prop_cycle=cycler('color', colors))

%matplotlib inline
%config InlineBackend.figure_format = "retina"

2025-05-05 14:24:47.998810: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-05 14:24:48.119482: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-05 14:24:48.214548: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1746455088.316332     542 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746455088.344856     542 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746455088.545434     542 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

### Loading Datasets

In [2]:
interactions_test = pd.read_csv("../data_final_project/KuaiRec 2.0/data/small_matrix.csv")
interactions_test

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,14,148,4381,6067,2020-07-05 05:27:48.378,20200705.0,1.593898e+09,0.722103
1,14,183,11635,6100,2020-07-05 05:28:00.057,20200705.0,1.593898e+09,1.907377
2,14,3649,22422,10867,2020-07-05 05:29:09.479,20200705.0,1.593898e+09,2.063311
3,14,5262,4479,7908,2020-07-05 05:30:43.285,20200705.0,1.593898e+09,0.566388
4,14,8234,4602,11000,2020-07-05 05:35:43.459,20200705.0,1.593899e+09,0.418364
...,...,...,...,...,...,...,...,...
4676565,7162,2267,11908,5467,,,,2.178160
4676566,7162,2065,11919,6067,,,,1.964562
4676567,7162,1296,16690,19870,,,,0.839960
4676568,7162,4822,11862,24400,,,,0.486148


In [3]:
interactions_train = pd.read_csv("../data_final_project/KuaiRec 2.0/data/big_matrix.csv")
interactions_train

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,date,timestamp,watch_ratio
0,0,3649,13838,10867,2020-07-05 00:08:23.438,20200705,1.593879e+09,1.273397
1,0,9598,13665,10984,2020-07-05 00:13:41.297,20200705,1.593879e+09,1.244082
2,0,5262,851,7908,2020-07-05 00:16:06.687,20200705,1.593879e+09,0.107613
3,0,1963,862,9590,2020-07-05 00:20:26.792,20200705,1.593880e+09,0.089885
4,0,8234,858,11000,2020-07-05 00:43:05.128,20200705,1.593881e+09,0.078000
...,...,...,...,...,...,...,...,...
12530801,7175,1281,34618,140017,2020-09-05 15:07:10.576,20200905,1.599290e+09,0.247241
12530802,7175,3407,12619,21888,2020-09-05 15:08:45.228,20200905,1.599290e+09,0.576526
12530803,7175,10360,2407,7067,2020-09-05 19:10:29.041,20200905,1.599304e+09,0.340597
12530804,7175,10360,6455,7067,2020-09-05 19:10:36.995,20200905,1.599304e+09,0.913400


In [4]:
video_features = pd.read_csv("../data_final_project/KuaiRec 2.0/data/kuairec_caption_category.csv", lineterminator='\n')
video_features.head()

Unnamed: 0,video_id,manual_cover_text,caption,topic_tag,first_level_category_id,first_level_category_name,second_level_category_id,second_level_category_name,third_level_category_id,third_level_category_name
0,0,UNKNOWN,精神小伙路难走 程哥你狗粮慢点撒,[],8,颜值,673,颜值随拍,-124,UNKNOWN
1,1,UNKNOWN,,[],27,高新数码,-124,UNKNOWN,-124,UNKNOWN
2,2,UNKNOWN,晚饭后，运动一下！,[],9,喜剧,727,搞笑互动,-124,UNKNOWN
3,3,UNKNOWN,我平淡无奇，惊艳不了时光，温柔不了岁月，我只想漫无目的的走走，努力发笔小财，给自己买花 自己长大.,[],26,摄影,686,主题摄影,2434,景物摄影
4,4,五爱街最美美女 一天1q,#搞笑 #感谢快手我要上热门 #五爱市场 这真是完美搭配啊！,"[五爱市场,感谢快手我要上热门,搞笑]",5,时尚,737,营销售卖,2596,女装


In [5]:
video_categories = pd.read_csv("../data_final_project/KuaiRec 2.0/data/item_categories.csv")
video_categories.head()

Unnamed: 0,video_id,feat
0,0,[8]
1,1,"[27, 9]"
2,2,[9]
3,3,[26]
4,4,[5]


In [6]:
video_daily = pd.read_csv("../data_final_project/KuaiRec 2.0/data/item_daily_features.csv")
video_daily.head()
video_daily.columns

Index(['video_id', 'date', 'author_id', 'video_type', 'upload_dt',
       'upload_type', 'visible_status', 'video_duration', 'video_width',
       'video_height', 'music_id', 'video_tag_id', 'video_tag_name',
       'show_cnt', 'show_user_num', 'play_cnt', 'play_user_num',
       'play_duration', 'complete_play_cnt', 'complete_play_user_num',
       'valid_play_cnt', 'valid_play_user_num', 'long_time_play_cnt',
       'long_time_play_user_num', 'short_time_play_cnt',
       'short_time_play_user_num', 'play_progress', 'comment_stay_duration',
       'like_cnt', 'like_user_num', 'click_like_cnt', 'double_click_cnt',
       'cancel_like_cnt', 'cancel_like_user_num', 'comment_cnt',
       'comment_user_num', 'direct_comment_cnt', 'reply_comment_cnt',
       'delete_comment_cnt', 'delete_comment_user_num', 'comment_like_cnt',
       'comment_like_user_num', 'follow_cnt', 'follow_user_num',
       'cancel_follow_cnt', 'cancel_follow_user_num', 'share_cnt',
       'share_user_num', 'download

In [7]:
user_features = pd.read_csv("../data_final_project/KuaiRec 2.0/data/user_features.csv")
user_features

Unnamed: 0,user_id,user_active_degree,is_lowactive_period,is_live_streamer,is_video_author,follow_user_num,follow_user_num_range,fans_user_num,fans_user_num_range,friend_user_num,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
0,0,high_active,0,0,0,5,"(0,10]",0,0,0,...,184,6,3,0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,full_active,0,0,0,386,"(250,500]",4,"[1,10)",2,...,186,6,2,0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,full_active,0,0,0,27,"(10,50]",0,0,0,...,51,2,3,0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,full_active,0,0,0,16,"(10,50]",0,0,0,...,251,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,full_active,0,0,0,122,"(100,150]",4,"[1,10)",0,...,99,4,2,0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7171,7171,full_active,0,0,1,52,"(50,100]",1,"[1,10)",0,...,259,1,4,0,1.0,0.0,0.0,0.0,0.0,0.0
7172,7172,full_active,0,0,0,45,"(10,50]",2,"[1,10)",2,...,11,2,0,0,1.0,0.0,0.0,0.0,0.0,0.0
7173,7173,full_active,0,0,0,615,500+,3,"[1,10)",2,...,51,2,2,0,1.0,0.0,0.0,0.0,0.0,0.0
7174,7174,full_active,0,0,0,959,500+,0,0,0,...,107,3,2,0,0.0,0.0,0.0,0.0,0.0,0.0


### XGBRanker

Our first model will use a a XGBRanker that will try to maximize the NDCG on user and item representation from carefuly selected features including (see EDAs for more information):
- Video features:
    - **categorical**: upload_type (18 dummies), first video category (39 dummies)
    - **text**: tags (TF-IDF), captions (TF-IDF)
    - **numerical**: engagement metrics (like, comments, shares, reports and other ratios), video_duration, is_add
- User features:
    - **categorical**: is_live_streamer, is_user_full_active, is_live_streamer
    - **numerical**: followers, fans, does video has its prefered category ?, does video has its prefered upload_type ?, video_duration_diff

#### Preprocessing

##### Preprocessing on User

First we preprocess the user features and videos features to create user representation

In [8]:
def preprocess_users(interactions: pd.DataFrame, user_features: pd.DataFrame, 
                    video_categories: pd.DataFrame, video_daily: pd.DataFrame) -> pd.DataFrame:
    ###########
    # Generic user stats
    user_stats = pd.DataFrame({'user_id': user_features['user_id']})
    user_stats['follower_fan_ratio'] = user_features['follow_user_num'] / (user_features['fans_user_num'] + 1)
    user_stats['popularity_score'] = user_features['follow_user_num'] + user_features['fans_user_num']
    user_stats['video_id_count'] = interactions.groupby('user_id').agg({'video_id': 'count'}).reset_index()['video_id']
    user_stats['is_user_active'] = (user_features['user_active_degree'] == 'full_active').astype(int)

    ###########
    # Additional user features
    user_features = user_features[['user_id', 'user_active_degree', 'is_lowactive_period', 'is_live_streamer', 'is_video_author']].copy()

    ###########
    # Get video information for user preference extraction
    video_info = video_daily[['video_id', 'video_duration', 'upload_type', 'video_type']].drop_duplicates('video_id')
    video_categories = video_categories.explode('feat').rename(columns={'feat': 'category_id'})
    video_info = pd.merge(video_info, video_categories, on='video_id', how='left')

    # Find user preferred categories and video types based on their interactions
    user_video_interactions = pd.merge(interactions[['video_id', 'user_id', 'watch_ratio']], video_info, on='video_id', how='left')

    # Get preferred categories
    top_categories = (user_video_interactions
        .groupby(['user_id', 'category_id'])
        .agg({
            'watch_ratio': 'sum',
            'video_id': 'count'
        })
        .reset_index()
        .sort_values(['user_id', 'watch_ratio'], ascending=[True, False])
        .groupby('user_id')
        .first()
        .reset_index()
        .rename(columns={'category_id': 'preferred_category'})
    )
    top_categories['preferred_category'] = top_categories['preferred_category'].astype(str).str.replace('[', '').str.replace(']', '')

    # Get preferred upload types
    top_upload_types = (user_video_interactions
        .groupby(['user_id', 'upload_type'])
        .agg({
            'watch_ratio': 'sum',
            'video_id': 'count'
        })
        .reset_index()
        .sort_values(['user_id', 'watch_ratio'], ascending=[True, False])
        .groupby('user_id')
        .first()
        .reset_index()
        .rename(columns={'upload_type': 'preferred_upload_type'})
    )

    # Get preferred video_type
    top_video_types = (user_video_interactions
        .groupby(['user_id', 'video_type'])
        .agg({
            'watch_ratio': 'sum',
            'video_id': 'count'
        })
        .reset_index()
        .sort_values(['user_id', 'watch_ratio'], ascending=[True, False])
        .groupby('user_id')
        .first()
        .reset_index()
        .rename(columns={'video_type': 'preferred_video_type'})
    )
    
    # Get preferred video duration (average of watched videos)
    user_duration_prefs = (user_video_interactions
        .groupby('user_id')
        .agg({
            'video_duration': 'mean'
        })
        .reset_index()
        .rename(columns={'video_duration': 'preferred_duration'})
    )

    # Merge all user preference data
    dfs = [
        user_features,
        user_stats,
        top_categories[['user_id', 'preferred_category']],
        top_video_types[['user_id', 'preferred_video_type']],
        top_upload_types[['user_id', 'preferred_upload_type']],
        user_duration_prefs
    ]
    users = reduce(lambda left, right: pd.merge(left, right, on='user_id', how='left'), dfs)

    # Fill NaN values
    fill_dict = {
        'video_id_count': 0,
        'follower_fan_ratio': 0,
        'popularity_score': 0,
        'preferred_category': -1,
        'preferred_duration': users['preferred_duration'].median() if not users['preferred_duration'].isna().all() else 0
    }
    users.fillna(fill_dict, inplace=True)

    print(f"Processed {len(users)} unique users")
    return users

##### Preprocessing on Videos

We do the same thing for videos with generic information about the video

In [9]:
 def preprocess_videos(video_features: pd.DataFrame, video_daily: pd.DataFrame, 
                         video_categories: pd.DataFrame) -> pd.DataFrame:
    # Computing generic video stats and information
    video_stats = video_daily.groupby('video_id').agg({
        'play_cnt': 'mean',
        'play_duration': 'mean',
        'like_cnt': 'mean',
        'cancel_like_cnt': 'mean',
        'comment_cnt': 'mean',
        #'reply_comment_cnt': 'mean',
        'share_cnt': 'mean',
        'download_cnt': 'mean',
        'report_cnt': 'mean',
        'follow_cnt': 'mean',
        'cancel_follow_cnt': 'mean',
        # first value for categorical/constant features
        'video_duration': 'first', 
        'video_type': 'first',
        #'video_tag_id': 'first',
        #'video_tag_name': 'first',
        #'author_id': 'first',
        'upload_type': 'first'
    }).reset_index()

    # Engagement metrics
    video_stats['like_play_ratio'] = video_stats['like_cnt'] / (video_stats['play_cnt'] + 1)
    video_stats['comment_play_ratio'] = video_stats['comment_cnt'] / (video_stats['play_cnt'] + 1)
    video_stats['share_play_ratio'] = video_stats['share_cnt'] / (video_stats['play_cnt'] + 1)
    video_stats['follow_play_ratio'] = video_stats['follow_cnt'] / (video_stats['play_cnt'] + 1)
    video_stats['follow_cancel_ratio'] = video_stats['cancel_follow_cnt'] / (video_stats['follow_cnt'] + 1)
    video_stats['like_cancel_ratio'] = video_stats['cancel_like_cnt'] / (video_stats['like_cnt'] + 1)
    video_stats['like_to_comment_ratio'] = video_stats['like_cnt'] / (video_stats['comment_cnt'] + 1)

    # Binarize cat
    video_stats["is_add"] = (video_stats["video_type"] == "AD").astype(int)

    # Keep all relevant columns
    video_stats = video_stats[['video_id', 'like_play_ratio', 'comment_play_ratio', 'share_play_ratio',
                'like_cancel_ratio', 'video_duration', 'is_add', 'like_to_comment_ratio', 
                'upload_type', 'follow_play_ratio', 'follow_cancel_ratio']]

    # Join with video features and categories
    videos = pd.merge(video_stats, video_features[['video_id', 'manual_cover_text', 'caption', 'topic_tag']], 
                     on='video_id', how='left').merge(video_categories, on='video_id', how='left')

    # Process text features if needed
    def parse_tags(tag_str):
        if isinstance(tag_str, str):
            tag_str = tag_str.strip("[]")
            tags = [tag.strip() for tag in tag_str.split(",") if tag.strip()]
            return tags[:10]
        return []

    # Parse tags and concatenate with caption
    videos['parsed_tags'] = videos['topic_tag'].apply(parse_tags)
    videos['caption'] = videos['caption'].fillna('')
    videos['manual_cover_text'] = videos['manual_cover_text'].apply(lambda x: '' if x == 'UNKNOWN' else x)
    videos['tags_caption_cover'] = videos.apply(
        lambda row: ' '.join(row['parsed_tags']) + ' ' + row['caption'] + ' ' + row['manual_cover_text'], axis=1
    )
    videos.drop(columns=['parsed_tags', 'caption', 'topic_tag', 'manual_cover_text'], inplace=True)
    return videos

#### Full Preprocessing pipeline

We implement the full pipeline with fit and transform steps after both user and video preprocessing steps are done

In [10]:
class PreprocessingXGRanker:
    def __init__(self, tfidf_max_features=64):
        #  Representations, should be computed once by fit
        self.users = None
        self.videos = None
        # Registering different features
        self.cat_cols = ['upload_type']
        self.binary_cols = ['is_add', 'category_match', 'upload_type_match', 'is_weekend', 'is_user_active',
                            'is_lowactive_period', 'is_live_streamer', 	'is_video_author']
        self.num_cols = ['duration_diff', 'like_play_ratio', 'comment_play_ratio',
            'share_play_ratio', 'follow_play_ratio', 'follow_cancel_ratio',
            'like_cancel_ratio', 'like_to_comment_ratio', 'popularity_score', 'follower_fan_ratio'
        ]
        self.target_col = "engagement"
        # Some variables that should be reused in transform step
        self.cat_values = {}
        self.num_cols_medians = {}
        self.cat_cols_modes = {}
        self.tfidf_max_features = tfidf_max_features
        self.tfidf_video = None
        self.standard_scaler = None
        self.ohe = None

    def _generate_features(self, df: pd.DataFrame) -> None:
        df['category_match'] = df.apply(lambda row: int(row['preferred_category'] in row['feat']), axis=1)
        df['upload_type_match'] = (df['preferred_upload_type'] == df['upload_type']).astype(int)
        #df['video_type_match'] = (df['preferred_video_type'] == df['video_type']).astype(int)
        df['duration_diff'] = abs(df['video_duration'] - df['preferred_duration'])

    def _merge_representations(self, df: pd.DataFrame) -> pd.DataFrame:
        df_merged = df[['video_id', 'user_id', 'engagement']].merge(self.videos, on='video_id', how='left')
        df_merged.dropna(subset=['video_id', 'user_id', 'engagement'], inplace=True)
        return df_merged.merge(self.users, on='user_id', how='left').dropna(subset=['video_id', 'user_id', 'engagement'])

    def _encode_categoricals(self, df: pd.DataFrame, is_fit: bool) -> None:
        if is_fit:
            self.ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
            encoded = self.ohe.fit_transform(df[self.cat_cols])
        else:
            encoded = self.ohe.transform(df[self.cat_cols])

        encoded_df = pd.DataFrame(
            encoded.astype(np.int16),
            columns=self.ohe.get_feature_names_out(self.cat_cols),
            index=df.index
        )

        df.drop(columns=self.cat_cols, inplace=True)
        df[encoded_df.columns] = encoded_df

    def _drop_and_fill_missing(self, df: pd.DataFrame, is_fit: bool) -> None:
        col_to_remove = (set(self.num_cols) | set(self.cat_cols) | set(self.binary_cols) | set(["tags_caption_cover"]) | set([self.target_col])) - set(df.columns)
        print(f"Removing columns: {col_to_remove}")
        df.drop(columns=list(col_to_remove), inplace=True, errors='ignore')

        # NUMERIC
        for col in self.num_cols:
            if is_fit:
                self.num_cols_medians[col] = df[col].median()
            df[col].fillna(self.num_cols_medians.get(col, df[col].median()), inplace=True)

        # CATEGORICAL
        for col in self.cat_cols:
            mode = self.cat_cols_modes.get(col) if not is_fit else (df[col].mode()[0] if not df[col].mode().empty else 'unknown')
            if is_fit:
                self.cat_cols_modes[col] = mode
            df[col].fillna(mode, inplace=True)

    def _scale_numerical(self, df: pd.DataFrame, is_fit: bool) -> None:
        if is_fit:
            self.standard_scaler = StandardScaler()
            df[self.num_cols] = self.standard_scaler.fit_transform(df[self.num_cols])
        else:
            df[self.num_cols] = self.standard_scaler.transform(df[self.num_cols])

    def transform(self, interactions: pd.DataFrame, video_features: pd.DataFrame, video_daily: pd.DataFrame, 
                  video_categories: pd.DataFrame, user_features: pd.DataFrame, save = True) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        if self.users is None or self.videos is None or self.standard_scaler is None or self.ohe is None:
            raise ValueError("Preprocessor has not been fitted. Call fit_transform first.")

        interactions_clean = interactions.copy()

        print("Processing interaction data...")
        interactions_clean.drop_duplicates(['user_id', 'video_id'], keep='first', inplace=True)
        interactions_clean['engagement'] = interactions_clean['watch_ratio'] * (
            np.log1p(interactions_clean['play_duration']) / np.log1p(interactions_clean['video_duration'])
        )
        #interactions_clean['time'] = pd.to_datetime(interactions_clean['timestamp'], unit='s')
        #interactions_clean['is_weekend'] = (interactions_clean['time'].dt.dayofweek >= 5).astype(int)

        print("Combining all features...")
        df_interactions = self._merge_representations(interactions_clean)

        print("Generating new features...")
        self._generate_features(df_interactions)

        print("Handling missing features (cat / num)...")
        self._drop_and_fill_missing(df_interactions, is_fit=False)

        print("Encoding Categorical features...")
        self._encode_categoricals(df_interactions, is_fit=False)

        print("Scaling Numerical features...")
        self._scale_numerical(df_interactions, is_fit=False)

        #print("TF-IDF on text")
        #tfidf_matrix = self.tfidf_video.fit_transform(df_interactions['tags_caption_cover'])
        #tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=[
        #    f'tfidf_feature_{i}' for i in range(tfidf_matrix.shape[1])
        #])
        #df_interactions = pd.concat([df_interactions.reset_index(drop=True), tfidf_df], axis=1)
        #df_interactions.drop(columns=['tags_caption_cover'], inplace=True)

        df_interactions.drop(columns=['preferred_category', 'preferred_video_type', 'preferred_upload_type', 'feat', 'tags_caption_cover'], inplace=True)        
        print(f"Final dataset shape: {df_interactions.shape}")
        print(df_interactions.columns)

        if save:
            df_interactions.to_parquet("dataset/test_processed_data.parquet", compression='gzip')

        return df_interactions
    
    def fit_transform(self, interactions: pd.DataFrame, video_features: pd.DataFrame, video_daily: pd.DataFrame, 
                   video_categories: pd.DataFrame, user_features: pd.DataFrame, save = True) -> tuple:
        interactions_clean = interactions.drop_duplicates(['user_id', 'video_id'], keep='first')
    
        print("Processing interaction data...")
        interactions_clean['engagement'] = interactions_clean['watch_ratio'] * (
            np.log1p(interactions_clean['play_duration']) / np.log1p(interactions_clean['video_duration'])
        )
        interactions_clean['time'] = pd.to_datetime(interactions_clean['timestamp'], unit='s')
        interactions_clean['is_weekend'] = interactions_clean['time'].dt.dayofweek.apply(lambda x: 1 if x >= 5 else 0)

        print("Processing videos...")
        self.videos = preprocess_videos(video_features, video_daily, video_categories)
        print(self.videos.head())

        print("Processing users...")
        self.users = preprocess_users(interactions_clean, user_features, video_categories, video_daily)
        print(self.users.head())
    
        print("Combining all features...")
        df_interactions = self._merge_representations(interactions_clean)

        del interactions_clean

        print("Generating new features...")
        self._generate_features(df_interactions)

        print("Dropping features...")
        self._drop_and_fill_missing(df_interactions, is_fit=True)

        print("Encoding Categoricals...")
        self._encode_categoricals(df_interactions, is_fit=True)

        print("Scaling numerical...")
        self._scale_numerical(df_interactions, is_fit=True)

        #print("TF-IDF on Text..")
        #self.tfidf_video = TfidfVectorizer(max_features=self.tfidf_max_features)
        #tfidf_matrix = self.tfidf_video.fit_transform(df_interactions['tags_caption_cover'])
        #tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=[
        #    f'tfidf_feature_{i}' for i in range(tfidf_matrix.shape[1])
        #])
        #df_interactions = pd.concat([df_interactions.reset_index(drop=True), tfidf_df], axis=1)
        #df_interactions.drop(columns=['tags_caption_cover'], inplace=True)

        for col in df_interactions.select_dtypes(include=['int', 'int64', 'int32']).columns:
            df_interactions[col] = df_interactions[col].astype(np.int16)
        for col in df_interactions.select_dtypes(include=['float', 'float64']).columns:
            df_interactions[col] = df_interactions[col].astype(np.float32)

        df_interactions.drop(columns=['preferred_category', 'preferred_video_type', 'preferred_upload_type', 'feat', 'tags_caption_cover', 'user_active_degree'], inplace=True)        
        print(f"Final dataset shape: {df_interactions.shape}")
        print(df_interactions.columns)

        #df_interactions.to_csv("dataset/df_train_interactions.csv", index=False, chunksize=100000)
        if save:
            print("Saving train dataframe into parquet: dataset/train_processed_data.parquet")
            df_interactions.to_parquet("dataset/train_processed_data.parquet", compression='gzip')
        return df_interactions

In [11]:
pipeline = PreprocessingXGRanker(tfidf_max_features=32)

df_train = pipeline.fit_transform(
    interactions=interactions_train,
    video_features=video_features,
    video_daily=video_daily,
    video_categories=video_categories,
    user_features=user_features,
    save=True
)

Processing interaction data...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  interactions_clean['engagement'] = interactions_clean['watch_ratio'] * (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  interactions_clean['time'] = pd.to_datetime(interactions_clean['timestamp'], unit='s')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  interactions_clean['is_weekend'] = interactio

Processing videos...
   video_id  like_play_ratio  comment_play_ratio  share_play_ratio  \
0         0         0.059485            0.001088          0.000255   
1         1         0.058329            0.000416          0.000452   
2         2         0.004744            0.000045          0.000028   
3         3         0.102283            0.000761          0.002826   
4         4         0.004532            0.000000          0.000000   

   like_cancel_ratio  video_duration  is_add  like_to_comment_ratio  \
0           0.187164          5966.0       0              47.931507   
1           0.603601             NaN       0              67.762295   
2           0.117484          8000.0       0              34.193548   
3           0.147410             NaN       0              13.442857   
4           0.090909         18000.0       0               0.057692   

   upload_type  follow_play_ratio  follow_cancel_ratio     feat  \
0  ShortImport           0.030664                  0.0      [8] 

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(self.num_cols_medians.get(col, df[col].median()), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mode, inplace=True)


Encoding Categoricals...
Scaling numerical...
Final dataset shape: (10300969, 41)
Index(['video_id', 'user_id', 'engagement', 'like_play_ratio',
       'comment_play_ratio', 'share_play_ratio', 'like_cancel_ratio',
       'video_duration', 'is_add', 'like_to_comment_ratio',
       'follow_play_ratio', 'follow_cancel_ratio', 'is_lowactive_period',
       'is_live_streamer', 'is_video_author', 'follower_fan_ratio',
       'popularity_score', 'video_id_count', 'is_user_active',
       'preferred_duration', 'category_match', 'upload_type_match',
       'duration_diff', 'upload_type_AiCutVideo', 'upload_type_FlashPhoto',
       'upload_type_FollowShoot', 'upload_type_Kmovie',
       'upload_type_LocalCollection', 'upload_type_LocalIntelligenceAlbum',
       'upload_type_LongCamera', 'upload_type_LongImport',
       'upload_type_LongPicture', 'upload_type_PhotoCopy',
       'upload_type_PictureCopy', 'upload_type_PictureSet',
       'upload_type_SameFrame', 'upload_type_ShareFromOtherApp',
 

In [13]:
df_test = pipeline.transform(
    interactions=interactions_test,
    video_features=video_features,
    video_daily=video_daily,
    video_categories=video_categories,
    user_features=user_features,
    save=True
)

Processing interaction data...
Combining all features...
Generating new features...
Handling missing features (cat / num)...
Removing columns: {'is_weekend'}


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(self.num_cols_medians.get(col, df[col].median()), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(mode, inplace=True)


Encoding Categorical features...
Scaling Numerical features...
Final dataset shape: (4676570, 42)
Index(['video_id', 'user_id', 'engagement', 'like_play_ratio',
       'comment_play_ratio', 'share_play_ratio', 'like_cancel_ratio',
       'video_duration', 'is_add', 'like_to_comment_ratio',
       'follow_play_ratio', 'follow_cancel_ratio', 'user_active_degree',
       'is_lowactive_period', 'is_live_streamer', 'is_video_author',
       'follower_fan_ratio', 'popularity_score', 'video_id_count',
       'is_user_active', 'preferred_duration', 'category_match',
       'upload_type_match', 'duration_diff', 'upload_type_AiCutVideo',
       'upload_type_FlashPhoto', 'upload_type_FollowShoot',
       'upload_type_Kmovie', 'upload_type_LocalCollection',
       'upload_type_LocalIntelligenceAlbum', 'upload_type_LongCamera',
       'upload_type_LongImport', 'upload_type_LongPicture',
       'upload_type_PhotoCopy', 'upload_type_PictureCopy',
       'upload_type_PictureSet', 'upload_type_SameFram

#### Model

In [6]:
class ContentBasedFilteringXGRanker:
    """
    Content-based filtering ranker using XGBoost with K-Fold cross-validation
    and comprehensive logging capabilities.
    """
    
    def __init__(self, verbose: bool = True):
        self.models = []
        self.best_model = None
        self.feature_importances = None
        self.verbose = verbose
        self.setup_logging()

    def setup_logging(self):
        logging.basicConfig(
            level=logging.INFO if self.verbose else logging.WARNING,
            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger('XGRanker')
        
    def _create_dmatrix(self, X: np.ndarray, y: Optional[np.ndarray] = None, 
                        groups: Optional[np.ndarray] = None) -> xgb.DMatrix:
        dmatrix = xgb.DMatrix(X, y)
        if groups is not None:
            unique_groups, group_counts = np.unique(groups, return_counts=True)
            dmatrix.set_group(group_counts)
        return dmatrix
    
    def fit(self, X: np.ndarray, y: np.ndarray, groups: np.ndarray, 
            params: Optional[Dict[str, Any]] = None, n_folds: int = 3, 
            early_stopping_rounds: int = 50):
        start_time = time.time()

        # Default parameters if none provided
        if params is None:
            params = {
                'objective': 'rank:ndcg',
                'eval_metric': 'ndcg@100',
                'learning_rate': 0.05,
                'ndcg_exp_gain': False,
                'max_depth': 8,
                'min_child_weight': 50,
                'subsample': 0.8,
                'colsample_bytree': 0.7,
                'n_estimators': 300,
                'tree_method': 'hist',
                'random_state': 42,
            }
        
        self.logger.info(f"Starting training with parameters: {params}")
        self.logger.info(f"Using {n_folds}-fold cross-validation")
        
        # Get unique groups to ensure they're not split across folds
        unique_groups = np.unique(groups)
        kf = KFold(n_splits=n_folds, shuffle=True, random_state=params.get('random_state', 42))
        
        fold_scores = []
        self.models = []
        
        # For tracking best model
        best_score = -np.inf
        best_model = None
        
        for fold, (train_idx, val_idx) in enumerate(tqdm(kf.split(unique_groups), 
                                                        total=n_folds, 
                                                        desc="Cross-validation")):
            self.logger.info(f"\n{'='*50}\nFold {fold+1}/{n_folds}\n{'='*50}")
            
            # Get indices for this fold based on groups
            train_groups = unique_groups[train_idx]
            val_groups = unique_groups[val_idx]
            
            train_mask = np.isin(groups, train_groups)
            val_mask = np.isin(groups, val_groups)
            
            X_train, y_train, groups_train = X[train_mask], y[train_mask], groups[train_mask]
            X_val, y_val, groups_val = X[val_mask], y[val_mask], groups[val_mask]
            
            # Create DMatrix objects
            dtrain = self._create_dmatrix(X_train, y_train, groups_train)
            dval = self._create_dmatrix(X_val, y_val, groups_val)
            
            # Setup watch list for monitoring progress
            watchlist = [(dtrain, 'train'), (dval, 'validation')]
            
            # Train the model for this fold
            evals_result = {}
            model = xgb.train(
                params,
                dtrain,
                num_boost_round=params.get('n_estimators', 300),
                evals=watchlist,
                early_stopping_rounds=early_stopping_rounds,
                evals_result=evals_result,
                verbose_eval=10 if self.verbose else False
            )
            
            # Save the model and validation score
            self.models.append(model)
            best_iteration = model.best_iteration
            best_score_fold = max(evals_result['validation'][params['eval_metric']])
            fold_scores.append(best_score_fold)
            
            self.logger.info(f"Fold {fold+1} best score: {best_score_fold:.6f} at iteration {best_iteration}")
            
            # Track the best model
            if best_score_fold > best_score:
                best_score = best_score_fold
                best_model = model
                
        # Set the best model
        self.best_model = best_model
        
        # Calculate feature importances from all models
        if best_model is not None:
            self.feature_importances = best_model.get_score(importance_type='gain')
        
        mean_score = np.mean(fold_scores)
        std_score = np.std(fold_scores)
        
        training_time = time.time() - start_time
        self.logger.info(f"\n{'='*50}")
        self.logger.info(f"Training completed in {training_time:.2f} seconds")
        self.logger.info(f"Cross-validation scores: {fold_scores}")
        self.logger.info(f"Mean CV score: {mean_score:.6f} ± {std_score:.6f}")
        self.logger.info(f"Best model score: {best_score:.6f}")
        
        if self.feature_importances:
            self.logger.info("Top 10 feature importances:")
            sorted_features = sorted(self.feature_importances.items(), key=lambda x: x[1], reverse=True)[:10]
            for feature, importance in sorted_features:
                self.logger.info(f"  {feature}: {importance:.4f}")
        
        return self
    
    def predict(self, X: np.ndarray, use_all_models: bool = False) -> np.ndarray:
        start_time = time.time()
        self.logger.info(f"Generating predictions for {X.shape[0]} samples...")
        
        dtest = xgb.DMatrix(X)
        
        if use_all_models and len(self.models) > 0:
            # Ensemble prediction by averaging all models
            self.logger.info("Using ensemble prediction from all folds")
            preds = np.zeros(X.shape[0])
            for model in self.models:
                preds += model.predict(dtest)
            preds /= len(self.models)
        elif self.best_model is not None:
            # Use only the best model
            self.logger.info("Using best model for prediction")
            preds = self.best_model.predict(dtest)
        else:
            raise ValueError("No trained models available. Please fit the model first.")
        
        prediction_time = time.time() - start_time
        self.logger.info(f"Prediction completed in {prediction_time:.2f} seconds")
        
        return preds
    
    def get_feature_importances(self) -> Dict[str, float]:
        if self.feature_importances is None:
            self.logger.warning("Feature importances not available. Model may not be trained.")
            return {}
        return self.feature_importances
    
    def save_model(self, filepath: str, save_all: bool = False) -> None:
        if self.best_model is not None:
            best_model_path = f"{filepath}_best.json"
            self.best_model.save_model(best_model_path)
            self.logger.info(f"Best model saved to {best_model_path}")
            
        if save_all and len(self.models) > 0:
            for i, model in enumerate(self.models):
                fold_path = f"{filepath}_fold_{i}.json"
                model.save_model(fold_path)
            self.logger.info(f"All {len(self.models)} fold models saved")
    
    def load_model(self, filepath: str):
        self.best_model = xgb.Booster()
        self.best_model.load_model(filepath)
        self.logger.info(f"Model loaded from {filepath}")
        
        try:
            self.feature_importances = self.best_model.get_score(importance_type='gain')
        except:
            self.logger.warning("Could not extract feature importances from loaded model")
            
        return self

#### Loading Train Datasets

In [3]:
df_train = pd.read_parquet("dataset/train_processed_data.parquet")
df_train.info(memory_usage="deep")
X_cols = [col for col in df_train.columns if col != "engagement"]
X_train = df_train[X_cols].to_numpy(copy=False)
y_train = np.ceil(df_train["engagement"].to_numpy(copy=False)).astype(int)
groups_train = df_train["user_id"].to_numpy(copy=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10300969 entries, 0 to 10300968
Data columns (total 41 columns):
 #   Column                              Dtype  
---  ------                              -----  
 0   video_id                            int16  
 1   user_id                             int16  
 2   engagement                          float32
 3   like_play_ratio                     float32
 4   comment_play_ratio                  float32
 5   share_play_ratio                    float32
 6   like_cancel_ratio                   float32
 7   video_duration                      float32
 8   is_add                              int16  
 9   like_to_comment_ratio               float32
 10  follow_play_ratio                   float32
 11  follow_cancel_ratio                 float32
 12  is_lowactive_period                 int16  
 13  is_live_streamer                    int16  
 14  is_video_author                     int16  
 15  follower_fan_ratio                  float32
 16

##### Training Model

In [4]:
print(y_train.min())
print(y_train.max())

0
1309


In [None]:
model = ContentBasedFilteringXGRanker()
model.fit(X_train, y_train, groups_train, params={
    'objective': 'rank:ndcg',
    'eval_metric': 'ndcg@100',
    'learning_rate': 0.05,
    'ndcg_exp_gain': False,
    'max_depth': 8,
    'min_child_weight': 50,
    'subsample': 0.8,
    'colsample_bytree': 0.7,
    'n_estimators': 300,
    'tree_method': 'hist',
    'random_state': 42,
})

2025-05-05 14:26:11,301 - XGRanker - INFO - Starting training with parameters: {'objective': 'rank:ndcg', 'eval_metric': 'ndcg@100', 'learning_rate': 0.05, 'ndcg_exp_gain': False, 'max_depth': 8, 'min_child_weight': 50, 'subsample': 0.8, 'colsample_bytree': 0.7, 'n_estimators': 300, 'tree_method': 'hist', 'random_state': 42}
2025-05-05 14:26:11,302 - XGRanker - INFO - Using 3-fold cross-validation
Cross-validation:   0%|          | 0/3 [00:00<?, ?it/s]2025-05-05 14:26:11,480 - XGRanker - INFO - 
Fold 1/3


#### Evaluation

In [9]:
df_test = pd.read_parquet("dataset/test_processed_data.parquet")

X_test = df_train[list(set(df_train.columns) - set(["engagement"]))].to_numpy()
y_test = df_train["engagement"].to_numpy()

y_pred = model.predict(X_test)


In [10]:
y_pred

array([0.9250729 , 0.94992095, 0.93757343, ..., 0.9539452 , 0.97146255,
       0.93965286], dtype=float32)