# Project Assignment: Short Video Recommender System (KuaiRec)

Dataset Source: [Kuairec](https://kuairec.com/)

Arxiv Paper: [KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems](https://arxiv.org/pdf/2202.10842)

## Dataset import

The server is down, please download from the Google Drive in the given link.

In [None]:
!wget https://nas.chongminggao.top:4430/datasets/KuaiRec.zip --no-check-certificate
!unzip KuaiRec.zip

## Imports

In [18]:
import os
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dot, Dense
from tensorflow.keras.callbacks import EarlyStopping
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split


# I get my dataset from a Kaggle input
DATA_PATH = "/kaggle/input/kuairec/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   raise FileNotFoundError("KuaiRec dataset not found. Please check the path.")

DATA_PATH

'/home/tofeha/ING2/ING2/REMA1/FinalProject_2025_aziz.zeghal/KuaiRec 2.0/data'

# Step 1: Load the dataset

In [4]:
def data_clear(df : pd.DataFrame) -> pd.DataFrame:
    # Date is time in a weird format

    # Time and Date are duplicated of timestamp, we can drop them
    df.drop(columns=["time", "date"], inplace=True)
    # Not a problem, we want to keep the data for the density
    df = df.astype({
        "user_id": "int32",
        "video_id": "int32",
        "play_duration":"int32",
        "timestamp": "int64",
        "watch_ratio": "float32"}, errors="ignore")
    
    # Drop duplicates
    df.drop_duplicates(inplace=True)
    df.dropna(inplace=True)
    df = df[df["timestamp"] >= 0]
    
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="s")

    return df

In [None]:
def my_describe(df : pd.DataFrame) -> pd.DataFrame:
    """
    Custom describe for datasets containing user_id and video_id
    """
    print(f"Shape of the small matrix: {df.shape}")
    unique_users = df["user_id"].nunique()
    unique_posts = df["video_id"].nunique()
    print(f"Number of unique users: {unique_users}")
    print(f"Number of unique posts: {unique_posts}")
    print(f"Matrix sparsity: {len(df) /(unique_posts * unique_users) * 100}%")
    return df.describe()

## Small matrix

This table has a density of 99.6%. This means that 99.6% of the entries in the matrix are non-zero, indicating that most users have interacted with most items.

In [None]:
small_matrix = pd.read_csv(f"{DATA_PATH}/small_matrix.csv")

small_matrix = data_clear(small_matrix)


## Big matrix

This table has a density of 16.3%. We will use this matrix for our training and testing.

It contains more interactions with the same users/items of the small matrix. We do not need to substract the small matrix.

In [5]:
big_matrix = pd.read_csv(f"{DATA_PATH}/big_matrix.csv")

big_matrix = data_clear(big_matrix)


In [None]:
big_matrix

## Misc

In [None]:
print(f"Proportion of small_matrix relative to big_matrix: {small_matrix.shape[0] * 100 / big_matrix.shape[0]:.2f}%")

## Item category encoding

We have the caracteristics of the videos (author_id, video_type...) but this part requires less preprocessing.

For Content-based filtering, we need to use features of the videos (list of tags). No need for TF-IDF, we will use a simple one-hot encoding.

In [None]:
# No missing values for this data
item_categories = pd.read_csv(f"{DATA_PATH}/item_categories.csv")

# Transform the feat column to a list (evaluate with python)
item_categories["feat"] = item_categories["feat"].apply(eval)

## Item daily features

This dataset is also interesting for content-based filtering.

Mostly composed of textual data, we will use a TF-IDF vectorizer to encode the features of the videos.

In [None]:
item_daily_features = pd.read_csv(f"{DATA_PATH}/item_daily_features.csv", lineterminator='\n')
item_daily_features

## Caption Category

In [None]:
caption_category = pd.read_csv(f"{DATA_PATH}/kuairec_caption_category.csv", lineterminator='\n')
caption_category

# Step 2: Feature Engineering

- Create meaningful features from interaction and metadata (e.g., content tags, user activity history)
- Build user-item interaction matrix
- Optionally extract time-based or popularity-based features

## Item category encoding

We have the caracteristics of the videos (author_id, video_type...) but this part requires less preprocessing.

For Content-based filtering, we need to use features of the videos (list of tags). No need for TF-IDF, we will use a simple one-hot encoding.

In [None]:
# Use MultiLabelBinarizer to manage efficiently the feat column
mlb = MultiLabelBinarizer()

matrix_item_category = pd.DataFrame(mlb.fit_transform(item_categories["feat"]), 
                  columns=mlb.classes_,
                  index=item_categories["video_id"])


In [None]:
matrix_item_category

# Step 3: Two-Tower Model
Two-Tower is an embedding model used mostly for retrieval tasks, such as search or recommendation.

Two towers refer to the two separate neural networks that are used to encode the user and item features. Each tower is trained independently, and the outputs of the two towers are combined to make predictions.

It is meant to be efficient for large data, and scalable to new users and items.

The model is cut into 4 parts:
- Data preparation and tuning
- Model training
- Model evaluation
- Model saving


## Model 1: Basic Two-Tower Model no features

### Data preparation

In [13]:
# Train test split
train, test = train_test_split(big_matrix[["user_id", "video_id", "watch_ratio"]], test_size=0.2, random_state=42)

### Model creation

In [19]:
# User tower

num_users = big_matrix['user_id'].nunique()
num_videos = big_matrix['video_id'].nunique()

user_input = Input(shape=(1,), name="user_input")
user_embedding = Embedding(input_dim=num_users, output_dim=50, name="user_embedding")(user_input)
user_embedding = Flatten(name="user_flatten")(user_embedding)

In [20]:
# Video tower
video_input = Input(shape=(1,), name="video_input")
video_embedding = Embedding(input_dim=num_videos, output_dim=50, name="video_embedding")(video_input)
video_embedding = Flatten(name="video_flatten")(video_embedding)

In [21]:
# Dot product
dot_product = Dot(axes=1)([user_embedding, video_embedding])



In [22]:
# Create the model
model = Model(inputs=[user_input, video_input], outputs=dot_product)

model.compile(optimizer="adam", loss="mse")

model.summary()

### Training

In [23]:
history = model.fit(
    x=[train["user_id"], train["video_id"]],
    y=train["watch_ratio"],
    batch_size=128,
    epochs=5,
    validation_data=([test["user_id"], test["video_id"]], test["watch_ratio"]),
)

Epoch 1/5


I0000 00:00:1745265651.627683  180690 service.cc:152] XLA service 0x7eff340055f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1745265651.627721  180690 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 3060 Laptop GPU, Compute Capability 8.6
2025-04-21 22:00:51.643094: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1745265651.684036  180690 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m   23/72238[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m2:45[0m 2ms/step - loss: 2.3617   

I0000 00:00:1745265651.961822  180690 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m234s[0m 3ms/step - loss: 2.8754 - val_loss: 2.6026
Epoch 2/5
[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m229s[0m 3ms/step - loss: 2.6324 - val_loss: 2.6029
Epoch 3/5
[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m228s[0m 3ms/step - loss: 2.6286 - val_loss: 2.6085
Epoch 4/5
[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m231s[0m 3ms/step - loss: 2.5747 - val_loss: 2.6113
Epoch 5/5
[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m230s[0m 3ms/step - loss: 2.5740 - val_loss: 2.6172


### Evaluation

In [24]:
# Evaluate on the test set
test_loss = model.evaluate([test["user_id"], test["video_id"]], test["watch_ratio"])
print(f"Test loss (MSE): {test_loss}")

[1m72238/72238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m154s[0m 2ms/step - loss: 2.5886
Test loss (MSE): 2.617175817489624


### Saving

In [25]:
model.save("two_tower_model.keras")

# Step 4: Two-Tower Recommendation

- Predict which videos are likely to be enjoyed by each user in the test set
- Generate a top-N ranked list of recommendations for each user

### Loading model

### Recommendation

# Evaluation

- Choose suitable metrics (e.g., Precision@K, Recall@K, MAP, NDCG)
- Evaluate performance and provide interpretations