# Project Assignment: Short Video Recommender System (KuaiRec)

Dataset Source: [Kuairec](https://kuairec.com/)

Arxiv Paper: [KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems](https://arxiv.org/pdf/2202.10842)

## Dataset import

The server is down, please download from the Google Drive in the given link.

In [14]:
!wget https://nas.chongminggao.top:4430/datasets/KuaiRec.zip --no-check-certificate
!unzip KuaiRec.zip

--2025-04-23 10:53:03--  https://nas.chongminggao.top:4430/datasets/KuaiRec.zip
Resolving nas.chongminggao.top (nas.chongminggao.top)... 211.86.155.249
Connecting to nas.chongminggao.top (nas.chongminggao.top)|211.86.155.249|:4430... failed: Connection refused.
unzip:  cannot find or open KuaiRec.zip, KuaiRec.zip.zip or KuaiRec.zip.ZIP.


## Imports

In [1]:
import os
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Flatten, Dot, Dense
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split


# I get my dataset from a Kaggle input
DATA_PATH = "/kaggle/input/kuairec/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   raise FileNotFoundError("KuaiRec dataset not found. Please check the path.")

DATA_PATH

2025-04-23 21:47:20.531520: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745437640.545679  163207 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745437640.549566  163207 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1745437640.562252  163207 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745437640.562277  163207 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745437640.562278  163207 computation_placer.cc:177] computation placer alr

'/home/tofeha/ING2/ING2/REMA1/FinalProject_2025_aziz.zeghal/KuaiRec 2.0/data'

# Step 1: Load the dataset

In [2]:
def data_clear(df : pd.DataFrame) -> pd.DataFrame:
    # Date is time in a weird format

    # Time and Date are duplicated of timestamp, we can drop them
    df.drop(columns=["time", "date"], inplace=True)
    # Not a problem, we want to keep the data for the density
    df = df.astype({
        "user_id": "int32",
        "video_id": "int32",
        "play_duration":"int32",
        "timestamp": "int64",
        "watch_ratio": "float32"}, errors="ignore")
    
    # Drop duplicates
    df.drop_duplicates(inplace=True)
    df.dropna(inplace=True)
    df = df[df["timestamp"] >= 0]
    
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit="s")

    return df

## Small matrix

This table has a density of 99.6%. This means that 99.6% of the entries in the matrix are non-zero, indicating that most users have interacted with most items.

In [4]:
small_matrix = pd.read_csv(f"{DATA_PATH}/small_matrix.csv")

small_matrix = data_clear(small_matrix)


## Big matrix

This table has a density of 16.3%. We will use this matrix for our training and testing.

It contains more interactions with the same users/items of the small matrix. We do not need to substract the small matrix.

In [4]:
big_matrix = pd.read_csv(f"{DATA_PATH}/big_matrix.csv")

big_matrix = data_clear(big_matrix)


In [None]:
big_matrix

Unnamed: 0,user_id,video_id,play_duration,video_duration,timestamp,watch_ratio
0,0,3649,13838,10867,2020-07-04 16:08:23,1.273396
1,0,9598,13665,10984,2020-07-04 16:13:41,1.244082
2,0,5262,851,7908,2020-07-04 16:16:06,0.107613
3,0,1963,862,9590,2020-07-04 16:20:26,0.089885
4,0,8234,858,11000,2020-07-04 16:43:05,0.078000
...,...,...,...,...,...,...
12530801,7175,1281,34618,140017,2020-09-05 07:07:10,0.247241
12530802,7175,3407,12619,21888,2020-09-05 07:08:45,0.576526
12530803,7175,10360,2407,7067,2020-09-05 11:10:29,0.340597
12530804,7175,10360,6455,7067,2020-09-05 11:10:36,0.913400


## Misc

In [None]:
print(f"Proportion of small_matrix relative to big_matrix: {small_matrix.shape[0] * 100 / big_matrix.shape[0]:.2f}%")

## Item category encoding

We have the caracteristics of the videos (author_id, video_type...) but this part requires less preprocessing.

For Content-based filtering, we need to use features of the videos (list of tags). No need for TF-IDF, we will use a simple one-hot encoding.

In [5]:
# No missing values for this data
item_categories = pd.read_csv(f"{DATA_PATH}/item_categories.csv")

## Item daily features

This dataset is also interesting for content-based filtering.

Mostly composed of textual data, we will use a TF-IDF vectorizer to encode the features of the videos.

In [70]:
item_daily_features = pd.read_csv(f"{DATA_PATH}/item_daily_features.csv", lineterminator='\n')

## Caption Category

In [6]:
caption_category = pd.read_csv(f"{DATA_PATH}/kuairec_caption_category.csv", lineterminator='\n')

## User features

In [41]:
user_features = pd.read_csv(f"{DATA_PATH}/user_features.csv", lineterminator='\n')

# Step 2: Feature Engineering

- Create meaningful features from interaction and metadata (e.g., content tags, user activity history)
- Build user-item interaction matrix
- Optionally extract time-based or popularity-based features

## Item category encoding

We have the caracteristics of the videos (author_id, video_type...) but this part requires less preprocessing.

For Content-based filtering, we need to use features of the videos (list of tags). No need for TF-IDF, we will use a simple one-hot encoding.

## Tower preparation

In [42]:
# Use MultiLabelBinarizer to manage efficiently the feat column
mlb = MultiLabelBinarizer()

# Transform the feat column to a list (evaluate with python)
item_categories["feat"] = item_categories["feat"].apply(eval)

item_features = pd.DataFrame(mlb.fit_transform(item_categories["feat"]), 
                  columns=mlb.classes_,
                  index=item_categories["video_id"])

for column in item_features.columns:
    item_features[column] = item_features[column].astype("int16")

item_features.columns = item_features.columns.map(str)

item_feature_columns = [str(i) for i in range(1, 31)]

# Keep IDs for dataset creation
item_feature_map = item_features[item_feature_columns]


TypeError: eval() arg 1 must be a string, bytes or code object

In [None]:
user_feature_columns = [
    "is_lowactive_period","is_live_streamer", "is_video_author",
    "onehot_feat0", "onehot_feat1", "onehot_feat2", "onehot_feat3",
    "onehot_feat4", "onehot_feat5", "onehot_feat6", "onehot_feat7",
    "onehot_feat8", "onehot_feat9", "onehot_feat10", "onehot_feat11", 
    "onehot_feat12", "onehot_feat13", "onehot_feat14", "onehot_feat15",
    "onehot_feat16", "onehot_feat17"
]
# Keep IDs for dataset creation
user_features_map = user_features[["user_id"] + user_feature_columns].copy().set_index("user_id")

user_features_map.fillna(0, inplace=True)
for column in user_feature_columns:
    user_features_map[column] = user_features_map[column].astype("int16")

In [9]:
# Index is the associated IDs for quick creation
display(user_features_map.head())
display(item_feature_map.head())

Unnamed: 0_level_0,is_lowactive_period,is_live_streamer,is_video_author,onehot_feat0,onehot_feat1,onehot_feat2,onehot_feat3,onehot_feat4,onehot_feat5,onehot_feat6,...,onehot_feat8,onehot_feat9,onehot_feat10,onehot_feat11,onehot_feat12,onehot_feat13,onehot_feat14,onehot_feat15,onehot_feat16,onehot_feat17
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,1,17,638,2,0,1,...,184,6,3,0,0,0,0,0,0,0
1,0,0,0,0,3,25,1021,0,0,1,...,186,6,2,0,0,0,0,0,0,0
2,0,0,0,0,6,8,402,0,0,0,...,51,2,3,0,0,0,0,0,0,0
3,0,0,0,0,1,8,281,0,0,0,...,251,3,2,0,0,0,0,0,0,0
4,0,0,0,0,1,8,316,1,0,1,...,99,4,2,0,0,0,0,0,0,0


Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,...,21,22,23,24,25,26,27,28,29,30
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Dataset preparation

For each interaction in the dataset, we will create a row of user_feature and item_feature.

For example:
```python
interaction[0] = {
    'user_id': 1,
    'item_id': 2,
    'watch_ratio': 0.9
}
```
First row of user_features_vector has user1's features, and item_vector_features has item2's features.

In [49]:
# Take the IDs, populate the vectors
interaction_vector = big_matrix.iloc[:2_000_000].copy()
(user_ids, video_ids) = (interaction_vector["user_id"].values, interaction_vector["video_id"].values)
interaction_vector = interaction_vector[["watch_ratio"]].values

# Create the vector with features, drop IDs as not needed by the model
user_features_vector = pd.DataFrame(user_features_map.iloc[user_ids], columns=user_feature_columns).reset_index(drop=True)
item_features_vector = pd.DataFrame(item_feature_map.iloc[video_ids], columns=item_feature_columns).reset_index(drop=True)

In [50]:
def sanitize_check(df: pd.DataFrame) -> None:
    """
    Check if the dataframe has nan/inf values.
    Sends an error message if it does.
    """
    if np.isnan(df).any().any() or np.isinf(df).any().any():
        raise ValueError(f"The dataframe with these columns: {df.columns[:3]}\n has nan/inf values")
    else:
        print("df is good")

sanitize_check(user_features_vector)
sanitize_check(item_features_vector)
sanitize_check(interaction_vector)

df is good
df is good
df is good


## Scale features

Once we have our feature vectors associated to each interaction, we will scale the features and target

In [51]:
scalerItem = StandardScaler()
scalerItem.fit(item_features_vector)
item_features_vector = scalerItem.transform(item_features_vector)


In [52]:
scalerUser = StandardScaler()
scalerUser.fit(user_features_vector)
user_features_vector = scalerUser.transform(user_features_vector)

In [53]:
scalerTarget = StandardScaler()
scalerTarget.fit(interaction_vector.reshape(-1, 1))
interaction_vector = scalerTarget.transform(interaction_vector.reshape(-1, 1))

## Data splitting

In [54]:
# Train test split

y_train, y_test = train_test_split(interaction_vector, test_size=0.2, random_state=42)

item_features_train, item_features_test = train_test_split(item_features_vector, test_size=0.2, random_state=42)

user_features_train, user_features_test = train_test_split(user_features_vector, test_size=0.2, random_state=42)


# Step 3: Two-Tower Model
Two-Tower is an embedding model used mostly for retrieval tasks, such as search or recommendation.

Two towers refer to the two separate neural networks that are used to encode the user and item features. Each tower is trained independently, and the outputs of the two towers are combined to make predictions.

It is meant to be efficient for large data, and scalable to new users and items.

The model is cut into 4 parts:
- Data preparation and tuning
- Model training
- Model evaluation
- Model saving


## Model 1: Basic Two-Tower Model

### Model creation

In [55]:
# Item tower
num_item_features = item_features_train.shape[1]

# Input layer
input_item = Input(shape=(num_item_features,), name="item_input")

# Process features
item_NN = tf.keras.models.Sequential(
    [
        Dense(256, activation="relu", name="item_x"),
        Dense(128, activation="relu", name="item_embedding"),
    ], name="item_NN"
)

item_embedding = item_NN(input_item)

In [57]:
# User tower
num_user_features = user_features_train.shape[1]

# Input layer
input_user = Input(shape=(num_user_features,), name="user_input")

# Process features
user_NN = tf.keras.models.Sequential(
    [
        Dense(256, activation="relu", name="user_x"),
        Dense(128, activation="relu", name="user_embedding"),
    ], name="user_NN"
)
user_embedding = user_NN(input_user)

In [58]:
output = Dot(axes=1, name="dot_product")([user_embedding, item_embedding])

model = Model(inputs=[input_user, input_item], outputs=output, name = "Two_Tower_Model")

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.RootMeanSquaredError()],
)
model.summary()

### Training

In [59]:
user_features_train.shape, item_features_train.shape, y_train.shape

((1600000, 21), (1600000, 30), (1600000, 1))

In [60]:
history = model.fit(
    [user_features_train, item_features_train],
    y_train,
    epochs=5,
    batch_size=128,
    validation_data=([user_features_test, item_features_test], y_test),
)


2025-04-23 22:08:08.974253: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 192000000 exceeds 10% of free system memory.
2025-04-23 22:08:09.360358: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 192000000 exceeds 10% of free system memory.


Epoch 1/5
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m51s[0m 4ms/step - loss: 0.9623 - root_mean_squared_error: 0.9806 - val_loss: 0.8387 - val_root_mean_squared_error: 0.9158
Epoch 2/5
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 4ms/step - loss: 1.0854 - root_mean_squared_error: 1.0405 - val_loss: 0.8402 - val_root_mean_squared_error: 0.9166
Epoch 3/5
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 4ms/step - loss: 0.9913 - root_mean_squared_error: 0.9944 - val_loss: 0.8402 - val_root_mean_squared_error: 0.9166
Epoch 4/5
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 4ms/step - loss: 1.0641 - root_mean_squared_error: 1.0312 - val_loss: 0.8402 - val_root_mean_squared_error: 0.9166
Epoch 5/5
[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 4ms/step - loss: 1.0826 - root_mean_squared_error: 1.0398 - val_loss: 0.8402 - val_root_mean_squared_error: 0.9166


### Evaluation

In [61]:
# Evaluate on the test set
test_loss = model.evaluate([user_features_test, item_features_test], y_test)
print(f"Baseline mean value: {big_matrix["watch_ratio"].mean():.4f}")
print(f"Test loss: {test_loss[0]:.4f}")

[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 3ms/step - loss: 0.8409 - root_mean_squared_error: 0.9117
Baseline mean value: 0.9470
Test loss: 0.8402


### Saving

In [42]:
model.save("two_tower_model.keras")

# Step 4: Two-Tower Recommendation

- Predict which videos are likely to be enjoyed by each user in the test set
- Generate a top-N ranked list of recommendations for each user

### Loading model

In [None]:
model = tf.keras.models.load_model("two_tower_model.keras")

### Recommendation

In [35]:
test_predictions = model.predict([user_test, item_test])

[1m50000/50000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m75s[0m 1ms/step


In [40]:
test_predictions

array([[1.0342128 ],
       [1.0342128 ],
       [1.1473429 ],
       ...,
       [1.1473429 ],
       [0.90934604],
       [1.1473429 ]], dtype=float32)

In [None]:
test_results = pd.DataFrame({
    "user_id": y_test["user_id"],
    "video_id": y_test["video_id"],
    "watch_ratio": y_test["watch_ratio"],
    "predictions": test_predictions.flatten()
})
test_results

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [None]:
test_results["error"] = round(abs(test_results["watch_ratio"] - test_results["predicted_watch_ratio"]), 2)
test_results.sort_values(by="error", inplace=True)

In [None]:
test_results

# Evaluation

- Choose suitable metrics (e.g., Precision@K, Recall@K, MAP, NDCG)
- Evaluate performance and provide interpretations