# Project Assignment: Short Video Recommender System (KuaiRec)

Dataset Source: [Kuairec](https://kuairec.com/)

Arxiv Paper: [KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems](https://arxiv.org/pdf/2202.10842)

## Dataset import

In [None]:
!wget https://nas.chongminggao.top:4430/datasets/KuaiRec.zip --no-check-certificate
!unzip KuaiRec.zip

## Imports

In [1]:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import pandas as pd
import numpy as np
import os
from sklearn import metrics
from scipy.sparse import csr_matrix


# I get my dataset from a Kaggle input
DATA_PATH = "/kaggle/input/kuairec/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   raise FileNotFoundError("KuaiRec dataset not found. Please check the path.")

DATA_PATH

'/home/tofeha/ING2/ING2/REMA1/FinalProject_2025_aziz.zeghal/KuaiRec 2.0/data'

## Step 1: Load and observe the dataset

- Load and inspect the dataset
- Handle missing or inconsistent data
- Merge metadata for content-based models if necessary

### Small matrix

This table has a density of 99.6%. This means that 99.6% of the entries in the matrix are non-zero, indicating that most users have interacted with most items.

In [2]:
def data_clear(df : pd.DataFrame) -> pd.DataFrame:
    # Date is time in a weird format
    df.drop("date", axis="columns", inplace=True)

    # Timestamp and time can be missing
    # Not a problem, we want to keep the data for the density
    df = df.astype({
        "user_id": "int32",
        "video_id": "int32",
        "play_duration":"int32",
        "timestamp": "int64",
        "watch_ratio": "float32"}, errors="ignore")
    
    # Drop duplicates
    df.drop_duplicates(subset=["user_id", "video_id"], inplace=True)

    df["time"] = pd.to_datetime(df["time"])

    return df

In [3]:
train_set = pd.read_csv(f"{DATA_PATH}/small_matrix.csv")

train_set = data_clear(train_set)


In [4]:
print(f"Shape of the small matrix: {train_set.shape}")
unique_users = train_set["user_id"].nunique()
unique_posts = train_set["video_id"].nunique()
print(f"Matrix sparsity: {len(train_set) /(unique_posts * unique_users) * 100}%")

Shape of the small matrix: (4676570, 7)
Matrix sparsity: 99.62024941648522%


In [5]:
train_set.head()

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,timestamp,watch_ratio
0,14,148,4381,6067,2020-07-05 05:27:48.378,1593898000.0,0.722103
1,14,183,11635,6100,2020-07-05 05:28:00.057,1593898000.0,1.907377
2,14,3649,22422,10867,2020-07-05 05:29:09.479,1593898000.0,2.063311
3,14,5262,4479,7908,2020-07-05 05:30:43.285,1593898000.0,0.566388
4,14,8234,4602,11000,2020-07-05 05:35:43.459,1593899000.0,0.418364


### Big matrix

This table has a density of 16.3%. We will use this matrix for our training and testing.

It contains more interactions with the same users/items of the small matrix. We do not need to substract the small matrix.

In [6]:
evaluation_set = pd.read_csv(f"{DATA_PATH}/big_matrix.csv")

evaluation_set = data_clear(evaluation_set)


In [7]:
evaluation_set

Unnamed: 0,user_id,video_id,play_duration,video_duration,time,timestamp,watch_ratio
0,0,3649,13838,10867,2020-07-05 00:08:23.438,1593878903,1.273396
1,0,9598,13665,10984,2020-07-05 00:13:41.297,1593879221,1.244082
2,0,5262,851,7908,2020-07-05 00:16:06.687,1593879366,0.107613
3,0,1963,862,9590,2020-07-05 00:20:26.792,1593879626,0.089885
4,0,8234,858,11000,2020-07-05 00:43:05.128,1593880985,0.078000
...,...,...,...,...,...,...,...
12530800,7175,6630,4342,13855,2020-09-05 15:00:33.379,1599289233,0.313389
12530801,7175,1281,34618,140017,2020-09-05 15:07:10.576,1599289630,0.247241
12530802,7175,3407,12619,21888,2020-09-05 15:08:45.228,1599289725,0.576526
12530803,7175,10360,2407,7067,2020-09-05 19:10:29.041,1599304229,0.340597


### Item metadata

In [117]:
item_categories = pd.read_csv(f"{DATA_PATH}/kuairec_caption_category.csv", lineterminator='\n')
item_categories.astype({"video_id": "int32"})


Unnamed: 0,video_id,manual_cover_text,caption,topic_tag,first_level_category_id,first_level_category_name,second_level_category_id,second_level_category_name,third_level_category_id,third_level_category_name
0,0,UNKNOWN,精神小伙路难走 程哥你狗粮慢点撒,[],8,颜值,673,颜值随拍,-124,UNKNOWN
1,1,UNKNOWN,,[],27,高新数码,-124,UNKNOWN,-124,UNKNOWN
2,2,UNKNOWN,晚饭后，运动一下！,[],9,喜剧,727,搞笑互动,-124,UNKNOWN
3,3,UNKNOWN,我平淡无奇，惊艳不了时光，温柔不了岁月，我只想漫无目的的走走，努力发笔小财，给自己买花 自己长大.,[],26,摄影,686,主题摄影,2434,景物摄影
4,4,五爱街最美美女 一天1q,#搞笑 #感谢快手我要上热门 #五爱市场 这真是完美搭配啊！,"[五爱市场,感谢快手我要上热门,搞笑]",5,时尚,737,营销售卖,2596,女装
...,...,...,...,...,...,...,...,...,...,...
10723,10723,UNKNOWN,昨天爱你，今天爱你，明天也爱你，丫头，别担心，我以后都会爱你，我的小傻瓜@公主没烦恼 、(O...,[],33,自拍,-124,UNKNOWN,-124,UNKNOWN
10724,10724,UNKNOWN,#感谢推广小助手 #感谢快手绿色平台 #,"[感谢快手绿色平台,感谢推广小助手]",6,明星娱乐,-124,UNKNOWN,-124,UNKNOWN
10725,10725,UNKNOWN,,[],15,艺术,170,表演,-124,UNKNOWN
10726,10726,老人言,老人言，喜欢留个关注加红心 #老人言 @今天拍点啥(O840386039) @快手活动中...,[老人言],38,读书,696,文学赏析,2477,民间俗语


## Step 2: Feature Engineering

- Create meaningful features from interaction and metadata (e.g., content tags, user activity history)
- Build user-item interaction matrix
- Optionally extract time-based or popularity-based features

In [8]:
def popularity_score(video_id: int) -> float:
    """
    Calculate the popularity score of a video based on its view ratio.
    """
    video_interest = train_set[train_set["video_id"] == video_id]
    return video_interest["watch_ratio"].sum() / len(video_interest) if len(video_interest) > 0 else 0.0

In [9]:
popularity_score(148)

1.430915337104302

## Step 3: Model Development

- Choose a recommendation approach:
    - Collaborative filtering (e.g., ALS, Matrix Factorisation)
    - Content-based filtering
    - Sequence-aware models
    - Hybrid approaches
- Train and validate your model on the training set

In [109]:
matrix_train = train_set.pivot(index='user_id', columns='video_id', values='watch_ratio').fillna(0)
interactions = csr_matrix(matrix_train.values)

# user_ids = train_set["user_id"].astype("category").cat.codes.values
# item_ids = train_set["video_id"].astype("category").cat.codes.values
# interactions = csr_matrix((train_set["watch_ratio"], (user_ids, item_ids)))


In [84]:
interactions

video_id,103,109,120,122,128,130,131,133,136,137,...,10430,10436,10457,10462,10500,10506,10519,10552,10589,10595
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
14,0.429126,1.482039,0.728738,0.477810,0.439333,1.150245,1.816317,0.781122,1.554396,2.307486,...,0.348932,0.965220,1.787169,1.816010,0.150323,1.535256,2.757278,0.143556,1.273362,1.719201
19,0.624466,1.070684,1.006063,0.759092,0.882691,0.639313,0.670019,1.407319,0.874814,0.722665,...,0.642896,0.633833,0.586222,1.178295,0.000000,0.977297,1.266322,0.265038,0.928168,1.107873
21,1.415049,1.028840,1.809125,0.688823,0.588365,0.619549,0.818749,1.944596,1.015039,0.575723,...,0.896847,0.918930,0.602573,0.995887,1.173871,0.957399,1.148837,0.216699,1.210398,1.713792
23,0.169223,2.549891,0.247487,0.438669,0.114338,0.828292,0.038440,2.455882,1.128438,1.021400,...,0.577134,0.335534,5.304503,0.610346,0.185161,4.725427,0.338674,0.430445,2.225363,0.000000
24,0.345049,0.449337,0.802936,0.797411,1.875599,0.783867,2.104939,6.418434,0.228018,3.892566,...,0.884743,0.578658,0.300125,2.151558,2.311935,1.848424,0.388630,0.103633,0.547944,0.093900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7142,0.575631,0.960989,0.950854,0.611157,0.569484,0.944993,0.481794,1.020178,0.805988,0.514642,...,0.727006,0.794809,0.843906,1.491536,1.525323,1.261752,1.630146,0.300578,1.047888,1.419171
7147,1.112427,0.380971,1.419818,0.617423,1.067951,1.020166,0.925878,1.656635,0.907465,0.722232,...,0.541901,0.813574,1.014475,1.575067,1.614839,1.191774,1.935745,0.261799,0.974836,1.364633
7153,1.338544,0.414703,0.664433,0.339429,1.038049,0.225616,0.425432,2.330027,2.045522,0.604141,...,0.568422,0.887281,0.287259,4.364816,0.898387,0.944044,1.031352,0.309217,0.723977,0.425931
7159,0.658155,0.231235,0.788894,0.870249,0.179318,1.141559,0.464358,0.173222,1.065303,0.718853,...,0.201295,0.444783,1.494014,3.151875,1.700323,1.479968,2.092679,0.393572,1.600266,1.596454


### Model 1: Alternating Least Squares (ALS)
Considering that we only have implicit feedback, ALS can work well. We will not use demographic data for this simple model.

This algorithm is mostly used for sparse datasets.

In [110]:
from implicit.als import AlternatingLeastSquares


# Add diversity regularization
model = AlternatingLeastSquares(
    factors=128,
    regularization=0.15,
    calculate_training_loss=True
)


In [111]:
%%time
model.fit(interactions)


  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 43.1 s, sys: 159 ms, total: 43.2 s
Wall time: 3.62 s


In [89]:
item_factors, user_factors = model.item_factors, model.user_factors

Now with the training, we should have:

R ≈ U x V

Where:
- R is the user-item interaction matrix
- U is the user feature matrix
- V is the item feature matrix

In [97]:
a = user_factors @ item_factors.T
a

array([[0.7639965 , 0.9111066 , 0.83232737, ..., 0.9715715 , 0.7584928 ,
        0.9574807 ],
       [0.9792026 , 0.9637219 , 0.92031574, ..., 0.713523  , 0.7093892 ,
        1.2840939 ],
       [0.95884335, 1.1278911 , 1.188765  , ..., 0.31304657, 0.68187606,
        1.0420986 ],
       ...,
       [0.97609   , 1.0717669 , 0.8040704 , ..., 0.81096596, 1.0353142 ,
        0.37594905],
       [1.0837406 , 1.2667172 , 0.8468077 , ..., 0.71362275, 1.0840847 ,
        0.85573477],
       [1.009142  , 1.0000426 , 1.0124941 , ..., 1.0988531 , 0.9064739 ,
        0.9205485 ]], dtype=float32)

## Step 4: Recommendation Algorithm

- Predict which videos are likely to be enjoyed by each user in the test set
- Generate a top-N ranked list of recommendations for each user

### Model 1: Alternating Least Squares (ALS)

In [128]:
user_id = 14
ids, scores = model.recommend(
    userid=user_id,
    user_items=interactions[user_id],
    N=10
)

In [None]:
pd.DataFrame({
    "video_id": ids,
    # Get the caption
    "video_caption": item_categories.iloc[ids, 2],
    "score": scores
}).sort_values("score", ascending=False).set_index("video_id")

Unnamed: 0_level_0,video_caption,score
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1
408,多给予你爱的人一些理解与包容#情感,0.933605
1244,你学会了吗？ #推广助力计划 #热门 #818购房节 #告别金融pua #用快影上热门 #作品推广,0.885279
519,真实不？ #12星座 #快手创作者服务中心 #作品推广 #感谢快手我要上热门,0.852139
2769,#街拍# #穿搭# #每日穿搭# @快手时尚,0.844528
3060,自驾游路上遇到洪水山体滑坡，堵住隧道口，堵车两小时，雨天行车，一定小心谨慎，注意安全！ #暑...,0.838215
2756,有一段感情的阴影 会影响以后遇见的所有人 安全感这东西一旦炸裂 你会觉得所有东西都带刀而来。,0.670181
850,剧里的暗恋真的太心酸了💔只对小果可见的朋友圈可太让人心疼你了 #王安宇 #二十不惑 #作品...,0.669137
1782,一岁多的娃跳起来兄弟呀想你了，浑身哈撒 #快手有萌娃 #作品推广,0.632251
1717,#美鞋教搭配 #秋上新 #感恩所有遇见 #感谢快手平台 #感谢快手平台,0.565446
735,还是单身狗比较懂单身狗哈@林锡（护妻狂魔）(O1840275520) @郝青平_(O1420...,0.549887


## Evaluation

- Choose suitable metrics (e.g., Precision@K, Recall@K, MAP, NDCG)
- Evaluate performance and provide interpretations