# Project Assignment: Short Video Recommender System (KuaiRec)

Dataset Source: [Kuairec](https://kuairec.com/)

Arxiv Paper: [KuaiRec: A Fully-observed Dataset and Insights for Evaluating Recommender Systems](https://arxiv.org/pdf/2202.10842)

## Dataset import

In [None]:
!wget https://nas.chongminggao.top:4430/datasets/KuaiRec.zip --no-check-certificate
!unzip KuaiRec.zip

## Imports

In [1]:
import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
import pandas as pd
import numpy as np
import os
from sklearn import metrics
from scipy.sparse import csr_matrix


# I get my dataset from a Kaggle input
DATA_PATH = "/kaggle/input/kuairec/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec/data"
if not os.path.exists(DATA_PATH):
   DATA_PATH = f"{os.getcwd()}/KuaiRec 2.0/data"
if not os.path.exists(DATA_PATH):
   raise FileNotFoundError("KuaiRec dataset not found. Please check the path.")

DATA_PATH

'/home/tofeha/ING2/ING2/REMA1/FinalProject_2025_aziz.zeghal/KuaiRec 2.0/data'

## Step 1: Load and observe the dataset

- Load and inspect the dataset
- Handle missing or inconsistent data
- Merge metadata for content-based models if necessary

### Small matrix

This table has a density of 99.6%. This means that 99.6% of the entries in the matrix are non-zero, indicating that most users have interacted with most items.

In [None]:
def data_clear(df : pd.DataFrame) -> pd.DataFrame:
    # Date is time in a weird format
    df.drop("date", axis="columns", inplace=True)

    # Timestamp and time can be missing
    # Not a problem, we want to keep the data for the density
    df = df.astype({
        "user_id": "int32",
        "video_id": "int32",
        "play_duration":"int32",
        "timestamp": "int64",
        "watch_ratio": "float32"}, errors="ignore")
    
    # Drop duplicates
    df.drop_duplicates(subset=["user_id", "video_id"], inplace=True)

    df["time"] = pd.to_datetime(df["time"])

    return df

In [None]:
train_set = pd.read_csv(f"{DATA_PATH}/small_matrix.csv")

train_set = data_clear(train_set)


In [None]:
print(f"Shape of the small matrix: {train_set.shape}")
unique_users = train_set["user_id"].nunique()
unique_posts = train_set["video_id"].nunique()
print(f"Matrix sparsity: {len(train_set) /(unique_posts * unique_users) * 100}%")

In [None]:
train_set.head()

### Big matrix

This table has a density of 16.3%. We will use this matrix for our training and testing.

It contains more interactions with the same users/items of the small matrix. We do not need to substract the small matrix.

In [None]:
evaluation_set = pd.read_csv(f"{DATA_PATH}/big_matrix.csv")

evaluation_set = data_clear(evaluation_set)


In [None]:
evaluation_set

### Misc

In [None]:
item_categories = pd.read_csv(f"{DATA_PATH}/kuairec_caption_category.csv", lineterminator='\n')
item_categories.astype({"video_id": "int32"})


## Step 2: Feature Engineering

- Create meaningful features from interaction and metadata (e.g., content tags, user activity history)
- Build user-item interaction matrix
- Optionally extract time-based or popularity-based features

In [None]:
def popularity_score(video_id: int) -> float:
    """
    Calculate the popularity score of a video based on its view ratio.
    """
    video_interest = train_set[train_set["video_id"] == video_id]
    return video_interest["watch_ratio"].sum() / len(video_interest) if len(video_interest) > 0 else 0.0

In [None]:
popularity_score(148)

In [None]:
matrix_train = train_set.pivot(index='user_id', columns='video_id', values='watch_ratio').fillna(0)
interactions = csr_matrix(matrix_train.values)

# user_ids = train_set["user_id"].astype("category").cat.codes.values
# item_ids = train_set["video_id"].astype("category").cat.codes.values
# interactions = csr_matrix((train_set["watch_ratio"], (user_ids, item_ids)))


In [None]:
interactions

## Step 3: Model Development

- Choose a recommendation approach:
    - Collaborative filtering (e.g., ALS, Matrix Factorisation)
    - Content-based filtering
    - Sequence-aware models
    - Hybrid approaches
- Train and validate your model on the training set

### Model 1: Alternating Least Squares (ALS)
Considering that we only have implicit feedback, ALS can work well. We will not use demographic data for this simple model.

This algorithm is mostly used for sparse datasets.

#### Pyspark imports

In [2]:
import pyspark
from pyspark.ml.recommendation import ALS
from pyspark.sql import SparkSession
# To evaluate the model with RMSE
from pyspark.ml.evaluation import RegressionEvaluator
# For hyperparameter tuning
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

print(f"Spark version: {pyspark.__version__}")
print(f"Pandas version: {pd.__version__}")

# Create a Spark session
spark = SparkSession.builder \
    .appName("KuaiRec ALS") \
    .config("spark.driver.memory", "16g") \
    .config("spark.executor.memory", "16g") \
    .getOrCreate()

Spark version: 3.4.1
Pandas version: 2.2.3


your 131072x1 screen size is bogus. expect trouble
25/04/17 20:26:16 WARN Utils: Your hostname, DESKTOP-1TVCQAV resolves to a loopback address: 127.0.1.1; using 172.17.236.101 instead (on interface eth0)
25/04/17 20:26:16 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/17 20:26:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Data preparation for Pyspark

In [3]:
# Add the training dataframe to the Spark session

# Optional: filter training data to users/items that also exist in test set

# We load directly from the CSV to avoid memory issues
# TODO: Maybe later on use parquet files for cleaned up data
train_data = spark.read.csv(
    f"{DATA_PATH}/small_matrix.csv",
    header=True,
    sep=",",
    nullValue="",
    # We have to infer for correct types
    inferSchema=True,
).select("user_id", "video_id", "watch_ratio")

train_data.show()

# Load the evaluation data
test_data = spark.read.csv(
    f"{DATA_PATH}/big_matrix.csv",
    header=True,
    sep=",",
    inferSchema=True,
    nullValue="",
).select("user_id", "video_id", "watch_ratio")



test_data.show()

                                                                                

+-------+--------+------------------+
|user_id|video_id|       watch_ratio|
+-------+--------+------------------+
|     14|     148|0.7221031811438932|
|     14|     183| 1.907377049180328|
|     14|    3649| 2.063310941382166|
|     14|    5262|0.5663884673748103|
|     14|    8234|0.4183636363636364|
|     14|    6789|0.6487525439059321|
|     14|    1963|0.8981230448383734|
|     14|     175| 0.250247237390893|
|     14|    1973|0.6178378378378379|
|     14|     171|1.6327391221008245|
|     14|    6803|1.3599621092516576|
|     14|    3634|1.0625113574413956|
|     14|    6787|2.2209484106305366|
|     14|    1951|            2.2415|
|     14|     179|1.4244272292731168|
|     14|    5266|1.2312694373763076|
|     14|    5241|0.4111530321155793|
|     14|    6782|1.1655276381909547|
|     14|    6788|0.7878258532652512|
|     14|    8220|1.4898848971049878|
+-------+--------+------------------+
only showing top 20 rows





+-------+--------+------------------+
|user_id|video_id|       watch_ratio|
+-------+--------+------------------+
|      0|    3649|1.2733965215790926|
|      0|    9598|1.2440823015294975|
|      0|    5262|0.1076125442589782|
|      0|    1963|0.0898852971845672|
|      0|    8234|             0.078|
|      0|    8228| 1.572294776119403|
|      0|    6789|0.1753976030752996|
|      0|    6812| 2.212061894108874|
|      0|     183|0.1304918032786885|
|      0|     169|1.4062659977475171|
|      0|    1988|1.8889678703440431|
|      0|    5274|2.0970967741935485|
|      0|     179|2.1577385857919893|
|      0|    3647|0.0679368262722486|
|      0|    8248|1.2918181818181818|
|      0|     206|0.0902172714238447|
|      0|    6801| 1.935958459541324|
|      0|     171| 33.27602070155262|
|      0|    3672|1.4420652173913044|
|      0|    2000|0.0716965742251223|
+-------+--------+------------------+
only showing top 20 rows



                                                                                

#### Hyperparameter tuning and Cross Validation

In [9]:
# ALS model configuration
als = ALS(
    maxIter=10,
    rank=10,
    userCol="user_id",
    itemCol="video_id",
    ratingCol="watch_ratio",
    # Handle NaN predictions
    coldStartStrategy="drop",
    implicitPrefs=True,
)

# For CrossValidator
params = ParamGridBuilder() \
    .addGrid(als.rank, [10, 20]) \
    .addGrid(als.maxIter, [10, 15]) \
    .build()


# RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="watch_ratio", predictionCol="prediction")


# CrossValidator
cvs = CrossValidator(
    estimator=als,
    estimatorParamMaps=params,
    evaluator=evaluator,
    # Between 2 and 5
    numFolds=3,
)

#### Training
Now with the training, we should have:

R ≈ U x V

Where:
- R is the user-item interaction matrix
- U is the user feature matrix
- V is the item feature matrix

In [11]:
# Fit the ALS model on the train data
models = cvs.fit(train_data)

                                                                                

In [15]:
# Take the best model from the CrossValidator
my_model = models.bestModel

predictions = my_model.transform(train_data)
rmse = evaluator.evaluate(predictions)

                                                                                

In [16]:
print(f"RMSE: {rmse}")
print(f"Rank: {my_model.rank}")
print(f"MaxIter: {my_model._java_obj.parent().getMaxIter()}")
print(f"RegParam: {my_model._java_obj.parent().getRegParam()}")

RMSE: 1.3592511123402955
Rank: 10
MaxIter: 15
RegParam: 0.1


## Step 4: Recommendation Algorithm

- Predict which videos are likely to be enjoyed by each user in the test set
- Generate a top-N ranked list of recommendations for each user

### Model 1: Alternating Least Squares (ALS)

In [17]:
recommends = my_model.recommendForAllUsers(10)
recommends_df = recommends.toPandas()

                                                                                

In [18]:
recommends_df

Unnamed: 0,user_id,recommendations
0,120,"[(7383, 0.9606729745864868), (4040, 0.95974075..."
1,137,"[(7383, 0.9637352228164673), (4040, 0.96279984..."
2,140,"[(7383, 0.9655925035476685), (4040, 0.96465539..."
3,155,"[(7383, 0.9650819897651672), (4040, 0.96414554..."
4,157,"[(7383, 0.9602046012878418), (4040, 0.95927274..."
...,...,...
1406,7135,"[(7383, 0.9609105587005615), (4040, 0.95997804..."
1407,7141,"[(7383, 0.9661915302276611), (4040, 0.96525388..."
1408,7142,"[(7383, 0.9652813673019409), (4040, 0.96434468..."
1409,7153,"[(7383, 0.9657152891159058), (4040, 0.96477800..."


## Evaluation

- Choose suitable metrics (e.g., Precision@K, Recall@K, MAP, NDCG)
- Evaluate performance and provide interpretations