# 1. Data Preparation

**Goal:** Load raw data from the KuaiRec dataset (`small_matrix.csv`, `item_categories.csv`), process it, perform a time-based split, and save the results into the required format:
*   `../data/interactions_train.csv`
*   `../data/interactions_test.csv`
*   `../data/video_metadata.csv`
*   `../data/sample_submission.csv`
*   `../data/test_user_item_map.pkl` (Ground truth for evaluation)

## Setup
Import libraries and define file paths.

In [9]:
import pandas as pd
import numpy as np
import pickle
import os

# --- Configuration ---
RAW_DATA_DIR = "../raw_data/KuaiRec/data/"
PROCESSED_DATA_DIR = "../data/"
POSITIVE_INTERACTION_THRESHOLD = 1.0 # Define positive interaction (watch_ratio >= 1.0)
TRAIN_RATIO = 0.8

# Input file paths
raw_interactions_path = os.path.join(RAW_DATA_DIR, "small_matrix.csv")
raw_item_categories_path = os.path.join(RAW_DATA_DIR, "item_categories.csv")

# Output file paths
output_video_metadata_path = os.path.join(PROCESSED_DATA_DIR, "video_metadata.csv")
output_train_path = os.path.join(PROCESSED_DATA_DIR, "interactions_train.csv")
output_test_path = os.path.join(PROCESSED_DATA_DIR, "interactions_test.csv")
output_sample_submission_path = os.path.join(PROCESSED_DATA_DIR, "sample_submission.csv")
output_ground_truth_path = os.path.join(PROCESSED_DATA_DIR, "test_user_item_map.pkl")

# Ensure output directory exists
os.makedirs(PROCESSED_DATA_DIR, exist_ok=True)

## Load Raw Data

In [10]:
print(f"Loading raw interactions from: {raw_interactions_path}")
try:
    interactions_df = pd.read_csv(raw_interactions_path)
except FileNotFoundError:
    print(f"Error: Interaction file not found at {raw_interactions_path}. Ensure data is downloaded.")
    exit()

print(f"Loading raw item categories from: {raw_item_categories_path}")
try:
    item_categories_df = pd.read_csv(raw_item_categories_path)
except FileNotFoundError:
    print(f"Error: Item categories file not found at {raw_item_categories_path}. Ensure data is downloaded.")
    exit()

print(f"Loaded {len(interactions_df)} interactions and metadata for {len(item_categories_df)} items.")

Loading raw interactions from: ../raw_data/KuaiRec/data/small_matrix.csv
Loading raw item categories from: ../raw_data/KuaiRec/data/item_categories.csv
Loaded 4676570 interactions and metadata for 10728 items.


## Create `video_metadata.csv`
Select relevant columns from item categories and rename `video_id` to `item_id`.

In [11]:
print("Processing item metadata...")
video_metadata_df = item_categories_df.rename(columns={'video_id': 'item_id'})
video_metadata_df = video_metadata_df[['item_id', 'feat']]

video_metadata_df.to_csv(output_video_metadata_path, index=False)
print(f"Saved processed video metadata to: {output_video_metadata_path}")

Processing item metadata...
Saved processed video metadata to: ../data/video_metadata.csv


## Process Interactions
1. Rename `video_id` to `item_id`.
2. Define `positive_interaction` based on `watch_ratio`.
3. Sort interactions by `timestamp`.

In [12]:
print("Processing interactions data...")
# Rename column
interactions_df = interactions_df.rename(columns={'video_id': 'item_id'})

# Define positive interaction 
# Justification: Choosing watch_ratio >= 1.0 implies the user watched at least the full video duration.
# This is a reasonable heuristic for positive engagement, though other thresholds could be explored.
interactions_df['positive_interaction'] = (interactions_df['watch_ratio'] >= POSITIVE_INTERACTION_THRESHOLD).astype(int)
print(f"Defined 'positive_interaction' using threshold: {POSITIVE_INTERACTION_THRESHOLD}")
print(interactions_df['positive_interaction'].value_counts(normalize=True))

# Sort by timestamp for chronological split
interactions_df = interactions_df.sort_values('timestamp').reset_index(drop=True)
print("Interactions sorted by timestamp.")

Processing interactions data...
Defined 'positive_interaction' using threshold: 1.0
positive_interaction
0    0.676034
1    0.323966
Name: proportion, dtype: float64
Interactions sorted by timestamp.


## Time-Based Train/Test Split

In [13]:
split_index = int(len(interactions_df) * TRAIN_RATIO)

train_df = interactions_df.iloc[:split_index].copy()
test_df = interactions_df.iloc[split_index:].copy()

print(f"Split data: {len(train_df)} train interactions, {len(test_df)} test interactions.")

Split data: 3741256 train interactions, 935314 test interactions.


## Create and Save `interactions_train.csv`

In [14]:
cols_to_keep_train = ['user_id', 'item_id', 'watch_ratio', 'timestamp', 'positive_interaction']
interactions_train_df = train_df[cols_to_keep_train]

interactions_train_df.to_csv(output_train_path, index=False)
print(f"Saved training interactions to: {output_train_path}")

Saved training interactions to: ../data/interactions_train.csv


## Create and Save `interactions_test.csv`
Contains only `user_id`, `item_id` pairs for prediction. Users not seen in training are filtered out.

In [15]:
# Filter test set for users present in the training set (common practice)
train_users = train_df['user_id'].unique()
test_df_filtered = test_df[test_df['user_id'].isin(train_users)].copy()
print(f"Filtered test set to {len(test_df_filtered)} interactions from users seen in training.")

# Select only user_id and item_id, remove duplicates
interactions_test_df = test_df_filtered[['user_id', 'item_id']].drop_duplicates().reset_index(drop=True)

interactions_test_df.to_csv(output_test_path, index=False)
print(f"Saved {len(interactions_test_df)} unique test user-item pairs to: {output_test_path}")

Filtered test set to 935314 interactions from users seen in training.
Saved 935314 unique test user-item pairs to: ../data/interactions_test.csv


## Store Test Ground Truth (`test_user_item_map.pkl`)
Create a map of user -> set of positive items from the test period for evaluation.

In [16]:
test_ground_truth_df = test_df_filtered[test_df_filtered['positive_interaction'] == 1]
test_user_item_map = test_ground_truth_df.groupby('user_id')['item_id'].agg(set).to_dict()

with open(output_ground_truth_path, 'wb') as f:
    pickle.dump(test_user_item_map, f)
print(f"Saved test ground truth map for {len(test_user_item_map)} users to: {output_ground_truth_path}")

Saved test ground truth map for 1411 users to: ../data/test_user_item_map.pkl


## Create and Save `sample_submission.csv`

In [17]:
sample_submission_df = interactions_test_df.copy()
sample_submission_df['score'] = 0.5 # Placeholder score

sample_submission_df.to_csv(output_sample_submission_path, index=False)
print(f"Saved sample submission file to: {output_sample_submission_path}")

Saved sample submission file to: ../data/sample_submission.csv


## Completion Summary
Data preparation complete. The required files have been generated in the `../data/` directory.