# KRAFT: Model Training

This notebook focuses on training the two main components of our recommender system:
1.  **ALS (Alternating Least Squares) Model:** For candidate generation. This model is trained on the user-item interaction matrix derived from `big_matrix`.
2.  **LightGBM Model:** For ranking the candidates generated by ALS (or for direct ranking if evaluating on `small_matrix`). This model is trained on the feature-rich dataset derived from `big_matrix`.

The trained models will be saved to disk for later use in evaluation and inference.

## 1. Imports and Configuration

Import necessary libraries and define paths for loading processed data and saving trained models.

In [3]:
import pandas as pd
import numpy as np
import gc
import os
import json
from scipy.sparse import csr_matrix, load_npz
import implicit
import joblib
import lightgbm as lgb

# --- Path Definitions ---
RAW_DATA_BASE_PATH = "../raw_data/KuaiRec/data/"
PROCESSED_DATA_PATH = "../data/"
MODELS_PATH = "../models/"
os.makedirs(MODELS_PATH, exist_ok=True)

# --- Global Variables
TARGET_COL = 'watch_ratio' # Consistent with data preparation

## 2. ALS Model Training (Candidate Generation)

The ALS model is trained on the sparse user-item interaction matrix created from `big_matrix` interactions. `watch_ratio` is used as the confidence score. 

**Note:** The `interaction_matrix_als` and ID mappings (`user_to_idx`, `video_to_idx`) are assumed to be created and available from the data preparation phase. For this notebook, if they are not in memory, we would typically load them. However, since this notebook follows the data prep one, we'll proceed assuming they might be in memory or we'd load the necessary components if this were a standalone script.

In [4]:
print("--- Training ALS Model ---")

# Load ALS components (ID mappings and interaction matrix)
print("Re-creating ALS interaction matrix from processed big_matrix data...")
try:
    # Attempt to load df_big_merged components
    with open(os.path.join(PROCESSED_DATA_PATH, 'user_to_idx_als.json'), 'r') as f:
        user_to_idx = {int(k): v for k, v in json.load(f).items()}
    with open(os.path.join(PROCESSED_DATA_PATH, 'video_to_idx_als.json'), 'r') as f:
        video_to_idx = {int(k): v for k, v in json.load(f).items()}
    
    # We need the raw interactions from big_matrix to build the sparse matrix again
    print("Loading minimal big_matrix_interactions for ALS matrix construction...")
    interaction_cols_initial_load_als = {'user_id': 'float32', 'video_id': 'float32', 'watch_ratio': 'float32'}
    interaction_cols_final_dtypes_als = {'user_id': 'int32', 'video_id': 'int32', 'watch_ratio': 'float32'}
    
    temp_big_interactions = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "big_matrix.csv"),
                                  usecols=interaction_cols_initial_load_als.keys(),
                                  dtype=interaction_cols_initial_load_als)
    # Simplified post_process for this temp load
    for col in ['user_id', 'video_id']:
        temp_big_interactions[col] = temp_big_interactions[col].fillna(-1).astype(interaction_cols_final_dtypes_als[col])
    temp_big_interactions['watch_ratio'] = temp_big_interactions['watch_ratio'].astype(interaction_cols_final_dtypes_als['watch_ratio'])

    # Filter out any interactions where user_id or video_id is not in our mappings
    temp_big_interactions = temp_big_interactions[
        temp_big_interactions['user_id'].isin(user_to_idx.keys()) &
        temp_big_interactions['video_id'].isin(video_to_idx.keys())
    ]

    als_user_ids = temp_big_interactions['user_id'].map(user_to_idx)
    als_item_ids = temp_big_interactions['video_id'].map(video_to_idx)
    als_ratings = temp_big_interactions['watch_ratio']
    als_ratings_clipped = np.maximum(als_ratings, 0.001)
    
    num_users_als = len(user_to_idx)
    num_videos_als = len(video_to_idx)

    interaction_matrix_als = csr_matrix((als_ratings_clipped, (als_user_ids, als_item_ids)),
                                        shape=(num_users_als, num_videos_als))
    print(f"Successfully re-created ALS Sparse Matrix Shape: {interaction_matrix_als.shape}, NNZ: {interaction_matrix_als.nnz}")
    del temp_big_interactions, als_user_ids, als_item_ids, als_ratings, als_ratings_clipped
    gc.collect()

except FileNotFoundError:
    print("Error: ALS ID mapping files not found in processed_data. Please run Data Preparation notebook first.")
    raise

# ALS Model Configuration
als_params = {
    'factors': 100, 
    'regularization': 0.1,
    'iterations': 20,
    'use_cg': True,
    'calculate_training_loss': True,
    'random_state': 42 # For reproducibility
}

als_model = implicit.als.AlternatingLeastSquares(**als_params)

# Train the ALS model (expects user-item matrix)
print("Fitting ALS model...")
als_model.fit(interaction_matrix_als)
print("ALS model training complete.")

# Save the trained ALS model
als_model_path = os.path.join(MODELS_PATH, "als_model.joblib")
joblib.dump(als_model, als_model_path)
print(f"ALS model saved to: {als_model_path}")

del interaction_matrix_als, als_model
gc.collect()

--- Training ALS Model ---
Re-creating ALS interaction matrix from processed big_matrix data...
Loading minimal big_matrix_interactions for ALS matrix construction...
Successfully re-created ALS Sparse Matrix Shape: (7176, 10728), NNZ: 10300969
Fitting ALS model...


  0%|          | 0/20 [00:00<?, ?it/s]

ALS model training complete.
ALS model saved to: ../models/als_model.joblib


16

## 3. LightGBM Model Training (Ranking)

The LightGBM model is trained for the ranking task using the feature-engineered training data derived from `big_matrix`. It predicts the `watch_ratio`.

In [5]:
print("\n--- Training LightGBM Model ---")

# Load preprocessed training data for LightGBM
print("Loading LightGBM training data...")
train_lgbm_parquet_path = os.path.join(PROCESSED_DATA_PATH, 'lightgbm_train_data.parquet')
try:
    train_lgbm_df = pd.read_parquet(train_lgbm_parquet_path)
except FileNotFoundError:
    print(f"Error: {train_lgbm_parquet_path} not found. Please run Data Preparation notebook first.")
    raise

print(f"Loaded LightGBM training data: {train_lgbm_df.shape}")

y_train_lgbm = train_lgbm_df[TARGET_COL]
X_train_lgbm = train_lgbm_df.drop(columns=[TARGET_COL])
del train_lgbm_df
gc.collect()

# Determine categorical features for LightGBM
base_categorical_features = ['user_id', 'video_id', 'user_active_degree', 
                             'interaction_hour', 'interaction_day_of_week']
user_flag_categoricals = ['is_lowactive_period', 'is_live_streamer', 'is_video_author']
onehot_feature_names_train = [f'onehot_feat{i}' for i in range(18)]
daily_item_categoricals = ['author_id', 'video_type', 'video_tag_id']

categorical_features_for_lgbm_training = []
for col_list in [base_categorical_features, user_flag_categoricals, onehot_feature_names_train, daily_item_categoricals]:
    for col in col_list:
        if col in X_train_lgbm.columns:
            categorical_features_for_lgbm_training.append(col)

# Ensure categorical features have 'category' dtype
print("Verifying and casting dtypes for categorical features in training data...")
for col in categorical_features_for_lgbm_training:
    if X_train_lgbm[col].dtype.name != 'category':
        X_train_lgbm[col] = X_train_lgbm[col].astype('category')
print(f"Identified {len(categorical_features_for_lgbm_training)} categorical features for LGBM training.")

# LightGBM Model Configuration
lgbm_train_params = {
    'objective': 'regression_l1', # MAE is often robust for watch ratio like targets
    'metric': ['mae', 'rmse'], # Metrics to monitor if a validation set were used
    'boosting_type': 'gbdt',
    'num_leaves': 63,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,    # Column subsampling
    'bagging_fraction': 0.8,    # Row subsampling
    'bagging_freq': 1,          # Perform bagging at every iteration
    'verbose': -1,              
    'n_jobs': -1,               
    'seed': 42                  # For reproducibility
}
num_boost_round_lgbm = 1000 # Number of boosting iterations

print("Creating LightGBM training dataset...")
lgb_train_dataset = lgb.Dataset(X_train_lgbm, y_train_lgbm,
                                categorical_feature=categorical_features_for_lgbm_training,
                                free_raw_data=False) # Keep raw data for feature importance
del X_train_lgbm, y_train_lgbm
gc.collect()

print(f"Training LightGBM model for {num_boost_round_lgbm} rounds...")
model_lgbm_trained = lgb.train(
    params=lgbm_train_params,
    train_set=lgb_train_dataset,
    num_boost_round=num_boost_round_lgbm
    # Early stopping, add valid_sets and callbacks=[lgb.early_stopping(...)]?
)
print("LightGBM model training complete.")

# Save the trained LightGBM model
lgbm_model_path = os.path.join(MODELS_PATH, "lightgbm_ranker_model.txt")
model_lgbm_trained.save_model(lgbm_model_path)
print(f"LightGBM model saved to: {lgbm_model_path}")

del lgb_train_dataset, model_lgbm_trained
gc.collect()

print("\n--- Model Training Phase Complete. Models are saved in ../models/ ---")


--- Training LightGBM Model ---
Loading LightGBM training data...
Loaded LightGBM training data: (10024644, 42)
Verifying and casting dtypes for categorical features in training data...
Identified 29 categorical features for LGBM training.
Creating LightGBM training dataset...
Training LightGBM model for 1000 rounds...
LightGBM model training complete.
LightGBM model saved to: ../models/lightgbm_ranker_model.txt

--- Model Training Phase Complete. Models are saved in ../models/ ---
