# KRAFT: Exploratory Data Analysis

This notebook performs an exploratory data analysis on the KuaiRec dataset. The goals are to:
1. Understand the basic statistics and distributions of users, items, and interactions.
2. Analyze the characteristics of user features and item features.
3. Investigate the target variable (`watch_ratio`) and its relationship with other features.
4. Gain insights that can inform feature engineering, model selection, and evaluation strategies for the recommender system.

We will primarily load raw data files for this EDA to inspect their original state.

## 1. Imports and Configuration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc

# --- Path Definitions ---
RAW_DATA_BASE_PATH = "../raw_data/KuaiRec/data/"

# --- Plotting Configuration ---
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 12

## 2. Loading Data for EDA

We will load the key raw data files. For large files like `big_matrix.csv`, we might load a sample or specific columns if full loading is too memory-intensive for interactive EDA, but for this dataset size, full load with selected columns is often feasible.

In [None]:
# Define columns and dtypes for leaner loading during EDA
interaction_eda_cols = {
    'user_id': 'int32', 'video_id': 'int32', 
    'play_duration': 'int32', 'video_duration': 'int32', 
    'time': 'str', 'date': 'int32', 'watch_ratio': 'float32'
}
user_features_eda_cols = { # Load most user features for EDA
    'user_id': 'int32', 'user_active_degree': 'category',
    'is_lowactive_period': 'int8', 'is_live_streamer': 'int8', 
    'is_video_author': 'int8', 'follow_user_num': 'int32',
    'fans_user_num': 'int32', 'friend_user_num': 'int32', 
    'register_days': 'int32'
    # Skipping onehot_feat* for initial EDA overview, can be added if needed
}
item_categories_eda_cols = {'video_id': 'int32', 'feat': 'str'}
item_daily_eda_cols = { # Select key daily features
    'video_id': 'int32', 'date': 'int32', 'author_id': 'int32',
    'video_type': 'category', 'upload_dt': 'str', 'video_duration': 'float32',
    'show_cnt': 'int32', 'play_cnt': 'int32', 'like_cnt': 'int32',
    'play_progress': 'float32'
}

print("Loading big_matrix.csv for EDA...")
try:
    df_big = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "big_matrix.csv"), 
                         usecols=interaction_eda_cols.keys(), 
                         dtype=interaction_eda_cols)
    print(f"big_matrix loaded: {df_big.shape}")
except Exception as e:
    print(f"Error loading big_matrix: {e}")
    df_big = pd.DataFrame() # Empty df if load fails

print("\nLoading small_matrix.csv for EDA...")
try:
    df_small = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "small_matrix.csv"), 
                           usecols=interaction_eda_cols.keys(), 
                           dtype=interaction_eda_cols)
    print(f"small_matrix loaded: {df_small.shape}")
except Exception as e:
    print(f"Error loading small_matrix: {e}")
    df_small = pd.DataFrame()

print("\nLoading user_features.csv for EDA...")
try:
    df_users = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "user_features.csv"), 
                           usecols=user_features_eda_cols.keys(),
                           dtype=user_features_eda_cols)
    print(f"user_features loaded: {df_users.shape}")
except Exception as e:
    print(f"Error loading user_features: {e}")
    df_users = pd.DataFrame()

print("\nLoading item_categories.csv for EDA...")
try:
    df_items_cat = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "item_categories.csv"), 
                               usecols=item_categories_eda_cols.keys(),
                               dtype=item_categories_eda_cols)
    # Process 'feat' to get number of tags
    def parse_feat_list(feat_str):
        if pd.isna(feat_str) or not isinstance(feat_str, str): return 0
        try: return len(eval(feat_str))
        except: return 0
    df_items_cat['num_tags'] = df_items_cat['feat'].apply(parse_feat_list).astype('int16')
    print(f"item_categories loaded and processed: {df_items_cat.shape}")
except Exception as e:
    print(f"Error loading item_categories: {e}")
    df_items_cat = pd.DataFrame()

print("\nLoading item_daily_features.csv for EDA...")
try:
    df_items_daily = pd.read_csv(os.path.join(RAW_DATA_BASE_PATH, "item_daily_features.csv"), 
                                 usecols=item_daily_eda_cols.keys(),
                                 dtype=item_daily_eda_cols,
                                 # nrows=500000 # Optional: load a sample for quicker EDA on very large daily file
                                )
    print(f"item_daily_features loaded: {df_items_daily.shape}")
except Exception as e:
    print(f"Error loading item_daily_features: {e}")
    df_items_daily = pd.DataFrame()

## 3. Basic Statistics and Sparsity

Understanding the scale of the data (users, items, interactions) and the sparsity of the interaction matrices is fundamental.

In [None]:
if not df_big.empty:
    num_users_big = df_big['user_id'].nunique()
    num_items_big = df_big['video_id'].nunique()
    num_interactions_big = len(df_big)
    sparsity_big = 1 - (num_interactions_big / (num_users_big * num_items_big))
    print("--- Big Matrix Statistics ---")
    print(f"Number of unique users: {num_users_big}")
    print(f"Number of unique items: {num_items_big}")
    print(f"Number of interactions: {num_interactions_big}")
    print(f"Sparsity: {sparsity_big:.4f} ({(sparsity_big * 100):.2f}% empty)")
    
    # Dataset description gives density 16.3%, so sparsity is 1 - 0.163 = 0.837.
    # This matches if we use the total unique users/items from the full dataset, not just those in big_matrix interactions
    total_users_dataset = 7176
    total_items_dataset = 10728 
    density_big_reported = num_interactions_big / (total_users_dataset * total_items_dataset)
    print(f"Density (based on all users/items in dataset): {density_big_reported:.4f} ({(density_big_reported * 100):.2f}%)")

if not df_small.empty:
    num_users_small = df_small['user_id'].nunique()
    num_items_small = df_small['video_id'].nunique()
    num_interactions_small = len(df_small)
    density_small_reported = num_interactions_small / (num_users_small * num_items_small) # Small matrix is dense for its users/items
    print("\n--- Small Matrix Statistics ---")
    print(f"Number of unique users: {num_users_small}")
    print(f"Number of unique items: {num_items_small}")
    print(f"Number of interactions: {num_interactions_small}")
    print(f"Density (for its own scope): {density_small_reported:.4f} ({(density_small_reported * 100):.2f}%)")

**Justification for Model Choices (Initial Thoughts):**
- The high sparsity of `big_matrix` suggests that methods good with sparse data (like Matrix Factorization variants - ALS, or hybrid models - LightGBM with good features) are appropriate.
- The presence of rich side information (user and item features) strongly motivates a hybrid approach over pure collaborative filtering.

## 4. Analysis of Interaction Data (`df_big`)

In [None]:
if not df_big.empty:
    print("\n--- Interaction Data Analysis (df_big) ---")
    
    # Distribution of watch_ratio
    plt.figure(figsize=(10, 5))
    sns.histplot(df_big['watch_ratio'], bins=50, kde=False)
    plt.title('Distribution of Watch Ratio (Big Matrix)')
    plt.xlabel('Watch Ratio')
    plt.ylabel('Frequency')
    plt.xlim(0, 5) # Zoom in on a practical range
    plt.show()
    print(df_big['watch_ratio'].describe())
    print(f"Percentage of interactions with watch_ratio > 1: {(df_big['watch_ratio'] > 1).mean()*100:.2f}%")
    print(f"Percentage of interactions with watch_ratio > 2: {(df_big['watch_ratio'] > 2).mean()*100:.2f}%")

    # Distribution of play_duration and video_duration (log scale due to skew)
    fig, axes = plt.subplots(1, 2, figsize=(16, 5))
    sns.histplot(np.log1p(df_big['play_duration']), bins=50, ax=axes[0], kde=True)
    axes[0].set_title('Distribution of log(Play Duration)')
    sns.histplot(np.log1p(df_big['video_duration']), bins=50, ax=axes[1], kde=True)
    axes[1].set_title('Distribution of log(Video Duration)')
    plt.tight_layout()
    plt.show()
    
    # Interactions per user
    user_interaction_counts = df_big['user_id'].value_counts()
    plt.figure(figsize=(10, 5))
    sns.histplot(user_interaction_counts, bins=50, log_scale=(False, True))
    plt.title('Distribution of Interactions per User (Big Matrix)')
    plt.xlabel('Number of Interactions')
    plt.ylabel('Number of Users (log scale)')
    plt.show()
    print("\nInteractions per user describe:")
    print(user_interaction_counts.describe())

    # Interactions per item (video)
    item_interaction_counts = df_big['video_id'].value_counts()
    plt.figure(figsize=(10, 5))
    sns.histplot(item_interaction_counts, bins=50, log_scale=(True, True))
    plt.title('Distribution of Interactions per Video (Big Matrix)')
    plt.xlabel('Number of Interactions (log scale)')
    plt.ylabel('Number of Videos (log scale)')
    plt.show()
    print("\nInteractions per video describe:")
    print(item_interaction_counts.describe())

    # Temporal patterns (if 'time' or 'date' is reasonably formatted)
    df_big['interaction_datetime'] = pd.to_datetime(df_big['time'], errors='coerce')
    df_big.dropna(subset=['interaction_datetime'], inplace=True) # Drop rows where time couldn't be parsed
    
    if not df_big.empty:
        df_big['interaction_hour'] = df_big['interaction_datetime'].dt.hour
        plt.figure(figsize=(10,5))
        sns.countplot(data=df_big, x='interaction_hour')
        plt.title('Interactions by Hour of Day (Big Matrix)')
        plt.show()
else:
    print("df_big is empty or became empty after NA drop, skipping interaction analysis.")

**Justification for Feature Engineering (Interactions):**
- The `watch_ratio` distribution shows many values around 0-2, with a long tail. Values > 1 indicate rewatches or watching more than the video length (perhaps due to player UI or looping). This confirms `watch_ratio` is a good continuous target. Using MAE (L1 loss) for LightGBM regression seems appropriate to be robust to outliers.
- `play_duration` and `video_duration` are skewed, typical for time-based measures. Log transformation helps in visualization and potentially for some models (though tree models handle raw scale well).
- User and item interaction counts show a long-tail distribution (power law), typical in recommender systems. This implies some users are very active, and some items are very popular, while many are not. This reinforces the need for models that can handle this, and features like user/item activity could be useful.
- Extracting temporal features like `interaction_hour` and `day_of_week` is justified by observed daily patterns in user activity.

## 5. Analysis of User Features (`df_users`)

In [None]:
if not df_users.empty:
    print("\n--- User Feature Analysis ---")
    print(df_users.head())
    print(df_users.info())
    print(df_users.describe(include='all'))
    
    # Categorical user features
    if 'user_active_degree' in df_users.columns:
        plt.figure(figsize=(8, 4))
        sns.countplot(data=df_users, y='user_active_degree', order=df_users['user_active_degree'].value_counts().index)
        plt.title('Distribution of User Active Degree')
        plt.show()
    
    # Numerical user features
    numerical_user_cols = ['fans_user_num', 'register_days', 'follow_user_num', 'friend_user_num']
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    axes = axes.flatten()
    for i, col in enumerate(numerical_user_cols):
        if col in df_users.columns:
            sns.histplot(np.log1p(df_users[col]), bins=30, ax=axes[i], kde=True)
            axes[i].set_title(f'Distribution of log({col})')
    plt.tight_layout()
    plt.show()
else:
    print("df_users is empty, skipping user feature analysis.")

**Justification for Feature Engineering (User Features):**
- Categorical features like `user_active_degree` can be directly used or one-hot encoded. Their distribution shows if certain activity levels are dominant.
- Numerical features like `fans_user_num` and `register_days` are skewed, suggesting log transformation might be beneficial for linear models or distance-based algorithms if they were used (though tree models are less sensitive). Keeping them as is for LightGBM is fine. These features directly capture user engagement and tenure, which are likely predictive.
- Binary flags like `is_live_streamer`, `is_video_author` directly segment users and are easy to incorporate.
- The `onehot_feat*` (not plotted here for brevity) are pre-encoded categorical features and should be included in models that can handle categorical inputs well (like LightGBM).

## 6. Analysis of Item Features (`df_items_cat`, `df_items_daily`)

In [None]:
if not df_items_cat.empty:
    print("\n--- Item Category Analysis ---")
    print(df_items_cat.head())
    if 'num_tags' in df_items_cat.columns:
        plt.figure(figsize=(8,4))
        sns.histplot(df_items_cat['num_tags'], bins=max(1, df_items_cat['num_tags'].max()), discrete=True)
        plt.title('Distribution of Number of Tags per Item')
        plt.xlabel('Number of Tags')
        plt.show()
        print(df_items_cat['num_tags'].describe())

if not df_items_daily.empty:
    print("\n--- Item Daily Features Analysis (Sample) ---")
    print(df_items_daily.head())
    print(df_items_daily.info())
    
    # Distribution of play_progress for a sample of videos
    if 'play_progress' in df_items_daily.columns:
        plt.figure(figsize=(10, 5))
        sns.histplot(df_items_daily['play_progress'].dropna(), bins=50, kde=True)
        plt.title('Distribution of Daily Play Progress (All Daily Entries)')
        plt.xlim(0, df_items_daily['play_progress'].quantile(0.99)) # Zoom on main distribution
        plt.show()
        print(df_items_daily['play_progress'].describe())
    
    # Average play_cnt per video over its observed days
    if 'play_cnt' in df_items_daily.columns:
        avg_daily_play_cnt = df_items_daily.groupby('video_id')['play_cnt'].mean()
        plt.figure(figsize=(10, 5))
        sns.histplot(np.log1p(avg_daily_play_cnt), bins=50, kde=True)
        plt.title('Distribution of log(Average Daily Play Count per Video)')
        plt.show()
        print("\nAverage Daily Play Count per Video describe:")
        print(avg_daily_play_cnt.describe())
    
    if 'video_type' in df_items_daily.columns:
        plt.figure(figsize=(8,4))
        sns.countplot(data=df_items_daily, y='video_type', order=df_items_daily['video_type'].value_counts().index)
        plt.title('Distribution of Video Types (Daily Entries)')
        plt.show()
else:
    print("Item feature dataframes are empty, skipping item feature analysis.")

**Justification for Feature Engineering (Item Features):**
- `num_item_tags` (derived from `item_categories`) provides a simple item content complexity measure.
- `item_daily_features` are rich. 
    - Features like `play_progress` directly measure item engagement on a given day.
    - Aggregating daily stats (e.g., average daily play count) can give a general popularity signal for an item. This EDA shows these stats are also skewed.
    - Calculating ratios from daily counts (e.g., play/show, like/play) is crucial for normalizing popularity and capturing efficiency, as done in the main data prep notebook. This EDA shows the raw counts are available for such derivations.
    - `video_type` is a useful categorical feature.
    - `upload_dt` can be used to calculate `video_age_days`, which is a very common and useful feature representing item freshness.

## 7. Correlating Features with Target (Watch Ratio)

To get a preliminary idea of feature importance, we can compute average `watch_ratio` grouped by some key categorical features, or look at correlations for numerical features. This requires joining feature tables with interaction data aggregates.

In [None]:
if not df_big.empty and not df_users.empty:
    print("\n--- Correlating User Features with Watch Ratio (using df_big) ---")
    # Calculate average watch_ratio per user
    avg_watch_ratio_per_user = df_big.groupby('user_id')['watch_ratio'].mean().reset_index()
    avg_watch_ratio_per_user.rename(columns={'watch_ratio': 'avg_user_watch_ratio'}, inplace=True)
    
    # Merge with user features
    df_users_eda_merged = pd.merge(df_users, avg_watch_ratio_per_user, on='user_id', how='left')
    
    if 'user_active_degree' in df_users_eda_merged.columns:
        plt.figure(figsize=(10, 6))
        sns.boxplot(data=df_users_eda_merged, x='user_active_degree', y='avg_user_watch_ratio',
                    order=df_users_eda_merged.groupby('user_active_degree')['avg_user_watch_ratio'].median().sort_values(ascending=False).index)
        plt.title('Average Watch Ratio by User Active Degree')
        plt.ylim(0, df_users_eda_merged['avg_user_watch_ratio'].quantile(0.95)) # Zoom
        plt.xticks(rotation=45)
        plt.show()
        
    numerical_user_cols_for_corr = ['fans_user_num', 'register_days', 'follow_user_num']
    for col in numerical_user_cols_for_corr:
        if col in df_users_eda_merged.columns and 'avg_user_watch_ratio' in df_users_eda_merged.columns:
            # Calculate correlation on log-transformed data if skewed, or rank correlation
            if col == 'fans_user_num':
                plt.figure(figsize=(8,5))
                # Sample to avoid overplotting if df is large
                sample_df = df_users_eda_merged.sample(min(len(df_users_eda_merged), 5000))
                sns.scatterplot(data=sample_df, x=np.log1p(sample_df[col]), y=sample_df['avg_user_watch_ratio'])
                plt.title(f'log({col}) vs. Avg User Watch Ratio')
                plt.ylim(0, sample_df['avg_user_watch_ratio'].quantile(0.98))
                plt.show()
            correlation = df_users_eda_merged[col].corr(df_users_eda_merged['avg_user_watch_ratio'], method='spearman')
            print(f"Spearman correlation between {col} and avg_user_watch_ratio: {correlation:.2f}")
else:
    print("Skipping user feature correlation with watch ratio due to missing dataframes.")

if not df_big.empty and not df_items_daily.empty:
    print("\n--- Correlating Item Daily Features with Watch Ratio (using df_big) ---")
    # Calculate average watch_ratio per item
    avg_watch_ratio_per_item = df_big.groupby('video_id')['watch_ratio'].mean().reset_index()
    avg_watch_ratio_per_item.rename(columns={'watch_ratio': 'avg_item_watch_ratio'}, inplace=True)
    
    # We need an average of daily stats per video_id from df_items_daily
    item_daily_summary_eda = df_items_daily.groupby('video_id').agg(
        avg_daily_play_progress=('play_progress', 'mean'),
        video_type_mode=('video_type', lambda x: x.mode()[0] if not x.mode().empty else 'Unknown')
    ).reset_index()
    
    df_items_eda_merged = pd.merge(item_daily_summary_eda, avg_watch_ratio_per_item, on='video_id', how='inner')
    
    if 'avg_daily_play_progress' in df_items_eda_merged.columns:
        correlation = df_items_eda_merged['avg_daily_play_progress'].corr(df_items_eda_merged['avg_item_watch_ratio'])
        print(f"Pearson correlation between avg_daily_play_progress and avg_item_watch_ratio: {correlation:.2f}")
        plt.figure(figsize=(8,5))
        sample_df = df_items_eda_merged.sample(min(len(df_items_eda_merged), 5000))
        sns.scatterplot(data=sample_df, x='avg_daily_play_progress', y='avg_item_watch_ratio')
        plt.title('Avg Daily Play Progress vs. Avg Item Watch Ratio')
        plt.ylim(0, sample_df['avg_item_watch_ratio'].quantile(0.98))
        plt.xlim(0, sample_df['avg_daily_play_progress'].quantile(0.98))
        plt.show()
else:
    print("Skipping item feature correlation with watch ratio due to missing dataframes.")

**Justification for Feature Importance (Preliminary):**
- Features showing clear trends or correlations with average `watch_ratio` (even if aggregated for EDA) are good candidates for inclusion in the feature set. For example, `user_active_degree` seems to influence average watch ratios. `fans_user_num` and `avg_daily_play_progress` also show some correlation.
- This justifies keeping these types of features in the main data preparation pipeline.

## 8. EDA Summary & Next Steps

This initial EDA has provided several key insights:
1.  **Data Characteristics:** We have a large, sparse `big_matrix` for training and a smaller, dense `small_matrix` which will be valuable for a specific type of evaluation.
2.  **Target Variable:** `watch_ratio` is a continuous variable, skewed, with values often exceeding 1.0. This supports regression models and loss functions robust to outliers (like L1/MAE).
3.  **Feature Relevance:** User activity, tenure, popularity (fans), item daily engagement metrics (like play progress), and item categories (number of tags) show potential relationships with the target and user/item activity. Temporal features (hour, day of week, video age) are also important.
4.  **Data Quality:** The `item_daily_features` file contained duplicate entries for `(video_id, date)`, which needed to be handled in the main data preparation pipeline to prevent issues during merges.

**Implications for Modeling & Feature Engineering:**
- The choice of a two-stage model (ALS for candidates, LightGBM for ranking) is well-supported. ALS handles sparsity for candidate generation, and LightGBM can leverage the rich feature set for fine-grained ranking.
- The feature engineering steps in the main data preparation notebook (e.g., creating ratios from daily stats, calculating video age, extracting temporal features) are justified by the patterns observed here.
- The selected features (`user_active_degree`, `fans_user_num`, `register_days`, `onehot_feat*` for users; `num_item_tags`, daily ratios, `video_age_days`, `author_id`, `video_type` for items) seem like a good starting point.

This EDA provides a foundation for the choices made in subsequent data preparation, model training, and evaluation notebooks.