# Phase 3: ML Preprocessing (Optimized)
## MovieLens 32M Dataset (30% Sample)

**Objectives:**
1. Create stratified random train/val/test splits
2. Ensure all users and items have representation in training
3. Build ID mappings and sparse matrices
4. Compute statistics from training data only
5. Prepare content-based features (genres)
6. Save all artifacts for Phase 4 (Model Training)

**Why Random Split (not Temporal)?**
- Movie preferences are stable over time (genres don't change)
- Temporal split causes 93% artificial cold-start
- Rating patterns are consistent regardless of when rated
- Focus: model quality, not temporal realism

**Data Leakage Prevention:**
- Statistics computed from training set only
- ID mappings fitted on training set only

In [1]:
import numpy as np
import pandas as pd
import pickle
import os
from scipy.sparse import csr_matrix, save_npz
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from collections import defaultdict

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
np.random.seed(42)

In [None]:
# ===========================================
# CONFIGURATION
# ===========================================

PROCESSED_PATH = 'data/processed'
ML_READY_PATH = 'data/ml_ready'

# Split ratios
TRAIN_RATIO = 0.70
VAL_RATIO = 0.15
TEST_RATIO = 0.15

# Minimum ratings thresholds
MIN_USER_RATINGS = 5   
MIN_ITEM_RATINGS = 5   # Items must have at least this many ratings in training

# Random state for reproducibility
RANDOM_STATE = 42

os.makedirs(ML_READY_PATH, exist_ok=True)

print("CONFIGURATION")
print("=" * 50)
print(f"Input path: {PROCESSED_PATH}")
print(f"Output path: {ML_READY_PATH}")
print(f"Split ratios: {TRAIN_RATIO}/{VAL_RATIO}/{TEST_RATIO}")
print(f"Min user ratings: {MIN_USER_RATINGS}")
print(f"Min item ratings: {MIN_ITEM_RATINGS}")
print(f"Random state: {RANDOM_STATE}")

CONFIGURATION
Input path: D:/Courses/DL INTERNSHIP/THIRD PROJECT/data/processed
Output path: D:/Courses/DL INTERNSHIP/THIRD PROJECT/data/ml_ready
Split ratios: 0.7/0.15/0.15
Min user ratings: 5
Min item ratings: 5
Random state: 42


In [3]:
# Track all preprocessing steps for summary
preprocessing_log = {}

---
## 1. Load Clean Data

In [4]:
print("=" * 60)
print("1. LOAD CLEAN DATA")
print("=" * 60)

ratings_df = pd.read_parquet(f'{PROCESSED_PATH}/ratings_clean.parquet')
movies_df = pd.read_parquet(f'{PROCESSED_PATH}/movies_clean.parquet')

preprocessing_log['initial_ratings'] = len(ratings_df)
preprocessing_log['initial_users'] = ratings_df['userId'].nunique()
preprocessing_log['initial_items'] = len(movies_df)

print(f"\nInitial data:")
print(f"  Ratings: {preprocessing_log['initial_ratings']:,}")
print(f"  Users: {preprocessing_log['initial_users']:,}")
print(f"  Items: {preprocessing_log['initial_items']:,}")

1. LOAD CLEAN DATA

Initial data:
  Ratings: 9,659,235
  Users: 60,284
  Items: 61,455


---
## 2. Filter Active Users and Items



In [5]:
print("=" * 60)
print("2. FILTER ACTIVE USERS AND ITEMS")
print("=" * 60)

# Count ratings per user and item
user_counts = ratings_df.groupby('userId').size()
item_counts = ratings_df.groupby('movieId').size()

print(f"\nBefore filtering:")
print(f"  Users with <{MIN_USER_RATINGS} ratings: {(user_counts < MIN_USER_RATINGS).sum():,}")
print(f"  Items with <{MIN_ITEM_RATINGS} ratings: {(item_counts < MIN_ITEM_RATINGS).sum():,}")

2. FILTER ACTIVE USERS AND ITEMS

Before filtering:
  Users with <5 ratings: 0
  Items with <5 ratings: 33,951


In [6]:
# Iterative filtering (users and items affect each other)
for iteration in range(5):  # Usually converges in 2-3 iterations
    n_before = len(ratings_df)
    
    # Filter users
    user_counts = ratings_df.groupby('userId').size()
    valid_users = user_counts[user_counts >= MIN_USER_RATINGS].index
    ratings_df = ratings_df[ratings_df['userId'].isin(valid_users)]
    
    # Filter items
    item_counts = ratings_df.groupby('movieId').size()
    valid_items = item_counts[item_counts >= MIN_ITEM_RATINGS].index
    ratings_df = ratings_df[ratings_df['movieId'].isin(valid_items)]
    
    n_after = len(ratings_df)
    
    if n_before == n_after:
        print(f"\nConverged after {iteration + 1} iterations")
        break
    else:
        print(f"Iteration {iteration + 1}: {n_before:,} → {n_after:,} ratings")

Iteration 1: 9,659,235 → 9,596,347 ratings

Converged after 2 iterations


In [7]:
preprocessing_log['filtered_ratings'] = len(ratings_df)
preprocessing_log['filtered_users'] = ratings_df['userId'].nunique()
preprocessing_log['filtered_items'] = ratings_df['movieId'].nunique()

print(f"\nAfter filtering:")
print(f"  Ratings: {preprocessing_log['filtered_ratings']:,} ({preprocessing_log['filtered_ratings']/preprocessing_log['initial_ratings']*100:.1f}% kept)")
print(f"  Users: {preprocessing_log['filtered_users']:,} ({preprocessing_log['filtered_users']/preprocessing_log['initial_users']*100:.1f}% kept)")
print(f"  Items: {preprocessing_log['filtered_items']:,} ({preprocessing_log['filtered_items']/preprocessing_log['initial_items']*100:.1f}% kept)")


After filtering:
  Ratings: 9,596,347 (99.3% kept)
  Users: 60,284 (100.0% kept)
  Items: 27,504 (44.8% kept)


---
## 3. User-Stratified Random Split

**Strategy:** For each user, randomly split their ratings into train/val/test.
This ensures every user has ratings in all three sets.

In [None]:
print("=" * 60)
print("3. USER-STRATIFIED RANDOM SPLIT")
print("=" * 60)

def user_stratified_split(df, train_ratio=0.7, val_ratio=0.15, random_state=42):
    """
    Split ratings so each user has ratings in train/val/test.
    """
    np.random.seed(random_state)
    
    train_list = []
    val_list = []
    test_list = []
    
    for user_id, group in df.groupby('userId'):
        n = len(group)
        indices = np.random.permutation(n)
        
        train_end = int(n * train_ratio)
        val_end = int(n * (train_ratio + val_ratio))
        
        train_idx = indices[:train_end]
        val_idx = indices[train_end:val_end]
        test_idx = indices[val_end:]
        
        group_array = group.values
        
        if len(train_idx) > 0:
            train_list.append(group_array[train_idx])
        if len(val_idx) > 0:
            val_list.append(group_array[val_idx])
        if len(test_idx) > 0:
            test_list.append(group_array[test_idx])
    
    columns = df.columns
    train_df = pd.DataFrame(np.vstack(train_list), columns=columns)
    val_df = pd.DataFrame(np.vstack(val_list), columns=columns)
    test_df = pd.DataFrame(np.vstack(test_list), columns=columns)
    
    # Restore dtypes
    for col in ['userId', 'movieId', 'timestamp']:
        train_df[col] = train_df[col].astype('int32')
        val_df[col] = val_df[col].astype('int32')
        test_df[col] = test_df[col].astype('int32')
    for col in ['rating']:
        train_df[col] = train_df[col].astype('float32')
        val_df[col] = val_df[col].astype('float32')
        test_df[col] = test_df[col].astype('float32')
    
    return train_df, val_df, test_df


train_df, val_df, test_df = user_stratified_split(
    ratings_df, 
    train_ratio=TRAIN_RATIO, 
    val_ratio=VAL_RATIO, 
    random_state=RANDOM_STATE
)

preprocessing_log['train_size'] = len(train_df)
preprocessing_log['val_size'] = len(val_df)
preprocessing_log['test_size'] = len(test_df)

print(f"\nSplit sizes:")
print(f"  Train: {len(train_df):,} ({len(train_df)/len(ratings_df)*100:.1f}%)")
print(f"  Val: {len(val_df):,} ({len(val_df)/len(ratings_df)*100:.1f}%)")
print(f"  Test: {len(test_df):,} ({len(test_df)/len(ratings_df)*100:.1f}%)")

3. USER-STRATIFIED RANDOM SPLIT
Splitting data by user...

Split sizes:
  Train: 6,690,428 (69.7%)
  Val: 1,437,878 (15.0%)
  Test: 1,468,041 (15.3%)


In [9]:
# Verify user coverage
train_users = set(train_df['userId'].unique())
val_users = set(val_df['userId'].unique())
test_users = set(test_df['userId'].unique())

val_users_in_train = len(val_users & train_users)
test_users_in_train = len(test_users & train_users)

print(f"\nUser coverage:")
print(f"  Val users in train: {val_users_in_train:,} / {len(val_users):,} ({val_users_in_train/len(val_users)*100:.1f}%)")
print(f"  Test users in train: {test_users_in_train:,} / {len(test_users):,} ({test_users_in_train/len(test_users)*100:.1f}%)")


User coverage:
  Val users in train: 60,284 / 60,284 (100.0%)
  Test users in train: 60,284 / 60,284 (100.0%)


In [10]:
# Verify item coverage
train_items = set(train_df['movieId'].unique())
val_items = set(val_df['movieId'].unique())
test_items = set(test_df['movieId'].unique())

val_items_in_train = len(val_items & train_items)
test_items_in_train = len(test_items & train_items)

print(f"\nItem coverage:")
print(f"  Val items in train: {val_items_in_train:,} / {len(val_items):,} ({val_items_in_train/len(val_items)*100:.1f}%)")
print(f"  Test items in train: {test_items_in_train:,} / {len(test_items):,} ({test_items_in_train/len(test_items)*100:.1f}%)")

preprocessing_log['val_user_coverage'] = val_users_in_train / len(val_users) * 100
preprocessing_log['test_user_coverage'] = test_users_in_train / len(test_users) * 100
preprocessing_log['val_item_coverage'] = val_items_in_train / len(val_items) * 100
preprocessing_log['test_item_coverage'] = test_items_in_train / len(test_items) * 100


Item coverage:
  Val items in train: 24,066 / 24,072 (100.0%)
  Test items in train: 24,122 / 24,128 (100.0%)


---
## 4. ID Mappings

In [11]:
print("=" * 60)
print("4. ID MAPPINGS")
print("=" * 60)

# Fit on training data
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

user_encoder.fit(train_df['userId'])
item_encoder.fit(train_df['movieId'])

n_users = len(user_encoder.classes_)
n_items = len(item_encoder.classes_)

preprocessing_log['n_users'] = n_users
preprocessing_log['n_items'] = n_items

print(f"\nUsers: {n_users:,}")
print(f"Items: {n_items:,}")

4. ID MAPPINGS

Users: 60,284
Items: 27,498


In [None]:
# Transform all sets
train_df['user_idx'] = user_encoder.transform(train_df['userId'])
train_df['item_idx'] = item_encoder.transform(train_df['movieId'])

# Safe transform for val/test (handle any edge cases)
def safe_transform(encoder, values):
    known = set(encoder.classes_)
    mask = np.array([v in known for v in values])
    result = np.full(len(values), -1, dtype=np.int32)
    result[mask] = encoder.transform(values[mask])
    return result, mask

val_df['user_idx'], val_user_mask = safe_transform(user_encoder, val_df['userId'].values)
val_df['item_idx'], val_item_mask = safe_transform(item_encoder, val_df['movieId'].values)

test_df['user_idx'], test_user_mask = safe_transform(user_encoder, test_df['userId'].values)
test_df['item_idx'], test_item_mask = safe_transform(item_encoder, test_df['movieId'].values)




Transformation complete.


In [13]:
# Filter val/test to only known users and items
val_valid_mask = (val_df['user_idx'] != -1) & (val_df['item_idx'] != -1)
test_valid_mask = (test_df['user_idx'] != -1) & (test_df['item_idx'] != -1)

val_unknown = (~val_valid_mask).sum()
test_unknown = (~test_valid_mask).sum()

print(f"\nUnknown user/item pairs:")
print(f"  Val: {val_unknown:,} ({val_unknown/len(val_df)*100:.2f}%)")
print(f"  Test: {test_unknown:,} ({test_unknown/len(test_df)*100:.2f}%)")

# Filter to keep only valid
val_df = val_df[val_valid_mask].copy()
test_df = test_df[test_valid_mask].copy()

print(f"\nAfter filtering unknown:")
print(f"  Val: {len(val_df):,}")
print(f"  Test: {len(test_df):,}")

preprocessing_log['val_final'] = len(val_df)
preprocessing_log['test_final'] = len(test_df)


Unknown user/item pairs:
  Val: 17 (0.00%)
  Test: 14 (0.00%)

After filtering unknown:
  Val: 1,437,861
  Test: 1,468,027


---
## 5. Compute Statistics (Training Only)

In [14]:
print("=" * 60)
print("5. COMPUTE STATISTICS")
print("=" * 60)

# Global mean
global_mean = train_df['rating'].mean()
preprocessing_log['global_mean'] = global_mean
print(f"\nGlobal mean: {global_mean:.4f}")

5. COMPUTE STATISTICS

Global mean: 3.5377


In [15]:
# User statistics
user_stats = train_df.groupby('user_idx').agg(
    num_ratings=('rating', 'count'),
    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std')
).reset_index()

user_stats['std_rating'] = user_stats['std_rating'].fillna(0)
user_stats['bias'] = user_stats['mean_rating'] - global_mean

print(f"\nUser statistics:")
print(f"  Avg ratings/user: {user_stats['num_ratings'].mean():.1f}")
print(f"  Avg user bias: {user_stats['bias'].mean():.4f}")
print(f"  User bias std: {user_stats['bias'].std():.4f}")


User statistics:
  Avg ratings/user: 111.0
  Avg user bias: 0.1682
  User bias std: 0.4927


In [16]:
# Item statistics
item_stats = train_df.groupby('item_idx').agg(
    num_ratings=('rating', 'count'),
    mean_rating=('rating', 'mean'),
    std_rating=('rating', 'std')
).reset_index()

item_stats['std_rating'] = item_stats['std_rating'].fillna(0)
item_stats['bias'] = item_stats['mean_rating'] - global_mean

print(f"\nItem statistics:")
print(f"  Avg ratings/item: {item_stats['num_ratings'].mean():.1f}")
print(f"  Avg item bias: {item_stats['bias'].mean():.4f}")
print(f"  Item bias std: {item_stats['bias'].std():.4f}")


Item statistics:
  Avg ratings/item: 243.3
  Avg item bias: -0.3827
  Item bias std: 0.5748


In [None]:
# Create lookup dictionaries
user_bias_dict = dict(zip(user_stats['user_idx'], user_stats['bias']))
item_bias_dict = dict(zip(item_stats['item_idx'], item_stats['bias']))
user_mean_dict = dict(zip(user_stats['user_idx'], user_stats['mean_rating']))
item_mean_dict = dict(zip(item_stats['item_idx'], item_stats['mean_rating']))

# Item popularity for negative sampling
item_popularity = dict(zip(item_stats['item_idx'], item_stats['num_ratings']))



Lookup dictionaries created.


---
## 6. Create Sparse Matrices

In [18]:
print("=" * 60)
print("6. SPARSE MATRICES")
print("=" * 60)

# Rating matrix
train_sparse = csr_matrix(
    (train_df['rating'].values, (train_df['user_idx'].values, train_df['item_idx'].values)),
    shape=(n_users, n_items)
)

# Binary interaction matrix
train_binary = csr_matrix(
    (np.ones(len(train_df)), (train_df['user_idx'].values, train_df['item_idx'].values)),
    shape=(n_users, n_items)
)

# Density
density = train_sparse.nnz / (n_users * n_items) * 100
preprocessing_log['density'] = density

print(f"\nSparse matrices:")
print(f"  Shape: ({n_users:,}, {n_items:,})")
print(f"  Non-zero: {train_sparse.nnz:,}")
print(f"  Density: {density:.4f}%")
print(f"  Memory (ratings): {train_sparse.data.nbytes / 1024**2:.2f} MB")
print(f"  Memory (binary): {train_binary.data.nbytes / 1024**2:.2f} MB")

6. SPARSE MATRICES

Sparse matrices:
  Shape: (60,284, 27,498)
  Non-zero: 6,690,428
  Density: 0.4036%
  Memory (ratings): 25.52 MB
  Memory (binary): 51.04 MB


---
## 7. Content-Based Features (Genres)

In [19]:
print("=" * 60)
print("7. CONTENT-BASED FEATURES")
print("=" * 60)

# Get movies in training
train_movie_ids = set(train_df['movieId'].unique())
movies_train = movies_df[movies_df['movieId'].isin(train_movie_ids)].copy()

# Add item_idx
movies_train['item_idx'] = item_encoder.transform(movies_train['movieId'])

print(f"\nMovies in training: {len(movies_train):,}")

7. CONTENT-BASED FEATURES

Movies in training: 27,498


In [20]:
# Parse and encode genres
movies_train['genre_list'] = movies_train['genres'].str.split('|')

mlb = MultiLabelBinarizer()
genre_matrix_raw = mlb.fit_transform(movies_train['genre_list'])
genre_names = list(mlb.classes_)

preprocessing_log['n_genres'] = len(genre_names)

print(f"\nGenres: {len(genre_names)}")
print(f"  {genre_names}")


Genres: 19
  ['Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']


In [21]:
# Create genre feature matrix aligned with item_idx
genre_features = np.zeros((n_items, len(genre_names)), dtype=np.float32)

for idx, row in movies_train.iterrows():
    item_idx = row['item_idx']
    movie_idx_in_df = movies_train.index.get_loc(idx)
    genre_features[item_idx] = genre_matrix_raw[movie_idx_in_df]

# Verify
items_with_genres = (genre_features.sum(axis=1) > 0).sum()
print(f"\nGenre feature matrix:")
print(f"  Shape: {genre_features.shape}")
print(f"  Items with genres: {items_with_genres:,} / {n_items:,}")


Genre feature matrix:
  Shape: (27498, 19)
  Items with genres: 27,498 / 27,498


---
## 8. Evaluation Data Preparation

In [22]:
print("=" * 60)
print("8. EVALUATION DATA")
print("=" * 60)

# Positive items for each user (for negative sampling)
user_positive_items = train_df.groupby('user_idx')['item_idx'].apply(set).to_dict()
all_items = set(range(n_items))

print(f"\nPositive items stored for {len(user_positive_items):,} users")

8. EVALUATION DATA

Positive items stored for 60,284 users


In [23]:
# Group val/test by user for ranking evaluation
val_user_items = val_df.groupby('user_idx').apply(
    lambda x: list(zip(x['item_idx'].values, x['rating'].values))
).to_dict()

test_user_items = test_df.groupby('user_idx').apply(
    lambda x: list(zip(x['item_idx'].values, x['rating'].values))
).to_dict()

print(f"\nEvaluation user groups:")
print(f"  Val: {len(val_user_items):,} users")
print(f"  Test: {len(test_user_items):,} users")


Evaluation user groups:
  Val: 60,284 users
  Test: 60,284 users


In [24]:
# Relevant items (rating >= 4.0) for ranking metrics
RELEVANCE_THRESHOLD = 4.0

val_relevant = val_df[val_df['rating'] >= RELEVANCE_THRESHOLD].groupby('user_idx')['item_idx'].apply(set).to_dict()
test_relevant = test_df[test_df['rating'] >= RELEVANCE_THRESHOLD].groupby('user_idx')['item_idx'].apply(set).to_dict()

print(f"\nRelevant items (rating >= {RELEVANCE_THRESHOLD}):")
print(f"  Val users with relevant: {len(val_relevant):,}")
print(f"  Test users with relevant: {len(test_relevant):,}")


Relevant items (rating >= 4.0):
  Val users with relevant: 58,788
  Test users with relevant: 59,059


---
## 9. Create Reverse Mappings

In [25]:
print("=" * 60)
print("9. REVERSE MAPPINGS")
print("=" * 60)

# idx to original ID
idx_to_user = {idx: uid for idx, uid in enumerate(user_encoder.classes_)}
idx_to_item = {idx: mid for idx, mid in enumerate(item_encoder.classes_)}

# item_idx to movie title
item_to_title = dict(zip(movies_train['item_idx'], movies_train['title']))
item_to_genres = dict(zip(movies_train['item_idx'], movies_train['genres']))

print(f"\nReverse mappings created:")
print(f"  idx_to_user: {len(idx_to_user):,}")
print(f"  idx_to_item: {len(idx_to_item):,}")
print(f"  item_to_title: {len(item_to_title):,}")

9. REVERSE MAPPINGS

Reverse mappings created:
  idx_to_user: 60,284
  idx_to_item: 27,498
  item_to_title: 27,498


---
## 10. Save All Artifacts

In [26]:
print("=" * 60)
print("10. SAVE ARTIFACTS")
print("=" * 60)

# 1. Split dataframes
train_df.to_parquet(f'{ML_READY_PATH}/train.parquet', index=False)
val_df.to_parquet(f'{ML_READY_PATH}/val.parquet', index=False)
test_df.to_parquet(f'{ML_READY_PATH}/test.parquet', index=False)

print("\n1. Split dataframes:")
for f in ['train.parquet', 'val.parquet', 'test.parquet']:
    size = os.path.getsize(f'{ML_READY_PATH}/{f}') / 1024**2
    print(f"   {f}: {size:.2f} MB")

10. SAVE ARTIFACTS

1. Split dataframes:
   train.parquet: 58.35 MB
   val.parquet: 13.96 MB
   test.parquet: 14.19 MB


In [27]:
# 2. Sparse matrices
save_npz(f'{ML_READY_PATH}/train_sparse.npz', train_sparse)
save_npz(f'{ML_READY_PATH}/train_binary.npz', train_binary)

print("\n2. Sparse matrices:")
print(f"   train_sparse.npz")
print(f"   train_binary.npz")


2. Sparse matrices:
   train_sparse.npz
   train_binary.npz


In [28]:
# 3. Genre features
np.save(f'{ML_READY_PATH}/genre_features.npy', genre_features)

print("\n3. Genre features:")
print(f"   genre_features.npy: {genre_features.shape}")


3. Genre features:
   genre_features.npy: (27498, 19)


In [29]:
# 4. Mappings
mappings = {
    'user_encoder': user_encoder,
    'item_encoder': item_encoder,
    'n_users': n_users,
    'n_items': n_items,
    'idx_to_user': idx_to_user,
    'idx_to_item': idx_to_item,
    'item_to_title': item_to_title,
    'item_to_genres': item_to_genres,
    'genre_encoder': mlb,
    'genre_names': genre_names
}

with open(f'{ML_READY_PATH}/mappings.pkl', 'wb') as f:
    pickle.dump(mappings, f)

print("\n4. Mappings:")
print(f"   mappings.pkl")


4. Mappings:
   mappings.pkl


In [30]:
# 5. Statistics
stats = {
    'global_mean': global_mean,
    'user_bias': user_bias_dict,
    'item_bias': item_bias_dict,
    'user_mean': user_mean_dict,
    'item_mean': item_mean_dict,
    'item_popularity': item_popularity
}

with open(f'{ML_READY_PATH}/stats.pkl', 'wb') as f:
    pickle.dump(stats, f)

print("\n5. Statistics:")
print(f"   stats.pkl")


5. Statistics:
   stats.pkl


In [31]:
# 6. Evaluation data
eval_data = {
    'val_user_items': val_user_items,
    'test_user_items': test_user_items,
    'val_relevant': val_relevant,
    'test_relevant': test_relevant,
    'user_positive_items': user_positive_items,
    'all_items': all_items,
    'relevance_threshold': RELEVANCE_THRESHOLD
}

with open(f'{ML_READY_PATH}/eval_data.pkl', 'wb') as f:
    pickle.dump(eval_data, f)

print("\n6. Evaluation data:")
print(f"   eval_data.pkl")


6. Evaluation data:
   eval_data.pkl


In [32]:
# 7. Movies metadata
movies_train.to_parquet(f'{ML_READY_PATH}/movies_train.parquet', index=False)

print("\n7. Movies metadata:")
print(f"   movies_train.parquet")


7. Movies metadata:
   movies_train.parquet


In [33]:
# 8. Preprocessing config (for reproducibility)
config = {
    'train_ratio': TRAIN_RATIO,
    'val_ratio': VAL_RATIO,
    'test_ratio': TEST_RATIO,
    'min_user_ratings': MIN_USER_RATINGS,
    'min_item_ratings': MIN_ITEM_RATINGS,
    'random_state': RANDOM_STATE,
    'relevance_threshold': RELEVANCE_THRESHOLD,
    'split_type': 'user_stratified_random'
}

with open(f'{ML_READY_PATH}/config.pkl', 'wb') as f:
    pickle.dump(config, f)

print("\n8. Config:")
print(f"   config.pkl")


8. Config:
   config.pkl


---
## Phase 3 Summary

In [34]:
print("=" * 70)
print("PHASE 3 SUMMARY: ML PREPROCESSING (OPTIMIZED)")
print("=" * 70)

print("\n" + "-" * 70)
print("1. DATA FILTERING")
print("-" * 70)
print(f"   Initial: {preprocessing_log['initial_ratings']:,} ratings, {preprocessing_log['initial_users']:,} users, {preprocessing_log['initial_items']:,} items")
print(f"   Filtered: {preprocessing_log['filtered_ratings']:,} ratings, {preprocessing_log['filtered_users']:,} users, {preprocessing_log['filtered_items']:,} items")
print(f"   Kept: {preprocessing_log['filtered_ratings']/preprocessing_log['initial_ratings']*100:.1f}% of ratings")

print("\n" + "-" * 70)
print("2. USER-STRATIFIED RANDOM SPLIT")
print("-" * 70)
print(f"   Train: {preprocessing_log['train_size']:,} ratings ({TRAIN_RATIO*100:.0f}%)")
print(f"   Val: {preprocessing_log['val_final']:,} ratings")
print(f"   Test: {preprocessing_log['test_final']:,} ratings")
print(f"   ✓ Every user has ratings in all splits")

print("\n" + "-" * 70)
print("3. COVERAGE")
print("-" * 70)
print(f"   Val users in train: {preprocessing_log['val_user_coverage']:.1f}%")
print(f"   Val items in train: {preprocessing_log['val_item_coverage']:.1f}%")
print(f"   Test users in train: {preprocessing_log['test_user_coverage']:.1f}%")
print(f"   Test items in train: {preprocessing_log['test_item_coverage']:.1f}%")

print("\n" + "-" * 70)
print("4. MATRIX")
print("-" * 70)
print(f"   Shape: ({preprocessing_log['n_users']:,}, {preprocessing_log['n_items']:,})")
print(f"   Density: {preprocessing_log['density']:.4f}%")

print("\n" + "-" * 70)
print("5. STATISTICS")
print("-" * 70)
print(f"   Global mean: {preprocessing_log['global_mean']:.4f}")
print(f"   User biases: {len(user_bias_dict):,}")
print(f"   Item biases: {len(item_bias_dict):,}")
print(f"   ✓ Computed from training only")

print("\n" + "-" * 70)
print("6. CONTENT FEATURES")
print("-" * 70)
print(f"   Genres: {preprocessing_log['n_genres']}")
print(f"   Feature matrix: ({preprocessing_log['n_items']:,}, {preprocessing_log['n_genres']})")

print("\n" + "-" * 70)
print("7. OUTPUT FILES")
print("-" * 70)
print(f"   Location: {ML_READY_PATH}")
total_size = 0
for f in os.listdir(ML_READY_PATH):
    size = os.path.getsize(f'{ML_READY_PATH}/{f}') / 1024**2
    total_size += size
    print(f"     - {f}: {size:.2f} MB")
print(f"   Total: {total_size:.2f} MB")

print("\n" + "=" * 70)
print("IMPROVEMENTS vs TEMPORAL SPLIT")
print("=" * 70)
print(f"   ✓ User coverage: ~100% (was ~7% with temporal)")
print(f"   ✓ Item coverage: ~{preprocessing_log['val_item_coverage']:.0f}%+ (was ~7% with temporal)")
print(f"   ✓ No artificial cold-start problem")
print(f"   ✓ All val/test ratings usable for CF evaluation")

print("\n" + "=" * 70)
print("READY FOR PHASE 4: MODEL TRAINING")
print("=" * 70)

PHASE 3 SUMMARY: ML PREPROCESSING (OPTIMIZED)

----------------------------------------------------------------------
1. DATA FILTERING
----------------------------------------------------------------------
   Initial: 9,659,235 ratings, 60,284 users, 61,455 items
   Filtered: 9,596,347 ratings, 60,284 users, 27,504 items
   Kept: 99.3% of ratings

----------------------------------------------------------------------
2. USER-STRATIFIED RANDOM SPLIT
----------------------------------------------------------------------
   Train: 6,690,428 ratings (70%)
   Val: 1,437,861 ratings
   Test: 1,468,027 ratings
   ✓ Every user has ratings in all splits

----------------------------------------------------------------------
3. COVERAGE
----------------------------------------------------------------------
   Val users in train: 100.0%
   Val items in train: 100.0%
   Test users in train: 100.0%
   Test items in train: 100.0%

--------------------------------------------------------------------