# Model Training & Hyperparameter Tuning (Lab)

This notebook serves as the **prototyping lab** for our recommendation pipeline. 

**Strategy**: 3-Way Split (Train / Validation / Test)
1.  **Split Data**: Train (60%), Validation (20%), Test (20%).
2.  **Hyperparameter Tuning**: Optimize ALS and SVD models using the **Validation** set.
3.  **Evaluate**: Measure final performance (RMSE, Precision@K) on the **Test** set using the best parameters.
4.  **Production**: Retrain on the FULL dataset with the optimal hyperparameters.


In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sparse
import implicit
from surprise import Dataset, Reader, SVD, accuracy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
import pickle
import os

# Ensure models directory exists
os.makedirs("../models", exist_ok=True)

# Set seed for reproducibility
SEED = 42

  from .autonotebook import tqdm as notebook_tqdm


## 1. Data Splitting (60/20/20)
We need a Validation set to select the best Hyperparameters without overfitting the Test set.

In [2]:
# Load Ratings
RATINGS_FILE = "../data/ml-1m/ratings.dat"
ratings_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']
ratings = pd.read_csv(RATINGS_FILE, sep='::', header=None, names=ratings_cols, engine='python', encoding='latin-1')
ratings = ratings.drop(columns=['Timestamp'])

# 1. Split into Train+Val (80%) and Test (20%)
train_val_df, test_df = train_test_split(ratings, test_size=0.2, random_state=SEED, stratify=ratings['UserID'])

# 2. Split Train+Val into Train (75% of 80% = 60%) and Validation (25% of 80% = 20%)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=SEED, stratify=train_val_df['UserID'])

print(f"Total Ratings: {len(ratings)}")
print(f"Train Set: {len(train_df)} ({len(train_df)/len(ratings):.1%})")
print(f"Validation Set: {len(val_df)} ({len(val_df)/len(ratings):.1%})")
print(f"Test Set: {len(test_df)} ({len(test_df)/len(ratings):.1%})")

Total Ratings: 1000209
Train Set: 600125 (60.0%)
Validation Set: 200042 (20.0%)
Test Set: 200042 (20.0%)


## 2. Hyperparameter Tuning
We will tune:
- **ALS**: `factors` (latent dimension), `regularization`.
- **SVD**: `n_factors`, `lr_all` (learning rate).

### 2.1 Tuning ALS (Candidate Generation)

In [3]:
# Prepare Sparse Matrices for Training
all_users = ratings['UserID'].unique()
all_movies = ratings['MovieID'].unique()

ts_users = pd.Categorical(train_df['UserID'], categories=all_users)
ts_movies = pd.Categorical(train_df['MovieID'], categories=all_movies)

item_user_train = sparse.csr_matrix(
    (train_df['Rating'].astype(float), (ts_movies.codes, ts_users.codes)),
    shape=(len(all_movies), len(all_users))
)

# Create User x Item Matrix (for recommend function & filtering)
user_item_train = item_user_train.T.tocsr()

print(f"ALS Matrix Sparsity: {100 * (1 - item_user_train.nnz / (item_user_train.shape[0] * item_user_train.shape[1])):.2f}%")

# For Validation Custom Metric (Precision@K Proxy): 
# We check if the model assigns high scores to items in the Validation Set.

def evaluate_als(model, val_df, k=10, sample_size=500):
    # Quick P@K on Validation
    user_ids = val_df['UserID'].unique()
    sample_users = np.random.choice(user_ids, size=min(sample_size, len(user_ids)), replace=False)
    
    hits = 0
    count = 0
    
    # User Map Check
    real_user_id_to_idx = {uid: i for i, uid in enumerate(all_users)}
    idx_to_real_movie_id = {i: mid for i, mid in enumerate(all_movies)}

    for uid in sample_users:
        if uid not in real_user_id_to_idx: continue
        u_idx = real_user_id_to_idx[uid]
        
        # Ground Truth from Val
        pos_items = set(val_df[(val_df['UserID'] == uid) & (val_df['Rating'] >= 4)]['MovieID'])
        if not pos_items: continue
        
        # Recommend
        # Note: implicit requires the user_items matrix (User X Item) for filtering
        ids, _ = model.recommend(u_idx, user_item_train[u_idx], N=k, filter_already_liked_items=False)
        recs = {idx_to_real_movie_id[i] for i in ids}
        
        if len(recs & pos_items) > 0:
            hits += len(recs & pos_items) / k
        count += 1
        
    return hits / count if count > 0 else 0

# --- Grid Search Loop ---
param_grid_als = {
    'factors': [32, 64],
    'regularization': [0.05, 0.1]
}

best_als_score = -1
best_als_params = None
best_als_model = None

print("Tuning ALS...")
for f in param_grid_als['factors']:
    for r in param_grid_als['regularization']:
        model = implicit.als.AlternatingLeastSquares(factors=f, regularization=r, iterations=15, random_state=SEED)
        
        # --- CORRRECCIÓN AQUÍ ---
        # Usamos user_item_train (User x Item) en lugar de item_user_train
        model.fit(user_item_train)
        # ------------------------
        
        score = evaluate_als(model, val_df)
        print(f"ALS (factors={f}, reg={r}) -> Precision@10 (Val): {score:.4f}")
        
        if score > best_als_score:
            best_als_score = score
            best_als_params = {'factors': f, 'regularization': r}
            best_als_model = model

print(f"Best ALS Params: {best_als_params}")

  check_blas_config()


ALS Matrix Sparsity: 97.32%
Tuning ALS...


100%|██████████| 15/15 [00:01<00:00, 12.61it/s]


ALS (factors=32, reg=0.05) -> Precision@10 (Val): 0.0940


100%|██████████| 15/15 [00:01<00:00, 12.52it/s]


ALS (factors=32, reg=0.1) -> Precision@10 (Val): 0.0988


100%|██████████| 15/15 [00:02<00:00,  5.92it/s]


ALS (factors=64, reg=0.05) -> Precision@10 (Val): 0.0531


100%|██████████| 15/15 [00:02<00:00,  5.79it/s]


ALS (factors=64, reg=0.1) -> Precision@10 (Val): 0.0641
Best ALS Params: {'factors': 32, 'regularization': 0.1}


### 2.2 Tuning SVD (Ranking)
Optimizing RMSE on Validation Set. SVD is purely for explicit rating prediction here.

In [4]:
# Prepare Data for Surprise
reader = Reader(rating_scale=(1, 5))
data_train = Dataset.load_from_df(train_df[['UserID', 'MovieID', 'Rating']], reader)
trainset = data_train.build_full_trainset()

# Validation List for Accuracy Check
val_set = list(val_df[['UserID', 'MovieID', 'Rating']].itertuples(index=False, name=None))

# --- Grid Search Loop ---
param_grid_svd = {
    'n_factors': [50, 100],
    'lr_all': [0.005, 0.01]
}

best_svd_rmse = float('inf')
best_svd_params = None
best_svd_model = None

print("\nTuning SVD...")
for nf in param_grid_svd['n_factors']:
    for lr in param_grid_svd['lr_all']:
        model = SVD(n_factors=nf, lr_all=lr, n_epochs=20, random_state=SEED)
        model.fit(trainset)
        predictions = model.test(val_set)
        rmse = accuracy.rmse(predictions, verbose=False)
        print(f"SVD (n_factors={nf}, lr={lr}) -> RMSE (Val): {rmse:.4f}")
        
        if rmse < best_svd_rmse:
            best_svd_rmse = rmse
            best_svd_params = {'n_factors': nf, 'lr_all': lr}
            best_svd_model = model

print(f"Best SVD Params: {best_svd_params}")


Tuning SVD...
SVD (n_factors=50, lr=0.005) -> RMSE (Val): 0.8869
SVD (n_factors=50, lr=0.01) -> RMSE (Val): 0.9068
SVD (n_factors=100, lr=0.005) -> RMSE (Val): 0.8921
SVD (n_factors=100, lr=0.01) -> RMSE (Val): 0.9167
Best SVD Params: {'n_factors': 50, 'lr_all': 0.005}


### 2.3 Prepare Content-Based (No Tuning needed for now)
We simply fit TF-IDF on all movies.

In [5]:
MOVIES_FILE = "../data/ml-1m/movies.dat"
movies_cols = ['MovieID', 'Title', 'Genres']
movies = pd.read_csv(MOVIES_FILE, sep='::', header=None, names=movies_cols, engine='python', encoding='latin-1')

movies['genres_str'] = movies['Genres'].str.replace('|', ' ', regex=False)
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b[A-Za-z-]+\b")
tfidf_matrix = tfidf.fit_transform(movies['genres_str'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

## 3. Final Test Evaluation (Prototype Pipeline)
Now we evaluate the **Hybrid Strategy** using the BEST models on the **Test Set** (which was unseen during tuning).

In [6]:
# Helper maps
real_user_id_to_idx = {uid: i for i, uid in enumerate(all_users)}
idx_to_real_movie_id = {i: mid for i, mid in enumerate(all_movies)}
movie_id_to_df_idx = pd.Series(movies.index, index=movies['MovieID']).to_dict()

def get_hybrid_recommendations(user_id, k=10):
    # 1. ALS Candidates (Best Model)
    if user_id in real_user_id_to_idx:
        user_idx = real_user_id_to_idx[user_id]
        ids, scores = best_als_model.recommend(user_idx, user_item_train[user_idx], N=50, filter_already_liked_items=False)
        als_candidates = [idx_to_real_movie_id[i] for i in ids]
    else:
        als_candidates = []
        
    # 2. Content Candidates
    user_history = train_df[train_df['UserID'] == user_id]
    # Note: In a real scenario we'd use all history, here we use what's in Training Split
    top_movies = user_history[user_history['Rating'] >= 4].sort_values('Rating', ascending=False).head(3)['MovieID'].tolist()
    
    content_candidates = []
    for mid in top_movies:
        if mid in movie_id_to_df_idx:
            idx = movie_id_to_df_idx[mid]
            sim_scores = list(enumerate(cosine_sim[idx]))
            sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:11]
            content_candidates.extend([movies.iloc[i[0]]['MovieID'] for i in sim_scores])
            
    # Union
    all_candidates = list(set(als_candidates + content_candidates))
    
    # 3. SVD Ranking (Best Model)
    scored_candidates = []
    for mid in all_candidates:
        est = best_svd_model.predict(user_id, mid).est
        scored_candidates.append((mid, est))
        
    scored_candidates.sort(key=lambda x: x[1], reverse=True)
    return [x[0] for x in scored_candidates[:k]]

# Evaluate on Test Set
def precision_at_k(truth_df, k=10, sample_size=200):
    hits = 0
    total = 0
    sample_users = np.random.choice(truth_df['UserID'].unique(), size=sample_size)
    
    for uid in sample_users:
        true_pos = set(truth_df[(truth_df['UserID'] == uid) & (truth_df['Rating'] >= 4)]['MovieID'])
        if len(true_pos) == 0: continue
            
        recs = get_hybrid_recommendations(uid, k)
        hits += len(set(recs) & true_pos) / k
        total += 1
    return hits / total

print(f"Final Test Precision@10: {precision_at_k(test_df):.4f}")

Final Test Precision@10: 0.0864


## 4. Retrain on Full Dataset (Production)
Using the best hyperparameters found.

In [7]:
best_als_f = best_als_params['factors']
best_als_r = best_als_params['regularization']
best_svd_nf = best_svd_params['n_factors']
best_svd_lr = best_svd_params['lr_all']

print(f"Retraining Full Models with: ALS(f={best_als_f}, r={best_als_r}), SVD(nf={best_svd_nf}, lr={best_svd_lr})")

# --- Full Data Prep ---
full_users = pd.Categorical(ratings['UserID'], categories=all_users)
full_movies = pd.Categorical(ratings['MovieID'], categories=all_movies)

# Matriz base (Item x User)
item_user_full = sparse.csr_matrix(
    (ratings['Rating'].astype(float), (full_movies.codes, full_users.codes)),
    shape=(len(all_movies), len(all_users))
)

# --- CORRECCIÓN IMPORTANTE ---
# Transponemos a User x Item explícitamente para implicit
user_item_full = item_user_full.T.tocsr()
# -----------------------------

data_full = Dataset.load_from_df(ratings[['UserID', 'MovieID', 'Rating']], reader)
trainset_full = data_full.build_full_trainset()

# --- Train ---
# ALS
als_final = implicit.als.AlternatingLeastSquares(factors=best_als_f, regularization=best_als_r, iterations=20, random_state=SEED)
# Usamos la matriz transpuesta (User x Item)
als_final.fit(user_item_full)

# SVD
svd_final = SVD(n_factors=best_svd_nf, lr_all=best_svd_lr, n_epochs=20, random_state=SEED)
svd_final.fit(trainset_full)

# --- Save Artifacts ---
# ALS
user_map = dict(enumerate(full_users.categories))
movie_map = dict(enumerate(full_movies.categories))

als_artifacts = {
    "model": als_final,
    "user_item_matrix": user_item_full, # Guardamos la User x Item
    "user_inv_map": {v: k for k, v in user_map.items()},
    "movie_inv_map": {v: k for k, v in movie_map.items()},
    "user_map": user_map,
    "movie_map": movie_map
}
with open("../models/als_artifacts.pkl", "wb") as f: pickle.dump(als_artifacts, f)

# SVD
with open("../models/svd_model.pkl", "wb") as f: pickle.dump(svd_final, f)

# Content
content_artifacts = {
    "tfidf_matrix": tfidf_matrix,
    "tfidf_vectorizer": tfidf,
    "cosine_sim_matrix": cosine_sim,
    "movies_df": movies[['MovieID', 'Title', 'Genres']]
}
with open("../models/content_artifacts.pkl", "wb") as f: pickle.dump(content_artifacts, f)

print("All production models saved to ../models/")

Retraining Full Models with: ALS(f=32, r=0.1), SVD(nf=50, lr=0.005)


100%|██████████| 20/20 [00:02<00:00,  7.47it/s]


All production models saved to ../models/


In [8]:
# --- 3.5 Robustness Check: Train on (Train + Test), Evaluate on Validation ---
print("\n--- Robustness Check (Train + Test -> Eval on Val) ---")

# 1. Combine Train + Test
train_test_combined = pd.concat([train_df, test_df])

# 2. Prepare Matrices
tt_users = pd.Categorical(train_test_combined['UserID'], categories=all_users)
tt_movies = pd.Categorical(train_test_combined['MovieID'], categories=all_movies)

item_user_tt = sparse.csr_matrix(
    (train_test_combined['Rating'].astype(float), (tt_movies.codes, tt_users.codes)),
    shape=(len(all_movies), len(all_users))
)
# ¡IMPORTANTE! Transponer para implicit
user_item_tt = item_user_tt.T.tocsr()

# 3. Train ALS (Best Params)
print(f"Training on {len(train_test_combined)} ratings (Train+Test)...")
als_check = implicit.als.AlternatingLeastSquares(
    factors=best_als_params['factors'], 
    regularization=best_als_params['regularization'], 
    iterations=20, 
    random_state=SEED
)
als_check.fit(user_item_tt)

# 4. Evaluate on Validation Set
score_val = evaluate_als(als_check, val_df)
print(f"Robustness Precision@10 (on Validation): {score_val:.4f}")


--- Robustness Check (Train + Test -> Eval on Val) ---
Training on 800167 ratings (Train+Test)...


100%|██████████| 20/20 [00:02<00:00,  8.26it/s]


Robustness Precision@10 (on Validation): 0.1082
