# Model Training: ALS, SVD, and Content-Based

In this notebook, we will train three recommender system components:
1.  **ALS (Alternating Least Squares)** from the `implicit` library: For candidate generation (collaborative filtering).
2.  **SVD (Singular Value Decomposition)** from `scikit-surprise`: For scoring and ranking.
3.  **Content-Based (TF-IDF)**: Using movie genres to find similar items, useful for cold-start or hybrid approaches.

We will save all trained models and artifacts to the `models/` directory.

In [None]:
import pandas as pd
import numpy as np
import scipy.sparse as sparse
import implicit
from surprise import Dataset, Reader, SVD
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
import os

# Ensure models directory exists
os.makedirs("../models", exist_ok=True)

## 1. Load Data (Ratings)
We load only the necessary columns: UserID, MovieID, Rating. **Timestamp is ignored**.

In [None]:
RATINGS_FILE = "../data/ml-1m/ratings.dat"
ratings_cols = ['UserID', 'MovieID', 'Rating', 'Timestamp']

# Load and drop timestamp
ratings = pd.read_csv(RATINGS_FILE, sep='::', header=None, names=ratings_cols, engine='python', encoding='latin-1')
ratings = ratings.drop(columns=['Timestamp'])

print(f"Ratings shape: {ratings.shape}")
ratings.head()

## 2. Train ALS Model (Implicit)

ALS requires a sparse matrix (User x Item or Item x User). We'll treat ratings as "confidence".

In [None]:
# Create categorical types for mapping IDs to matrix indices efficiently
users = ratings['UserID'].astype("category")
movies_cat = ratings['MovieID'].astype("category") # Rename to avoid clash with movies df later

# Create sparse matrix (rows=items, cols=users for implicit training)
item_user_matrix = sparse.csr_matrix(
    (ratings['Rating'].astype(float), (movies_cat.cat.codes, users.cat.codes))
)

# Also create user_item matrix for inference references
user_item_matrix = sparse.csr_matrix(
    (ratings['Rating'].astype(float), (users.cat.codes, movies_cat.cat.codes))
)

print(f"Matrix Sparsity: {100 * (1 - item_user_matrix.nnz / (item_user_matrix.shape[0] * item_user_matrix.shape[1])):.2f}%")

# Initialize ALS model
als_model = implicit.als.AlternatingLeastSquares(
    factors=50, 
    regularization=0.1, 
    iterations=20, 
    random_state=42
)

# Train
print("Training ALS model...")
als_model.fit(item_user_matrix)
print("ALS Training Complete.")

### Save ALS Artifacts
We need to save the model, but ALSO the mappings from Real IDs to Matrix Indices.

In [None]:
# Mappings
user_map = dict(enumerate(users.cat.categories))
movie_map = dict(enumerate(movies_cat.cat.categories))
user_inv_map = {v: k for k, v in user_map.items()}
movie_inv_map = {v: k for k, v in movie_map.items()}

# Save everything in a dictionary
als_artifacts = {
    "model": als_model,
    "user_item_matrix": user_item_matrix,
    "user_inv_map": user_inv_map,  # Real UserID -> Matrix Index
    "movie_inv_map": movie_inv_map, # Real MovieID -> Matrix Index
    "user_map": user_map,          # Matrix Index -> Real UserID
    "movie_map": movie_map         # Matrix Index -> Real MovieID
}

with open("../models/als_artifacts.pkl", "wb") as f:
    pickle.dump(als_artifacts, f)
    
print("ALS artifacts saved to ../models/als_artifacts.pkl")

## 3. Train Content-Based Model (TF-IDF on Genres)
As per `tech_note_fe_genres.MD`, we will use TF-IDF to represent movie genres. This allows us to find similar movies based on content.

In [None]:
MOVIES_FILE = "../data/ml-1m/movies.dat"
movies_cols = ['MovieID', 'Title', 'Genres']

print("Loading movies for Content-Based Filtering...")
movies = pd.read_csv(MOVIES_FILE, sep='::', header=None, names=movies_cols, engine='python', encoding='latin-1')

# Preprocess Genres: Replace pipe with space
movies['genres_str'] = movies['Genres'].str.replace('|', ' ', regex=False)

# TF-IDF Vectorization
# Token pattern to capture hyphenated genres like Sci-Fi
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b[A-Za-z-]+\b")
tfidf_matrix = tfidf.fit_transform(movies['genres_str'])

print(f"TF-IDF Matrix Shape: {tfidf_matrix.shape}")
print(f"Vocabulary: {list(tfidf.vocabulary_.keys())[:10]}...")

### Save Content-Based Artifacts
We save the matrix, the vectorizer, and the movies DataFrame (for metadata mapping).

In [None]:
content_artifacts = {
    "tfidf_matrix": tfidf_matrix,
    "tfidf_vectorizer": tfidf,
    "movies_df": movies[['MovieID', 'Title', 'Genres']] # Keep metadata
}

with open("../models/content_artifacts.pkl", "wb") as f:
    pickle.dump(content_artifacts, f)

print("Content-based artifacts saved to ../models/content_artifacts.pkl")

## 4. Train SVD Model (Surprise)
Surprise is designed for explicit feedback (ratings prediction).

In [None]:
# Define Reader (MovieLens is 1-5 scale)
reader = Reader(rating_scale=(1, 5))

# Load data from DataFrame
data = Dataset.load_from_df(ratings[['UserID', 'MovieID', 'Rating']], reader)

# Build full trainset
trainset = data.build_full_trainset()

# Initialize SVD
svd_model = SVD(n_factors=100, n_epochs=20, random_state=42)

# Train
print("Training SVD model...")
svd_model.fit(trainset)
print("SVD Training Complete.")

### Save SVD Model

In [None]:
with open("../models/svd_model.pkl", "wb") as f:
    pickle.dump(svd_model, f)
    
print("SVD model saved to ../models/svd_model.pkl")