# Collaborative Filtering with SVD

This notebook implements a **collaborative filtering (CF)** approach using TruncatedSVD to leverage historical interaction data.

## Why Collaborative Filtering?
- Discovers hidden patterns from user-program interactions
- Can recommend programs based on what similar users liked
- Complements content-based filtering for a hybrid system
- Effective for personalization as interaction data grows

## Why TruncatedSVD?
- **Matrix factorization**: Decomposes user-item matrix into latent factors
- **Efficient**: Works well with sparse matrices (most users haven't interacted with most programs)
- **Interpretable**: Latent dimensions capture hidden preference patterns
- **scikit-learn**: No compilation required, works on all Python versions
- Fast and scalable for production use

## Import Libraries
TruncatedSVD from scikit-learn provides efficient matrix factorization.

In [20]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
import joblib


## Load Data and Create Train/Test Split

Load all data and split interactions 80/20 for training/testing.

**Critical:** CF model must be trained on training data only (80%) to avoid data leakage during evaluation.

In [21]:
# Load raw data
users = pd.read_csv("../data/raw/users.csv")
programs = pd.read_csv("../data/raw/programs.csv")
interactions = pd.read_csv("../data/raw/interactions.csv")

print(f"Total data: {len(users)} users, {len(programs)} programs, {len(interactions)} interactions")

# Split interactions 80/20 for train/test
train_interactions, test_interactions = train_test_split(
    interactions, 
    test_size=0.2, 
    random_state=42
)

# Save for later use
train_interactions.to_csv("../data/processed/train_interactions.csv", index=False)
test_interactions.to_csv("../data/processed/test_interactions.csv", index=False)

print(f"\nTrain/Test Split:")
print(f"  Training: {len(train_interactions)} interactions ({len(train_interactions)/len(interactions)*100:.1f}%)")
print(f"  Testing:  {len(test_interactions)} interactions ({len(test_interactions)/len(interactions)*100:.1f}%)")

# Create user and item ID mappings from training data
user_ids = train_interactions["user_id"].unique()
program_ids = train_interactions["program_id"].unique()

user_id_map = {uid: idx for idx, uid in enumerate(user_ids)}
item_id_map = {pid: idx for idx, pid in enumerate(program_ids)}

reverse_item_map = {idx: pid for pid, idx in item_id_map.items()}

print(f"Unique users in training: {len(user_ids)}")
print(f"Unique programs in training: {len(program_ids)}")

Total data: 500 users, 50 programs, 8004 interactions

Train/Test Split:
  Training: 6403 interactions (80.0%)
  Testing:  1601 interactions (20.0%)
Unique users in training: 500
Unique programs in training: 50


## Create ID Mappings
Map user and program IDs to matrix indices for building the interaction matrix.

This establishes the vocabulary for both users and items before building the sparse matrix.

In [22]:
# Build sparse user-item interaction matrix from TRAINING data only
row_indices = [user_id_map[uid] for uid in train_interactions["user_id"]]
col_indices = [item_id_map[pid] for pid in train_interactions["program_id"]]
data = train_interactions["interaction"].values

interaction_matrix = csr_matrix(
    (data, (row_indices, col_indices)),
    shape=(len(user_ids), len(program_ids))
)

print(f"Interaction matrix shape: {interaction_matrix.shape}")
print(f"Number of training interactions: {interaction_matrix.nnz}")


Interaction matrix shape: (500, 50)
Number of training interactions: 6403


## Build Interaction Matrix
Convert **training interactions only** into a sparse user-item matrix format.

**Important:** Using only 80% training data to build the model. The remaining 20% test data will be used for evaluation in notebook 05.

**Format:** Rows = users, Columns = programs, Values = interaction (1 = liked/viewed)

- Using CSR (Compressed Sparse Row) format for efficiency- Most entries are 0 (users haven't interacted with most programs)

In [23]:
# Train SVD model with more components for larger dataset
n_components = min(20, len(program_ids) - 1)  # Adaptive based on program count
svd = TruncatedSVD(n_components=n_components, random_state=42)
user_factors = svd.fit_transform(interaction_matrix)
item_factors = svd.components_.T

print(f"User factors shape: {user_factors.shape}")
print(f"Item factors shape: {item_factors.shape}")
print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.3f}")


User factors shape: (500, 20)
Item factors shape: (50, 20)
Explained variance ratio: 0.546


## Train SVD Model

**How it works:**
- Decomposes the user-item matrix into two lower-dimensional matrices
- **User factors**: Each user represented by latent features
- **Item factors**: Each program represented by latent features
- Predictions = dot product of user and item factors

**Hyperparameters:**
- `n_components`: Latent factor dimension (adaptive based on dataset size)
  - Lower = simpler patterns, less overfitting
  - Higher = more complex patterns
  - Set to min(20, num_programs - 1) for optimal performance

**Explained variance**: Shows how much of the interaction patterns are captured by the latent factors.

In [24]:
# Compute predicted scores for all user-item pairs
predicted_scores = user_factors @ item_factors.T


print(f"Predicted scores shape: {predicted_scores.shape}")
print(f"Score range: [{predicted_scores.min():.3f}, {predicted_scores.max():.3f}]")

Predicted scores shape: (500, 50)
Score range: [-0.620, 1.442]


## Compute Predictions
Calculate predicted interaction scores for all user-program pairs using matrix multiplication.

**Result:** Each user gets a predicted score for every program, representing how likely they are to like it.

In [25]:
def recommend_cf(user_id, k=3):
    """Generate top-k recommendations for a user using CF"""
    if user_id not in user_id_map:
        return []  # Handle cold-start users
    
    user_idx = user_id_map[user_id]
    scores = predicted_scores[user_idx]
    
    # Filter out programs the user has already interacted with
    interacted_items = interaction_matrix[user_idx].nonzero()[1]
    scores_copy = scores.copy()
    scores_copy[interacted_items] = -np.inf
    
    # Get top-k programs
    top_items = np.argsort(scores_copy)[::-1][:k]
    recommendations = [
        {"program_id": reverse_item_map[i], "score": scores[i]}
        for i in top_items
    ]
    return recommendations

# Test with an actual user from the training data
test_user = list(user_id_map.keys())[0]
print(f"Testing recommendations for user: {test_user}")
recommend_cf(test_user)


Testing recommendations for user: u_0067


[{'program_id': 'p_012', 'score': np.float64(0.6177207136133009)},
 {'program_id': 'p_025', 'score': np.float64(0.5879447565336278)},
 {'program_id': 'p_038', 'score': np.float64(0.44512790154136594)}]

## Recommendation Function
Generate top-k program recommendations for a given user based on CF predictions.

**Process:**
1. Convert user ID to internal index
2. Get predicted scores for all programs
3. **Filter out already-interacted programs** (don't recommend what they've seen)
4. Sort by score (descending)
5. Return top-k program IDs with scores

**Key improvement:** Excludes programs the user has already liked/viewed.

In [26]:
# Save SVD model and mappings
joblib.dump({
    "svd": svd,
    "user_factors": user_factors,
    "item_factors": item_factors,
    "predicted_scores": predicted_scores,
    "user_id_map": user_id_map,
    "item_id_map": item_id_map,
    "reverse_item_map": reverse_item_map,
    "interaction_matrix": interaction_matrix
}, "../models/cf_svd.pkl")

print("✓ CF model saved to ../models/cf_svd.pkl")


✓ CF model saved to ../models/cf_svd.pkl


## Save Model
Persist the trained SVD model, factors, and mappings for deployment.

**Saved components:**
- SVD model (for retraining if needed)

- User and item latent factors- Interaction matrix (to filter already-seen items)

- Precomputed predicted scores (fast inference)- ID mappings (external ↔ internal indices)