# Collaborative Filtering with SVD

This notebook implements a **collaborative filtering (CF)** approach using TruncatedSVD to leverage historical interaction data.

## Why Collaborative Filtering?
- Discovers hidden patterns from user-program interactions
- Can recommend programs based on what similar users liked
- Complements content-based filtering for a hybrid system
- Effective for personalization as interaction data grows

## Why TruncatedSVD?
- **Matrix factorization**: Decomposes user-item matrix into latent factors
- **Efficient**: Works well with sparse matrices (most users haven't interacted with most programs)
- **Interpretable**: Latent dimensions capture hidden preference patterns
- **scikit-learn**: No compilation required, works on all Python versions
- Fast and scalable for production use

## Import Libraries
TruncatedSVD from scikit-learn provides efficient matrix factorization.

In [13]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
import joblib


In [14]:
users = pd.read_csv("../data/raw/users.csv")
programs = pd.read_csv("../data/raw/programs.csv")
interactions = pd.read_csv("../data/raw/interactions.csv")


## Load Data
Load users, programs, and historical interactions (likes) generated in notebook 01.

In [15]:
# Create user and item ID mappings
user_ids = users["user_id"].unique()
program_ids = programs["program_id"].unique()

user_id_map = {uid: idx for idx, uid in enumerate(user_ids)}
item_id_map = {pid: idx for idx, pid in enumerate(program_ids)}

reverse_item_map = {idx: pid for pid, idx in item_id_map.items()}

print(f"Users: {len(user_ids)}, Programs: {len(program_ids)}")

Users: 100, Programs: 6


## Create ID Mappings
Map user and program IDs to matrix indices for building the interaction matrix.

This establishes the vocabulary for both users and items before building the sparse matrix.

In [16]:
# Build sparse user-item interaction matrix
row_indices = [user_id_map[uid] for uid in interactions["user_id"]]
col_indices = [item_id_map[pid] for pid in interactions["program_id"]]
data = interactions["interaction"].values

interaction_matrix = csr_matrix(
    (data, (row_indices, col_indices)),
    shape=(len(user_ids), len(program_ids))
)

print(f"Interaction matrix shape: {interaction_matrix.shape}")
print(f"Number of interactions: {interaction_matrix.nnz}")


Interaction matrix shape: (100, 6)
Number of interactions: 200


## Build Interaction Matrix
Convert interaction data into a sparse user-item matrix format.

**Format:** Rows = users, Columns = programs, Values = interaction (1 = liked/viewed)
- Using CSR (Compressed Sparse Row) format for efficiency
- Most entries are 0 (users haven't interacted with most programs)

In [17]:
# Train SVD model
svd = TruncatedSVD(n_components=5, random_state=42)
user_factors = svd.fit_transform(interaction_matrix)
item_factors = svd.components_.T

print(f"User factors shape: {user_factors.shape}")
print(f"Item factors shape: {item_factors.shape}")
print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.3f}")


User factors shape: (100, 5)
Item factors shape: (6, 5)
Explained variance ratio: 0.844


## Train SVD Model

**How it works:**
- Decomposes the user-item matrix into two lower-dimensional matrices
- **User factors**: Each user represented by 5 latent features
- **Item factors**: Each program represented by 5 latent features
- Predictions = dot product of user and item factors

**Hyperparameters:**
- `n_components=5`: Latent factor dimension
  - Lower = simpler patterns, less overfitting
  - Higher = more complex patterns
  - With only 6 programs, 5 dimensions is sufficient

**Explained variance**: Shows how much of the interaction patterns are captured by the latent factors.

In [18]:
# Compute predicted scores for all user-item pairs
predicted_scores = user_factors @ item_factors.T


print(f"Predicted scores shape: {predicted_scores.shape}")
print(f"Score range: [{predicted_scores.min():.3f}, {predicted_scores.max():.3f}]")

Predicted scores shape: (100, 6)
Score range: [-0.236, 1.120]


## Compute Predictions
Calculate predicted interaction scores for all user-program pairs using matrix multiplication.

**Result:** Each user gets a predicted score for every program, representing how likely they are to like it.

In [19]:
def recommend_cf(user_id, k=3):
    """Generate top-k recommendations for a user using CF"""
    user_idx = user_id_map[user_id]
    scores = predicted_scores[user_idx]
    
    # Filter out programs the user has already interacted with
    interacted_items = interaction_matrix[user_idx].nonzero()[1]
    scores_copy = scores.copy()
    scores_copy[interacted_items] = -np.inf
    
    # Get top-k programs
    top_items = np.argsort(scores_copy)[::-1][:k]
    recommendations = [
        {"program_id": reverse_item_map[i], "score": scores[i]}
        for i in top_items
    ]
    return recommendations

# Test with a user
recommend_cf("u_0")


[{'program_id': 'p_5', 'score': np.float64(0.5674785327653241)},
 {'program_id': 'p_4', 'score': np.float64(0.24600269506252911)},
 {'program_id': 'p_2', 'score': np.float64(0.08358082680506651)}]

## Recommendation Function
Generate top-k program recommendations for a given user based on CF predictions.

**Process:**
1. Convert user ID to internal index
2. Get predicted scores for all programs
3. **Filter out already-interacted programs** (don't recommend what they've seen)
4. Sort by score (descending)
5. Return top-k program IDs with scores

**Key improvement:** Excludes programs the user has already liked/viewed.

In [20]:
# Save SVD model and mappings
joblib.dump({
    "svd": svd,
    "user_factors": user_factors,
    "item_factors": item_factors,
    "predicted_scores": predicted_scores,
    "user_id_map": user_id_map,
    "item_id_map": item_id_map,
    "reverse_item_map": reverse_item_map,
    "interaction_matrix": interaction_matrix
}, "../models/cf_svd.pkl")

print("✓ CF model saved to ../models/cf_svd.pkl")


✓ CF model saved to ../models/cf_svd.pkl


## Save Model
Persist the trained SVD model, factors, and mappings for deployment.

**Saved components:**
- SVD model (for retraining if needed)

- User and item latent factors- Interaction matrix (to filter already-seen items)

- Precomputed predicted scores (fast inference)- ID mappings (external ↔ internal indices)