# Lab B.1: Collaborative Filtering Fundamentals

**Module:** B - Recommender Systems  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the user-item interaction matrix and sparsity challenges
- [ ] Implement matrix factorization from scratch in PyTorch
- [ ] Train a collaborative filtering model with ALS-style optimization
- [ ] Evaluate recommendations using RMSE
- [ ] Visualize learned embeddings with t-SNE

---

## üìö Prerequisites

- Completed: Module 2.1 (PyTorch Fundamentals)
- Knowledge of: Basic linear algebra (vectors, matrices, dot products)

---

## üåç Real-World Context

**The Netflix Prize**: In 2006, Netflix offered $1 million to anyone who could improve their recommendation algorithm by 10%. The winning solution used **matrix factorization** - the exact technique you'll learn today!

Matrix factorization powers recommendations at:
- üé¨ **Netflix**: "Because you watched..."
- üéµ **Spotify**: Discover Weekly playlists
- üõí **Amazon**: "Customers who bought this also bought..."
- üì± **TikTok**: The For You page

---

## üßí ELI5: Collaborative Filtering

> **Imagine you're at a pizza party with friends...**
>
> You've never tried the Hawaiian pizza, but you notice that:
> - Your friend Sarah loves pepperoni AND Hawaiian pizza
> - You ALSO love pepperoni pizza
> - So maybe you'd like Hawaiian pizza too!
>
> This is **collaborative filtering**: finding patterns in what similar people like.
>
> **Matrix Factorization** takes this further: instead of just finding "similar people," 
> it learns *hidden factors* like "likes spicy food" or "prefers comedy movies" that 
> explain why people rate things the way they do.
>
> **In AI terms:** We decompose a giant ratings matrix (users √ó items) into two smaller 
> matrices: user preferences and item characteristics. The dot product of these gives us 
> predicted ratings!

---

## Part 1: Setup and Data Loading

Let's start by loading the MovieLens dataset - the classic benchmark for recommender systems.

In [None]:
# First, let's make sure we have our utilities available
import sys
from pathlib import Path

# Add scripts directory to path
module_dir = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(module_dir / 'scripts'))

# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

# Our utilities
from data_utils import (
    download_movielens, 
    print_dataset_info,
    train_test_split_by_time,
    RatingsDataset,
    compute_statistics
)

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"üñ•Ô∏è  Using device: {device}")

if torch.cuda.is_available():
    print(f"üöÄ GPU: {torch.cuda.get_device_name(0)}")
    print(f"üíæ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Download and load MovieLens 100K dataset
ratings_df, movies_df = download_movielens('100k')

# Display dataset statistics
print_dataset_info(ratings_df, movies_df)

### üîç What Just Happened?

We loaded the MovieLens 100K dataset with:
- **943 users** who rated **1,682 movies**
- **100,000 ratings** on a 1-5 star scale
- **93.7% sparsity** - most user-movie pairs have NO rating!

This sparsity is the core challenge of recommender systems. How do we predict ratings for movies a user has never seen?

In [None]:
# Let's visualize the rating distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Rating distribution
axes[0].hist(ratings_df['rating'], bins=5, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Rating')
axes[0].set_ylabel('Count')
axes[0].set_title('Rating Distribution')

# Ratings per user
user_counts = ratings_df.groupby('user_id').size()
axes[1].hist(user_counts, bins=50, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Number of Ratings')
axes[1].set_ylabel('Number of Users')
axes[1].set_title('Ratings per User')
axes[1].axvline(user_counts.median(), color='red', linestyle='--', label=f'Median: {user_counts.median():.0f}')
axes[1].legend()

# Ratings per movie
item_counts = ratings_df.groupby('item_id').size()
axes[2].hist(item_counts, bins=50, edgecolor='black', alpha=0.7)
axes[2].set_xlabel('Number of Ratings')
axes[2].set_ylabel('Number of Movies')
axes[2].set_title('Ratings per Movie (Long Tail!)')
axes[2].axvline(item_counts.median(), color='red', linestyle='--', label=f'Median: {item_counts.median():.0f}')
axes[2].legend()

plt.tight_layout()
plt.show()

print(f"\nüìä Key Observations:")
print(f"   - Most ratings are 3-4 stars (positive skew)")
print(f"   - Some users rate 20 movies, others rate 700+")
print(f"   - Long tail: many movies have very few ratings (cold start problem!)")

---

## Part 2: Understanding the User-Item Matrix

Before we do matrix factorization, let's visualize what we're working with.

In [None]:
# Create a small user-item matrix for visualization
# (Full matrix would be 943 x 1682 = 1.58 million cells!)

# Select 20 most active users and 30 most popular movies
top_users = ratings_df.groupby('user_id').size().nlargest(20).index
top_items = ratings_df.groupby('item_id').size().nlargest(30).index

# Filter to these users/items
subset = ratings_df[
    ratings_df['user_id'].isin(top_users) & 
    ratings_df['item_id'].isin(top_items)
]

# Create pivot table (the user-item matrix)
matrix = subset.pivot_table(
    index='user_id', 
    columns='item_id', 
    values='rating',
    fill_value=0
)

# Visualize
plt.figure(figsize=(14, 8))
sns.heatmap(
    matrix, 
    cmap='YlOrRd',
    cbar_kws={'label': 'Rating'},
    linewidths=0.5
)
plt.title('User-Item Rating Matrix (Subset)\nWhite = No Rating (the sparsity problem!)')
plt.xlabel('Movie ID')
plt.ylabel('User ID')
plt.show()

# Calculate sparsity of this subset
subset_sparsity = (matrix == 0).sum().sum() / matrix.size
print(f"\nüìä Even in this active subset: {subset_sparsity:.1%} of entries are empty!")

### üßí ELI5: Why Factorization?

> **The Big Idea:**
>
> Instead of storing 943 √ó 1,682 = 1.58 million numbers (mostly zeros),
> we represent each user with a small vector (say, 64 numbers)
> and each movie with a small vector (64 numbers).
>
> To predict User 5's rating for Movie 100:
> - Look up User 5's preference vector: [0.2, -0.5, 0.8, ...]
> - Look up Movie 100's characteristic vector: [0.1, 0.3, -0.2, ...]
> - Dot product ‚Üí predicted rating!
>
> **Why this works:** The 64 dimensions learn to represent things like
> "how much does this user like action movies?" and "how action-y is this movie?"
> We never tell the model what these dimensions mean - it discovers them from data!

```
User-Item Matrix (sparse)     =    User Matrix    √ó    Item Matrix
    943 √ó 1,682                    943 √ó 64            64 √ó 1,682
    1.58M parameters               60K params          108K params
    (93% zeros)                    (dense!)            (dense!)
```

---

## Part 3: Implementing Matrix Factorization

Now let's build our model! We'll implement the classic Matrix Factorization with biases:

$$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i$$

Where:
- $\mu$ = global average rating
- $b_u$ = user bias ("does this user tend to rate high or low?")
- $b_i$ = item bias ("is this movie generally liked or disliked?")
- $\mathbf{p}_u$ = user embedding vector
- $\mathbf{q}_i$ = item embedding vector

In [None]:
class MatrixFactorization(nn.Module):
    """
    Matrix Factorization for Collaborative Filtering.
    
    This is the model that won the Netflix Prize!
    
    The key insight: every user and every item can be represented
    as a vector in a shared "latent space". Similar users have 
    similar vectors. Similar movies have similar vectors.
    """
    
    def __init__(self, num_users, num_items, embedding_dim=64):
        super().__init__()
        
        self.num_users = num_users
        self.num_items = num_items
        self.embedding_dim = embedding_dim
        
        # User and item embeddings (the "P" and "Q" matrices)
        self.user_embeddings = nn.Embedding(num_users, embedding_dim)
        self.item_embeddings = nn.Embedding(num_items, embedding_dim)
        
        # Bias terms (very important for accuracy!)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.global_bias = nn.Parameter(torch.zeros(1))
        
        # Initialize with small random values
        self._init_weights()
        
    def _init_weights(self):
        """Initialize embeddings with small values to prevent exploding gradients."""
        nn.init.normal_(self.user_embeddings.weight, std=0.01)
        nn.init.normal_(self.item_embeddings.weight, std=0.01)
        nn.init.zeros_(self.user_bias.weight)
        nn.init.zeros_(self.item_bias.weight)
        
    def forward(self, user_ids, item_ids):
        """
        Predict ratings for user-item pairs.
        
        Args:
            user_ids: Tensor of user IDs (batch_size,)
            item_ids: Tensor of item IDs (batch_size,)
            
        Returns:
            Predicted ratings (batch_size,)
        """
        # Look up embeddings
        user_emb = self.user_embeddings(user_ids)  # (batch, dim)
        item_emb = self.item_embeddings(item_ids)  # (batch, dim)
        
        # Dot product: sum of element-wise multiplication
        interaction = (user_emb * item_emb).sum(dim=1)  # (batch,)
        
        # Add all the biases
        prediction = (
            self.global_bias +           # Overall average
            self.user_bias(user_ids).squeeze() +  # User tendency
            self.item_bias(item_ids).squeeze() +  # Item tendency
            interaction                  # User-item affinity
        )
        
        return prediction
    
    def recommend_for_user(self, user_id, top_k=10, exclude_rated=None):
        """
        Get top-K recommendations for a user.
        
        Args:
            user_id: The user to recommend for
            top_k: Number of recommendations
            exclude_rated: Set of already-rated item IDs to exclude
            
        Returns:
            Tuple of (item_ids, predicted_ratings)
        """
        self.eval()
        with torch.no_grad():
            # Predict for all items
            user_ids = torch.LongTensor([user_id] * self.num_items).to(
                next(self.parameters()).device
            )
            item_ids = torch.arange(self.num_items).to(
                next(self.parameters()).device
            )
            
            predictions = self(user_ids, item_ids)
            
            # Exclude already-rated items
            if exclude_rated is not None:
                for item in exclude_rated:
                    predictions[item] = float('-inf')
            
            # Get top K
            top_scores, top_items = torch.topk(predictions, top_k)
            
        return top_items.cpu().numpy(), top_scores.cpu().numpy()

# Quick test
num_users = ratings_df['user_id'].nunique()
num_items = ratings_df['item_id'].nunique()

model = MatrixFactorization(num_users, num_items, embedding_dim=64)
print(f"‚úÖ Model created!")
print(f"   - Users: {num_users}, Items: {num_items}")
print(f"   - Embedding dimension: 64")
print(f"   - Total parameters: {sum(p.numel() for p in model.parameters()):,}")

### üîç Parameter Count Analysis

Let's break down where our parameters come from:
- User embeddings: 943 users √ó 64 dimensions = 60,352
- Item embeddings: 1,682 items √ó 64 dimensions = 107,648
- User biases: 943
- Item biases: 1,682
- Global bias: 1

**Total: ~170K parameters** - much smaller than storing 1.58M ratings!

---

## Part 4: Training the Model

Let's train our matrix factorization model!

In [None]:
# Split data: 80% train, 20% test (by time for realism)
train_df, test_df = train_test_split_by_time(ratings_df, test_ratio=0.2)

print(f"üìä Data Split:")
print(f"   Training:   {len(train_df):,} ratings")
print(f"   Testing:    {len(test_df):,} ratings")

In [None]:
# Create PyTorch datasets and dataloaders
train_dataset = RatingsDataset(train_df)
test_dataset = RatingsDataset(test_df)

# Batch size of 1024 works well on DGX Spark
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)

print(f"‚úÖ DataLoaders created")
print(f"   Training batches: {len(train_loader)}")
print(f"   Test batches: {len(test_loader)}")

In [None]:
def train_epoch(model, train_loader, optimizer, criterion, device):
    """Train for one epoch."""
    model.train()
    total_loss = 0
    
    for users, items, ratings in train_loader:
        users = users.to(device)
        items = items.to(device)
        ratings = ratings.to(device)
        
        # Forward pass
        predictions = model(users, items)
        loss = criterion(predictions, ratings)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * len(users)
    
    return total_loss / len(train_loader.dataset)


def evaluate(model, test_loader, criterion, device):
    """Evaluate on test set."""
    model.eval()
    total_loss = 0
    all_preds = []
    all_targets = []
    
    with torch.no_grad():
        for users, items, ratings in test_loader:
            users = users.to(device)
            items = items.to(device)
            ratings = ratings.to(device)
            
            predictions = model(users, items)
            loss = criterion(predictions, ratings)
            
            total_loss += loss.item() * len(users)
            all_preds.extend(predictions.cpu().numpy())
            all_targets.extend(ratings.cpu().numpy())
    
    avg_loss = total_loss / len(test_loader.dataset)
    rmse = np.sqrt(avg_loss)
    
    return rmse, np.array(all_preds), np.array(all_targets)

In [None]:
# Initialize model, optimizer, and loss
model = MatrixFactorization(
    num_users=num_users,
    num_items=num_items,
    embedding_dim=64
).to(device)

# Set global bias to mean rating (smart initialization!)
model.global_bias.data = torch.tensor([train_df['rating'].mean()])

# Optimizer with weight decay (regularization)
optimizer = optim.Adam(
    model.parameters(), 
    lr=0.005,           # Learning rate
    weight_decay=1e-5   # L2 regularization to prevent overfitting
)

# Mean Squared Error loss
criterion = nn.MSELoss()

print(f"üéØ Target: RMSE < 0.95 (Netflix Prize baseline was ~0.95)")
print(f"\nStarting training...\n")

In [None]:
# Training loop
num_epochs = 30
train_losses = []
test_rmses = []
best_rmse = float('inf')

for epoch in range(num_epochs):
    # Train
    train_loss = train_epoch(model, train_loader, optimizer, criterion, device)
    train_losses.append(train_loss)
    
    # Evaluate
    test_rmse, _, _ = evaluate(model, test_loader, criterion, device)
    test_rmses.append(test_rmse)
    
    # Track best
    if test_rmse < best_rmse:
        best_rmse = test_rmse
        best_epoch = epoch + 1
    
    # Print progress every 5 epochs
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch {epoch+1:2d}/{num_epochs} | "
              f"Train Loss: {train_loss:.4f} | "
              f"Test RMSE: {test_rmse:.4f}")

print(f"\n{'='*50}")
print(f"üèÜ Best RMSE: {best_rmse:.4f} (Epoch {best_epoch})")
if best_rmse < 0.95:
    print(f"üéâ Goal achieved! RMSE < 0.95")
else:
    print(f"üìà Keep tuning! Try more epochs or adjust hyperparameters.")

In [None]:
# Visualize training progress
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Training loss
axes[0].plot(train_losses, 'b-', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss (MSE)')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

# Test RMSE
axes[1].plot(test_rmses, 'r-', linewidth=2)
axes[1].axhline(y=0.95, color='green', linestyle='--', label='Target: 0.95')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Test RMSE')
axes[1].set_title('Test RMSE')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üîç What Just Happened?

We trained a collaborative filtering model that learned:
1. **User embeddings**: 64-dimensional vectors representing each user's preferences
2. **Item embeddings**: 64-dimensional vectors representing each movie's characteristics
3. **Biases**: Accounting for users who rate high/low and movies that are generally liked/disliked

The model minimizes the difference between predicted and actual ratings (MSE loss).

---

## Part 5: Analyzing the Learned Embeddings

The magic of matrix factorization is in the learned embeddings. Let's visualize them!

In [None]:
# Extract item embeddings
model.eval()
with torch.no_grad():
    item_embeddings = model.item_embeddings.weight.cpu().numpy()
    user_embeddings = model.user_embeddings.weight.cpu().numpy()

print(f"üìä Embedding shapes:")
print(f"   User embeddings: {user_embeddings.shape}")
print(f"   Item embeddings: {item_embeddings.shape}")

### Key Tools for Embedding Analysis

Before we visualize our embeddings, let's understand two important tools we'll use:

**1. t-SNE (t-Distributed Stochastic Neighbor Embedding)** from `sklearn.manifold`
- Reduces high-dimensional embeddings (64D) to 2D for visualization
- Preserves local neighborhood structure: similar items stay close together
- Key parameters:
  - `n_components=2`: Output dimensions for plotting
  - `perplexity=30`: Balance between local/global structure (15-50 typical)
  - `random_state`: For reproducibility

**2. `np.linalg.norm` (Vector Norm/Magnitude)**
- Computes the length of vectors: for v = [x, y, z], norm = ‚àö(x¬≤ + y¬≤ + z¬≤)
- Key parameters:
  - `axis=1`: Compute norm for each row (useful for matrices of vectors)
  - `keepdims=True`: Maintain shape for broadcasting during normalization
- **Use case**: Normalizing vectors to unit length enables cosine similarity via dot product

```python
# Example: Normalize vectors to unit length
norms = np.linalg.norm(vectors, axis=1, keepdims=True)  # Shape: (n, 1)
normalized = vectors / norms  # Each row now has length 1
similarity = normalized @ query  # Dot product = cosine similarity!
```

In [None]:
# Use t-SNE to visualize movie embeddings in 2D
from sklearn.manifold import TSNE

# Only use movies with enough ratings for clearer visualization
popular_items = ratings_df.groupby('item_id').size()
popular_items = popular_items[popular_items >= 50].index.tolist()

print(f"Visualizing {len(popular_items)} movies with 50+ ratings...")

# Get embeddings for popular items
popular_embeddings = item_embeddings[popular_items]

# Run t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embeddings_2d = tsne.fit_transform(popular_embeddings)

print("‚úÖ t-SNE complete!")

In [None]:
# Get genre information for coloring
def get_primary_genre(item_id):
    """Get the first (primary) genre for a movie."""
    genres = movies_df[movies_df['item_id'] == item_id]['genres'].values
    if len(genres) > 0 and genres[0]:
        return genres[0].split('|')[0]
    return 'Unknown'

# Get genres for popular items
genres = [get_primary_genre(item_id) for item_id in popular_items]

# Create color mapping
unique_genres = list(set(genres))
genre_to_color = {g: i for i, g in enumerate(unique_genres)}
colors = [genre_to_color[g] for g in genres]

# Plot
plt.figure(figsize=(14, 10))
scatter = plt.scatter(
    embeddings_2d[:, 0], 
    embeddings_2d[:, 1],
    c=colors,
    cmap='tab20',
    alpha=0.7,
    s=50
)

# Add legend
handles = []
for genre in unique_genres[:10]:  # Show top 10 genres
    idx = genre_to_color[genre]
    handles.append(plt.scatter([], [], c=[plt.cm.tab20(idx/20)], label=genre, s=100))
plt.legend(handles=handles, loc='upper right', title='Genre')

plt.title('Movie Embeddings Visualization (t-SNE)\nSimilar movies cluster together!')
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.tight_layout()
plt.show()

print("\nüìä Notice how movies of similar genres tend to cluster together!")
print("   This emerged naturally from the ratings - we never told the model about genres!")

In [None]:
# Find similar movies using embedding similarity
def find_similar_movies(movie_title, top_k=5):
    """Find movies similar to the given one based on learned embeddings."""
    # Find the movie
    matches = movies_df[movies_df['title'].str.contains(movie_title, case=False, na=False)]
    
    if len(matches) == 0:
        print(f"Movie '{movie_title}' not found!")
        return
    
    movie_id = matches.iloc[0]['item_id']
    movie_name = matches.iloc[0]['title']
    
    # Get embedding
    movie_emb = item_embeddings[movie_id]
    
    # Compute similarity to all movies (dot product of normalized vectors = cosine similarity)
    norms = np.linalg.norm(item_embeddings, axis=1, keepdims=True)
    normalized = item_embeddings / (norms + 1e-8)
    query_normalized = movie_emb / (np.linalg.norm(movie_emb) + 1e-8)
    
    similarities = normalized @ query_normalized
    
    # Get top K (excluding the query movie)
    top_indices = np.argsort(similarities)[::-1][1:top_k+1]
    
    print(f"\nüé¨ Movies similar to: {movie_name}")
    print("‚îÄ" * 60)
    
    for i, idx in enumerate(top_indices):
        similar_movie = movies_df[movies_df['item_id'] == idx]
        if len(similar_movie) > 0:
            title = similar_movie.iloc[0]['title']
            genre = similar_movie.iloc[0]['genres']
            sim = similarities[idx]
            print(f"  {i+1}. {title}")
            print(f"     Genres: {genre} | Similarity: {sim:.3f}")

# Try it out!
find_similar_movies("Toy Story")
find_similar_movies("Star Wars")
find_similar_movies("Pulp Fiction")

---

## Part 6: Making Recommendations

Let's use our trained model to make actual recommendations!

In [None]:
def recommend_for_user(user_id, model, ratings_df, movies_df, top_k=10):
    """
    Generate recommendations for a user.
    
    Shows what the user has rated highly, then recommends new movies.
    """
    # Get user's rated items
    user_ratings = ratings_df[ratings_df['user_id'] == user_id].copy()
    rated_items = set(user_ratings['item_id'].values)
    
    # Show user's top rated movies
    user_ratings = user_ratings.merge(movies_df, on='item_id')
    top_rated = user_ratings.nlargest(5, 'rating')
    
    print(f"\nüë§ User {user_id}'s Top Rated Movies:")
    print("‚îÄ" * 60)
    for _, row in top_rated.iterrows():
        print(f"  ‚≠ê {row['rating']:.0f}/5 - {row['title']}")
    
    # Get recommendations
    model.to(device)
    rec_items, rec_scores = model.recommend_for_user(
        user_id, 
        top_k=top_k,
        exclude_rated=rated_items
    )
    
    print(f"\nüé¨ Top {top_k} Recommendations:")
    print("‚îÄ" * 60)
    for i, (item_id, score) in enumerate(zip(rec_items, rec_scores)):
        movie = movies_df[movies_df['item_id'] == item_id]
        if len(movie) > 0:
            title = movie.iloc[0]['title']
            genre = movie.iloc[0]['genres']
            print(f"  {i+1:2d}. {title}")
            print(f"      Predicted: {score:.2f} stars | Genres: {genre}")

# Recommend for a few users
recommend_for_user(user_id=0, model=model, ratings_df=train_df, movies_df=movies_df)
recommend_for_user(user_id=100, model=model, ratings_df=train_df, movies_df=movies_df)

---

## ‚úã Try It Yourself!

### Exercise 1: Hyperparameter Tuning

Try different embedding dimensions and see how they affect RMSE:

<details>
<summary>üí° Hint</summary>

Create models with embedding_dim = 16, 32, 64, 128, 256 and compare:
- Smaller embeddings = faster, but less expressive
- Larger embeddings = more expressive, but risk overfitting

</details>

In [None]:
# YOUR CODE HERE: Experiment with different embedding dimensions
# embedding_dims = [16, 32, 64, 128]
# results = {}
# for dim in embedding_dims:
#     model = MatrixFactorization(num_users, num_items, embedding_dim=dim)
#     ... train and evaluate ...
#     results[dim] = best_rmse



### Exercise 2: Regularization Impact

Try different weight_decay values (L2 regularization) and observe the train/test gap:

<details>
<summary>üí° Hint</summary>

- weight_decay=0: No regularization, might overfit
- weight_decay=1e-5: Light regularization (we used this)
- weight_decay=1e-3: Strong regularization

Look at the gap between training loss and test RMSE!

</details>

In [None]:
# YOUR CODE HERE: Experiment with regularization



---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting to Initialize Global Bias

In [None]:
# ‚ùå Wrong: Random global bias initialization
# model.global_bias = nn.Parameter(torch.randn(1))  # Could be anything!

# ‚úÖ Right: Initialize to mean rating
# model.global_bias.data = torch.tensor([train_df['rating'].mean()])

print("Why this matters:")
print(f"  Mean rating in dataset: {train_df['rating'].mean():.2f}")
print(f"  Random init might start at: 0.5 or -2.3 or anything!")
print(f"  \n  Smart initialization = faster convergence + better results")

### Mistake 2: Not Handling Cold Start

In [None]:
# ‚ùå Wrong: Assume all user/item IDs are valid
# prediction = model(new_user_id, item_id)  # Crash if new_user_id >= num_users!

# ‚úÖ Right: Check for cold start users/items
def safe_predict(model, user_id, item_id, num_users, num_items):
    if user_id >= num_users:
        print(f"‚ö†Ô∏è User {user_id} is new (cold start). Using global average.")
        return model.global_bias.item()
    if item_id >= num_items:
        print(f"‚ö†Ô∏è Item {item_id} is new (cold start). Using global average.")
        return model.global_bias.item()
    
    with torch.no_grad():
        return model(
            torch.LongTensor([user_id]).to(device),
            torch.LongTensor([item_id]).to(device)
        ).item()

# Test
print(f"Prediction for existing user/item: {safe_predict(model, 0, 0, num_users, num_items):.2f}")
print(f"Prediction for new user: {safe_predict(model, 99999, 0, num_users, num_items):.2f}")

### Mistake 3: Data Leakage in Evaluation

In [None]:
# ‚ùå Wrong: Random train/test split
# This can leak future information into training!
# train, test = sklearn.model_selection.train_test_split(ratings, test_size=0.2)

# ‚úÖ Right: Time-based split (as we did)
# train_df, test_df = train_test_split_by_time(ratings_df, test_ratio=0.2)

print("Why time-based split matters:")
print(f"  Train set max timestamp: {train_df['timestamp'].max()}")
print(f"  Test set min timestamp:  {test_df['timestamp'].min()}")
print(f"  \n  In production, you can't use future ratings to predict!")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ What the user-item matrix is and why it's sparse
- ‚úÖ How matrix factorization decomposes ratings into embeddings
- ‚úÖ Implementing MF with biases in PyTorch
- ‚úÖ Training and evaluating with RMSE
- ‚úÖ Visualizing embeddings with t-SNE
- ‚úÖ Generating recommendations for users

---

## üöÄ Challenge (Optional)

**Implement SVD++ (10-15 min):**

SVD++ incorporates implicit feedback: even items a user rated (regardless of score) tell us something about their preferences.

$$\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{q}_i^T \left( \mathbf{p}_u + \frac{1}{\sqrt{|N(u)|}} \sum_{j \in N(u)} \mathbf{y}_j \right)$$

Where $N(u)$ is the set of items user $u$ has rated, and $\mathbf{y}_j$ are implicit factor vectors.

This typically improves RMSE by 1-3%!

---

## üìñ Further Reading

- [Matrix Factorization Techniques for Recommender Systems](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf) - The Netflix Prize paper
- [Factorization Machines](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf) - Extension to handle features
- [Surprise Library](http://surpriselib.com/) - Easy recommender systems in Python

---

## üßπ Cleanup

In [None]:
# Clear GPU memory
import gc

del model
torch.cuda.empty_cache()
gc.collect()

print("‚úÖ GPU memory cleared!")

---

## ‚û°Ô∏è Next Steps

In the next notebook, we'll build **Neural Collaborative Filtering (NeuMF)** - a deep learning approach that can learn non-linear user-item interactions!

Continue to: **02-neural-collaborative-filtering.ipynb**