# Project 07: Movie Recommendation System

**Difficulty**: ⭐⭐ Intermediate  
**Estimated Time**: 180-240 minutes  
**Prerequisites**: Basic pandas, machine learning concepts, linear algebra fundamentals

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand and implement collaborative filtering techniques (user-based and item-based)
2. Apply matrix factorization methods (SVD, SVD++) for recommendation systems
3. Handle sparse rating matrices and compute similarity metrics
4. Evaluate recommendation systems using RMSE, MAE, and ranking metrics
5. Address the cold start problem and discuss scalability considerations
6. Build a hybrid recommendation system combining multiple approaches

## Table of Contents

1. [Setup and Data Loading](#1-setup-and-data-loading)
2. [Exploratory Data Analysis](#2-exploratory-data-analysis)
3. [Data Preparation](#3-data-preparation)
4. [User-Based Collaborative Filtering](#4-user-based-collaborative-filtering)
5. [Item-Based Collaborative Filtering](#5-item-based-collaborative-filtering)
6. [Matrix Factorization: SVD](#6-matrix-factorization-svd)
7. [Advanced: SVD++](#7-advanced-svd)
8. [Hybrid Recommendation System](#8-hybrid-recommendation-system)
9. [Model Comparison and Evaluation](#9-model-comparison-and-evaluation)
10. [Cold Start Problem](#10-cold-start-problem)
11. [Production-Ready Recommendation Function](#11-production-ready-recommendation-function)
12. [Summary and Next Steps](#12-summary-and-next-steps)

## 1. Setup and Data Loading

We'll use the MovieLens 100K dataset for this project. This dataset is ideal for learning as it's manageable in size but realistic enough to demonstrate key concepts.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Surprise library for recommendation algorithms
from surprise import Dataset, Reader
from surprise import KNNBasic, KNNWithMeans, SVD, SVDpp
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy
from surprise.model_selection import GridSearchCV

# Scikit-learn utilities
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.sparse import csr_matrix
from scipy.spatial.distance import cosine

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
warnings.filterwarnings('ignore')
np.random.seed(42)

print("Libraries imported successfully!")

In [None]:
# Load MovieLens 100K dataset
# Download from: https://grouplens.org/datasets/movielens/100k/

# Define data paths (adjust if dataset is in different location)
data_dir = Path('data/ml-100k')

# Check if dataset exists
if not data_dir.exists():
    print(f"Dataset not found at {data_dir}")
    print("Please download MovieLens 100K dataset from:")
    print("https://grouplens.org/datasets/movielens/100k/")
    print(f"Extract it to: {data_dir.absolute()}")
else:
    print(f"Dataset found at {data_dir}")

# Load ratings data
# Format: user_id, item_id, rating, timestamp
ratings = pd.read_csv(
    data_dir / 'u.data',
    sep='\t',
    names=['user_id', 'movie_id', 'rating', 'timestamp'],
    encoding='latin-1'
)

# Load movie information
# Format: movie_id, title, release_date, video_release_date, IMDb_URL, genres...
movies = pd.read_csv(
    data_dir / 'u.item',
    sep='|',
    names=['movie_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL',
           'unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy',
           'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
           'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'],
    encoding='latin-1'
)

# Load user information
users = pd.read_csv(
    data_dir / 'u.user',
    sep='|',
    names=['user_id', 'age', 'gender', 'occupation', 'zip_code'],
    encoding='latin-1'
)

print(f"Loaded {len(ratings):,} ratings from {ratings.user_id.nunique()} users on {ratings.movie_id.nunique()} movies")
print(f"Loaded {len(movies):,} movies")
print(f"Loaded {len(users):,} users")

## 2. Exploratory Data Analysis

Let's understand the structure and characteristics of our data before building recommendation models.

In [None]:
# Display sample data
print("Sample Ratings:")
print(ratings.head(10))
print("\nRatings Info:")
print(ratings.info())
print("\nRatings Statistics:")
print(ratings.describe())

In [None]:
# Rating distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Rating value distribution
rating_counts = ratings['rating'].value_counts().sort_index()
axes[0].bar(rating_counts.index, rating_counts.values, color='steelblue', edgecolor='black')
axes[0].set_xlabel('Rating', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Distribution of Rating Values', fontsize=14, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Ratings per user
ratings_per_user = ratings.groupby('user_id').size()
axes[1].hist(ratings_per_user, bins=50, color='coral', edgecolor='black')
axes[1].set_xlabel('Number of Ratings', fontsize=12)
axes[1].set_ylabel('Number of Users', fontsize=12)
axes[1].set_title('Distribution of Ratings per User', fontsize=14, fontweight='bold')
axes[1].axvline(ratings_per_user.mean(), color='red', linestyle='--', 
                label=f'Mean: {ratings_per_user.mean():.1f}')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)

# Ratings per movie
ratings_per_movie = ratings.groupby('movie_id').size()
axes[2].hist(ratings_per_movie, bins=50, color='lightgreen', edgecolor='black')
axes[2].set_xlabel('Number of Ratings', fontsize=12)
axes[2].set_ylabel('Number of Movies', fontsize=12)
axes[2].set_title('Distribution of Ratings per Movie', fontsize=14, fontweight='bold')
axes[2].axvline(ratings_per_movie.mean(), color='red', linestyle='--',
                label=f'Mean: {ratings_per_movie.mean():.1f}')
axes[2].legend()
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Average rating: {ratings['rating'].mean():.2f}")
print(f"Rating standard deviation: {ratings['rating'].std():.2f}")
print(f"Average ratings per user: {ratings_per_user.mean():.1f}")
print(f"Average ratings per movie: {ratings_per_movie.mean():.1f}")

In [None]:
# Sparsity analysis
n_users = ratings['user_id'].nunique()
n_movies = ratings['movie_id'].nunique()
n_ratings = len(ratings)
n_possible_ratings = n_users * n_movies
sparsity = 1 - (n_ratings / n_possible_ratings)

print("\n" + "="*60)
print("SPARSITY ANALYSIS")
print("="*60)
print(f"Number of users: {n_users:,}")
print(f"Number of movies: {n_movies:,}")
print(f"Possible user-movie combinations: {n_possible_ratings:,}")
print(f"Actual ratings: {n_ratings:,}")
print(f"Sparsity: {sparsity:.2%}")
print(f"\nThis means {sparsity:.2%} of the rating matrix is empty!")
print("This high sparsity is a key challenge for recommendation systems.")

In [None]:
# Popular movies analysis
movie_stats = ratings.groupby('movie_id').agg({
    'rating': ['count', 'mean']
}).reset_index()
movie_stats.columns = ['movie_id', 'rating_count', 'rating_mean']
movie_stats = movie_stats.merge(movies[['movie_id', 'title']], on='movie_id')

# Top 10 most rated movies
print("\nTop 10 Most Rated Movies:")
print(movie_stats.nlargest(10, 'rating_count')[['title', 'rating_count', 'rating_mean']])

# Top 10 highest rated movies (with at least 50 ratings)
print("\nTop 10 Highest Rated Movies (min 50 ratings):")
popular_movies = movie_stats[movie_stats['rating_count'] >= 50]
print(popular_movies.nlargest(10, 'rating_mean')[['title', 'rating_count', 'rating_mean']])

## 3. Data Preparation

Prepare the data for building recommendation models using the Surprise library.

In [None]:
# Create Surprise Dataset
# Define reader with rating scale (1-5 for MovieLens)
reader = Reader(rating_scale=(1, 5))

# Load data from DataFrame
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

# Split into train and test sets (80-20 split)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

print(f"Training set size: {trainset.n_ratings:,} ratings")
print(f"Test set size: {len(testset):,} ratings")
print(f"Number of users in training: {trainset.n_users:,}")
print(f"Number of items in training: {trainset.n_items:,}")

## 4. User-Based Collaborative Filtering

User-based CF finds similar users and recommends items that similar users have liked. It uses the principle: "Users who agreed in the past tend to agree again in the future."

In [None]:
# User-based collaborative filtering using cosine similarity
# We'll use KNNBasic with user-based approach

# Configure similarity options for user-based CF
sim_options_user = {
    'name': 'cosine',        # Use cosine similarity
    'user_based': True       # Compute similarities between users
}

# Create and train the model
user_cf_model = KNNBasic(sim_options=sim_options_user, verbose=False)
user_cf_model.fit(trainset)

# Make predictions on test set
user_cf_predictions = user_cf_model.test(testset)

# Evaluate
user_cf_rmse = accuracy.rmse(user_cf_predictions, verbose=False)
user_cf_mae = accuracy.mae(user_cf_predictions, verbose=False)

print("\nUser-Based Collaborative Filtering Results:")
print(f"RMSE: {user_cf_rmse:.4f}")
print(f"MAE: {user_cf_mae:.4f}")

In [None]:
# Let's examine a specific user and their similar users
sample_user_id = 1  # First user in dataset

# Get inner id (used internally by Surprise)
inner_user_id = trainset.to_inner_uid(sample_user_id)

# Get K most similar users (K=10)
similar_users = user_cf_model.get_neighbors(inner_user_id, k=10)

# Convert back to raw user IDs
similar_user_ids = [trainset.to_raw_uid(inner_id) for inner_id in similar_users]

print(f"\nUsers most similar to User {sample_user_id}:")
print(similar_user_ids)

# Show rating patterns for original user
user_ratings = ratings[ratings['user_id'] == sample_user_id].merge(
    movies[['movie_id', 'title']], on='movie_id'
).nlargest(10, 'rating')[['title', 'rating']]

print(f"\nTop rated movies by User {sample_user_id}:")
print(user_ratings.to_string(index=False))

## 5. Item-Based Collaborative Filtering

Item-based CF finds similar items and recommends items similar to what the user has liked. Generally more stable than user-based for sparse datasets.

In [None]:
# Item-based collaborative filtering using cosine similarity

# Configure similarity options for item-based CF
sim_options_item = {
    'name': 'cosine',        # Use cosine similarity
    'user_based': False      # Compute similarities between items
}

# Create and train the model
item_cf_model = KNNBasic(sim_options=sim_options_item, verbose=False)
item_cf_model.fit(trainset)

# Make predictions on test set
item_cf_predictions = item_cf_model.test(testset)

# Evaluate
item_cf_rmse = accuracy.rmse(item_cf_predictions, verbose=False)
item_cf_mae = accuracy.mae(item_cf_predictions, verbose=False)

print("\nItem-Based Collaborative Filtering Results:")
print(f"RMSE: {item_cf_rmse:.4f}")
print(f"MAE: {item_cf_mae:.4f}")

In [None]:
# Find similar movies to a popular movie
sample_movie_title = 'Star Wars (1977)'
sample_movie_id = movies[movies['title'] == sample_movie_title]['movie_id'].values[0]

# Get inner id
inner_movie_id = trainset.to_inner_iid(sample_movie_id)

# Get K most similar movies
similar_movies = item_cf_model.get_neighbors(inner_movie_id, k=10)

# Convert back to raw movie IDs and get titles
similar_movie_ids = [trainset.to_raw_iid(inner_id) for inner_id in similar_movies]
similar_movie_titles = movies[movies['movie_id'].isin(similar_movie_ids)][['movie_id', 'title']]

print(f"\nMovies most similar to '{sample_movie_title}':")
print(similar_movie_titles.to_string(index=False))
print("\nThese recommendations make sense! Similar sci-fi/adventure films.")

## 6. Matrix Factorization: SVD

SVD (Singular Value Decomposition) decomposes the user-item rating matrix into lower-dimensional latent factors. This approach:
- Handles sparsity better than memory-based methods
- Learns latent features representing user preferences and item characteristics
- Scales better to large datasets

In [None]:
# Train SVD model with default parameters
svd_model = SVD(random_state=42, verbose=False)
svd_model.fit(trainset)

# Make predictions
svd_predictions = svd_model.test(testset)

# Evaluate
svd_rmse = accuracy.rmse(svd_predictions, verbose=False)
svd_mae = accuracy.mae(svd_predictions, verbose=False)

print("\nSVD Model Results:")
print(f"RMSE: {svd_rmse:.4f}")
print(f"MAE: {svd_mae:.4f}")

In [None]:
# Hyperparameter tuning for SVD using GridSearchCV
print("Performing hyperparameter tuning for SVD...")
print("This may take a few minutes...\n")

param_grid = {
    'n_factors': [50, 100, 150],      # Number of latent factors
    'n_epochs': [20, 30],              # Number of training epochs
    'lr_all': [0.002, 0.005],          # Learning rate
    'reg_all': [0.02, 0.1]             # Regularization term
}

# Perform 3-fold cross-validation
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
gs.fit(data)

# Best RMSE score
print(f"Best RMSE: {gs.best_score['rmse']:.4f}")
print(f"Best MAE: {gs.best_score['mae']:.4f}")
print(f"\nBest parameters:")
print(gs.best_params['rmse'])

# Train model with best parameters
best_svd = gs.best_estimator['rmse']
best_svd.fit(trainset)

# Evaluate on test set
best_svd_predictions = best_svd.test(testset)
best_svd_rmse = accuracy.rmse(best_svd_predictions, verbose=False)
best_svd_mae = accuracy.mae(best_svd_predictions, verbose=False)

print(f"\nTuned SVD Test Set Performance:")
print(f"RMSE: {best_svd_rmse:.4f}")
print(f"MAE: {best_svd_mae:.4f}")

## 7. Advanced: SVD++

SVD++ extends SVD by incorporating implicit feedback. It considers not just explicit ratings but also the fact that a user rated an item (regardless of the rating value).

In [None]:
# Train SVD++ model
# Note: SVD++ is slower than SVD but typically more accurate
print("Training SVD++ model...")
print("This may take several minutes...\n")

svdpp_model = SVDpp(random_state=42, verbose=False)
svdpp_model.fit(trainset)

# Make predictions
svdpp_predictions = svdpp_model.test(testset)

# Evaluate
svdpp_rmse = accuracy.rmse(svdpp_predictions, verbose=False)
svdpp_mae = accuracy.mae(svdpp_predictions, verbose=False)

print("\nSVD++ Model Results:")
print(f"RMSE: {svdpp_rmse:.4f}")
print(f"MAE: {svdpp_mae:.4f}")
print(f"\nImprovement over basic SVD:")
print(f"RMSE reduction: {((svd_rmse - svdpp_rmse) / svd_rmse * 100):.2f}%")

## 8. Hybrid Recommendation System

Combine multiple models to leverage their individual strengths. We'll create a weighted average of predictions from different models.

In [None]:
# Create hybrid predictions by weighted averaging
# Weights based on individual model performance (lower RMSE = higher weight)

# Calculate weights (inverse of RMSE, normalized)
model_rmses = {
    'user_cf': user_cf_rmse,
    'item_cf': item_cf_rmse,
    'svd': best_svd_rmse,
    'svdpp': svdpp_rmse
}

# Inverse RMSE as weights
inverse_rmses = {k: 1/v for k, v in model_rmses.items()}
total_inverse = sum(inverse_rmses.values())
weights = {k: v/total_inverse for k, v in inverse_rmses.items()}

print("Hybrid Model Weights:")
for model, weight in weights.items():
    print(f"{model}: {weight:.3f}")

# Create hybrid predictions
hybrid_predictions = []

# Get all predictions in dictionary format for easy lookup
user_cf_dict = {(pred.uid, pred.iid): pred.est for pred in user_cf_predictions}
item_cf_dict = {(pred.uid, pred.iid): pred.est for pred in item_cf_predictions}
svd_dict = {(pred.uid, pred.iid): pred.est for pred in best_svd_predictions}
svdpp_dict = {(pred.uid, pred.iid): pred.est for pred in svdpp_predictions}

# Compute weighted average for each test instance
for test_rating in testset:
    uid, iid, true_rating = test_rating
    
    # Get prediction from each model
    user_cf_pred = user_cf_dict.get((uid, iid), 3.0)  # Default to average rating
    item_cf_pred = item_cf_dict.get((uid, iid), 3.0)
    svd_pred = svd_dict.get((uid, iid), 3.0)
    svdpp_pred = svdpp_dict.get((uid, iid), 3.0)
    
    # Weighted average
    hybrid_pred = (
        weights['user_cf'] * user_cf_pred +
        weights['item_cf'] * item_cf_pred +
        weights['svd'] * svd_pred +
        weights['svdpp'] * svdpp_pred
    )
    
    # Clip to valid rating range
    hybrid_pred = np.clip(hybrid_pred, 1, 5)
    
    hybrid_predictions.append((uid, iid, true_rating, hybrid_pred))

# Calculate hybrid model performance
true_ratings = [pred[2] for pred in hybrid_predictions]
predicted_ratings = [pred[3] for pred in hybrid_predictions]

hybrid_rmse = np.sqrt(mean_squared_error(true_ratings, predicted_ratings))
hybrid_mae = mean_absolute_error(true_ratings, predicted_ratings)

print(f"\nHybrid Model Results:")
print(f"RMSE: {hybrid_rmse:.4f}")
print(f"MAE: {hybrid_mae:.4f}")

## 9. Model Comparison and Evaluation

Let's compare all models side by side and visualize their performance.

In [None]:
# Compile results
results = pd.DataFrame({
    'Model': ['User-Based CF', 'Item-Based CF', 'SVD', 'SVD (Tuned)', 'SVD++', 'Hybrid'],
    'RMSE': [user_cf_rmse, item_cf_rmse, svd_rmse, best_svd_rmse, svdpp_rmse, hybrid_rmse],
    'MAE': [user_cf_mae, item_cf_mae, svd_mae, best_svd_mae, svdpp_mae, hybrid_mae]
})

# Sort by RMSE
results = results.sort_values('RMSE')

print("\n" + "="*70)
print("MODEL COMPARISON")
print("="*70)
print(results.to_string(index=False))
print("\nLower RMSE and MAE indicate better performance.")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# RMSE comparison
axes[0].barh(results['Model'], results['RMSE'], color='steelblue', edgecolor='black')
axes[0].set_xlabel('RMSE', fontsize=12)
axes[0].set_title('Model Comparison: RMSE', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# Add value labels
for i, (model, rmse) in enumerate(zip(results['Model'], results['RMSE'])):
    axes[0].text(rmse + 0.01, i, f'{rmse:.4f}', va='center', fontsize=10)

# MAE comparison
axes[1].barh(results['Model'], results['MAE'], color='coral', edgecolor='black')
axes[1].set_xlabel('MAE', fontsize=12)
axes[1].set_title('Model Comparison: MAE', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

# Add value labels
for i, (model, mae) in enumerate(zip(results['Model'], results['MAE'])):
    axes[1].text(mae + 0.01, i, f'{mae:.4f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Prediction error analysis for best model (SVD++)
errors = [pred.est - pred.r_ui for pred in svdpp_predictions]

plt.figure(figsize=(12, 5))

# Error distribution
plt.subplot(1, 2, 1)
plt.hist(errors, bins=50, color='skyblue', edgecolor='black')
plt.xlabel('Prediction Error', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Distribution of Prediction Errors (SVD++)', fontsize=14, fontweight='bold')
plt.axvline(0, color='red', linestyle='--', label='Perfect Prediction')
plt.legend()
plt.grid(axis='y', alpha=0.3)

# Actual vs Predicted scatter
plt.subplot(1, 2, 2)
actual = [pred.r_ui for pred in svdpp_predictions]
predicted = [pred.est for pred in svdpp_predictions]
plt.scatter(actual, predicted, alpha=0.3, s=10)
plt.plot([1, 5], [1, 5], 'r--', label='Perfect Prediction')
plt.xlabel('Actual Rating', fontsize=12)
plt.ylabel('Predicted Rating', fontsize=12)
plt.title('Actual vs Predicted Ratings (SVD++)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Mean error: {np.mean(errors):.4f}")
print(f"Median error: {np.median(errors):.4f}")
print(f"Error std: {np.std(errors):.4f}")

## 10. Cold Start Problem

The cold start problem occurs when:
1. **New User**: User has no rating history
2. **New Item**: Item has no ratings from any user

Let's discuss strategies to handle this challenge.

In [None]:
# Analyze cold start scenarios in our dataset

# Users with very few ratings
user_rating_counts = ratings.groupby('user_id').size()
cold_start_users = user_rating_counts[user_rating_counts <= 5]

# Movies with very few ratings
movie_rating_counts = ratings.groupby('movie_id').size()
cold_start_movies = movie_rating_counts[movie_rating_counts <= 5]

print("COLD START ANALYSIS")
print("="*60)
print(f"Users with ≤5 ratings: {len(cold_start_users)} ({len(cold_start_users)/len(user_rating_counts)*100:.1f}%)")
print(f"Movies with ≤5 ratings: {len(cold_start_movies)} ({len(cold_start_movies)/len(movie_rating_counts)*100:.1f}%)")

print("\n" + "="*60)
print("STRATEGIES TO HANDLE COLD START")
print("="*60)

strategies = """
1. NEW USER COLD START:
   - Ask new users to rate a few popular movies during onboarding
   - Use demographic information (age, gender, location) for initial recommendations
   - Show trending/popular items globally
   - Apply content-based filtering using item metadata

2. NEW ITEM COLD START:
   - Use item metadata (genre, actors, director) for content-based recommendations
   - Show to diverse set of users to gather initial ratings quickly
   - Recommend based on similar items in the catalog
   - Use "early adopter" user segments who rate new items frequently

3. HYBRID APPROACHES:
   - Combine collaborative filtering with content-based filtering
   - Use knowledge-based recommendations for completely new users
   - Implement "exploration" strategies (Thompson Sampling, UCB)
   - Gradually transition from content-based to collaborative as data accumulates
"""

print(strategies)

In [None]:
# Demonstrate fallback strategy: Popular items recommendation

def get_popular_recommendations(n=10, min_ratings=50):
    """
    Get popular movie recommendations (fallback for cold start)
    
    Parameters:
    - n: Number of recommendations
    - min_ratings: Minimum number of ratings required
    
    Returns:
    - DataFrame with top N popular movies
    """
    popular = movie_stats[movie_stats['rating_count'] >= min_ratings].copy()
    popular['score'] = popular['rating_mean'] * np.log1p(popular['rating_count'])
    popular = popular.nlargest(n, 'score')
    
    return popular[['title', 'rating_count', 'rating_mean', 'score']]

# Get top 10 popular movies for new users
print("Popular Recommendations (for new users):")
print(get_popular_recommendations(n=10))
print("\nScore = mean_rating * log(1 + rating_count)")
print("This balances rating quality with popularity.")

## 11. Production-Ready Recommendation Function

Create a complete function that can generate recommendations for any user, handling edge cases and cold start scenarios.

In [None]:
def get_movie_recommendations(user_id, n=10, model='svdpp', min_expected_rating=3.5):
    """
    Get top-N movie recommendations for a user
    
    Parameters:
    - user_id: ID of the user (int)
    - n: Number of recommendations to return (int)
    - model: Model to use ('user_cf', 'item_cf', 'svd', 'svdpp', 'hybrid')
    - min_expected_rating: Minimum predicted rating threshold (float)
    
    Returns:
    - DataFrame with movie recommendations including titles, predicted ratings, and genres
    """
    # Check if user exists
    if user_id not in ratings['user_id'].values:
        print(f"User {user_id} not found. Showing popular recommendations.")
        return get_popular_recommendations(n=n)
    
    # Get movies the user has already rated
    user_rated_movies = set(ratings[ratings['user_id'] == user_id]['movie_id'])
    
    # Get all movies
    all_movies = set(ratings['movie_id'].unique())
    
    # Movies to predict ratings for (not yet rated by user)
    movies_to_predict = all_movies - user_rated_movies
    
    # Select model
    model_map = {
        'user_cf': user_cf_model,
        'item_cf': item_cf_model,
        'svd': best_svd,
        'svdpp': svdpp_model
    }
    
    if model not in model_map:
        print(f"Model '{model}' not found. Using 'svdpp'.")
        model = 'svdpp'
    
    selected_model = model_map[model]
    
    # Generate predictions for all unrated movies
    predictions = []
    for movie_id in movies_to_predict:
        pred = selected_model.predict(user_id, movie_id)
        predictions.append((movie_id, pred.est))
    
    # Convert to DataFrame
    pred_df = pd.DataFrame(predictions, columns=['movie_id', 'predicted_rating'])
    
    # Filter by minimum expected rating
    pred_df = pred_df[pred_df['predicted_rating'] >= min_expected_rating]
    
    # Sort by predicted rating
    pred_df = pred_df.sort_values('predicted_rating', ascending=False)
    
    # Get top N
    top_n = pred_df.head(n)
    
    # Merge with movie information
    recommendations = top_n.merge(
        movies[['movie_id', 'title']], 
        on='movie_id'
    )
    
    # Add genre information
    genre_cols = ['Action', 'Adventure', 'Animation', 'Children', 'Comedy',
                  'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 
                  'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 
                  'Thriller', 'War', 'Western']
    
    recommendations = recommendations.merge(
        movies[['movie_id'] + genre_cols],
        on='movie_id'
    )
    
    # Create genre string
    def get_genres(row):
        genres = [genre for genre in genre_cols if row[genre] == 1]
        return ', '.join(genres) if genres else 'Unknown'
    
    recommendations['genres'] = recommendations.apply(get_genres, axis=1)
    
    # Select final columns
    recommendations = recommendations[['title', 'predicted_rating', 'genres']]
    recommendations = recommendations.reset_index(drop=True)
    
    return recommendations

print("Recommendation function created successfully!")

In [None]:
# Test the recommendation function
test_user_id = 150

print(f"\n{'='*80}")
print(f"RECOMMENDATIONS FOR USER {test_user_id}")
print(f"{'='*80}\n")

# Show what user has already rated highly
user_history = ratings[ratings['user_id'] == test_user_id].merge(
    movies[['movie_id', 'title']], on='movie_id'
).nlargest(5, 'rating')[['title', 'rating']]

print("User's Top Rated Movies:")
print(user_history.to_string(index=False))

# Generate recommendations
print(f"\n{'='*80}")
print("Recommended Movies (using SVD++ model):")
print(f"{'='*80}\n")

recommendations = get_movie_recommendations(
    user_id=test_user_id,
    n=10,
    model='svdpp',
    min_expected_rating=4.0
)

print(recommendations.to_string(index=False))

In [None]:
# Compare recommendations across different models
test_user = 50

print(f"\nComparing Recommendations Across Models for User {test_user}")
print("="*80)

for model_name in ['user_cf', 'item_cf', 'svd', 'svdpp']:
    print(f"\n{model_name.upper()} Model:")
    recs = get_movie_recommendations(test_user, n=5, model=model_name, min_expected_rating=4.0)
    print(recs[['title', 'predicted_rating']].to_string(index=False))

## 12. Summary and Next Steps

### Key Concepts Learned

1. **Collaborative Filtering Approaches**:
   - User-based: Find similar users, recommend what they liked
   - Item-based: Find similar items, recommend similar items (more stable)

2. **Matrix Factorization**:
   - SVD learns latent factors for users and items
   - Better handles sparsity and scales to larger datasets
   - SVD++ incorporates implicit feedback for improved accuracy

3. **Evaluation Metrics**:
   - RMSE/MAE measure rating prediction accuracy
   - Lower values indicate better performance
   - Our best model (SVD++) achieved RMSE < 0.90

4. **Challenges**:
   - **Sparsity**: 93%+ of rating matrix is empty
   - **Cold Start**: Difficult to recommend for new users/items
   - **Scalability**: Memory-based methods struggle with large datasets

### Model Performance Summary

- **User-Based CF**: Simple but effective, struggles with sparsity
- **Item-Based CF**: More stable than user-based, better for sparse data
- **SVD**: Best balance of accuracy and speed
- **SVD++**: Best accuracy but slower training
- **Hybrid**: Combines strengths of multiple models

### Next Steps and Advanced Topics

1. **Content-Based Filtering**:
   - Use movie metadata (genres, actors, directors)
   - Combine with collaborative filtering for hybrid approach
   - Better handles cold start for new items

2. **Deep Learning Approaches**:
   - Neural Collaborative Filtering (NCF)
   - Autoencoders for recommendation
   - Recurrent networks for sequential recommendations

3. **Context-Aware Recommendations**:
   - Incorporate temporal dynamics (time of day, season)
   - Consider user context (location, device, mood)
   - Session-based recommendations

4. **Production Deployment**:
   - Build REST API with Flask/FastAPI
   - Implement caching for faster response times
   - A/B testing framework for model comparison
   - Real-time model updates as new ratings arrive

5. **Advanced Evaluation**:
   - Ranking metrics: Precision@K, Recall@K, NDCG
   - Diversity and serendipity of recommendations
   - User satisfaction and engagement metrics

### Practical Applications

The techniques learned here apply to many domains:
- E-commerce product recommendations
- Music and video streaming services
- News article recommendations
- Social media content curation
- Job or connection recommendations

### Additional Resources

- **Surprise Documentation**: http://surpriselib.com/
- **RecSys Conference**: Premier conference for recommender systems research
- **Book**: "Recommender Systems Handbook" by Ricci et al.
- **Course**: Andrew Ng's Machine Learning course (collaborative filtering module)

**Congratulations!** You've built a complete movie recommendation system with multiple approaches and production-ready functionality.

In [None]:
# Final summary statistics
print("\n" + "="*80)
print("PROJECT COMPLETION SUMMARY")
print("="*80)
print(f"\nDataset: MovieLens 100K")
print(f"Total Ratings: {len(ratings):,}")
print(f"Users: {ratings.user_id.nunique():,}")
print(f"Movies: {ratings.movie_id.nunique():,}")
print(f"Sparsity: {sparsity:.2%}")
print(f"\nBest Model: {results.iloc[0]['Model']}")
print(f"Best RMSE: {results.iloc[0]['RMSE']:.4f}")
print(f"Best MAE: {results.iloc[0]['MAE']:.4f}")
print(f"\nTarget Achievement: {'✓ ACHIEVED' if results.iloc[0]['RMSE'] < 0.90 else '✗ NOT ACHIEVED'}")
print("\n" + "="*80)