# **Kenza Bouqdir - Assignment 4 - Movie Recommender System using SVD (with User-Mean Centering)**

**Overview**

This project implements a movie recommendation system using Singular Value Decomposition (SVD) with user-mean centering. The system is built using the MovieLens dataset and optimized for performance and accuracy.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
import seaborn as sns
import gc  # Garbage collector for memory management
from scipy.sparse import csr_matrix
import time
from collections import defaultdict
import warnings
warnings.filterwarnings("ignore")

**Table of Contents**

    1- Data Loading and Preparation
    2 -Data Exploration
    3- Train-Test Split
    4- Mean-Centering
    5- Sparse Matrix Creation
    6- SVD Application
    7- Model Selection
    8- Recommendation Utilities
    9- Sample Recommendations
    10- Similar Movies
    11- Latent Space Visualization

## **1. Data Loading and Preparation**

The code begins by loading the MovieLens dataset with memory optimization techniques. It includes:

- Memory usage reduction by optimizing data types
- Fallback to sampling if memory issues arise
- Filtering out users with very few ratings to improve model quality
- Filtering out movies with very few ratings

In [2]:
# Set random seed for reproducibility
np.random.seed(42)

print("=" * 80)
print("MOVIE RECOMMENDER SYSTEM USING SVD - CSC5356 DATA ENGINEERING (with USER-MEAN CENTERING)")
print("=" * 80)

MOVIE RECOMMENDER SYSTEM USING SVD - CSC5356 DATA ENGINEERING (with USER-MEAN CENTERING)


In [3]:
# Start timer to measure performance
start_time = time.time()

In [4]:
# 1. Load and prepare the MovieLens dataset with optimization
print("\n1. LOADING AND PREPARING DATASET...")


1. LOADING AND PREPARING DATASET...


In [5]:
# Define a function to reduce memory usage
def reduce_mem_usage(df):
    """ Reduce memory usage of a dataframe by setting data types. """
    start_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage of dataframe is {start_mem:.2f} MB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    
    end_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage after optimization is: {end_mem:.2f} MB')
    print(f'Decreased by {100 * (start_mem - end_mem) / start_mem:.1f}%')
    
    return df

In [6]:
# Function to create a progress bar
def progress_bar(current, total, bar_length=50):
    fraction = current / total
    arrow = int(fraction * bar_length) * '█'
    padding = (bar_length - len(arrow)) * '░'
    print(f'\r[{arrow}{padding}] {int(fraction * 100)}%', end='')

In [7]:
# Try to load the full dataset, but fall back to sampling if memory issues arise
try:
    print("Attempting to load full dataset...")
    ratings = pd.read_csv('ml-25m/ml-25m/ratings.csv')
    use_sample = False
except (MemoryError, FileNotFoundError) as e:
    print(f"Error: {e}")
    print("Falling back to sampled dataset due to memory constraints or file not found")
    try:
        ratings = pd.read_csv('ratings.csv')
    except FileNotFoundError:
        ratings = pd.read_csv('ml-25m/ml-25m/ratings.csv')
    use_sample = True
    sample_size = 1000000  # Increased sample size for better representation

if use_sample:
    print(f"Using a sample of {sample_size} ratings due to memory constraints")
    # Stratified sampling by userId to maintain user distribution
    user_counts = ratings['userId'].value_counts()
    users_to_include = user_counts[user_counts >= 20].index.tolist()
    
    if len(users_to_include) > 0:
        sampled_ratings = ratings[ratings['userId'].isin(users_to_include)]
        if len(sampled_ratings) > sample_size:
            ratings = sampled_ratings.sample(n=sample_size, random_state=42)
        else:
            ratings = ratings.sample(n=min(sample_size, len(ratings)), random_state=42)
    else:
        ratings = ratings.sample(n=min(sample_size, len(ratings)), random_state=42)

ratings = reduce_mem_usage(ratings)

Attempting to load full dataset...
Memory usage of dataframe is 762.94 MB
Memory usage after optimization is: 381.47 MB
Decreased by 50.0%


In [8]:
# Load movies data with proper error handling
try:
    movies = pd.read_csv('ml-25m/ml-25m/movies.csv')
except FileNotFoundError:
    try:
        movies = pd.read_csv('movies.csv')
    except FileNotFoundError:
        movies = pd.read_csv('ml-25m/ml-25m/movies.csv')

movies = reduce_mem_usage(movies)

Memory usage of dataframe is 1.43 MB
Memory usage after optimization is: 1.19 MB
Decreased by 16.7%


In [9]:
# Add a timestamp conversion for analysis
ratings['date'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['year'] = ratings['date'].dt.year
ratings['month'] = ratings['date'].dt.month

# Filter out users with very few ratings to improve model quality
min_ratings_per_user = 5
user_counts = ratings['userId'].value_counts()
valid_users = user_counts[user_counts >= min_ratings_per_user].index
ratings = ratings[ratings['userId'].isin(valid_users)]

# Filter out movies with very few ratings
min_ratings_per_movie = 3
movie_counts = ratings['movieId'].value_counts()
valid_movies = movie_counts[movie_counts >= min_ratings_per_movie].index
ratings = ratings[ratings['movieId'].isin(valid_movies)]

print(f"\nWorking with {len(ratings)} ratings from {len(ratings['userId'].unique())} users")
print(f"Dataset spans {len(ratings['movieId'].unique())} movies")


Working with 24974531 ratings from 162541 users
Dataset spans 41116 movies


## 2. Data Exploration

This section analyzes the dataset structure and characteristics:

- Summary statistics for ratings
- Distribution of movie ratings visualization
- Analysis of movie genres
- Rating trends over time

In [10]:
print("\n2. EXPLORING DATA STRUCTURE...")
# Generate summary statistics
print("Summary statistics for ratings:")
rating_stats = ratings['rating'].describe()
print(rating_stats)



2. EXPLORING DATA STRUCTURE...
Summary statistics for ratings:
count    2.497453e+07
mean     3.534428e+00
std      1.060538e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64


In [11]:
# Visualize rating distribution
plt.figure(figsize=(10, 6))
sns.histplot(ratings['rating'], bins=10, kde=True)
plt.title('Distribution of Movie Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.savefig('rating_distribution.png')
plt.close()

In [12]:
# Extract and analyze genres
genres_list = []
for genres_str in movies['genres'].str.split('|'):
    genres_list.extend(genres_str)

unique_genres = set(genres_list)
genre_counts = {genre: genres_list.count(genre) for genre in unique_genres}

# Plot top genres
top_genres = dict(sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)[:10])
plt.figure(figsize=(12, 6))
sns.barplot(x=list(top_genres.keys()), y=list(top_genres.values()))
plt.title('Top 10 Movie Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.savefig('top_genres.png')
plt.close()

In [13]:
# Check for missing values
print("\nMissing values in ratings:", ratings.isnull().sum().sum())
print("Missing values in movies:", movies.isnull().sum().sum())


Missing values in ratings: 0
Missing values in movies: 0


In [14]:
# Analyze rating trends over time
yearly_ratings = ratings.groupby('year')['rating'].mean().reset_index()
plt.figure(figsize=(12, 6))
sns.lineplot(x='year', y='rating', data=yearly_ratings)
plt.title('Average Rating by Year')
plt.xlabel('Year')
plt.ylabel('Average Rating')
plt.grid(True)
plt.savefig('rating_trends.png')
plt.close()

## **3. Train-Test Split**
The dataset is split into training and test sets:

- 80% training, 20% testing
- Stratified by userId to ensure each user has ratings in both sets

In [15]:
# 3. Train-Test Split
print("\n3. PREPARING TRAIN-TEST SPLIT...")

# Stratify by userId to ensure each user has ratings in both training and test sets
ratings_train, ratings_test = train_test_split(
    ratings, test_size=0.2, random_state=42, 
    stratify=ratings['userId'].apply(lambda x: min(x % 10, 5))  # Simple stratification trick
)
print(f"Training set: {len(ratings_train)} ratings")
print(f"Test set: {len(ratings_test)} ratings")



3. PREPARING TRAIN-TEST SPLIT...
Training set: 19979624 ratings
Test set: 4994907 ratings


## **4. Mean-Centering**
User ratings are mean-centered to improve recommendation quality:

- Calculate mean rating for each user
- Subtract user's mean from their ratings

In [16]:
# 4. Mean-Center the Training Ratings
user_mean_map = ratings_train.groupby('userId')['rating'].mean().to_dict()

# Subtract each user's mean rating from their training ratings
centered_data = []
for row in ratings_train.itertuples(index=False):
    uid, mid, rating = row.userId, row.movieId, row.rating
    user_mean = user_mean_map[uid]
    centered_rating = rating - user_mean
    centered_data.append((uid, mid, centered_rating))

centered_df = pd.DataFrame(centered_data, columns=['userId', 'movieId', 'centered_rating'])


4. MEAN-CENTERING TRAINING RATINGS...


## **5. Sparse Matrix Creation**
A sparse matrix is created from the centered ratings:

- Maps user IDs and movie IDs to matrix indices
- Creates a compressed sparse row matrix

In [17]:
# 5. Build Sparse Matrix from Centered Ratings
print("\n5. CREATING SPARSE MATRIX (MEAN-CENTERED) FOR TRAINING...")
user_ids = centered_df['userId'].unique()
movie_ids = centered_df['movieId'].unique()

user_id_map = {uid: i for i, uid in enumerate(user_ids)}
movie_id_map = {mid: j for j, mid in enumerate(movie_ids)}

# Reverse mappings (used later for recommendations)
user_idx_map = {i: uid for uid, i in user_id_map.items()}
movie_idx_map = {j: mid for mid, j in movie_id_map.items()}

train_rows = centered_df['userId'].map(user_id_map).values
train_cols = centered_df['movieId'].map(movie_id_map).values
train_vals = centered_df['centered_rating'].values

train_sparse = csr_matrix((train_vals, (train_rows, train_cols)),
                          shape=(len(user_ids), len(movie_ids)))

print(f"Train sparse matrix shape: {train_sparse.shape}")
print(f"Non-zero entries: {train_sparse.nnz}")


5. CREATING SPARSE MATRIX (MEAN-CENTERED) FOR TRAINING...
Train sparse matrix shape: (162541, 41067)
Non-zero entries: 19979624


## **6. SVD Application**
SVD is applied with different numbers of components to find the optimal model:

- Tests multiple component counts: 10, 20, 50, 100
- Evaluates each model using RMSE, MAE, and Precision@K
- Visualizes predictions vs. actual ratings

In [None]:
# 6. Apply SVD (Testing Multiple Components)
print("\n6. APPLYING SVD WITH USER-MEAN CENTERING AND OPTIMIZING COMPONENTS...")
n_components_list = [10, 20, 50, 100]  
results = []

for n_components in n_components_list:
    print(f"\nTesting with {n_components} components...")

    # Fit SVD on the mean-centered training matrix
    svd = TruncatedSVD(n_components=n_components, random_state=42)
    user_features = svd.fit_transform(train_sparse)
    movie_features = svd.components_  # shape = (n_components, num_movies)

    # Calculate explained variance
    explained_variance = svd.explained_variance_ratio_.sum()
    print(f"Explained variance: {explained_variance:.4f}")

    # Evaluate on test set
    print("Generating predictions on test set...")
    test_users = ratings_test['userId'].values
    test_movies = ratings_test['movieId'].values
    test_ratings = ratings_test['rating'].values

    predictions = []
    actuals = []
    total_tests = len(ratings_test)

    for i in range(total_tests):
        if i % max(1, (total_tests // 10)) == 0:
            progress_bar(i, total_tests)

        uid = test_users[i]
        mid = test_movies[i]
        actual = test_ratings[i]

        # If user/movie not in training, fall back to a global or user mean
        if uid not in user_id_map or mid not in movie_id_map:
            # Could use global average or fallback
            pred_rating = 3.5
        else:
            u_idx = user_id_map[uid]
            m_idx = movie_id_map[mid]

            # Dot product in latent space
            centered_pred = np.dot(user_features[u_idx], movie_features[:, m_idx])
            # Add user mean back
            user_mean = user_mean_map[uid]
            pred_rating = user_mean + centered_pred

        # Clip predictions
        pred_rating = max(0.5, min(5.0, pred_rating))
        predictions.append(pred_rating)
        actuals.append(actual)

    progress_bar(total_tests, total_tests)
    print()

    # Calculate error metrics
    mse = mean_squared_error(actuals, predictions)
    rmse = sqrt(mse)
    mae = mean_absolute_error(actuals, predictions)

    # Compute a simple Precision@K
    k = 10
    user_test_ratings = defaultdict(list)
    for i in range(len(actuals)):
        user_test_ratings[test_users[i]].append((predictions[i], actuals[i]))

    precision_at_k = []
    for user_id_, user_preds in user_test_ratings.items():
        if len(user_preds) >= k:
            user_preds.sort(key=lambda x: x[0], reverse=True)
            # relevant if actual >= 4
            relevant = sum(1 for _, act in user_preds[:k] if act >= 4.0)
            precision_at_k.append(relevant / k)

    avg_precision_at_k = np.mean(precision_at_k) if precision_at_k else 0

    print(f"RMSE with {n_components} components: {rmse:.4f}")
    print(f"MAE with {n_components} components: {mae:.4f}")
    print(f"Precision@{k} with {n_components} components: {avg_precision_at_k:.4f}")

    results.append((n_components, rmse, mae, avg_precision_at_k, explained_variance))

    # Quick plots of predictions vs. actual
    sample_size = min(1000, len(predictions))
    sample_indices = np.random.choice(len(predictions), sample_size, replace=False)

    plt.figure(figsize=(8, 6))
    plt.scatter([actuals[i] for i in sample_indices],
                [predictions[i] for i in sample_indices],
                alpha=0.3)
    plt.plot([0.5, 5], [0.5, 5], 'r--')
    plt.xlabel('Actual Ratings')
    plt.ylabel('Predicted Ratings')
    plt.title(f'Actual vs Predicted (User-Mean Centered) - {n_components} Components')
    plt.grid(True)
    plt.tight_layout()
    plt.savefig(f'prediction_scatter_{n_components}.png')
    plt.close()

    # Error distribution
    errors = np.array(predictions) - np.array(actuals)
    plt.figure(figsize=(8, 6))
    sns.histplot(errors, kde=True, bins=50)
    plt.title(f'Error Distribution with {n_components} Components')
    plt.xlabel('Prediction Error')
    plt.ylabel('Count')
    plt.grid(True)
    plt.savefig(f'error_distribution_{n_components}.png')
    plt.close()

    # Cleanup
    del user_features, movie_features
    gc.collect()



6. APPLYING SVD WITH USER-MEAN CENTERING AND OPTIMIZING COMPONENTS...

Testing with 10 components...
Explained variance: 0.0765
Generating predictions on test set...
[██████████████████████████████████████████████████] 100%
RMSE with 10 components: 0.9021
MAE with 10 components: 0.6896
Precision@10 with 10 components: 0.7173

Testing with 20 components...
Explained variance: 0.1032
Generating predictions on test set...
[██████████████████████████████████████████████████] 100%
RMSE with 20 components: 0.8969
MAE with 20 components: 0.6850
Precision@10 with 20 components: 0.7217

Testing with 50 components...
Explained variance: 0.1562
Generating predictions on test set...
[██████████████████████████████████████████████████] 100%
RMSE with 50 components: 0.8936
MAE with 50 components: 0.6824
Precision@10 with 50 components: 0.7192

Testing with 100 components...
Explained variance: 0.2167
Generating predictions on test set...
[██████████████████████████████████████████████████] 100%
RMS

## 7. Compare results for different n_components


In [19]:
# 7. Compare results for different n_components
plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
components, rmses, maes, precisions, variances = zip(*results)
plt.plot(components, rmses, marker='o', label='RMSE')
plt.plot(components, maes, marker='s', label='MAE')
plt.xlabel('Number of Components')
plt.ylabel('Error Metric')
plt.title('Error Metrics vs. #Components (Mean-Centered)')
plt.grid(True)
plt.legend()

plt.subplot(2, 2, 2)
plt.plot(components, precisions, marker='d', color='green')
plt.xlabel('Number of Components')
plt.ylabel(f'Precision@{k}')
plt.title(f'Precision@{k} vs. #Components')
plt.grid(True)

plt.subplot(2, 2, 3)
plt.plot(components, variances, marker='*', color='purple')
plt.xlabel('Number of Components')
plt.ylabel('Explained Variance')
plt.title('Explained Variance vs. #Components')
plt.grid(True)

plt.subplot(2, 2, 4)
# A simplified "efficiency" metric
comp_efficiency = [c / time.time() for c in components]
plt.plot(components, comp_efficiency, marker='x', color='orange')
plt.xlabel('Number of Components')
plt.ylabel('Computational Efficiency')
plt.title('Efficiency vs. #Components')
plt.grid(True)

plt.tight_layout()
plt.savefig('component_analysis.png')
plt.close()

## **8. Model Selection**
The best model is selected based on a combined score:

- Normalized RMSE, precision, and variance metrics
- Weighted combination of metrics
- Rebuild final model with optimal component count

In [20]:
# 8. Select Best Model (Combined Score)
print("\n8. SELECTING BEST MODEL...")

min_rmse = min(rmses)
max_rmse = max(rmses)
normalized_rmses = [(max_rmse - r) / (max_rmse - min_rmse) if max_rmse > min_rmse else 0.5 for r in rmses]

min_prec = min(precisions)
max_prec = max(precisions)
normalized_precs = [(p - min_prec) / (max_prec - min_prec) if max_prec > min_prec else 0.5 for p in precisions]

min_var = min(variances)
max_var = max(variances)
normalized_vars = [(v - min_var) / (max_var - min_var) if max_var > min_var else 0.5 for v in variances]

weights = (0.4, 0.3, 0.3)  # RMSE, precision, variance
combined_scores = [weights[0]*nr + weights[1]*np_ + weights[2]*nv
                   for nr, np_, nv in zip(normalized_rmses, normalized_precs, normalized_vars)]

best_idx = combined_scores.index(max(combined_scores))
best_n_components = components[best_idx]
best_rmse = rmses[best_idx]
best_precision = precisions[best_idx]

print(f"Best model uses {best_n_components} components:")
print(f"  - RMSE: {best_rmse:.4f}")
print(f"  - Precision@{k}: {best_precision:.4f}")
print(f"  - Explained Variance: {variances[best_idx]:.4f}")


8. SELECTING BEST MODEL...
Best model uses 50 components:
  - RMSE: 0.8936
  - Precision@10: 0.7192
  - Explained Variance: 0.1562


## **9. Rebuild Final SVD Model**


In [21]:
# 9. Rebuild Final SVD Model
print("\n9. REBUILDING BEST MODEL FOR RECOMMENDATIONS...")
svd_final = TruncatedSVD(n_components=best_n_components, random_state=42)
final_user_factors = svd_final.fit_transform(train_sparse)
final_movie_factors = svd_final.components_

# For interpretability
final_explained_variance = svd_final.explained_variance_ratio_

plt.figure(figsize=(10, 6))
plt.bar(range(len(final_explained_variance)), final_explained_variance)
plt.xlabel('Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by SVD Component (Best Model)')
plt.savefig('explained_variance_best_model.png')
plt.close()

cumulative_variance = np.cumsum(final_explained_variance)
plt.figure(figsize=(10, 6))
plt.plot(range(len(cumulative_variance)), cumulative_variance, marker='o')
plt.axhline(y=0.9, color='r', linestyle='-', label='90% Variance')
plt.grid(True)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance (Best Model)')
plt.legend()
plt.savefig('cumulative_variance_best_model.png')
plt.close()


9. REBUILDING BEST MODEL FOR RECOMMENDATIONS...


## **10. Recommendation Utilities**
Utility functions for generating recommendations:

- ***predict_rating:*** Predicts rating for a user-movie pair
- ***recommend_movies:*** Recommends top N movies for a user
- ***find_similar_movies:*** Finds movies similar to a given movie

In [22]:
# 10. Prediction & Recommendation Utilities
def predict_rating(user_id, movie_id):
    """Predict rating for a single user–movie pair using mean-centering approach."""
    if user_id not in user_mean_map:
        return 3.5  # fallback
    
    if user_id in user_id_map and movie_id in movie_id_map:
        u_idx = user_id_map[user_id]
        m_idx = movie_id_map[movie_id]
        # Dot product in latent space
        centered_pred = np.dot(final_user_factors[u_idx], final_movie_factors[:, m_idx])
        # Add user mean back
        user_mean = user_mean_map[user_id]
        pred = user_mean + centered_pred
    else:
        # fallback
        pred = 3.5
    
    # Clip to [0.5, 5.0]
    return max(0.5, min(5.0, pred))


In [23]:
def recommend_movies(user_id, n_recommendations=5):
    """Recommend top N movies for a user based on final SVD model."""
    if user_id not in user_id_map:
        print(f"User {user_id} not found in training data")
        return []
    
    # Get all user-rated movies to exclude
    user_rated = ratings_train[ratings_train['userId'] == user_id]['movieId'].unique()
    
    recs = []
    for mid in movie_id_map:
        if mid not in user_rated:
            pr = predict_rating(user_id, mid)
            recs.append((mid, pr))
    recs.sort(key=lambda x: x[1], reverse=True)
    
    # Build recommendation list with movie titles
    recommendations = []
    for mid, score in recs[:n_recommendations]:
        row = movies[movies['movieId'] == mid]
        if not row.empty:
            title = row.iloc[0]['title']
            genres = row.iloc[0]['genres']
            recommendations.append((mid, title, genres, score))
    return recommendations

In [24]:
def safe_cosine_similarity(vec1, vec2):
    """Safe cosine similarity."""
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return np.dot(vec1, vec2) / (norm1 * norm2)


In [25]:
def find_similar_movies(movie_id, n_similar=5):
    """Find movies similar to a given movie using final SVD model."""
    if movie_id not in movie_id_map:
        print(f"Movie {movie_id} not found in training data.")
        return []
    
    m_idx = movie_id_map[movie_id]
    movie_vec = final_movie_factors[:, m_idx]
    similarities = []
    
    for other_idx, other_id in movie_idx_map.items():
        if other_id != movie_id:
            other_vec = final_movie_factors[:, other_idx]
            sim = safe_cosine_similarity(movie_vec, other_vec)
            similarities.append((other_id, sim))
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Build final list
    similar_movies = []
    for mid, sim in similarities[:n_similar]:
        row = movies[movies['movieId'] == mid]
        if not row.empty:
            title = row.iloc[0]['title']
            genres = row.iloc[0]['genres']
            similar_movies.append((mid, title, genres, sim))
    return similar_movies

## **11. Sample Recommendations**
Generates sample recommendations for different types of users:

- Light users (few ratings)
- Medium users (moderate ratings)
- Heavy users (many ratings)

In [26]:
# 11. Sample Recommendations
print("\n10. GENERATING SAMPLE RECOMMENDATIONS...")

user_rating_counts = ratings_train['userId'].value_counts()
light_user_id = user_rating_counts[user_rating_counts < 20].index[0] if any(user_rating_counts < 20) else user_rating_counts.index[0]
medium_user_id = user_rating_counts[(user_rating_counts >= 20) & (user_rating_counts < 100)].index[0] if any((user_rating_counts >= 20) & (user_rating_counts < 100)) else user_rating_counts.index[1]
heavy_user_id = user_rating_counts[user_rating_counts >= 100].index[0] if any(user_rating_counts >= 100) else user_rating_counts.index[2]

sample_users = [light_user_id, medium_user_id, heavy_user_id]

for uid in sample_users:
    cnt = user_rating_counts[uid]
    avg_r = user_mean_map.get(uid, 3.5)
    print(f"\nRecommendations for User {uid} (rated {cnt} movies, avg rating: {avg_r:.2f}):")
    user_recs = recommend_movies(uid, n_recommendations=5)
    for i, (mid, title, genres, score) in enumerate(user_recs, start=1):
        print(f"{i}. {title} ({genres}) - Predicted Rating: {score:.2f}")


10. GENERATING SAMPLE RECOMMENDATIONS...

Recommendations for User 124860 (rated 19 movies, avg rating: 3.95):
1. 2001: A Space Odyssey (1968) (Adventure|Drama|Sci-Fi) - Predicted Rating: 4.06
2. Twelve Monkeys (a.k.a. 12 Monkeys) (1995) (Mystery|Sci-Fi|Thriller) - Predicted Rating: 4.04
3. Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) (Comedy|War) - Predicted Rating: 4.03
4. Dark Knight, The (2008) (Action|Crime|Drama|IMAX) - Predicted Rating: 4.02
5. Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) (Comedy|Romance) - Predicted Rating: 4.02

Recommendations for User 69862 (rated 99 movies, avg rating: 3.80):
1. Shrek (2001) (Adventure|Animation|Children|Comedy|Fantasy|Romance) - Predicted Rating: 4.25
2. Lord of the Rings: The Return of the King, The (2003) (Action|Adventure|Drama|Fantasy) - Predicted Rating: 4.22
3. Lord of the Rings: The Two Towers, The (2002) (Adventure|Fantasy) - Predicted Rating: 4.21
4. Pirates of the Caribbean: The Curse of th

## **12. Similar Movies**
Finds and displays similar movies across popular genres:

- Drama, Comedy, Action, Sci-Fi
- Uses latent factors to compute similarities

In [27]:
# 12. Similar Movies
print("\n11. FINDING SIMILAR MOVIES IN POPULAR GENRES...")

genre_to_movie = {}
for _, row in movies.iterrows():
    for g in row['genres'].split('|'):
        if g not in genre_to_movie:
            genre_to_movie[g] = []
        genre_to_movie[g].append(row['movieId'])

popular_genre_movies = []
for g in ['Drama', 'Comedy', 'Action', 'Sci-Fi']:
    if g in genre_to_movie:
        movie_ids_g = genre_to_movie[g]
        # Find most rated in training set
        ratings_per_movie = {m: len(ratings_train[ratings_train['movieId'] == m]) 
                             for m in movie_ids_g if m in ratings_train['movieId'].values}
        if ratings_per_movie:
            most_rated = max(ratings_per_movie.items(), key=lambda x: x[1])[0]
            popular_genre_movies.append(most_rated)

for mid in popular_genre_movies:
    row = movies[movies['movieId'] == mid]
    if not row.empty:
        m_title = row.iloc[0]['title']
        m_genres = row.iloc[0]['genres']
        print(f"\nMovies similar to '{m_title}' ({m_genres}):")
        sims = find_similar_movies(mid, n_similar=5)
        for i, (smid, stitle, sgenres, s) in enumerate(sims, start=1):
            print(f"{i}. {stitle} ({sgenres}) - Similarity: {s:.4f}")



11. FINDING SIMILAR MOVIES IN POPULAR GENRES...

Movies similar to 'Shawshank Redemption, The (1994)' (Crime|Drama):
1. The Dancer and the Thief (2009) (Drama) - Similarity: 0.5362
2. Flower Girl (2009) (Comedy|Romance) - Similarity: 0.4999
3. A Single Rider (2017) (Drama|Mystery) - Similarity: 0.4977
4. Grey Lady (2017) ((no genres listed)) - Similarity: 0.4645
5. Pig (2011) (Drama|Mystery|Sci-Fi|Thriller) - Similarity: 0.4318

Movies similar to 'Forrest Gump (1994)' (Comedy|Drama|Romance|War):
1. A la mala (2015) (Comedy) - Similarity: 0.6567
2. Life in the Doghouse (2018) (Documentary) - Similarity: 0.5674
3. Under The Greenwood Tree (2005) (Drama|Romance) - Similarity: 0.5484
4. The Double (1971) (Mystery|Thriller) - Similarity: 0.5154
5. 3 Worlds of Gulliver, The (1960) (Adventure|Fantasy) - Similarity: 0.5065

Movies similar to 'Matrix, The (1999)' (Action|Sci-Fi|Thriller):
1. The End of Old Times (1990) (Comedy) - Similarity: 0.6879
2. Perfect High (2015) ((no genres listed)) -

## **13. Latent Space Visualization**
Visualizes the latent space of movie factors:

- Projects movie factors to 2D using PCA
- Colors movies by decade
- Shows relationships between movies in the latent space

In [28]:
# 13. Optional: Visualize Latent Space
print("\n12. VISUALIZING LATENT SPACE...")

def extract_year(title):
    try:
        return int(title.strip()[-5:-1]) if title.strip()[-5:-1].isdigit() else None
    except:
        return None

movies['year'] = movies['title'].apply(extract_year)

movies_with_year = movies.dropna(subset=['year'])
decades = [(1950, 1959), (1960, 1969), (1970, 1979), (1980, 1989),
           (1990, 1999), (2000, 2009), (2010, 2019)]

sampled_decade_movies = []
for start_year, end_year in decades:
    dec_movies = movies_with_year[(movies_with_year['year'] >= start_year) & 
                                  (movies_with_year['year'] <= end_year)]
    if len(dec_movies) > 0:
        sampled_decade_movies.append(dec_movies.sample(min(10, len(dec_movies)), random_state=42))

if len(sampled_decade_movies) > 0:
    sampled_movies = pd.concat(sampled_decade_movies)
    sampled_movie_ids = sampled_movies['movieId'].values

    movie_latent_factors = []
    movie_titles = []
    decade_labels = []

    for mid in sampled_movie_ids:
        if mid in movie_id_map:
            m_idx = movie_id_map[mid]
            row = movies[movies['movieId'] == mid]
            title = row.iloc[0]['title']
            year = extract_year(title)
            if year:
                decade = f"{year // 10 * 10}s"
            else:
                decade = "Unknown"
            movie_latent_factors.append(final_movie_factors[:, m_idx])
            movie_titles.append(title)
            decade_labels.append(decade)

    if len(movie_latent_factors) > 0:
        from sklearn.decomposition import PCA
        arr = np.array(movie_latent_factors)
        pca = PCA(n_components=2)
        movie_factors_2d = pca.fit_transform(arr)

        plt.figure(figsize=(12, 8))
        decade_colors = {
            '1950s': 'blue', '1960s': 'green', '1970s': 'red',
            '1980s': 'purple', '1990s': 'orange', '2000s': 'brown', '2010s': 'black'
        }

        for i, (x, y) in enumerate(movie_factors_2d):
            color = decade_colors.get(decade_labels[i], 'gray')
            plt.scatter(x, y, color=color, alpha=0.7)
            plt.annotate(movie_titles[i], (x, y), fontsize=8)

        plt.title("Latent Space Visualization (Sampled Movies)")
        plt.xlabel("PC1")
        plt.ylabel("PC2")
        plt.grid(True)
        plt.savefig("latent_space_visualization.png")
        plt.close()


12. VISUALIZING LATENT SPACE...


In [29]:
end_time = time.time()
elapsed = end_time - start_time
print(f"\nDONE! Total runtime: {elapsed:.2f} seconds.")


DONE! Total runtime: 2510.89 seconds.


**NOTE**: with all these implementation the total runtime is around 42min