# Movie Recommendation System – MovieLens

### Task Overview:
This notebook fulfills the task of building a comprehensive movie recommendation system using the MovieLens dataset. The objective is to develop and evaluate several models to suggest the Top-N movies for a given user.

In [None]:
!pip install "numpy<2"

Collecting numpy<2
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m75.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.12.0.88 requires num

In [None]:
# Install surprise for our baseline SVD model

!pip uninstall -y scikit-surprise
!pip install scikit-surprise -q

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from pathlib import Path
from collections import defaultdict
import math
import pickle
import os

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Download and unzip the dataset
!wget -q --show-progress http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!unzip -o ml-latest-small.zip



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


In [None]:
# Define the path to the data
PATH = Path("/content/ml-latest-small")

## 2. Data Loading and Splitting

We load the `ratings.csv` data and split it into a training and validation set.

In [None]:
# Load the full ratings dataset
data = pd.read_csv(PATH/"ratings.csv")

# Split data into training and validation sets
np.random.seed(42)
msk = np.random.rand(len(data)) < 0.8
train_df = data[msk].copy()
val_df = data[~msk].copy()

print(f"Training set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")

Training set size: 80764
Validation set size: 20072


## 3. Model 1: Baseline with Singular Value Decomposition (SVD)

To fulfill the requirement of exploring SVD, we start with a robust implementation from the `scikit-surprise` library. This model is a form of matrix factorization optimized for handling sparse rating data, often referred to as FunkSVD. It will serve as our baseline.

In [None]:
from surprise.model_selection import GridSearchCV
from surprise import SVDpp # Import SVDpp instead of SVD

# The Reader class is used to parse a file containing ratings.
reader = Reader(rating_scale=(0.5, 5.0))

# Load the training data into surprise's data format
train_data = Dataset.load_from_df(train_df[['userId', 'movieId', 'rating']], reader)

# --- Hyperparameter Tuning for SVD++ ---
# This part is correct and does not need to change.
param_grid = {
    'n_factors': [50, 80],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.007],
    'reg_all': [0.02, 0.04],
    'cache_ratings': [True]
}

print("Starting SVD++ hyperparameter tuning with GridSearchCV...")
gs = GridSearchCV(SVDpp, param_grid, measures=['rmse'], cv=3, joblib_verbose=2)
gs.fit(train_data)

# --- Get the best SVD++ model and TRAIN IT ---
print("\nGridSearchCV for SVD++ finished.")
print(f"Best cross-validation RMSE score: {gs.best_score['rmse']:.4f}")
print("Best parameters:", gs.best_params['rmse'])

# The best estimator is an UNTRAINED algorithm with the best parameters
svd_model_blueprint = gs.best_estimator['rmse']

# Build the full trainset from ALL of our training data
full_trainset = train_data.build_full_trainset()

# *** THIS IS THE CRUCIAL FIX ***
# Train the best model on the full training set
print("\nTraining the best SVD++ model on the full training set...")
svd_model_blueprint.fit(full_trainset)

# Now, the model is fully trained and we can use it for predictions
svd_model = svd_model_blueprint

# --- Final Evaluation on the held-out validation set ---
val_data = Dataset.load_from_df(val_df[['userId', 'movieId', 'rating']], reader)
valset = val_data.build_full_trainset().build_testset()

predictions = svd_model.test(valset)
print("\nValidation RMSE of the final SVD++ model:")
accuracy.rmse(predictions)

Starting SVD++ hyperparameter tuning with GridSearchCV...


[Parallel(n_jobs=1)]: Done  40 tasks      | elapsed: 82.5min
[Parallel(n_jobs=1)]: Done  48 out of  48 | elapsed: 108.7min finished



GridSearchCV for SVD++ finished.
Best cross-validation RMSE score: 0.8768
Best parameters: {'n_factors': 80, 'n_epochs': 20, 'lr_all': 0.007, 'reg_all': 0.04, 'cache_ratings': True}

Training the best SVD++ model on the full training set...

Validation RMSE of the final SVD++ model:
RMSE: 0.8626


0.8625526193406733

## 4. Data Encoding for PyTorch Models

In [None]:
def encode_data(df, train_ref=None):
    """Encodes user and movie ids into continuous integers."""
    df = df.copy()

    if train_ref is not None:
        user_map = {o: i for i, o in enumerate(train_ref['userId'].unique())}
        movie_map = {o: i for i, o in enumerate(train_ref['movieId'].unique())}
    else:
        user_map = {o: i for i, o in enumerate(df['userId'].unique())}
        movie_map = {o: i for i, o in enumerate(df['movieId'].unique())}

    df["userId"] = df["userId"].map(user_map).fillna(-1).astype(int)
    df["movieId"] = df["movieId"].map(movie_map).fillna(-1).astype(int)

    df = df[(df["userId"] >= 0) & (df["movieId"] >= 0)]
    return df, user_map, movie_map

# Encode the datasets and get the mappings
df_train_encoded, user_map, movie_map = encode_data(train_df)
df_val_encoded, _, _ = encode_data(val_df, train_ref=train_df)

# Invert mappings for later use (to get original IDs back)
user_inv_map = {i: o for o, i in user_map.items()}
movie_inv_map = {i: o for o, i in movie_map.items()}

num_users = len(user_map)
num_items = len(movie_map)

print(f"Number of unique users: {num_users}")
print(f"Number of unique movies: {num_items}")
print("\nEncoded Training Data Head:")
print(df_train_encoded.head())

Number of unique users: 610
Number of unique movies: 8985

Encoded Training Data Head:
   userId  movieId  rating  timestamp
0       0        0     4.0  964982703
2       0        1     4.0  964982224
3       0        2     5.0  964983815
4       0        3     5.0  964982931
5       0        4     3.0  964982400


## 5. Models: MF, MF_bias, and Neural Network

Now, we build our custom models in PyTorch. This includes two matrix factorization models (one plain, one with biases) and a neural network model for the optional enhancement.

In [None]:
class MF(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.user_emb.weight.data.uniform_(0, 0.05)
        self.item_emb.weight.data.uniform_(0, 0.05)
    def forward(self, u, v):
        return (self.user_emb(u) * self.item_emb(v)).sum(1)

class MF_bias(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100):
        super(MF_bias, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.user_bias = nn.Embedding(num_users, 1)
        self.item_bias = nn.Embedding(num_items, 1)
        self.user_emb.weight.data.uniform_(0, 0.05)
        self.item_emb.weight.data.uniform_(0, 0.05)
        self.user_bias.weight.data.uniform_(-0.01, 0.01)
        self.item_bias.weight.data.uniform_(-0.01, 0.01)
    def forward(self, u, v):
        dot = (self.user_emb(u) * self.item_emb(v)).sum(1)
        u_bias = self.user_bias(u).squeeze()
        v_bias = self.item_bias(v).squeeze()
        return dot + u_bias + v_bias

class CollabFNet(nn.Module):
    def __init__(self, num_users, num_items, emb_size=100, n_hidden=10):
        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_emb = nn.Embedding(num_items, emb_size)
        self.lin1 = nn.Linear(emb_size * 2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)
    def forward(self, u, v):
        x = F.relu(torch.cat([self.user_emb(u), self.item_emb(v)], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.lin2(x)
        return x.squeeze()

## 6. Training the Models

In [None]:
import copy

def train_pytorch_model(model, epochs=40, lr=0.01, wd=1e-4):
    """
    Improved training loop with Early Stopping and Learning Rate Scheduling.
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    # Reduces learning rate when validation loss has stopped improving
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', patience=3, factor=0.5)

    users_train = torch.LongTensor(df_train_encoded.userId.values)
    items_train = torch.LongTensor(df_train_encoded.movieId.values)
    ratings_train = torch.FloatTensor(df_train_encoded.rating.values)

    users_val = torch.LongTensor(df_val_encoded.userId.values)
    items_val = torch.LongTensor(df_val_encoded.movieId.values)
    ratings_val = torch.FloatTensor(df_val_encoded.rating.values)

    # --- Early Stopping variables ---
    best_val_loss = float('inf')
    best_model_state = None
    epochs_no_improve = 0
    patience = 5  # Number of epochs to wait for improvement before stopping

    for i in range(epochs):
        model.train()
        y_hat = model(users_train, items_train)
        loss = F.mse_loss(y_hat, ratings_train)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        model.eval()
        with torch.no_grad():
            y_hat_val = model(users_val, items_val)
            val_loss = F.mse_loss(y_hat_val, ratings_val)

        # Update the learning rate scheduler
        scheduler.step(val_loss)

        if (i+1) % 2 == 0:
            print(f"Epoch {i+1}/{epochs} - Train Loss: {loss.item():.4f}, Val Loss: {val_loss.item():.4f}")

        # --- Early Stopping logic ---
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            # Use deepcopy to ensure the best model state is not a reference
            best_model_state = copy.deepcopy(model.state_dict())
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1

        if epochs_no_improve == patience:
            print(f"\nEarly stopping triggered after {i+1} epochs.")
            break

    # Load the best model state before returning
    if best_model_state:
        print(f"\nFinished training. Loading best model with Val Loss: {best_val_loss:.4f}")
        model.load_state_dict(best_model_state)

    return model

# --- Re-train ALL models with the final, robust training function ---

# Train the MF model
print("--- Training MF Model ---")
mf_model = MF(num_users, num_items, emb_size=100)
mf_model = train_pytorch_model(mf_model, epochs=40, lr=0.02, wd=1e-5)

# Train the MF_bias model with the tuned aggressive learning rate
print("\n--- Training MF_bias Model (Tuned) ---")
mf_bias_model = MF_bias(num_users, num_items, emb_size=100)
mf_bias_model = train_pytorch_model(mf_bias_model, epochs=40, lr=0.05, wd=1e-5)

# Train the CollabFNet model
print("\n--- Training Neural Network Model ---")
collab_net_model = CollabFNet(num_users, num_items, emb_size=100)
collab_net_model = train_pytorch_model(collab_net_model, epochs=40, lr=0.01, wd=1e-6)

--- Training MF Model ---
Epoch 2/40 - Train Loss: 11.9766, Val Loss: 10.6572
Epoch 4/40 - Train Loss: 8.9021, Val Loss: 7.0130
Epoch 6/40 - Train Loss: 4.9849, Val Loss: 3.1872
Epoch 8/40 - Train Loss: 1.7434, Val Loss: 1.0692
Epoch 10/40 - Train Loss: 1.0928, Val Loss: 1.8434
Epoch 12/40 - Train Loss: 2.3794, Val Loss: 2.7812

Early stopping triggered after 13 epochs.

Finished training. Loading best model with Val Loss: 1.0692

--- Training MF_bias Model (Tuned) ---
Epoch 2/40 - Train Loss: 9.1600, Val Loss: 4.4257
Epoch 4/40 - Train Loss: 1.1516, Val Loss: 2.5739
Epoch 6/40 - Train Loss: 3.7498, Val Loss: 2.6419
Epoch 8/40 - Train Loss: 1.0760, Val Loss: 0.9966
Epoch 10/40 - Train Loss: 0.7808, Val Loss: 1.1167
Epoch 12/40 - Train Loss: 1.1169, Val Loss: 1.4308
Epoch 14/40 - Train Loss: 1.3039, Val Loss: 1.4355

Early stopping triggered after 14 epochs.

Finished training. Loading best model with Val Loss: 0.9762

--- Training Neural Network Model ---
Epoch 2/40 - Train Loss: 10.05

## 7. Evaluation Metrics: Precision@K, Recall@K, NDCG@K

In [None]:
def get_top_n_for_eval(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions."""
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = [iid for (iid, _) in user_ratings[:n]]

    return top_n

def calculate_ranking_metrics(predictions, k=10, threshold=4.0):
    """Calculate Precision@K, Recall@K, and NDCG@K."""

    # Get the top-K recommendations for each user
    top_k = get_top_n_for_eval(predictions, n=k)

    # Get the actual relevant items for each user
    actuals = defaultdict(list)
    for uid, iid, r, _, _ in predictions:
        if r >= threshold:
            actuals[uid].append(iid)

    precisions = dict()
    recalls = dict()
    ndcgs = dict()

    for uid, recs in top_k.items():
        if uid not in actuals:
            continue # User has no relevant items in validation set

        # Metrics calculation
        hits = len(set(recs) & set(actuals[uid]))
        precisions[uid] = hits / k
        recalls[uid] = hits / len(actuals[uid]) if actuals[uid] else 0

        # NDCG calculation
        relevance_scores = [1 if item in actuals[uid] else 0 for item in recs]
        dcg = sum([rel / math.log2(i + 2) for i, rel in enumerate(relevance_scores)])
        idcg = sum([1 / math.log2(i + 2) for i in range(min(len(actuals[uid]), k))])
        ndcgs[uid] = dcg / idcg if idcg > 0 else 0

    avg_precision = sum(prec for prec in precisions.values()) / len(precisions) if precisions else 0
    avg_recall = sum(rec for rec in recalls.values()) / len(recalls) if recalls else 0
    avg_ndcg = sum(ndcg for ndcg in ndcgs.values()) / len(ndcgs) if ndcgs else 0

    return {"Precision@K": avg_precision, "Recall@K": avg_recall, "NDCG@K": avg_ndcg}

# Evaluate the baseline SVD model
print("--- Evaluating Baseline SVD Model ---")
svd_predictions = svd_model.test(valset)
svd_metrics = calculate_ranking_metrics(svd_predictions)
print(svd_metrics)

# --- Helper function to evaluate PyTorch models ---
def evaluate_pytorch_model(model, val_df, k=10, threshold=4.0):
    model.eval()
    all_preds = []
    with torch.no_grad():
        for _, row in val_df.iterrows():
            original_uid = row['userId']
            original_iid = row['movieId']
            true_rating = row['rating']

            # Predict only for items/users seen in training
            if original_uid in user_map and original_iid in movie_map:
                encoded_uid = user_map[original_uid]
                encoded_iid = movie_map[original_iid]

                user_tensor = torch.LongTensor([encoded_uid])
                item_tensor = torch.LongTensor([encoded_iid])

                est = model(user_tensor, item_tensor).item()
                all_preds.append((original_uid, original_iid, true_rating, est, None))

    return calculate_ranking_metrics(all_preds, k=k, threshold=threshold)

# Evaluate the custom PyTorch models
print("\n--- Evaluating PyTorch MF Model ---")
mf_metrics = evaluate_pytorch_model(mf_model, val_df)
print(mf_metrics)

print("\n--- Evaluating PyTorch MF_bias Model ---")
mf_bias_metrics = evaluate_pytorch_model(mf_bias_model, val_df)
print(mf_bias_metrics)

print("\n--- Evaluating PyTorch Neural Network Model ---")
collab_net_metrics = evaluate_pytorch_model(collab_net_model, val_df)
print(collab_net_metrics)

--- Evaluating Baseline SVD Model ---
{'Precision@K': 0.5790540540540541, 'Recall@K': 0.6816506551842707, 'NDCG@K': 0.8070618866342171}

--- Evaluating PyTorch MF Model ---
{'Precision@K': 0.5272419627749577, 'Recall@K': 0.6631758979579774, 'NDCG@K': 0.7337122147650847}

--- Evaluating PyTorch MF_bias Model ---
{'Precision@K': 0.5710659898477157, 'Recall@K': 0.6872413274344841, 'NDCG@K': 0.7989320940005483}

--- Evaluating PyTorch Neural Network Model ---
{'Precision@K': 0.4798646362098139, 'Recall@K': 0.6263591190379837, 'NDCG@K': 0.6613086067122192}


In [None]:
def recommend_movies(user_id, N, model=svd_model):
    """
    Recommends N movies for a given user using the best-performing model (SVD).

    Args:
        user_id (int): The original ID of the user.
        N (int): The number of movies to recommend.
        model: The trained surprise SVD model.

    Returns:
        pd.DataFrame: A DataFrame with the top N recommended movie titles and genres.
                      Returns a message if the user is unknown.
    """
    # Check if the user ID exists in the training data
    try:
        # Surprise uses inner IDs, but we can check if the raw ID is known
        model.trainset.to_inner_uid(user_id)
    except ValueError:
        return f"User ID {user_id} not found in the training data."

    # Get a list of all movie IDs from the training set
    all_movie_ids = train_df['movieId'].unique()

    # Get movies the user has already rated from the training set
    rated_movie_ids = train_df[train_df['userId'] == user_id]['movieId'].unique()

    # Predict ratings for movies the user has NOT rated
    unrated_movie_ids = [mid for mid in all_movie_ids if mid not in rated_movie_ids]
    predictions = [model.predict(user_id, movie_id) for movie_id in unrated_movie_ids]

    # Sort the predictions by the estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)

    # Get the top N movie IDs from the sorted predictions
    top_n_movie_ids = [pred.iid for pred in predictions[:N]]

    # Get movie titles and genres from the original movies dataframe
    movies_df = pd.read_csv(PATH/"movies.csv")
    recommendations = movies_df[movies_df['movieId'].isin(top_n_movie_ids)]

    # Reorder the dataframe to match the recommendation order
    recommendations = recommendations.set_index('movieId').loc[top_n_movie_ids].reset_index()

    return recommendations[['title', 'genres']]



In [None]:
# --- Example Usage ---
# Get 10 recommendations for user with original ID 100
user_to_recommend = 100
num_recommendations = 10

recommended_list = recommend_movies(user_to_recommend, num_recommendations)
print(f"Top {num_recommendations} recommendations for User ID {user_to_recommend}:\n")
print(recommended_list.to_string(index=False))

Top 10 recommendations for User ID 100:

                                                                                                             title                                genres
                                                                                         Lawrence of Arabia (1962)                   Adventure|Drama|War
Neon Genesis Evangelion: The End of Evangelion (Shin seiki Evangelion Gekijô-ban: Air/Magokoro wo, kimi ni) (1997) Action|Animation|Drama|Fantasy|Sci-Fi
                                                                   Seventh Seal, The (Sjunde inseglet, Det) (1957)                                 Drama
                                                                                  Streetcar Named Desire, A (1951)                                 Drama
                                                                                                Rear Window (1954)                      Mystery|Thriller
                                         

In [None]:
import os
import pickle

# Create a new directory for the Streamlit deployment files
os.makedirs("hf_space_streamlit_deploy", exist_ok=True)

# 1. Save the best model (your trained SVD++ model)
with open('hf_space_streamlit_deploy/svd_model.pkl', 'wb') as f:
    pickle.dump(svd_model, f)

# 2. Save the training data. The app needs this to know which movies a user has already seen.
train_df.to_csv('hf_space_streamlit_deploy/train_data.csv', index=False)

# 3. Copy the movies.csv file for getting movie titles
!cp ml-latest-small/movies.csv hf_space_streamlit_deploy/movies.csv

print("All necessary files have been saved to the 'hf_space_streamlit_deploy' directory.")

All necessary files have been saved to the 'hf_space_streamlit_deploy' directory.
