# Introduction

In this notebook, we explore and evaluate several recommendation algorithms using the `Surprise` library, a popular tool for building and evaluating recommender systems. Our objective is to train and assess different recommendation models, compare their performance, and identify the best-performing model based on specific metrics.

## Overview

1. **Data Preparation**: 
   - We start by loading the movie and rating data from CSV files. These datasets contain information about movies and user ratings, which are crucial for training and evaluating our models.

2. **Data Processing**:
   - Using the `Surprise` library, we prepare the data by creating a `Dataset` object from the ratings data and splitting it into training and test sets. Additionally, we create a sparse matrix representation of the ratings to optimize memory usage.

3. **Model Training**:
   - We define and train three recommendation models:
     - **SVD (Singular Value Decomposition)**: A matrix factorization technique.
     - **NMF (Non-negative Matrix Factorization)**: Another matrix factorization method that ensures all factors are non-negative.
     - **BaselineOnly**: A baseline model that predicts ratings based on user and item biases.

4. **Model Evaluation**:
   - We compute two key performance metrics for each model:
     - **RMSE (Root Mean Square Error)**: Measures the average magnitude of the prediction errors.
     - **MAE (Mean Absolute Error)**: Measures the average magnitude of the absolute errors.
   - These metrics are stored and printed for comparison.

5. **Model Selection**:
   - We identify the best model based on the RMSE score. The model with the lowest RMSE is considered the best-performing model.

6. **Model Saving and Loading**:
   - The best-performing model is saved using `pickle` for future use. This allows us to reload the model later for generating recommendations or other tasks.

## Code Implementation

The following code performs the steps outlined above:
1. Imports necessary libraries.
2. Loads and prepares the data.
3. Defines and trains multiple recommendation models.
4. Evaluates and compares model performance.
5. Saves the best-performing model for future use.


In [1]:
# Import necessary libraries
import pandas as pd
import pickle
import numpy as np
from surprise import Dataset, Reader, SVD, NMF, BaselineOnly
from surprise.model_selection import train_test_split
from surprise import accuracy
from scipy.sparse import csr_matrix

# Load movie and rating data
movies_data = pd.read_csv('data/movies.csv', encoding='latin-1')  # Movie data
ratings_data = pd.read_csv('data/ratings.csv', encoding='latin-1')  # Ratings data

# Prepare data for Surprise models using sparse matrices
reader = Reader(rating_scale=(0.5, 5.0))
data = Dataset.load_from_df(ratings_data[['userId', 'movieId', 'rating']], reader)

# Split the data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Create a sparse matrix for storing user ratings (to save memory)
ratings_matrix = csr_matrix((ratings_data['rating'], (ratings_data['userId'], ratings_data['movieId'])))

# Define the models to be trained
models = {
    'SVD': SVD(),
    'NMF': NMF(),
    'BaselineOnly': BaselineOnly()
}

# Train, evaluate, and save each model
model_performance = {}

for model_name, model in models.items():
    print(f"Training {model_name} model...")
    model.fit(trainset)
    
    # Evaluate model performance on the test set
    predictions = model.test(testset)
    rmse = accuracy.rmse(predictions, verbose=False)
    mae = accuracy.mae(predictions, verbose=False)
    
    model_performance[model_name] = {'RMSE': rmse, 'MAE': mae}
    
    # Save the model using pickle
    with open(f'{model_name}_model.pkl', 'wb') as file:
        pickle.dump(model, file)
    
    print(f"{model_name} - RMSE: {rmse}, MAE: {mae}")

# Print out the performance summary of all models
print("\nModel Performance Summary:")
for model_name, metrics in model_performance.items():
    print(f"{model_name} - RMSE: {metrics['RMSE']}, MAE: {metrics['MAE']}")

# Select the best model based on RMSE
best_model_name = min(model_performance, key=lambda x: model_performance[x]['RMSE'])
print(f"\nBest model based on RMSE: {best_model_name}")

# Load the best model (if needed)
with open(f'{best_model_name}_model.pkl', 'rb') as file:
    best_model = pickle.load(file)

# The `best_model` can now be used for generating recommendations or other tasks.


Training SVD model...
SVD - RMSE: 0.785632909943577, MAE: 0.5888763266768074
Training NMF model...
NMF - RMSE: 0.8711662399478903, MAE: 0.6625935510049581
Training BaselineOnly model...
Estimating biases using als...
BaselineOnly - RMSE: 0.8629859282099892, MAE: 0.6572851763002336

Model Performance Summary:
SVD - RMSE: 0.785632909943577, MAE: 0.5888763266768074
NMF - RMSE: 0.8711662399478903, MAE: 0.6625935510049581
BaselineOnly - RMSE: 0.8629859282099892, MAE: 0.6572851763002336

Best model based on RMSE: SVD
