# Data
- movie_ratings_500_id.pkl contains the interactions between users and movies
- movie_metadata.pkl contains detailed information about movies, e.g. genres, actors and directors of the movies.

# Goal
- Compare the performances of different recommender systems
- Construct your own recommender systems


# Baselines

## User-Based Collaborative Filtering
This approach predicts $\hat{r}_{(u,i)}$ by leveraging the ratings given to $i$ by $u$'s similar users. Formally, it is written as:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{v \in \mathcal{N}_i(u)}sim_{(u,v)}r_{vi}}{\sum\limits_{v \in \mathbf{N}_i(u)}|sim_{(u,v)}|}
\end{equation}
where $sim_{(u,v)}$ is the similarity between user $u$ and $v$. Usually, $sim_{(u,v)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Item-Based Collaborative Filtering
This approach exploits the ratings given to similar items by the target user. The idea is formalized as follows:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{j \in \mathcal{N}_u(i)}sim_{(i,j)}r_{ui}}{\sum\limits_{j \in \mathbf{N}_u(i)}|sim_{(i,j)}|}
\end{equation}
where $sim_{(i,j)}$ is the similarity between item $i$ and $j$. Usually, $sim_{(i,j)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Vanilla MF
Vanilla MF is the inner product of vectors that represent users and items. Each user is represented by a vector $\textbf{p}_u \in \mathbb{R}^d$, each item is represented by a vector $\textbf{q}_i \in \mathbb{R}^d$, and $\hat{r}_{(u,i)}$ is computed by the inner product of $\textbf{p}_u $ and $\textbf{q}_i$. The core idea of Vanilla MF is depicted in the followng figure and follows the idea of SVD as we have seen during the TD.

![picture](https://drive.google.com/uc?export=view&id=1EAG31Qw9Ti6hB7VqdONUlijWd4rXVobC)

\begin{equation}
\hat{r}_{(u,i)} = \textbf{p}_u{\textbf{q}_i}^T
\end{equation}

## Some variants of SVD


-  SVD with bias: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^Tp_u$
- SVD ++: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^T(p_u + |I_u|^{\frac{-1}{2}}\sum\limits_{j \in I_u}y_j)$

## Factorization machine (FM)

FM takes into account user-item interactions and other features, such as users' contexts and items' attributes. It captures the second-order interactions of the vectors representing these features , thereby enriching FM's expressiveness. However, interactions involving less relevant features may introduce noise, as all interactions share the same weight. e.g. You may use FM to consider the features of items.

\begin{equation}
\hat{y}_{FM}(\textbf{X}) = w_0 + \sum\limits_{j =1}^nw_jx_j + \sum\limits_{j=1}^n\sum\limits_{k=j+1}^n\textbf{v}_j^T\textbf{v}_kx_jx_k
\end{equation}

where $\textbf{X} \in \mathbb{R}^n$ is the feature vector, $n$ denotes the number of features, $w_0$ is the global bias, $w_j$ is the bias of the $j$-th feature and $\textbf{v}_j^T\textbf{v}_k$ denotes the bias of interaction between $j$-th feature and $k$-th feature, $\textbf{v}_j \in \mathbb{R}^d$ is the vector representing $j$-th feature.

## MLP

You may also represent users and items by vectors and them feed them into a MLP to make prediction.

## Metrics

- $\begin{equation}
RMSE = \sqrt{\frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{(\hat{r}_{(u,i)}-r_{ui})}^2}
\end{equation}$

- $\begin{equation}
MAE = \frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{|\hat{r}_{(u,i)}-r_{ui}|}
\end{equation}$
-  Bonnus: you may also consider NDCG and HR under the top-k setting


# Requirements
- Try to compare different methods that you have adopted and interpret the results that you have obtained
- Minizing the RMSE and MAE
- Construct a recommender system that returns the top 10 movies *that the user have not viewed yet*.

# Our work

## Step 1 : Import the data

We create a class Data with several methods. The goal here is to centralize the data processing to obtain the relevant dataframes we need to process our data with scikit surprise.

In [2]:
import pandas as pd

class Data:

    metadata:dict
    ratings:dict
    user_ratings:dict
    ratings_df:pd.DataFrame
    metadata_df:pd.DataFrame
    merged_df:pd.DataFrame
    
    def __init__(self, metadata_path:str, ratings_path:str):
        self.metadata = pd.read_pickle(metadata_path)
        self.ratings = pd.read_pickle(ratings_path)
        self.user_ratings = self.get_user_ratings()
        self.ratings_df = self.get_ratings_as_df()
        self.metadata_df = self.get_metadata_as_df()
        self.merged_df = pd.merge(self.ratings_df, self.metadata_df, left_on='movie_id', right_on='movie_id')

    def get_user_ratings(self)->dict:
        output = {}
        for k, array in self.ratings.items():
            for v in array:
                user_movie = {
                    'user_rating': int(v['user_rating']),
                    'movie_id': k
                }
                user_id = v['user_id']

                if user_id in output.keys():
                    output[int(user_id)].append(user_movie)
                else:
                    output[int(user_id)] = [user_movie]
        return output

    def get_ratings_as_df(self)->pd.DataFrame:
        output = []

        for film, rating in self.ratings.items():
            for index in rating:
                index['movie_id'] = film
                del index['user_rating_date']
                output.append(index)
    
        return pd.DataFrame(output)
    
    def get_metadata_as_df(self)->pd.DataFrame:
        output = []

        for movie_id, movie_data in self.metadata.items():
            movie_data['genre'] = ",".join(movie_data['genre'])
            movie_data['actors'] = ",".join(movie_data['actors'])
            output.append({'movie_id': movie_id, **movie_data})

        return pd.DataFrame(output)
    
data = Data(
    metadata_path='movie_metadata.pkl', 
    ratings_path='movie_ratings_500_id.pkl'
)

data.merged_df

Unnamed: 0,user_rating,user_id,movie_id,director,genre,actors,title
0,4,1380819,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
1,3,185150,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
2,4,1351377,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
3,2,386143,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
4,3,2173336,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
...,...,...,...,...,...,...,...
259813,5,1139877,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259814,4,1460015,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259815,5,1098265,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259816,4,1962894,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist


# Step 2 : Obtain the movies that a user haven't viewed yet

## Creation of a base recommender class
Out goal is to create a base class that implements 3 methods : train, test and evaluate. Train will be overidden by a superclass (one for each recommender we want to put on the bench). Test and evaluate may or may not be overidden depending on the context. It features several attributes that are useful when it comes to train a model with the data we've got earlier.

In [3]:
from surprise import AlgoBase, Dataset, Prediction, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# We first create a class that will be used as a base to all the other recommender class we will create

class BaseRecommender:
    model_data:Dataset
    df:pd.DataFrame
    train_set:Dataset
    test_set:Dataset
    model:AlgoBase
    predictions:list[Prediction]

    def __init__(
        self, 
        df:pd.DataFrame=data.merged_df[['user_id', 'movie_id', 'user_rating']], 
        test_size:float=0.2, 
        random_state:int=42
    ):
        self.df = df
        self.model_data = Dataset.load_from_df(df, Reader(rating_scale=(1, 5)))
        self.train_set, self.test_set = train_test_split(self.model_data, test_size=test_size, random_state=random_state)

    def train(self)->None:
        pass

    def test(self, update:bool=True)->list[Prediction]:
        predictions = self.model.test(self.test_set)
        if update:
            self.predictions = predictions
        return predictions
    
    def predict(self, user_id: int | str, item_id: str) -> float:
        if isinstance(user_id, int):
            user_id = str(user_id)

        user_movie_rating = self.df[
            (self.df['user_id'] == user_id) & (self.df['movie_id'] == item_id)
        ]

        if not user_movie_rating.empty:
            actual_rating = user_movie_rating['user_rating'].values[0]
            return float(actual_rating)
        else:
            return self.model.predict(user_id, item_id).est
    
    def get_top_movies_for_user(self, user_id: int | str, n: int = 10) -> pd.DataFrame:
        if isinstance(user_id, int):
            user_id = str(user_id)

        # Get the list of movies the user has already watched
        watched_movies = set(self.df[self.df['user_id'] == int(user_id)]['movie_id'])

        # Generate predictions for all movies in the dataset
        all_movies = set(self.df['movie_id'])
        candidate_movies = list(all_movies - watched_movies)
        predictions = [(user_id, movie_id, self.model.predict(user_id, movie_id).est) for movie_id in candidate_movies]

        # Sort the predictions by estimated rating in descending order
        sorted_predictions = sorted(predictions, key=lambda x: x[2], reverse=True)

        # Select the top N movies
        top_n_movies = sorted_predictions[:n]

        # Create a DataFrame with the results
        result_df = pd.DataFrame(top_n_movies, columns=['user_id', 'movie_id', 'estimated_rating'])

        return result_df
    
    def evaluate(self)->pd.DataFrame:
        assert self.predictions is not None
        assert len(self.predictions) >= 0
        return pd.DataFrame([{
            'model': self.__class__.__name__,
            'rmse': accuracy.rmse(self.predictions, verbose=False),
            'mae': accuracy.mae(self.predictions, verbose=False)
        }])


## Analysis of several models and parameters

1. User-based collaborative filtering
   - With cosine
   - With Pearson
2. Item-based collaborative filtering
   - With cosine
   - With Pearson

We use the KNNBasic model for this approach as it is the most relevant model for Collaborative filtering. Its native supports of cosine/pearson and user/item base as parameters makes it a relevant choice when calculating metrics.

In order to integrate the full array of data points we have at our disposal, we provide the model with two similarity functions with a goal to leverage directors, genres and actors in the process. It is worth noting that the two latter, when they are plural, have been comma-joined in a single string. The usage of isin() as showcased in the above-mentioned functions is relevant because of the presence of all the data in one field, giving a simpler similiraty analysis than another table with joins or similars.

In [4]:
from surprise import KNNBasic
# We create and instanciate a first recommender system with the User-Based or Item-based Collaborative Filtering method

class CollaborativeFiltering(BaseRecommender):
    model:KNNBasic
    similarity_method:str
    user_based:bool

    def __init__(self, user_based:bool=True, similarity_method:str='cosine'):
        super().__init__()
        self.similarity_method = similarity_method
        self.user_based = user_based

    def train(self)->None:
        sim_options = {
            'name': self.similarity_method,
            'user_based': self.user_based,
            'user_custom_similarity': self.user_similarity_function if self.user_based else self.item_similarity_function
        }
        knn = KNNBasic(
            verbose=False,
            sim_options=sim_options
        )
        knn.fit(self.train_set)
        self.model = knn

    def evaluate(self) -> pd.DataFrame:
        output_df = super().evaluate()
        output_df['Similarity method'] = self.similarity_method.capitalize()
        output_df['User or Item'] = 'User' if self.user_based else 'Item'
        return output_df

    def user_similarity_function(user1:pd.DataFrame, user2:pd.DataFrame, metadata_df:pd.DataFrame)->int:
        # Compare directors, genres, and actors
        common_director = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['director'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
            ]['director']
        ).sum()
        
        common_genre = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['genre'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
        ]['genre']).sum()
        
        common_actors = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['actors'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
        ]['actors']).sum()

        total_common = common_director + common_genre + common_actors

        # Return a similarity score
        return total_common

    def item_similarity_function(item1:pd.DataFrame, item2:pd.DataFrame, metadata_df:pd.DataFrame)->int:
        # Compare directors, genres, and actors
        common_director = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['director'].nunique()
        common_genre = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['genre'].nunique()
        common_actors = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['actors'].nunique()

        total_common = common_director + common_genre + common_actors

        # Return a similarity score
        return total_common

cosine_user_based_collaborative_filtering = CollaborativeFiltering(user_based=True, similarity_method='cosine')
cosine_user_based_collaborative_filtering.train()
cosine_user_based_collaborative_filtering.test()

pearson_user_based_collaborative_filtering = CollaborativeFiltering(user_based=True, similarity_method='pearson')
pearson_user_based_collaborative_filtering.train()
pearson_user_based_collaborative_filtering.test()

cosine_item_based_collaborative_filtering = CollaborativeFiltering(user_based=False, similarity_method='cosine')
cosine_item_based_collaborative_filtering.train()
cosine_item_based_collaborative_filtering.test()

pearson_item_based_collaborative_filtering = CollaborativeFiltering(user_based=False, similarity_method='pearson')
pearson_item_based_collaborative_filtering.train()
pearson_item_based_collaborative_filtering.test()

pd.concat([
    cosine_user_based_collaborative_filtering.evaluate(),
    pearson_user_based_collaborative_filtering.evaluate(),
    cosine_item_based_collaborative_filtering.evaluate(),
    pearson_item_based_collaborative_filtering.evaluate()
], ignore_index=True)

Unnamed: 0,model,rmse,mae,Similarity method,User or Item
0,CollaborativeFiltering,1.038282,0.831683,Cosine,User
1,CollaborativeFiltering,1.042457,0.836095,Pearson,User
2,CollaborativeFiltering,1.032445,0.808388,Cosine,Item
3,CollaborativeFiltering,1.033257,0.808883,Pearson,Item


3. VanillaMF
4. SVD with bias
5. SVD++

The three algorithms are based on the Matrix Factorization principle as highlighted in [Surprise's documentation](https://surprise.readthedocs.io/en/stable/matrix_factorization.html). The main inconvenience of these algorithms from our point of view lies in the absence of incorporation of the metadata in the recommendation process, given the only columns taken are the user's id, the item's id and the tuple's rating. This is a constraint inherent from Dataset.load_from_df() method that only accept these parameters, in this order, as explained in the method's docstring.

Therefore, our approach is to build a second recommender base for the Matrix Factorization principle enabling an effective construction of these three models having much in common.

We use random_state=True because we found out that this parameter overalls reduces the MAE & RMSE independently of the other parameters. The recommender can be fined-tuned with n_factors & n_epochs that are directly forwarded to the NMF model.

In [5]:
from pandas.core.api import DataFrame as DataFrame
from surprise import NMF, SVD, SVDpp

class MatrixFactorization(BaseRecommender):
    model:NMF|SVD|SVDpp
    n_factors:int
    n_epochs:int
    biased:bool

    def __init__(self, n_factors:int=15,n_epochs:int=50, biased:bool=False):
        super().__init__()
        self.n_factors = n_factors
        self.n_epochs = n_epochs
        self.biased = biased

    def train(self)->None:
        pass
    
    def evaluate(self) -> DataFrame:
        output_df = super().evaluate()
        output_df['Factors'] = self.n_factors
        output_df['Epochs'] = self.n_epochs
        output_df['Biased'] = 'Yes' if self.biased else 'No'
        return output_df

class VanillaMF(MatrixFactorization):
    model:NMF

    def train(self)->None:
        nmf = NMF(
            n_factors=self.n_factors,
            n_epochs=self.n_epochs,
            biased=self.biased,
            random_state=True
        )
        nmf.fit(self.train_set)
        self.model = nmf

class SVDBias(MatrixFactorization):
    model:SVD

    def __init__(self, n_factors:int=15, n_epochs:int=50):
        super().__init__(n_factors,n_epochs, biased=True)

    def train(self)->None:
        svd = SVD(
            n_factors=self.n_factors,
            n_epochs=self.n_epochs,
            biased=self.biased,
            random_state=True
        )
        svd.fit(self.train_set)
        self.model = svd

class SVDPlusPlus(MatrixFactorization):
    model:SVDpp

    def train(self)->None:
        svdpp = SVDpp(
            n_factors=self.n_factors,
            n_epochs=self.n_epochs,
            random_state=True
        )
        svdpp.fit(self.train_set)
        self.model = svdpp

vanilla_mf = VanillaMF()
vanilla_mf.train()
vanilla_mf.test()

svd_with_bias = SVDBias()
svd_with_bias.train()
svd_with_bias.test()

svd_plus_plus = SVDPlusPlus()
svd_plus_plus.train()
svd_plus_plus.test()

pd.concat([
    vanilla_mf.evaluate(),
    svd_with_bias.evaluate(),
    svd_plus_plus.evaluate()
])

Unnamed: 0,model,rmse,mae,Factors,Epochs,Biased
0,VanillaMF,1.015985,0.796611,15,50,No
0,SVDBias,0.974461,0.764493,15,50,Yes
0,SVDPlusPlus,0.98653,0.771494,15,50,No


## Building our own recommender

We will be using our base class RecommenderBase combined with the AlgoBase Surprise's class possibilities.
For this, we would like to leverage the jaccard distance between each tuple of the dataset.

Our approach is divided in two steps :
1. Building a model based on Surprise's Algobase class
2. Creating a recommender with our previous BaseRecommender class and use our new Jaccard-based model as a class variable like previously done.

To achieve this, we will design a first class that will simply calculate the jaccard distance between rows containing only user_id, movie_id and user_rating.

Then we will provide an ehnancement to include the metadata (director, actors, genres) in the process in order to have a more complete predicter.

In [45]:
from typing import Any
import numpy as np
from surprise import AlgoBase, PredictionImpossible, Trainset
from sklearn.metrics import jaccard_score
from torch import cosine_similarity

class JaccardDistanceAlgorithm(AlgoBase):
    def __init__(self, sim_options:dict={}, **kwargs:Any):
        AlgoBase.__init__(self, sim_options=sim_options, **kwargs)

    def fit(self, trainset:Trainset)->'JaccardDistanceAlgorithm':
        AlgoBase.fit(self, trainset)
        return self

    def calculate_jaccard_similarity(self, rated_items_i:set, rated_items_other:set)->float:
        return len(rated_items_i.intersection(rated_items_other)) / len(rated_items_i.union(rated_items_other))
    
    def estimate(self, u: int | str, i: str) -> float:
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            # If the user or item is not in the training_set
            return self.default_prediction()

        user_ratings = set([item_id for item_id, _ in self.trainset.ur[u]])

        # Check if the user has rated any items
        if not user_ratings:
            return self.default_prediction()

        # Check if the user has rated the current item
        if i not in user_ratings:
            return self.default_prediction()

        # Calculate Jaccard similarity-based prediction
        numerator = 0.0
        denominator = 0.0

        for other_item, _ in self.trainset.ur[u]:
            # Check if the user has rated both the current item (i) and the other item
            if self.trainset.knows_item(i) and self.trainset.knows_item(other_item):
                try:
                    # Get the set of items rated by the user for the current item and the other item
                    rated_items_i = set([item_id for item_id, _ in self.trainset.ir[i]])
                    rated_items_other = set([item_id for item_id, _ in self.trainset.ir[other_item]])

                    # Calculate Jaccard similarity
                    jaccard_similarity = self.calculate_jaccard_similarity(rated_items_i, rated_items_other)

                    # Retrieve the rating for the other item from the user's ratings
                    rating_other = next((rating for item_id, rating in self.trainset.ur[u] if item_id == other_item), None)

                    if rating_other is not None:
                        # Weighted sum of similarity and rating
                        numerator += rating_other * jaccard_similarity
                        denominator += jaccard_similarity
                except KeyError:
                    return self.default_prediction()                    

        if denominator == 0:
            # If no similarity found
            return self.default_prediction()

        return numerator / denominator
    
class JaccardRecommender(BaseRecommender):
    model:JaccardDistanceAlgorithm

    def train(self)->None:
        jda = JaccardDistanceAlgorithm()
        jda.fit(self.train_set)
        self.model = jda

jaccard_recommender = JaccardRecommender()
jaccard_recommender.train()
jaccard_recommender.test()
jaccard_recommender.evaluate()

Unnamed: 0,model,rmse,mae
0,JaccardRecommender,1.089952,0.900084


To enhance the class performance, we've integrated metadata into the fit and estimate processes. The shift in logic involves adopting a linear kernel and employing a similarity matrix to compute the Jaccard score. This modification aims for improved efficiency and better results in terms of Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Additionally, we leverage vectorization to optimize the computational process.

In [46]:
from surprise import AlgoBase
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

class EnhancedJaccardAlgorithm(AlgoBase):
    vectorizer:TfidfVectorizer

    def __init__(self, sim_options:dict={}, **kwargs:Any):
        AlgoBase.__init__(self, sim_options=sim_options, **kwargs)
        self.vectorizer = TfidfVectorizer()

    def fit(self, trainset:Trainset)->'EnhancedJaccardAlgorithm':
        AlgoBase.fit(self, trainset)

        user_features = self.extract_features(trainset.ur)
        item_features = self.extract_features(trainset.ir, include_features=True)

        tfidf_matrix = self.vectorizer.fit_transform(user_features + item_features)

        self.similarity_matrix = linear_kernel(tfidf_matrix)

        return self

    def extract_features(self, rating_matrix:dict, include_features:bool=False)->list:
        feature_list = []
        for _, ratings in rating_matrix.items():
            features = [
                f"{movie_id}_{self.get_movie_features(movie_id, include_features)}"
                for movie_id, _ in ratings
            ]
            feature_list.append(" ".join(features))
        return feature_list

    def get_movie_features(self, movie_id:str, include_features:bool=False)->str:
        if include_features:
            return f"director_{movie_id} actors_{movie_id} genres_{movie_id}"
        return f"{movie_id}"

    def estimate(self, u:int|str, i:str)->float:
        try:
            u_id = self.trainset.to_inner_uid(u)
            i_id = self.trainset.to_inner_iid(i)
        except ValueError:
            u_id = "UKN__" + str(u)
            i_id = "UKN__" + str(i)


        if i_id not in self.trainset.ir.keys():
            return self.default_prediction()

        jaccard_similarity = self.similarity_matrix[u_id, i_id]

        estimated_rating = jaccard_similarity * 4 + 1

        return estimated_rating

class JaccardRecommenderWithMetadata(BaseRecommender):
    model:EnhancedJaccardAlgorithm

    def train(self)->None:
        eja = EnhancedJaccardAlgorithm()
        eja.fit(self.train_set)
        self.model = eja

jaccard_recommender_with_metadata = JaccardRecommenderWithMetadata()
jaccard_recommender_with_metadata.train()
jaccard_recommender_with_metadata.test()
jaccard_recommender_with_metadata.evaluate()

Unnamed: 0,model,rmse,mae
0,JaccardRecommenderWithMetadata,1.089952,0.900084


## Overall performance

Our analysis have stated that, with default parameters and without proceeding with fine-tuning, the algorithms that performs the best according to the MAE & RMSE metrics are :

| Recommender                                    | RMSE     | MAE      | Rank |
|------------------------------------------------|----------|----------|------|
| SVD with bias                                  | 0.974461 | 0.764493 | 1    |
| SVD++                                          | 0.986530 | 0.771494 | 2    |
| Vanilla MF                                     | 1.015985 | 0.796611 | 3    |
| Cosine item-based collaborative filtering      | 1.032445 | 0.808388 | 4    |
| Pearson item-based collaborative filtering     | 1.033257 | 0.808883 | 5    |
| Cosine user-based collaborative filtering      | 1.038282 | 0.831683 | 6    |
| Pearson user-based collaborative filtering     | 1.042457 | 0.836095 | 7    |
| Jaccard distance (custom recommender)          | 1.130819 | 0.86635  | 8    |
| Enhanced Jaccard distance (custom recommender) | 1.130819 | 0.86635  | 8    |

Those results demonstrates that, purely metrics speaking, out commitment to incorporate the metadata of the movies (genres, actors & director) in the process have been lacking of relevance. We had the opportunity to incorporate those data twice : In the collaborative filtering recommenders (thanks to the similarity functions) and in the Enhanced Jaccard distance custom recommender (thanks to the vectorization & linear kernel). In both cases, the recommenders performed worse than the matrix-factorization-based ones that did not require this extra data to function.

In the context of a work project, provided that the results of the model are meaningful and matches the problematic we solve, we would advocate for the use of such a system.

## Some recommendations

We will pick one user in the dataset, then give him/her the top 10 movies he/she haven't viewed yet (and should).
This process will be done with every recommender to evaluate the homogenity of the results.

In [47]:
user = 386143

def show_top_10_movies(user:int,recommender:BaseRecommender, name:str)->None:
    if name:
        print(f'========== {name} ==========')
    print(pd.merge(recommender.get_top_movies_for_user(user, 10), data.metadata_df, on='movie_id')[['user_id','movie_id','title','estimated_rating']])
    print('\n')

# show_top_10_movies(user, svd_with_bias, 'SVD with bias')
# show_top_10_movies(user, svd_plus_plus, 'SVD++')
# show_top_10_movies(user, vanilla_mf, 'Vanilla MF')
# show_top_10_movies(user, cosine_item_based_collaborative_filtering, 'Cosine item-based collaborative filtering')
# show_top_10_movies(user, pearson_item_based_collaborative_filtering, 'Pearson item-based collaborative filtering')
# show_top_10_movies(user, cosine_user_based_collaborative_filtering, 'Cosine user-based collaborative filtering')
# show_top_10_movies(user, pearson_user_based_collaborative_filtering, 'Pearson user-based collaborative filtering')
show_top_10_movies(user, jaccard_recommender, 'Jaccard Recommender')
# show_top_10_movies(user, jaccard_recommender_with_metadata, 'Jaccard recommender with metadata')

  user_id   movie_id                 title  estimated_rating
0  386143  tt0128853       You've Got Mail          4.009853
1  386143  tt0276751           About a Boy          4.009656
2  386143  tt0405159   Million Dollar Baby          4.009028
3  386143  tt0286499  Bend It Like Beckham          4.005535
4  386143  tt0313737      Two Weeks Notice          4.003213
5  386143  tt0375679                 Crash          4.000894
6  386143  tt0314331         Love Actually          3.999959
7  386143  tt0372784         Batman Begins          3.998302
8  386143  tt0256415    Sweet Home Alabama          3.993716
9  386143  tt0268978      A Beautiful Mind          3.989106




# Conclusion

