# Data
- movie_ratings_500_id.pkl contains the interactions between users and movies
- movie_metadata.pkl contains detailed information about movies, e.g. genres, actors and directors of the movies.

# Goal
- Compare the performances of different recommender systems
- Construct your own recommender systems


# Baselines

## User-Based Collaborative Filtering
This approach predicts $\hat{r}_{(u,i)}$ by leveraging the ratings given to $i$ by $u$'s similar users. Formally, it is written as:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{v \in \mathcal{N}_i(u)}sim_{(u,v)}r_{vi}}{\sum\limits_{v \in \mathbf{N}_i(u)}|sim_{(u,v)}|}
\end{equation}
where $sim_{(u,v)}$ is the similarity between user $u$ and $v$. Usually, $sim_{(u,v)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Item-Based Collaborative Filtering
This approach exploits the ratings given to similar items by the target user. The idea is formalized as follows:

\begin{equation}
\hat{r}_{(u,i)} = \frac{\sum\limits_{j \in \mathcal{N}_u(i)}sim_{(i,j)}r_{ui}}{\sum\limits_{j \in \mathbf{N}_u(i)}|sim_{(i,j)}|}
\end{equation}
where $sim_{(i,j)}$ is the similarity between item $i$ and $j$. Usually, $sim_{(i,j)}$ can be computed by Pearson Correlation or Cosine Similarity.

## Vanilla MF
Vanilla MF is the inner product of vectors that represent users and items. Each user is represented by a vector $\textbf{p}_u \in \mathbb{R}^d$, each item is represented by a vector $\textbf{q}_i \in \mathbb{R}^d$, and $\hat{r}_{(u,i)}$ is computed by the inner product of $\textbf{p}_u $ and $\textbf{q}_i$. The core idea of Vanilla MF is depicted in the followng figure and follows the idea of SVD as we have seen during the TD.

![picture](https://drive.google.com/uc?export=view&id=1EAG31Qw9Ti6hB7VqdONUlijWd4rXVobC)

\begin{equation}
\hat{r}_{(u,i)} = \textbf{p}_u{\textbf{q}_i}^T
\end{equation}

## Some variants of SVD


-  SVD with bias: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^Tp_u$
- SVD ++: $\hat{r_{ui}} = \mu + b_u + b_i + {q_i}^T(p_u + |I_u|^{\frac{-1}{2}}\sum\limits_{j \in I_u}y_j)$

## Factorization machine (FM)

FM takes into account user-item interactions and other features, such as users' contexts and items' attributes. It captures the second-order interactions of the vectors representing these features , thereby enriching FM's expressiveness. However, interactions involving less relevant features may introduce noise, as all interactions share the same weight. e.g. You may use FM to consider the features of items.

\begin{equation}
\hat{y}_{FM}(\textbf{X}) = w_0 + \sum\limits_{j =1}^nw_jx_j + \sum\limits_{j=1}^n\sum\limits_{k=j+1}^n\textbf{v}_j^T\textbf{v}_kx_jx_k
\end{equation}

where $\textbf{X} \in \mathbb{R}^n$ is the feature vector, $n$ denotes the number of features, $w_0$ is the global bias, $w_j$ is the bias of the $j$-th feature and $\textbf{v}_j^T\textbf{v}_k$ denotes the bias of interaction between $j$-th feature and $k$-th feature, $\textbf{v}_j \in \mathbb{R}^d$ is the vector representing $j$-th feature.

## MLP

You may also represent users and items by vectors and them feed them into a MLP to make prediction.

## Metrics

- $\begin{equation}
RMSE = \sqrt{\frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{(\hat{r}_{(u,i)}-r_{ui})}^2}
\end{equation}$

- $\begin{equation}
MAE = \frac{1}{|\mathcal{T}|}\sum\limits_{(u,i)\in\mathcal{T}}{|\hat{r}_{(u,i)}-r_{ui}|}
\end{equation}$
-  Bonnus: you may also consider NDCG and HR under the top-k setting


# Requirements
- Try to compare different methods that you have adopted and interpret the results that you have obtained
- Minizing the RMSE and MAE
- Construct a recommender system that returns the top 10 movies *that the user have not viewed yet*.

# Our work

## Step 1 : Import the data

We create a class Data with several methods. The goal here is to centralize the data processing to obtain the relevant dataframes we need to process our data with scikit surprise.

In [3]:
import pandas as pd

class Data:

    metadata:dict
    ratings:dict
    user_ratings:dict
    ratings_df:pd.DataFrame
    metadata_df:pd.DataFrame
    merged_df:pd.DateOffset
    
    def __init__(self, metadata_path:str, ratings_path:str):
        self.metadata = pd.read_pickle(metadata_path)
        self.ratings = pd.read_pickle(ratings_path)
        self.user_ratings = self.get_user_ratings()
        self.ratings_df = self.get_ratings_as_df()
        self.metadata_df = self.get_metadata_as_df()
        self.merged_df = pd.merge(self.ratings_df, self.metadata_df, left_on='movie_id', right_on='movie_id')

    def get_user_ratings(self)->dict:
        output = {}
        for k, array in self.ratings.items():
            for v in array:
                user_movie = {
                    'user_rating': int(v['user_rating']),
                    'movie_id': k
                }
                user_id = v['user_id']

                if user_id in output.keys():
                    output[int(user_id)].append(user_movie)
                else:
                    output[int(user_id)] = [user_movie]
        return output

    def get_ratings_as_df(self)->pd.DataFrame:
        output = []

        for film, rating in self.ratings.items():
            for index in rating:
                index['movie_id'] = film
                del index['user_rating_date']
                output.append(index)
    
        return pd.DataFrame(output)
    
    def get_metadata_as_df(self)->pd.DataFrame:
        output = []

        for movie_id, movie_data in self.metadata.items():
            movie_data['genre'] = ",".join(movie_data['genre'])
            movie_data['actors'] = ",".join(movie_data['actors'])
            output.append({'movie_id': movie_id, **movie_data})

        return pd.DataFrame(output)
    
data = Data(
    metadata_path='movie_metadata.pkl', 
    ratings_path='movie_ratings_500_id.pkl'
)

data.merged_df

Unnamed: 0,user_rating,user_id,movie_id,director,genre,actors,title
0,4,1380819,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
1,3,185150,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
2,4,1351377,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
3,2,386143,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
4,3,2173336,tt0305224,Peter Segal,Comedy,"Jack Nicholson,Adam Sandler,Marisa Tomei,Woody...",Anger Management
...,...,...,...,...,...,...,...
259813,5,1139877,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259814,4,1460015,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259815,5,1098265,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist
259816,4,1962894,tt0361862,Brad Anderson,"Drama,Thriller","Christian Bale,Jennifer Jason Leigh,Aitana Sán...",The Machinist


# Step 2 : Obtain the movies that a user haven't viewed yet

## Creation of a base recommender class
Out goal is to create a base class that implements 3 methods : train, test and evaluate. Train will be overidden by a superclass (one for each recommender we want to put on the bench). Test and evaluate may or may not be overidden depending on the context. It features several attributes that are useful when it comes to train a model with the data we've got earlier.

In [14]:
from surprise import AlgoBase, Dataset, Prediction, Reader
from surprise.model_selection import train_test_split
from surprise import accuracy

# We first create a class that will be used as a base to all the other recommender class we will create

class BaseRecommender:
    model_data:Dataset
    train_set:Dataset
    test_set:Dataset
    model:AlgoBase
    predictions:list[Prediction]

    def __init__(
        self, 
        df:pd.DataFrame=data.merged_df[['user_id', 'movie_id', 'user_rating']], 
        test_size:float=0.2, 
        random_state:int=42
    ):
        self.model_data = Dataset.load_from_df(df, Reader(rating_scale=(1, 5)))
        self.train_set, self.test_set = train_test_split(self.model_data, test_size=test_size, random_state=random_state)

    def train(self)->None:
        pass

    def test(self, update:bool=True)->list[Prediction]:
        predictions = self.model.test(self.test_set)
        if update:
            self.predictions = predictions
        return predictions
    
    def evaluate(self)->pd.DataFrame:
        assert self.predictions is not None
        assert len(self.predictions) >= 0
        return pd.DataFrame([{'rmse': accuracy.rmse(self.predictions), 'mae': accuracy.mae(self.predictions)}])


## Analysis of several models and parameters

1. User-based collaborative filtering
   - With cosine
   - With Pearson
2. Item-based collaborative filtering
   - With cosine
   - With Pearson

In [15]:
from surprise import KNNBasic
# We create and instanciate a first recommender system with the User-Based or Item-based Collaborative Filtering method

class CollaborativeFiltering(BaseRecommender):
    model:KNNBasic
    similarity_method:str
    user_based:bool

    def __init__(self, user_based:bool=True, similarity_method:str='cosine'):
        super().__init__()
        self.similarity_method = similarity_method
        self.user_based = user_based

    def train(self)->None:
        sim_options = {
            'name': self.similarity_method, 
            'user_based': self.user_based, 
            'user_custom_similarity': self.user_similarity_function if self.user_based else self.item_similarity_function
        }
        knn = KNNBasic(sim_options=sim_options)
        knn.fit(self.train_set)
        self.model = knn

    def evaluate(self) -> pd.DataFrame:
        output_df = super().evaluate()
        output_df['Similarity method'] = self.similarity_method.capitalize()
        output_df['User or Item'] = 'User' if self.user_based else 'Item'
        return output_df

    def user_similarity_function(user1:pd.DataFrame, user2:pd.DataFrame, metadata_df:pd.DataFrame)->int:
        # Compare directors, genres, and actors
        common_director = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['director'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
            ]['director']
        ).sum()
        
        common_genre = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['genre'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
        ]['genre']).sum()
        
        common_actors = metadata_df[
            metadata_df['movie_id'].isin(user1['movie_id'])
        ]['actors'].isin(
            metadata_df[
                metadata_df['movie_id'].isin(user2['movie_id'])
        ]['actors']).sum()

        total_common = common_director + common_genre + common_actors

        # Return a similarity score
        return total_common

    def item_similarity_function(item1:pd.DataFrame, item2:pd.DataFrame, metadata_df:pd.DataFrame)->int:
        # Compare directors, genres, and actors
        common_director = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['director'].nunique()
        common_genre = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['genre'].nunique()
        common_actors = metadata_df[metadata_df['movie_id'].isin([item1, item2])]['actors'].nunique()

        total_common = common_director + common_genre + common_actors

        # Return a similarity score
        return total_common

cosine_user_based_collaborative_filtering = CollaborativeFiltering(user_based=True, similarity_method='cosine')
cosine_user_based_collaborative_filtering.train()
cosine_user_based_collaborative_filtering.test()

pearson_user_based_collaborative_filtering = CollaborativeFiltering(user_based=True, similarity_method='pearson')
pearson_user_based_collaborative_filtering.train()
pearson_user_based_collaborative_filtering.test()

cosine_item_based_collaborative_filtering = CollaborativeFiltering(user_based=False, similarity_method='cosine')
cosine_item_based_collaborative_filtering.train()
cosine_item_based_collaborative_filtering.test()

pearson_item_based_collaborative_filtering = CollaborativeFiltering(user_based=False, similarity_method='pearson')
pearson_item_based_collaborative_filtering.train()
pearson_item_based_collaborative_filtering.test()

pd.concat([
    cosine_user_based_collaborative_filtering.evaluate(),
    pearson_user_based_collaborative_filtering.evaluate(),
    cosine_item_based_collaborative_filtering.evaluate(),
    pearson_item_based_collaborative_filtering.evaluate()
], ignore_index=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
RMSE: 1.0383
MAE:  0.8317
RMSE: 1.0425
MAE:  0.8361
RMSE: 1.0324
MAE:  0.8084
RMSE: 1.0333
MAE:  0.8089


Unnamed: 0,rmse,mae,Similarity method,User or Item
0,1.038282,0.831683,Cosine,User
1,1.042457,0.836095,Pearson,User
2,1.032445,0.808388,Cosine,Item
3,1.033257,0.808883,Pearson,Item


3. Factorization Machine

In [25]:
from surprise import NMF

class FactorizationMachine(BaseRecommender):
    model:NMF
    n_factors:int
    biased:bool

    def __init__(self, n_factors:int=15, biased:bool=False):
        new_merged_df = self.replace_item_id_with_num_metadata(data.merged_df[['user_id', 'movie_id', 'user_rating']])
        super().__init__(df=new_merged_df)
        self.n_factors = n_factors
        self.biased = biased

    def train(self)->None:
        self.model = NMF(n_factors=self.n_factors, biased=self.biased)
        self.model.fit(self.train_set)

    def get_item_features(self, movie_id:int, metadata_df:pd.DataFrame)->list:
        row = metadata_df[metadata_df['movie_id'] == movie_id].iloc[0]
        return [row['director'], row['genre'], row['actors']]
    
    def convert_features_to_numbers(self, features:list)->int:
        # You may use more sophisticated methods for feature conversion
        return hash(" ".join(map(str, features)))
    
    def replace_item_id_with_num_metadata(self, merged_df:pd.DataFrame)->pd.DataFrame:
        item_features = {}
        for movie_id in merged_df['movie_id'].unique():
            features = self.get_item_features(movie_id, data.metadata_df)
            numerical_representation = self.convert_features_to_numbers(features)
            item_features[movie_id] = numerical_representation

        merged_df['movie_id'] = merged_df['movie_id'].map(item_features)
        return merged_df

factorization_machine = FactorizationMachine()
factorization_machine.train()
factorization_machine.test()
factorization_machine.evaluate()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df['movie_id'] = merged_df['movie_id'].map(item_features)


RMSE: 1.0783
MAE:  0.8383


Unnamed: 0,rmse,mae
0,1.07826,0.838327
