# Recommendation algorithms in Python

In this lab, we will implement different recommendation algorithms and evaluate their performance on a movie rating prediction task.

**Task 1:** First, we will build a simple item-based kNN recommendation algorithm with different item feature representations. We will visualize the item vectors on a 2D plot.

**Task 2:** Next, we will use the `surprise` Python package with different collaborative filtering recommendation algorithm implementations and compare their performance for different hyperparametes setting on a standard movie ratings dataset. 

In [None]:
import urllib.request
import pandas as pd
import numpy as np
import zipfile
import os
import datetime

from matplotlib import pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

import seaborn as sns
import umap 
import plotly
import plotly.express as px

## Task 1: Item-based kNN recommendations with different item representations

First, we wil implement an item-based approach to recommendations based on k Nearest Neighbors (kNN) model. 

### Read the data
We use a small version of the popular MovieLens movie recommendation dataset from GroupLens https://grouplens.org/datasets/movielens/

In [None]:
data_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
urllib.request.urlretrieve(data_url, 'movielens.zip')
movies_file = zipfile.ZipFile('movielens.zip')
data_filename = 'ml-latest-small'

ratings = pd.read_csv(movies_file.open('ml-latest-small/ratings.csv'))
movies = pd.read_csv(movies_file.open('ml-latest-small/movies.csv'))

### Exploratory data analysis 
First, we perform an exploratory analysis to learn basic characteristics of the dataset.


In [None]:
ratings.head()

What is the rating time distribution?

In [None]:
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
ratings['datetime'].hist(bins=100)

What is the rating values distribution?

In [None]:
ratings.rating.describe()

In [None]:
ratings.rating.hist(bins=10)

We can observe that each user has rated relatively few movies - the rating matrix is sparse which is a significant problem for the recommendation systems.

In [None]:
ratings.groupby('movieId').count()['rating'].describe()

In [None]:
ratings.groupby('userId').count()['rating'].describe()

### Calculate the density of the ratings matrix
  <font color='red'>**ToDo:**</font>
Matrix density is the fraction of non-zero elements in the matrix:  
 
 **density = nr of ratings / (nr of users * nr of items)**

In [None]:
print(??) 

### Item-based kNN recommendations
We will test a simple item-based kNN recommender. We will define the similarity between movies for different movie representations and return most similar movies to selected one.


In [None]:
def get_similar_movies(movie_id: int, similarity_df: pd.DataFrame, n_neighbors:int=5):
    similar_ids = similarity_df.loc[movie_id].sort_values(ascending=False)[1:n_neighbors+1].reset_index()
    return similar_ids.merge(movies, on='movieId')[['title', 'genres']]


def plot_2d_movies(movie_vectors: pd.DataFrame, movie_metadata: pd.DataFrame, samp_size: int=2000) -> None:
    features_sample = movie_vectors.sample(samp_size)
    features_sample_2d = umap.UMAP().fit(features_sample).transform(features_sample)
    df = pd.DataFrame(features_sample_2d, index=features_sample.index, columns=['x', 'y'])
    df = df.merge(movie_metadata, left_index=True, right_on='movieId')[['x', 'y', 'title', 'genres']]
    fig = px.scatter(df, x='x', y='y', hover_name="title", hover_data=['genres'])
    fig.update_traces(textposition='top center')
    fig.update_layout(height=800)
    fig.show()


We select an example of a movie for a qualitative evaluation of different methods.

In [None]:
test_movie = movies.iloc[0]
test_movie

### Content-Based Recommendations

As the first approach, we use the content-based features of movies to calculate their similarity - in this case these are the movie genres.

In [None]:
movies.head()

 <font color='red'>**ToDo:**</font>
- Use sklearn `CountVectorizer`  to build the content-based the movies features matrix from their genres https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html.
- Split the lists of genres with | separator (use argument `token_pattern='[^|]+'` for `CountVectorizer`)



In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = ??
content_features = pd.DataFrame(vectorizer.fit_transform(movies['genres']).toarray(), columns=vectorizer.get_feature_names(), index=movies['movieId'])
content_features.head()

#### Plot 2D movie representations

We will use UMAP algorithm for transforming the multi-dimensional vectors into 2D space and Plotly for interactive plot of 2D vectors: https://plotly.com/python/plotly-express/.

You can read more and experiment with UMAP: https://pair-code.github.io/understanding-umap/

<font color='red'>**ToDo:**</font>

Plot  `content_features` vectors in 2D using `plot_2d_movies` function

In [None]:
plot_2d_movies(movie_vectors=??, movie_metadata=movies)

As the similarity measure, we use pairwise cosine similarity between movie feature vectors. This measure is more robust to sparse vectors than the Euclidean distance. We will calculate the similarity matrix for all movies pairs according to this metric.

<font color='red'>**ToDo:**</font>
 - Build the similarity matrix for movies based on their `content_features`. 
 
 Hint: use `cosine_similarity` function https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

In [None]:
cosine_similarity_content_mx = ??
movie_similarity_content = pd.DataFrame(cosine_similarity_content_mx,
                                        columns=content_features.index, 
                                        index=content_features.index)
movie_similarity_content.head()

The most similar movies to "Toy Story" based on the content features:

In [None]:
get_similar_movies(test_movie['movieId'], movie_similarity_content)

### Collaborative Filtering recommendations

The content-based approach is simple and quite effective (even in the new item situation) but it does not consider the information about the interaction patterns. To address this problem, we will implement a collaborative-filtering kNN recommendation algorithm which calculates the movies similarity based on the user rating matrix.


#### Build a user-movie rating matrix
First, we will construct the rating matrix $R$. Note that our rating matrix is sparse (there are many empty values) - we fill them with 0s. In this exercise we use a small dataset but for larger ones it is more efficient to use a sparse matrix instead of the dense one (we use a dense matrix due to better readability).

<font color='red'>**ToDo:**</font>
- Construct a pivot table with user ids as columns, movie ids as index and user-movie ratings as values. 

Hint: 
You may use `df.pivot_table(index=..., columns=..., values=...)` https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html

In [None]:
rating_mx = ??
rating_mx.fillna(0, inplace=True)
rating_mx.head()

Plot  `rating_mx` vectors in 2D using `plot_2d_movies` function

In [None]:
plot_2d_movies(movie_vectors=??, movie_metadata=movies)

 <font color='red'>**ToDo:**</font>
 - Build the similarity matrix for movies based on their `rating_mx`. 

In [None]:
cosine_similarity_cf_mx = ??
movie_similarity_cf = pd.DataFrame(cosine_similarity_cf_mx, columns=rating_mx.index, index=rating_mx.index)
movie_similarity_cf.head()

In [None]:
get_similar_movies(test_movie['movieId'], movie_similarity_cf)

## Task 2: Comparing the performance of different recommendation algorithms 
In this task, we will use `surprise` python package to evaluate the performance of different collaborative filtering recommendation algorithms on the movie recommendation task for a larger dataset.

We will use the build-in MovieLens 100K dataset.

https://surprise.readthedocs.io/en/stable/


In [None]:
from surprise import NMF, SVD, KNNBasic, KNNWithMeans, KNNWithZScore, NormalPredictor
from surprise import Dataset
from surprise.model_selection import GridSearchCV

def evaluate_models(models: list, param_grid: dict, test_metrics: list) -> pd.DataFrame:
    results = pd.DataFrame()
    for model in models:
        print('Evaluating model: {}'.format(model.__name__))
        search = GridSearchCV(model, param_grid, measures=test_metrics, cv=cv)
        search.fit(data)
        model_cv_results = pd.DataFrame(search.cv_results)
        model_cv_results['model'] = model.__name__
        model_cv_results['params'] = model_cv_results['params'].astype(str)
        results = results.append(model_cv_results, sort=False)
    return results


# use the 100K movie recommendation dataset.
data = Dataset.load_builtin('ml-100k')

### Model selection 

We will compare several types of models and search for best configurations.

For each group of models prepare a list of classes and parameters for the grid search:

`some_models = [SomeModel1, SomeModel2]
 param_grid = {'param1': [1,2,3],
               'param2': [True, False]}`
               

<font color='red'>**ToDo:**</font>
Prepare the configuration for three types of models:
* Baseline model: use `NormalPredictor` model which predicts a random rating based on the distribution of the training set. This baseline does not require any parameters (grid will be empty).
Docs: https://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.random_pred.NormalPredictor

* KNN models: use  `KNNBasic`,  `KNNWithZScore`.
Prepare the hyperparameters grid for the neighborhood-based models - `k` - number of neighbors (20 and 50) and boolean `user_based` (`True` and `False` for using user or item-based similarity).
Docs: https://surprise.readthedocs.io/en/stable/knn_inspired.html

* Matrix Factorization models: use `SVD`, `NMF`. Prepare the hyperparameters grid for the MF models - `n_factors` - number of latent dimensions (20 and 50). Docs: https://surprise.readthedocs.io/en/stable/matrix_factorization.html



In [None]:
baseline_models = [??]
param_grid_baseline = {}


knn_models = [??]
param_grid_neighbors = {??}


mf_models = [??]
param_grid_mf = {??}

Run the grid search and save the results for each model. The best models are selected by minimizing the Mean Absolute Error: $$\text{MAE} = \frac{1}{|\hat{R}|} \sum_{\hat{r}_{ui} \in
\hat{R}}|r_{ui} - \hat{r}_{ui}|$$

In [None]:
cv_results = pd.DataFrame()
test_metrics = ['mae']

cv = 5
models_config = [ ('baselines', baseline_models, param_grid_baseline),
                ('KNN', knn_models, param_grid_neighbors),
                ('MF', mf_models, param_grid_mf)]

for model_cls, model_list, model_params in models_config:
    print('-------------------------------------------------')
    print(f'------ Evaluating model class: {model_cls} --------')
    cv_results = cv_results.append(evaluate_models(model_list, model_params, test_metrics), sort=False)

### Results comparison

In [None]:
plot_metrics = ['mean_test_mae', 'mean_fit_time', 'mean_test_time']

for metric in plot_metrics:
    display(cv_results.sort_values(metric)[["model", "params", metric]])
    fig = sns.barplot(data=cv_results, hue='params', y=metric, x='model')
    fig.set_title(metric)
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    fig.set_xticklabels(fig.get_xticklabels(), rotation=90)
    plt.show()