# Benchmark des algorithmes de recommendation v1
---

Dans ce premier **Benchmark** on s'intéressera aux modèles suivants :
- Filtrage collaboratif : item-based neighborhood
- Filtrage collaboratif : Matrix Factorization (SVD/Funk SVD, SGD/ALS)
- Algorithmes LTR : Bayesian Personnalized Ranking
- Neural Collaborative Filtering

*Le code source peut être trouvé dans le dossier src*

Il est important de noter que les différents algorithmes évalués ici sont très sensibles au probléme de **cold-start** (nouveaux utilisateurs, items).                    
Ce problème ne sera pas traité ici; les données seront donc filtrées afin de ne s'intéresser qu'aux utilisateurs/items ayant eu une activité passée.

## Preprocessing

In [2]:
import sys
sys.path.append('../src')

In [1]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [3]:
# (Users, Items) -> Ratings DataFrame
DEBUG_SIZE = 5000, 500
PRODUCTION_SIZE = 20000, 5000

In [4]:
df_ratings = pd.read_csv('../Datasets/dataset/ydata-ymusic-user-artist-ratings-v1_0.txt', sep='\t',header=None, encoding="ISO-8859-1", names=['user_id', 'artist_id','ratings'])

In [74]:
df_lookup = pd.read_csv('../Datasets/dataset/ydata-ymusic-artist-names-v1_0.txt', sep='\t',header=None, encoding="ISO-8859-1", names=['artist_id', 'artist_name'])

In [5]:
df_ratings.head()

Unnamed: 0,user_id,artist_id,ratings
0,1,1000125,90
1,1,1006373,100
2,1,1006978,90
3,1,1007035,100
4,1,1007098,100


In [7]:
df_ratings.replace(0, np.nan, inplace=True)
df_ratings.replace(255, 0, inplace=True)

In [8]:
df_artists = df_ratings.groupby('artist_id').count().sort_values(by='ratings', ascending=False)
artists_pop = df_artists.head(DEBUG_SIZE[1]).reset_index()['artist_id'].tolist()

In [9]:
user_cf = df_ratings.groupby('user_id').count().reset_index().sort_values(by='ratings', ascending=False).head(DEBUG_SIZE[0])['user_id'].tolist()

In [10]:
len(user_cf), len(artists_pop)

(5000, 500)

In [11]:
df_ratings_cf = df_ratings.loc[df_ratings['artist_id'].isin(artists_pop)]

In [12]:
df_ratings_cf = df_ratings_cf.loc[df_ratings['user_id'].isin(user_cf)]

In [13]:
total_items = df_ratings_cf["artist_id"].nunique()
total_users = df_ratings_cf['user_id'].nunique()
total_items, total_users

(500, 5000)

In [14]:
train_df, test_df = train_test_split(df_ratings_cf,
                                   test_size=0.20,
                                   shuffle=True,
                                   random_state=42)

In [15]:
train_df.head()

Unnamed: 0,user_id,artist_id,ratings
29614817,498068,1001925,87.0
87044977,1465888,1015424,
1718837,28640,1027263,100.0
73381208,1233790,1012903,50.0
78053062,1312829,1031703,


In [16]:
train_df.shape

(1315098, 3)

## Baseline : Popularity Recommender


In [17]:
%load_ext autoreload

In [18]:
%autoreload 2

In [19]:
from Popularity import PopularityRecommender

recsys_ = PopularityRecommender('user_id', 'artist_id', df_lookup=df_lookup)
recsys_.fit(df_ratings_cf)

In [20]:
recsys_.recommend(20)

Unnamed: 0,user_id,artist_id,score,Rank,artist_name
0,20,1022226,4866,1.0,Red Hot Chili Peppers
1,20,1021815,4849,2.0,Queen
2,20,1008023,4813,3.0,Eagles
3,20,1027798,4802,4.0,U2
4,20,1016600,4793,5.0,Madonna
5,20,1021869,4793,6.0,R.E.M.
6,20,1053507,4782,7.0,Linkin Park
7,20,1015272,4776,8.0,Led Zeppelin
8,20,1017240,4768,9.0,Matchbox Twenty
9,20,1010787,4765,10.0,Green Day


## Filtrage collaboratif : item-based neighborhood


Algorithme *Amazon* :
 * Pour chaque item I1 :                                           
   + Pour chaque user C ayant intéragi avec I1 :                                                   
       + Pour chaque item I2 de C:                                                      
           + Sauvegarder [I1, I2]                                                      
       + Pour chaque item I2:                                                         
           + Calculer Sim(I1, I20)                                                            

La fonction de similarité choisie est cos() pour sa stabilité et son efficacité.                                           
Les notes explicites doivent être normalisées en soustrayant la moyenne des notes de l'utilisateur (ou la moyenne des notes de l'item)

Enfin on calcule une prédiction des notes par la formule :                        

![Imgur](https://i.imgur.com/ELcXOvD.png)                                 

In [21]:
from ItemCF import ItemCFRecommender

recsys_itemcf = ItemCFRecommender(train_df, item_col='artist_id', load_sim_matrix=False)

In [22]:
# Algorithme naif trop lent, on prefera utiliser la librairie scikit-surprise
# recsys_itemcf.recommend(20)

Utilisation de surprise (http://surpriselib.com/) Algorithme : KNNwithMeans


In [23]:
from surprise import Reader
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate
from surprise import Dataset

reader = Reader(rating_scale=(0, 100))

train_df_2 = train_df.copy().dropna()

data = Dataset.load_from_df(train_df_2[['user_id', 'artist_id', 'ratings']], reader)

algo = KNNWithMeans()

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    21.7856 21.7352 21.7920 21.8087 21.8102 21.7864 0.0273  
MAE (testset)     17.0125 16.9888 17.0245 17.0290 17.0502 17.0210 0.0202  
Fit time          258.19  234.91  237.50  236.60  233.28  240.10  9.16    
Test time         432.75  435.18  465.91  492.59  466.13  458.51  22.28   


{'test_rmse': array([21.78564957, 21.73515604, 21.79203133, 21.8087461 , 21.81023902]),
 'test_mae': array([17.01249169, 16.98877092, 17.02453125, 17.02896947, 17.05023754]),
 'fit_time': (258.1892161369324,
  234.9132616519928,
  237.49816632270813,
  236.60197138786316,
  233.27565050125122),
 'test_time': (432.7506802082062,
  435.18238282203674,
  465.9146683216095,
  492.58912110328674,
  466.1335551738739)}

## Filtrage collaboratif : Matrix Factorization

In [24]:
from CF_SVD import SVDRecommender

recsys_svd = SVDRecommender(train_df, 0.001, 25, itemCol='artist_id')

In [25]:
recsys_svd.train__SGD()

Preprocessing data...

Epoch 1/100  |took 1.8 sec
Epoch 2/100  | took 0.6 sec
Epoch 3/100  |took 0.6 sec
Epoch 4/100  | took 0.6 sec
Epoch 5/100  | took 0.6 sec
Epoch 6/100  | took 0.6 sec
Epoch 7/100  | took 0.6 sec
Epoch 8/100  |took 0.6 sec
Epoch 9/100  | took 0.6 sec
Epoch 10/100 | took 0.6 sec
Epoch 11/100 | took 0.6 sec
Epoch 12/100 |took 0.6 sec
Epoch 13/100 | took 0.6 sec
Epoch 14/100 | took 0.6 sec
Epoch 15/100 |took 0.6 sec
Epoch 16/100 | took 0.6 sec
Epoch 17/100 | took 0.6 sec
Epoch 18/100 | took 0.6 sec
Epoch 19/100 | took 0.6 sec
Epoch 20/100 |took 0.6 sec
Epoch 21/100 | took 0.6 sec
Epoch 22/100 | took 0.6 sec
Epoch 23/100 | took 0.6 sec
Epoch 24/100 |took 0.6 sec
Epoch 25/100 | took 0.6 sec
Epoch 26/100 | took 0.6 sec
Epoch 27/100 | took 0.6 sec
Epoch 28/100 | took 0.6 sec
Epoch 29/100 |took 0.6 sec
Epoch 30/100 | took 0.6 sec
Epoch 31/100 | took 0.7 sec
Epoch 32/100 | took 0.6 sec
Epoch 33/100 |took 0.6 sec
Epoch 34/100 | took 0.6 sec
Epoch 35/100 | took 0.6 sec
Epoch 

In [26]:
recsys_svd.evaluate__SGD(test_df)

Metrics on test set for funkSVD model : 
 RMSE=24.68 
 MAE=18.95


In [27]:
user_recs = recsys_svd.recommend__SGD(527421)

In [28]:
user_recs.reset_index(inplace=True)
user_recs.rename(columns={'index': 'artist_id'}, inplace=True)
user_recs.merge(df_lookup, on='artist_id', how='inner')

Unnamed: 0,artist_id,0,artist_name
0,1021300,100.0,The Police
1,1020036,100.0,Orbital
2,1004095,100.0,Busta Rhymes
3,1023876,87.759499,Seal
4,1023356,87.101301,Sade
5,1028153,83.977972,Stevie Ray Vaughan
6,1024951,80.130625,The Smiths
7,1015851,78.132867,Lonestar
8,1010065,76.609026,Marvin Gaye
9,1009850,76.602437,Peter Gabriel


On note avec cet algorithme une personnalisation des recommendations bien plus précise.                     
Il est également possible d'avoir accès à de nombreaux attributs de l'algorithme tels que les vecteurs **User** et **Item** afin de rajouter de nouveaux items/users.

De nombreaux hyperparamètres sont modifiables et peuvent avoir un fort impact sur la précision du système :
- Le nombre de facteurs latents, plus il y en a, plus le système aura tendance a overfit le training set.
- Le taux d'apprentissage et le paramètre de régularisation qui peuvent chacun permettre une meilleure généralisation.

Néanmoins cette librairie permet un apprentissage extrêmement rapide et avec une bonne scalabilité (10x plus rapide que surprise par exemple)


## Bayesian Personnalized Ranking 

In [29]:
from BPR import BPRRecommender

recsys_bpr = BPRRecommender(train_df, itemCol='artist_id')

Instructions for updating:
non-resource variables are not supported in the long term
Data Sparsity : 47.39608 %


In [55]:
hyperparams = {
    'num_factors': 64,
    'lambda_user': 1e-6,
    'lambda_item': 1e-6,
    'lambda_bias': 1e-6,
    'learning_rate': 0.002
}

In [56]:
recsys_bpr.set_hyperparameters(**hyperparams)

In [57]:
recsys_bpr.build_graph()

In [60]:
epochs = 50
batches = 100
samples = 500

recsys_bpr.train(epochs, batches, samples)

Epoch 0 : Loss: 0.786 | AUC: 0.520
 Epoch 1 : Loss: 0.795 | AUC: 0.504
 Epoch 2 : Loss: 0.767 | AUC: 0.504
 Epoch 3 : Loss: 0.731 | AUC: 0.512
 Epoch 4 : Loss: 0.725 | AUC: 0.540
 Epoch 5 : Loss: 0.695 | AUC: 0.562
 Epoch 6 : Loss: 0.683 | AUC: 0.578
 Epoch 7 : Loss: 0.684 | AUC: 0.560
 Epoch 8 : Loss: 0.688 | AUC: 0.548
 Epoch 9 : Loss: 0.683 | AUC: 0.552
 Epoch 10 : Loss: 0.667 | AUC: 0.594
 Epoch 11 : Loss: 0.662 | AUC: 0.574
 Epoch 12 : Loss: 0.651 | AUC: 0.612
 Epoch 13 : Loss: 0.669 | AUC: 0.604
 Epoch 14 : Loss: 0.655 | AUC: 0.590
 Epoch 15 : Loss: 0.662 | AUC: 0.610
 Epoch 16 : Loss: 0.658 | AUC: 0.598
 Epoch 17 : Loss: 0.671 | AUC: 0.578
 Epoch 18 : Loss: 0.653 | AUC: 0.614
 Epoch 19 : Loss: 0.660 | AUC: 0.610
 Epoch 20 : Loss: 0.646 | AUC: 0.606
 Epoch 21 : Loss: 0.658 | AUC: 0.566
 Epoch 22 : Loss: 0.675 | AUC: 0.608
 Epoch 23 : Loss: 0.653 | AUC: 0.592
 Epoch 24 : Loss: 0.656 | AUC: 0.580
 Epoch 25 : Loss: 0.648 | AUC: 0.628
 Epoch 26 : Loss: 0.652 | AUC: 0.604
 Epoch 27 : 

In [96]:
df_lookup = pd.read_csv('../Datasets/dataset/ydata-ymusic-artist-names-v1_0.txt', sep='\t',header=None, encoding="ISO-8859-1", names=['artist_id', 'artist_name'])

In [97]:
df_lookup.rename(columns={"artist_name": 'item_name'}, inplace=True)
df_lookup = df_lookup.iloc[2:]
df_lookup = df_lookup[df_lookup.artist_id.isin(artists_pop)]
df_lookup.reset_index(inplace=True)
df_lookup = df_lookup[['item_name']]
df_lookup.reset_index(inplace=True)
df_lookup.rename(columns={"index": 'artist_id'}, inplace=True)
df_lookup.head()

Unnamed: 0,artist_id,item_name
0,0,112
1,1,2Pac
2,2,311
3,3,A-Ha
4,4,A.F.I.


In [109]:
recsys_bpr.find_similar_items(df_lookup, item='Queen')

Unnamed: 0,item,score
0,La Ley,1.579186
1,Led Zeppelin,1.517054
2,Saliva,1.487757
3,Pink Floyd,1.468446
4,Kirk Franklin,1.439833
5,Savage Garden,1.435377
6,Neil Young,1.42883
7,Godsmack,1.387676
8,Guns N' Roses,1.375337
9,Yes,1.372328


In [1]:
recsys_bpr.make_recommendation(df_lookup, user_id=1000)

NameError: name 'recsys_bpr' is not defined

## Neural Filtrage Collaboratif

-> Voir Notebook NeuCF.ipynb pour une démonstration

### Références

 + THE librairie pour aider dans la conception d'un système de recommendation : https://github.com/microsoft/recommenders/tree/master/reco_utils
 + La doc de la librairie **Surprise** de scikit : https://surprise.readthedocs.io/en/stable/
 + Une implémenation hyper rapide de l'algorithme FunkSVD : https://github.com/gbolmier/funk-svd
 + Librairie *Implicit* pour une implémentation rapide de filtrage collaboratif implicite : https://github.com/benfred/implicit
