# Recommander Systems

Construire, comprendre et tuner un système de recommandation.

# Description

## Familarisation

Les systèmes de recommandations sont utilisé traditionnellement et comme le nom l'indique pour recommander du contenu à des utilisateurs.
Par exemple pour recommander un film à des utilisateurs en fonctions de ceux qu'ils ont vue, ou de la musique, ou des vidéos ou encore implémenter des fonctionnalités "more like this".

Nous allons commencer par suivre et reproduire les étapes de ce tuto: 

*  https://www.datacamp.com/community/tutorials/recommender-systems-python

En assumant que vous avez peu de RAM, nous allons nous arrêter au moment de calculer la  `compute_sim` variable.


**step1 : simple recommander**
Quelle est la complexité en mémoire de cette opération ?
(utiliser cosine_similarity qui utilise moins de mémoire (quand même 8Go, possible sur collab)
Cela rentre t'il sur votre machine ?

Qu'essaye de faire l'auteur avec ce calcul ?
Comment pouvons-nous contourner ce problème ?


**step2 : content based recommander**

implémenter la deuxiéme partie en évitant le produit de matrice.

**step3 : amélioration**

coder les 2 améliorations :
1. Introduce a popularity filter: this recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.
2. Use the PCA to improve the speed of your similarity search with 100 components. Does the result are coherent.

In [2]:
import pandas as pd

file = pd.read_csv("datasets/movies_metadata.csv", low_memory=False)

In [3]:
file.columns.tolist()

['adult',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'id',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count']

file.shape

In [4]:
file.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [5]:
C = file['vote_average'].mean()
print(C)

5.618207215133889


In [6]:
# Calculate the minimum number of votes required to be in the chart, m
m = file['vote_count'].quantile(0.9)
print(m)

160.0


In [7]:
# Filter out all qualified movies into a new DataFrame
q_movies = file.copy().loc[file['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [8]:
# Function that computes the weighted rating of each movie
def weighted_rating(x, m=m, C=C):
    v = x['vote_count']
    R = x['vote_average']
    # Calculation based on the IMDB formula
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
# Define a new feature 'score' and calculate its value with `weighted_rating()`
q_movies['score'] = q_movies.apply(weighted_rating, axis=1)

In [10]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

#Print the top 15 movies
q_movies[['title', 'vote_count', 'vote_average', 'score', 'popularity']].head(30)

Unnamed: 0,title,vote_count,vote_average,score,popularity
314,The Shawshank Redemption,8358.0,8.5,8.445869,51.645403
834,The Godfather,6024.0,8.5,8.425439,41.109264
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453,34.457024
12481,The Dark Knight,12269.0,8.3,8.265477,123.167259
2843,Fight Club,9678.0,8.3,8.256385,63.869599
292,Pulp Fiction,8670.0,8.3,8.251406,140.950236
522,Schindler's List,4436.0,8.3,8.206639,41.725123
23673,Whiplash,4376.0,8.3,8.205404,64.29999
5481,Spirited Away,3968.0,8.3,8.196055,41.048867
2211,Life Is Beautiful,3643.0,8.3,8.187171,39.39497


# Content-Based Recommender

In [11]:
artists_list#Print plot overviews of the first 5 movies.
file['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [12]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
file['overview'] = file['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(file['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

In [13]:
tfidf_matrix

<45466x75827 sparse matrix of type '<class 'numpy.float64'>'
	with 1210882 stored elements in Compressed Sparse Row format>

In [14]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[5000:5010]

['avails',
 'avaks',
 'avalanche',
 'avalanches',
 'avallone',
 'avalon',
 'avant',
 'avanthika',
 'avanti',
 'avaracious']

In [15]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
#cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [16]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(file.index, index=file['title']).drop_duplicates()

In [17]:
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

* ### get_recommandations simple

# Similarity score
#cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix[0])

In [18]:
#cosine_sim.shape

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title):
    # Get the index of the movie that matches the title
    idx = indices[title]
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix[idx])
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return file['title'].iloc[movie_indices]

In [21]:
get_recommendations('Interstellar')

10674         Voices of a Distant Star
5460                 Suburban Commando
312                           Stargate
4728                    Hail Columbia!
23037     Space Pirate Captain Harlock
16335              On the Silver Globe
37041                      Time Runner
1598                 Starship Troopers
44216    Star Force: Fugitive Alien II
43523                           Cosmos
Name: title, dtype: object

* ### get_recommandations with PCA (TruncatedSVD)

In [22]:
# Better speed with PCA or TruncatedSVD

In [23]:
from sklearn.decomposition import PCA, TruncatedSVD

SVD = TruncatedSVD(n_components=100, n_iter=15, random_state=42)
tfidf_matrix_SVD = SVD.fit_transform(tfidf_matrix)

In [24]:
def get_recommendations_SVD(title):
    # Get the index of the movie that matches the title
    idx = indices[title]    
    cosine_sim_SVD = cosine_similarity(tfidf_matrix_SVD, tfidf_matrix_SVD[idx].reshape(1, -1)) # The trick Mr Potter is not minding when it hurts
    sim_scores_SVD = list(enumerate(cosine_sim_SVD))

    # Sort the movies based on the similarity scores
    sim_scores_SVD = sorted(sim_scores_SVD, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores_SVD = sim_scores_SVD[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores_SVD]

    # Return the top 10 most similar movies
    return file['title'].iloc[movie_indices]

In [25]:
get_recommendations_SVD('Interstellar')

11575                Wake Up, Ron Burgundy: The Lost Movie
18904    If a Tree Falls: A Story of the Earth Liberati...
28498                                   A Blank on the Map
42101                                     Ferocious Planet
40145                                    Fascisti su Marte
37055                                         Skull Forest
19577                    Daleks' Invasion Earth: 2150 A.D.
30418                           K2: Siren of the Himalayas
42908                                     Escape from Mars
26392                          They Came from Beyond Space
Name: title, dtype: object

In [26]:
get_recommendations('Interstellar')

10674         Voices of a Distant Star
5460                 Suburban Commando
312                           Stargate
4728                    Hail Columbia!
23037     Space Pirate Captain Harlock
16335              On the Silver Globe
37041                      Time Runner
1598                 Starship Troopers
44216    Star Force: Fugitive Alien II
43523                           Cosmos
Name: title, dtype: object

# SVD is not efficient as previous

* #### get_recommandations with "popularity filter"

In [27]:
# file.sort_values(by='popularity', ascending=False)

In [28]:
# Function that takes in movie title as input and outputs most similar movies

def get_recommendations_popularity(title):
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the similarity
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix[idx])
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:30]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    similarity_metadata = file.iloc[movie_indices]
    
    q_similarity_metadata = similarity_metadata.copy().loc[similarity_metadata['vote_count'] >= m]
    q_similarity_metadata['score'] = q_similarity_metadata.apply(weighted_rating, axis=1)
    q_similarity_metadata = q_similarity_metadata.sort_values('score', ascending=False)

    # Return the top 10 most similar movies
    return q_similarity_metadata[['title', 'vote_count','score']]

In [29]:
get_recommendations_popularity('Interstellar').head(10)

Unnamed: 0,title,vote_count,score
312,Stargate,942.0,6.628415
1598,Starship Troopers,1584.0,6.600753
18995,Prometheus,5152.0,6.279464
23037,Space Pirate Captain Harlock,365.0,6.161739
29943,The Green Inferno,359.0,5.190584


* #### get_recommandation with title suggestion

## LastFM Project

M. Pontier vous contact pour l'aider à construire un système de recommandation. Il dispose d'une base de données comportant des données concernant ses utilisateurs (anonymisé) contenant les artistes qu'ils écoutent sur sa plateforme ainsi que le nombre d'écoutes. Monsieur pontier souhaite recommander à ses utilisateur  des artistes qu'il n'ont pas encore écoutés, et cela en fonction de leurs préférences musicale.

Monsieur pontier souhaite utiliser la librairie Lightfm, avec laquelle il a déjà un driver permettant de charger ses données qu'il vous fournit, un vrai bonus.
Monsieur Pontier à pu voir que la documentation comporte plusieurs modèle, il souhaite évaluer les modèle sur une jeux de trainet test et utiliser le meilleurs modéle.

Pour l'évaluation, il souhaite comparer la mesure AUC, la précision et le rappel (visiter la documentation de Lightfm), qui devront être présenté dans un tableau. (donnée les valeurs pour le jeux de train ET de test, comparer)

---
**le train et test set ont une forme un peu différente de ce qu'on a l'habitude de voir, donc regardez leurs shape et enquêter sur ce que c'est/ce qu'ils représentent** 

Voici deux sous taches supplémentaire qui vont nous aider à evaluer/interpréter notre modéle, après l'obtention des tableaux de résultats : 
* faire la fonction get_recommandation qui prend en entrée un User et renvoie les Artists recommandé (du meilleurs au moins bon au sens du score de recommandation)
* get_ground_truth qui renvoie les artistes ecoutés par un utilisateur par ordre décroissant du playCountScaled

Ceci nous permettra d"evaluer qualitatement les résultats que retourne le modéle et le comparer avec la vérité terrain


#### Bonus

* Comparer les résulats de l'AUC avec le meilleurs modéle de lightfm et une PCA (TruncatedSDV).
* L'apprentissage devant être le plus rapide possible tout en obtenant les meilleurs résultats, il vous est demandé de trouver le nombre d'itération permettant d'atteindre la convergence de 95% de la valeur maximal d'AUC sur le jeux de test.
-- 
* clutering des artists avec les embeddings
* optimization des hyper-paramètres (k, n, learning_schedule, learning_rate)



### Veille

Quelle système de recommandation allez vous mettre en place ?

Qu'est ce un système de recommandation dit à "implicit feedback" ? Et a "explicit feedback ?

Qu'est ce que Lightfm ? Expliquer ce que font les méthodes:
* partial_fit
* precision_at_k (and recall_at_k)



### Ressources: 

LightFM: https://github.com/lyst/lightfm

Jeux de données Last.fm : https://grouplens.org/datasets/hetrec-2011/

The world is large and we know just a small part of it, dont forget the big picture: https://github.com/jihoo-kim/awesome-RecSys

# Livrable

notebook and code for the backend (if a backend is done)


In [2]:
import pandas as pd
import numpy as np

In [4]:
plays = pd.read_csv('datasets/user_artists.dat', sep='\t')
artists = pd.read_csv('datasets/artists.dat', sep='\t', usecols=['id','name'])

# Merge artist and user pref data
ap = pd.merge(artists, plays, how="inner", left_on="id", right_on="artistID")
ap = ap.rename(columns={"weight": "playCount"})

# Group artist by name
artist_rank = ap.groupby(['name']) \
    .agg({'userID' : 'count', 'playCount' : 'sum'}) \
    .rename(columns={"userID" : 'totalUsers', "playCount" : "totalPlays"}) \
    .sort_values(['totalPlays'], ascending=False)

artist_rank['avgPlays'] = artist_rank['totalPlays'] / artist_rank['totalUsers']
print(artist_rank)

                    totalUsers  totalPlays     avgPlays
name                                                   
Britney Spears             522     2393140  4584.559387
Depeche Mode               282     1301308  4614.567376
Lady Gaga                  611     1291387  2113.563011
Christina Aguilera         407     1058405  2600.503686
Paramore                   399      963449  2414.659148
...                        ...         ...          ...
Morris                       1           1     1.000000
Eddie Kendricks              1           1     1.000000
Excess Pressure              1           1     1.000000
My Mine                      1           1     1.000000
A.M. Architect               1           1     1.000000

[17632 rows x 3 columns]


In [32]:
# Merge into ap matrix
ap = ap.join(artist_rank, on="name", how="inner") \
    .sort_values(['playCount'], ascending=False)

# Preprocessing
pc = ap.playCount
play_count_scaled = (pc - pc.min()) / (pc.max() - pc.min())
ap = ap.assign(playCountScaled=play_count_scaled)
#print(ap)

# Build a user-artist rating matrix 
ratings_df = ap.pivot(index='userID', columns='artistID', values='playCountScaled')
ratings = ratings_df.fillna(0).values

# Show sparsity
sparsity = float(len(ratings.nonzero()[0])) / (ratings.shape[0] * ratings.shape[1]) * 100
print("sparsity: %.2f" % sparsity)

sparsity: 0.28


In [33]:
from scipy.sparse import csr_matrix

# Build a sparse matrix
X = csr_matrix(ratings)

n_users, n_items = ratings_df.shape
print("rating matrix shape", ratings_df.shape)

user_ids = ratings_df.index.values
artist_names = ap.sort_values("artistID")["name"].unique()

rating matrix shape (1892, 17632)


In [34]:
from lightfm import LightFM
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm.cross_validation import random_train_test_split
from lightfm.data import Dataset

# Build data references + train test
print("X:", X.shape)
Xcoo = X.tocoo()
print("Xcoo:", Xcoo.shape)

data = Dataset()
data.fit(np.arange(n_users), np.arange(n_items))
interactions, weights = data.build_interactions(zip(Xcoo.row, Xcoo.col, Xcoo.data)) 
train, test = random_train_test_split(interactions)

# Ignore that (weight seems to be ignored...)
#train = train_.tocsr()
#test = test_.tocsr()
#train[train==1] = X[train==1]
#test[test==1] = X[test==1]

# To be completed...
#print(train.shape, test.shape)
#print("Train Shape:")
#print(train)

#print("Test Shape:")
#print(test)


X: (1892, 17632)
Xcoo: (1892, 17632)


In [51]:
def Scoring():
    
    learning_rate = [0.03, 0.05, 0.08, 0.10]
    losslist = ['logistic', 'bpr', 'warp', 'warp-kos']
    klist = [3, 5, 7, 10]
    results = []
    
    for x in learning_rate:
        for y in losslist:
            for z in klist:
            
                model = LightFM(learning_rate=x, loss = y)
                model.fit(train, epochs=10, num_threads=4)

                trainPrecision = precision_at_k(model, train, k=z).mean()
                testPrecision = precision_at_k(model, test, k=z, train_interactions=train).mean()

                trainAUC = auc_score(model, train).mean()
                testAUC = auc_score(model, test, train_interactions=train).mean()

                dicttemp = {}
                dicttemp = {'K':z, 'Name':y, 'Learning Rate':x, 'Train Precision':trainPrecision, 'Train AUC':trainAUC, 'Test Precision':testPrecision, "Train AUC":trainAUC, "Test AUC":testAUC}

                results.append(dicttemp)
            
    results = pd.DataFrame(results)
    results.to_csv('tests.csv', encoding='utf-8')

    return results

In [23]:
Scoring()

Unnamed: 0,K,Name,Learning Rate,Train Precision,Train AUC,Test Precision,Test AUC
0,3,logistic,0.03,0.225875,0.887064,0.096763,0.807194
1,5,logistic,0.03,0.215058,0.886890,0.087407,0.807038
2,7,logistic,0.03,0.206257,0.887128,0.073563,0.807698
3,10,logistic,0.03,0.200424,0.887002,0.068570,0.807619
4,3,bpr,0.03,0.396783,0.767165,0.177161,0.732145
...,...,...,...,...,...,...,...
59,10,warp,0.10,0.377253,0.983816,0.122038,0.844771
60,3,warp-kos,0.10,0.354719,0.888683,0.149413,0.808378
61,5,warp-kos,0.10,0.357688,0.889755,0.137673,0.811491
62,7,warp-kos,0.10,0.349265,0.888411,0.130279,0.811224


In [49]:
import time

def Scoring():
    
#    learning_rate = [0.05, 0.08, 0.10]
#    losslist = ['logistic', 'bpr', 'warp', 'warp-kos']
#    klist = [5, 7, 10]
    learning_rate = [0.05, 0.08, 0.10]
    losslist = ['bpr', 'warp']
    klist = [5, 7, 9, 11]
    results = []
    
    for x in learning_rate:
        for y in losslist:
            for z in klist:
            
                model = LightFM(learning_rate=x, loss = y)
                t1 = time.process_time()
                model.fit(train, epochs=10, num_threads=2)
                t2 = time.process_time()
                t = t2 - t1
                trainPrecision = precision_at_k(model, train, k=z).mean()
                trainRecall = recall_at_k(model, train, k=z).mean()

                testPrecision = precision_at_k(model, test, k=z, train_interactions=train).mean()
                testRecall = recall_at_k(model, test, k=z, train_interactions=train).mean()

                trainAUC = auc_score(model, train).mean()
                testAUC = auc_score(model, test, train_interactions=train).mean()
                
                t3 = time.process_time()
                tt = t3 - t1
                dicttemp = {}
                dicttemp = {'Fit Time:':t,'Total Time:':tt, 'K':z, 'Name':y, 'Learning Rate':x, 'Train Precision':trainPrecision, '':trainRecall, 'Train AUC':trainAUC, 'Test Precision':testPrecision, "":testRecall, "Train AUC":trainAUC, "Test AUC":testAUC}

                results.append(dicttemp)
            
    results = pd.DataFrame(results)
    results.to_csv('270126 - tests.csv', encoding='utf-8')

    return results

In [50]:
Scoring()

Unnamed: 0,Fit Time:,Total Time:,K,Name,Learning Rate,Train Precision,Unnamed: 7,Train AUC,Test Precision,Test AUC
0,0.91782,16.717014,5,bpr,0.05,0.40212,0.079103,0.830727,0.153098,0.76901
1,0.79758,16.50225,7,bpr,0.05,0.409569,0.103956,0.857695,0.144002,0.784728
2,0.815862,16.159363,9,bpr,0.05,0.374845,0.115492,0.849287,0.125119,0.778638
3,0.832158,16.314365,11,bpr,0.05,0.358771,0.130818,0.852113,0.115919,0.780888
4,1.14717,16.644932,5,warp,0.05,0.420562,0.083281,0.962781,0.162286,0.852697
5,0.880927,15.848824,7,warp,0.05,0.40215,0.107138,0.965409,0.14797,0.85526
6,0.888408,15.275538,9,warp,0.05,0.387093,0.124869,0.967018,0.13509,0.854509
7,0.882138,15.210878,11,warp,0.05,0.371971,0.141798,0.964584,0.125583,0.85363
8,0.822649,16.021328,5,bpr,0.08,0.441865,0.084224,0.896863,0.164209,0.800768
9,0.800649,15.585142,7,bpr,0.08,0.452646,0.11527,0.911498,0.159341,0.808072


In [47]:
df = pd.read_csv('270126 - tests.csv')

print(df)

    Unnamed: 0  Fit Time:  Total Time:   K  Name  Learning Rate  \
0            0   0.890023    11.277907   5   bpr           0.05   
1            1   0.849757    11.160908   7   bpr           0.05   
2            2   0.900814    11.348791   9   bpr           0.05   
3            3   0.937319    11.905783  11   bpr           0.05   
4            4   1.282540    11.372688   5  warp           0.05   
5            5   1.031303    10.497100   7  warp           0.05   
6            6   1.209725    11.425931   9  warp           0.05   
7            7   1.290642    11.833668  11  warp           0.05   
8            8   0.874659    11.014168   5   bpr           0.08   
9            9   0.851164    10.877806   7   bpr           0.08   
10          10   0.849799    11.295545   9   bpr           0.08   
11          11   0.818872    10.921807  11   bpr           0.08   
12          12   0.904939    10.885537   5  warp           0.08   
13          13   1.049845    10.553159   7  warp           0.0

## The best "Test AUC" score (85,72%) is done with a "learning rate": 5% and "k"=11