M. Pontier vous contact pour l'aider à construire un système de recommandation. Il dispose d'une base de données comportant des données concernant ses utilisateurs (anonymisé) contenant les artistes qu'ils écoutent sur sa plateforme ainsi que le nombre d'écoutes. Monsieur pontier souhaite recommander à ses utilisateur des artistes qu'il n'ont pas encore écoutés, et cela en fonction de leurs préférences musicale.

Monsieur pontier souhaite utiliser la librairie Lightfm, avec laquelle il a déjà un driver permettant de charger ses données qu'il vous fournit, un vrai bonus. Monsieur Pontier à pu voir que la documentation comporte plusieurs modèle, il souhaite évaluer les modèle sur une jeux de train/test et utiliser le meilleurs modéle.

Pour l'évaluation, il souhaite comparer la mesure AUC, la précision et le rappel (visiter la documentation de Lightfm), qui devront être présenté dans un tableau.

In [None]:
#Veille

https://medium.com/@piocalderon/an-overview-of-recommendation-systems-f40ab77bc2c6

Hybrid systems leverage both item metadata and transaction data to give recommendations. There are many ways we can mix and match collaborative filtering models and content-based models. One such model is called Light FM, made by the fashion company Lyst.

Light FM uses item features, user features and transaction data to give recommendations. It does so by representing item features and user features in a latent embedding similar to Word2Vec. Then, each item and each user is represented by the vector sum of the latent features. Recommendations are made for a user by sorting the cosine similarity between the user’s vector representation and each item’s representation.
Feedback: Explicit vs. Implicit

One issue that e-commerce websites usually face is a lack of feedback. For Amazon, there are explicit 1- to 5-star reviews for each product, so making recommendation systems is a relatively straightforward task. This type of data is explicit feedback. What can one do if there aren’t a lot of reviews for your products? You can leverage implicit data, such as the number of orders made for a certain product or the number of clicks that the item receives. These constitute implicit feedback. LightFM has models that take into account either explicit and implicit feedback.
The Bottomline

The literature on Recommendation Systems is very large and there are still lots of open questions. To get started on the topic, the resource I recommend is Recommender Systems by Aggarwal. It is very comprehensive and perfect for new learners on the topic.

https://towardsdatascience.com/how-to-build-a-movie-recommender-system-in-python-using-lightfm-8fa49d7cbe3b
LightFm

LightFM is a Python implementation of a number of popular recommendation algorithms. LightFM includes implementations of BPR and WARP ranking losses(A loss function is a measure of how good a prediction model does in terms of being able to predict the expected outcome.).

BPR: Bayesian Personalised Ranking pairwise loss: It maximizes the prediction difference between a positive example and a randomly chosen negative example. It is useful when only positive interactions are present.

WARP: Weighted Approximate-Rank Pairwise loss: Maximises the rank of positive examples by repeatedly sampling negative examples until rank violating one is found

In [None]:
# Recommendation System

In [1]:
import pandas as pd
import numpy as np
import time

In [2]:
# import Datas
plays = pd.read_csv('hetrec2011-lastfm-2k/user_artists.dat', sep='\t')
artists = pd.read_csv('hetrec2011-lastfm-2k/artists.dat', sep='\t', usecols=['id','name'])
plays

Unnamed: 0,userID,artistID,weight
0,2,51,13883
1,2,52,11690
2,2,53,11351
3,2,54,10300
4,2,55,8983
...,...,...,...
92829,2100,18726,337
92830,2100,18727,297
92831,2100,18728,281
92832,2100,18729,280


In [3]:
artists

Unnamed: 0,id,name
0,1,MALICE MIZER
1,2,Diary of Dreams
2,3,Carpathian Forest
3,4,Moi dix Mois
4,5,Bella Morte
...,...,...
17627,18741,Diamanda Galás
17628,18742,Aya RL
17629,18743,Coptic Rain
17630,18744,Oz Alchemist


In [4]:
# Merge artist and user pref data
ap = pd.merge(artists, plays, how="inner", left_on="id", right_on="artistID")
ap = ap.rename(columns={"weight": "playCount"})

ap

Unnamed: 0,id,name,userID,artistID,playCount
0,1,MALICE MIZER,34,1,212
1,1,MALICE MIZER,274,1,483
2,1,MALICE MIZER,785,1,76
3,2,Diary of Dreams,135,2,1021
4,2,Diary of Dreams,257,2,152
...,...,...,...,...,...
92829,18741,Diamanda Galás,454,18741,301
92830,18742,Aya RL,454,18742,294
92831,18743,Coptic Rain,454,18743,287
92832,18744,Oz Alchemist,454,18744,286


In [5]:
# Group artist by name
artist_rank = ap.groupby(['name']) \
    .agg({'userID' : 'count', 'playCount' : 'sum'}) \
    .rename(columns={"userID" : 'totalUsers', "playCount" : "totalPlays"}) \
    .sort_values(['totalPlays'], ascending=False)

artist_rank['avgPlays'] = artist_rank['totalPlays'] / artist_rank['totalUsers']
print(artist_rank)


                    totalUsers  totalPlays     avgPlays
name                                                   
Britney Spears             522     2393140  4584.559387
Depeche Mode               282     1301308  4614.567376
Lady Gaga                  611     1291387  2113.563011
Christina Aguilera         407     1058405  2600.503686
Paramore                   399      963449  2414.659148
...                        ...         ...          ...
Morris                       1           1     1.000000
Eddie Kendricks              1           1     1.000000
Excess Pressure              1           1     1.000000
My Mine                      1           1     1.000000
A.M. Architect               1           1     1.000000

[17632 rows x 3 columns]


In [6]:
ap

Unnamed: 0,id,name,userID,artistID,playCount
0,1,MALICE MIZER,34,1,212
1,1,MALICE MIZER,274,1,483
2,1,MALICE MIZER,785,1,76
3,2,Diary of Dreams,135,2,1021
4,2,Diary of Dreams,257,2,152
...,...,...,...,...,...
92829,18741,Diamanda Galás,454,18741,301
92830,18742,Aya RL,454,18742,294
92831,18743,Coptic Rain,454,18743,287
92832,18744,Oz Alchemist,454,18744,286


In [7]:
# Merge into ap matrix
ap = ap.join(artist_rank, on="name", how="inner") \
    .sort_values(['playCount'], ascending=False)

# Preprocessing
pc = ap.playCount
play_count_scaled = (pc - pc.min()) / (pc.max() - pc.min())
ap = ap.assign(playCountScaled=play_count_scaled)
#print(ap)

# Build a user-artist rating matrix 
ratings_df = ap.pivot(index='userID', columns='artistID', values='playCountScaled')
ratings = ratings_df.fillna(0).values

# Show density
density = float(len(ratings.nonzero()[0])) / (ratings.shape[0] * ratings.shape[1]) * 100
print("density: %.2f" % density)

density: 0.28


In [11]:
from scipy.sparse import csr_matrix

# Build a sparse matrix
X = csr_matrix(ratings)

n_users, n_items = ratings_df.shape
print("rating matrix shape", ratings_df.shape)

user_ids = ratings_df.index.values
artist_names = ap.sort_values("artistID")["name"].unique()

rating matrix shape (1892, 17632)


In [16]:
from lightfm import LightFM
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm.cross_validation import random_train_test_split
from lightfm.data import Dataset

# Build data references + train test
Xcoo = X.tocoo()
data = Dataset()
data.fit(np.arange(n_users), np.arange(n_items))
interactions, weights = data.build_interactions(zip(Xcoo.row, Xcoo.col, Xcoo.data)) 
train, test = random_train_test_split(interactions)

# Ignore that (weight seems to be ignored...)
#train = train_.tocsr()
#test = test_.tocsr()
#train[train==1] = X[train==1]
#test[test==1] = X[test==1]

# To be completed...



In [29]:
#présenter les résultats sur le jeux de train ET de test 
# warning:  le train et test set ont une forme un peu différente de ce qu'on a l'habitude de voir, 
#donc regardez leurs shape et enquêter sur ce que c'est/ce qu'ils représentent 


#In a supervised learning, you use a training dataset, that contains outcomes, to train the machine. 
#You then use testing dataset that has no outcomes to predict outcomes.
#Training Data is data for build model and Testing Data is data for test model.
print(train.shape)
print(test.shape)

(1892, 17632)
(1892, 17632)


In [None]:
#lightfm.cross_validation

In [30]:
"""
Dataset splitting functions.
https://making.lyst.com/lightfm/docs/_modules/lightfm/cross_validation.html

"""

import numpy as np
import scipy.sparse as sp


def _shuffle(uids, iids, data, random_state):

    shuffle_indices = np.arange(len(uids))
    random_state.shuffle(shuffle_indices)

    return (uids[shuffle_indices],
            iids[shuffle_indices],
            data[shuffle_indices])
def random_train_test_split(interactions,
                            test_percentage=0.2,
                            random_state=None):
    """
    Randomly split interactions between training and testing.

    This function takes an interaction set and splits it into
    two disjoint sets, a training set and a test set. Note that
    no effort is made to make sure that all items and users with
    interactions in the test set also have interactions in the
    training set; this may lead to a partial cold-start problem
    in the test set.

    Parameters
    ----------

    interactions: a scipy sparse matrix containing interactions
        The interactions to split.
    test_percentage: float, optional
        The fraction of interactions to place in the test set.
    random_state: np.random.RandomState, optional
        The random state used for the shuffle.

    Returns
    -------

    (train, test): (scipy.sparse.COOMatrix,
                    scipy.sparse.COOMatrix)
         A tuple of (train data, test data)
    """

    if not sp.issparse(interactions):
        raise ValueError('Interactions must be a scipy.sparse matrix.')

    if random_state is None:
        random_state = np.random.RandomState()

    interactions = interactions.tocoo()

    shape = interactions.shape
    uids, iids, data = (interactions.row,
                        interactions.col,
                        interactions.data)

    uids, iids, data = _shuffle(uids, iids, data, random_state)

    cutoff = int((1.0 - test_percentage) * len(uids))

    train_idx = slice(None, cutoff)
    test_idx = slice(cutoff, None)

    train = sp.coo_matrix((data[train_idx],
                           (uids[train_idx],
                            iids[train_idx])),
                          shape=shape,
                          dtype=interactions.dtype)
    test = sp.coo_matrix((data[test_idx],
                          (uids[test_idx],
                           iids[test_idx])),
                         shape=shape,
                         dtype=interactions.dtype)

    return train, test

In [33]:
random_train_test_split(interactions)

(<1892x17632 sparse matrix of type '<class 'numpy.int32'>'
 	with 73758 stored elements in COOrdinate format>,
 <1892x17632 sparse matrix of type '<class 'numpy.int32'>'
 	with 18440 stored elements in COOrdinate format>)

Voici deux sous taches supplémentaire qui vont nous aider à evaluer/interpréter notre modéle, après l'obtention des tableaux de résultats : 
*  faire la fonction get_recommandation qui prend en entrée un User et renvoie les Artists recommandé (du meilleurs au moins bon au sens du score de recommandation)
* get_ground_truth qui renvoie les artistes ecoutés par un utilisateur par ordre décroissant du playCountScaled

Ceci nous permettra d"evaluer qualitatement les résultats que retourne le modéle et le comparer avec la vérité terrain.

In [133]:
#artist_recommendation en boucle

#https://towardsdatascience.com/how-to-build-a-movie-recommender-system-in-python-using-lightfm-8fa49d7cbe3b
model = LightFM(loss = 'warp')
model.fit(train, epochs=30, num_threads=2)

def artist_recommendation(model, ap, user_id):

 
    n_users, n_items = ap.shape
    for user_id in user_ids:
        
        known_positives = artist_names[train.tocsr()[user_id].indices]
        
        scores = model.predict(int(user_id), np.arange(n_items))

        top_items = artist_names[np.argsort(-scores)]

        print("User %s" % user_id)
        print("Known positives:")
        
        for x in known_positives[:3]:
            print("        %s" % x)
        
        print("Recommended:")
        
        for x in top_items[:3]:
            print("        %s" % x)

In [134]:
#artist_recommendation(model, ap, 1)

In [114]:

model = LightFM(loss = 'warp')
model.fit(train, epochs=30, num_threads=2)

def artist_recommendation(model, ap, user_id):

 
    n_users, n_items = ap.shape
    scores = model.predict(user_id, np.arange(n_items))
    top_items = artist_names[np.argsort(-scores)]
    print("User %s" % user_id)
    for x in top_items[:3]:
            print("        %s" % x)

In [115]:
artist_recommendation(model,ap,1)

User 1
        Carpathian Forest
        Moi dix Mois
        Bella Morte


# LightFM_ models evaluation

In [23]:
def models_results():
    learning_rate_datas = [0.05, 0.08, 0.1 ] #(dans la montagne) les pas 
    model_exploration = ['logistic', 'bpr', 'warp','warp-kos']
    k_list = [8,10] # nombre des items
    
    results = []
    for a in learning_rate_datas :
        for i in model_exploration :
            for z in k_list:
                
            # time
                t1 = time.process_time()
                t2 = time.process_time()
                t = t2 - t1
            # train
                model = LightFM(learning_rate=a, loss=i)
                model.fit(train, epochs=10, num_threads=2)
            
            #Evaluate
                train_precision = precision_at_k(model, train, k=z).mean()
                test_precision = precision_at_k(model, test, k=z).mean()

                train_auc = auc_score(model, train).mean()
                test_auc = auc_score(model, test).mean()
            
            
                dicttemp = {}
                dicttemp = {'Time:':t,'K':z, 'Name':i, 'Learning Rate':a, 'Train Precision':train_precision , 'Train AUC':train_auc, 'Test Precision':test_precision, 'Test AUC':test_auc}

                results.append(dicttemp)
            
    results = pd.DataFrame(results)
    results.to_csv('210126 - tests.csv', encoding='utf-8')  
    return results
    
      

In [24]:
models_results()

Unnamed: 0,Time:,K,Name,Learning Rate,Train Precision,Train AUC,Test Precision,Test AUC
0,4e-06,8,logistic,0.05,0.199616,0.88671,0.05404,0.810176
1,4e-06,10,logistic,0.05,0.195339,0.886537,0.05206,0.810845
2,4e-06,8,bpr,0.05,0.376126,0.846123,0.083935,0.77505
3,5e-06,10,bpr,0.05,0.348358,0.838972,0.076458,0.771834
4,4e-06,8,warp,0.05,0.396782,0.966209,0.092028,0.859558
5,5e-06,10,warp,0.05,0.377913,0.966627,0.088068,0.859423
6,5e-06,8,warp-kos,0.05,0.341367,0.886399,0.080591,0.819909
7,4e-06,10,warp-kos,0.05,0.34285,0.886877,0.077314,0.821613
8,5e-06,8,logistic,0.08,0.198954,0.884407,0.053973,0.811075
9,1.2e-05,10,logistic,0.08,0.194015,0.884128,0.05222,0.811118


In [66]:
## Construction de Grid-search manuelle

from itertools import product 
import copy 
from collections import defaultdict
## construire plusieurs modèles avec différentes valeurs des hyperparamètres mixées

        
def const_modeles_recom_grid(param_grid,base_model):
    
    ## modeles= dict()
    modeles= defaultdict(object)
    ## name of the model
    
    keys, values = zip(*param_grid.items())
    for v in product(*values):
        params = dict(zip(keys, v))
        this_model = copy.deepcopy(base_model)
        name = "-".join([str(x) for x in v])
        for k, v in params.items():
            setattr(this_model, k, v)
        modeles[name]= this_model
    
    return modeles 


param_grid = dict()
#space['LightFM__learning_rate'] = [0.05, 0.1, 0.5, 0.7, 0.8, 0.9]
#space['LightFM__loss'] = ["logistic", "warp", "bpr", "warp-kos"]
param_grid['learning_rate'] = [0.05, 0.1]
param_grid['loss'] = ["warp", "bpr"]
    # define the base_model
base_model=LightFM() 

In [69]:
const_modeles_recom_grid(param_grid,base_model)

defaultdict(object,
            {'0.05-warp': <lightfm.lightfm.LightFM at 0x7fe45e56d710>,
             '0.05-bpr': <lightfm.lightfm.LightFM at 0x7fe45e703d90>,
             '0.1-warp': <lightfm.lightfm.LightFM at 0x7fe45e56de50>,
             '0.1-bpr': <lightfm.lightfm.LightFM at 0x7fe45e56dbd0>})

In [105]:
## Construction de Grid-search manuelle (code fait avec Bassem à partir de ce code:)
#https://www.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/

from itertools import product 
import copy 

## construire plusieurs modèles avec différentes valeurs des hyperparamètres mixéesdef const_modeles_recom_grid(param_grid,base_model):
    
    
    modeles= defaultdict(object)
    keys, values = zip(*param_grid.items())
    
    for v in product(*values):     
        params = dict(zip(keys, v))
        name = "-".join([str(x) for x in v])
        modeles[name]=base_model(**params)
    
    return modeles

# test
param_grid = dict()
param_grid['learning_rate'] = [0.05, 0.1,0.2]
param_grid['loss'] = ["warp", "bpr"]
   
# define the base_model
base_model=LightFM
modeles=const_modeles_recom_grid(param_grid,base_model)
modeles 

# code pour récupérer le meilleur jeu de parametres

defaultdict(object,
            {'0.05-warp': <lightfm.lightfm.LightFM at 0x7fe45e6a0a50>,
             '0.05-bpr': <lightfm.lightfm.LightFM at 0x7fe45e6a0190>,
             '0.1-warp': <lightfm.lightfm.LightFM at 0x7fe45e6a0490>,
             '0.1-bpr': <lightfm.lightfm.LightFM at 0x7fe45e6a0410>,
             '0.2-warp': <lightfm.lightfm.LightFM at 0x7fe45e6a0c10>,
             '0.2-bpr': <lightfm.lightfm.LightFM at 0x7fe45e6a0650>})

In [129]:
# Optimisation de paramètres avec GridSearch (code fait collectivement avec Sassia et Nidhal)

from sklearn.model_selection import ParameterGrid

# Create the parameter grid based on the results of random search 

param_grid = {
    'learning_rate': [0.05 , 0.08],
    'learning_schedule':['adagrad','adadelta'],
    'loss': ['warp','bpr','logistic','warp-kos']

}

# Create a based model

list(ParameterGrid(param_grid))

auc_score_values = []

for grid in ParameterGrid(param_grid):
    model = LightFM(**grid)
    pred = model.fit(train)
    auc_score_values.append(round(auc_score(model, test, train_interactions=train).mean(),3))

max_value = max(auc_score_values) 
max_index = auc_score_values.index(max_value) 
ParameterGrid(param_grid)[max_index ].items()

dict_items([('loss', 'warp'), ('learning_schedule', 'adadelta'), ('learning_rate', 0.05)])

In [130]:
print(f"la combinaison de paramètres qui permet d'optimiser  le model : {round(max_value,3)} est la suivante \n {ParameterGrid(param_grid)[max_index ].items()}") 

la combinaison de paramètres qui permet d'optimiser  le model : 0.8230000138282776 est la suivante 
 dict_items([('loss', 'warp'), ('learning_schedule', 'adadelta'), ('learning_rate', 0.05)])
