Les systèmes de recommandation sont un ensemble d'algorithmes utilisés pour recommander des articles aux utilisateurs sur la base d'informations prises auprès de l'utilisateur. Ces systèmes sont devenus omniprésents et sont couramment utilisés dans les magasins en ligne, les bases de données de films et les chercheurs d'emploi. Dans ce cahier, nous allons explorer les systèmes de recommandation basés sur le contenu et mettre en œuvre une version simple d'un système utilisant Python et la bibliothèque Pandas.

### Importer les bibliothèques

In [1]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### Chargement des données 

In [2]:
# Stocker les informations du film sous forme d'une dataframe
movies_df = pd.read_csv('movies.csv')
#Stocker les informations des utilisateurs sous forme d'une dataframe
ratings_df = pd.read_csv('ratings.csv')

movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


### Exploration des données 

In [4]:
movies_df.describe()

Unnamed: 0,movieId
count,34208.0
mean,75585.571445
std,50726.054238
min,1.0
25%,26008.75
50%,86452.0
75%,119454.5
max,151711.0


In [7]:
set(movies_df['movieId'])

{131072,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 131154,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 18

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34208 entries, 0 to 34207
Data columns (total 3 columns):
movieId    34208 non-null int64
title      34208 non-null object
genres     34208 non-null object
dtypes: int64(1), object(2)
memory usage: 801.9+ KB


In [10]:
movies_df.shape

(34208, 3)

In [11]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,22884380.0,22884380.0,22884380.0,22884380.0
mean,123545.2,11408.16,3.526077,1128959000.0
std,71474.69,24136.88,1.061173,181989200.0
min,1.0,1.0,0.5,789652000.0
25%,61339.0,920.0,3.0,974763900.0
50%,123322.0,2329.0,3.5,1115685000.0
75%,185525.0,5218.0,4.0,1271194000.0
max,247753.0,151711.0,5.0,1454054000.0


In [12]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22884377 entries, 0 to 22884376
Data columns (total 4 columns):
userId       int64
movieId      int64
rating       float64
timestamp    int64
dtypes: float64(1), int64(3)
memory usage: 698.4 MB


Supprimons l'année de la colonne __titre__ en utilisant la fonction de remplacement des pandas et stockons-la dans une nouvelle colonne __année__.

In [13]:
# Utilisation d'expressions régulières pour trouver une année stockée entre parenthèses
#Nous précisons les paranthèses pour ne pas entrer en conflit avec les films qui ont des années dans leurs titres
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)


In [15]:
#Supprimer les parenthèses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)


In [17]:
#Supprimer les années de la colonne "titre"
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Appliquer la fonction strip pour se débarrasser des caractères de fin d'espacement qui ont pu apparaître
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [18]:
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Comme le fait de conserver les genres dans une liste n'est pas optimal pour la technique de recommandation content-based, nous utiliserons la technique One Hot Encoding pour convertir la liste des genres en un vecteur où chaque colonne correspond à une valeur possible de la caractéristique. Ce codage est nécessaire pour alimenter les données catégorielles. Dans ce cas, nous stockons chaque genre différent dans des colonnes qui contiennent soit 1 soit 0. 1 indique qu'un film a ce genre et 0 indique qu'il n'en a pas. Stockons également cette dataframe dans une autre variable puisque les genres ne seront pas importants pour notre premier système de recommandation.

In [19]:
#Copier les données du film dans une nouvelle dataframe puisque nous n'aurons pas besoin d'utiliser les informations sur le genre dans notre premier cas.
moviesWithGenres_df = movies_df.copy()

for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Chaque ligne de la base de données des classements comporte un identifiant associé à au moins un film, un classement et un timestamp indiquant l'heure à laquelle il a été visionné. Nous n'aurons pas besoin de la colonne timestamp, alors laissons tomber pour raison de faciliter le travail.

In [21]:
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


# Implémentation du Content-Based recommendation system

Maintenant on va voir comment mettre en œuvre les systèmes de recommandation __Content-Based__ ou __Item-Item__. Cette technique tente de déterminer quels sont les aspects préférés d'un utilisateur, puis recommande des articles qui présentent ces aspects. Dans notre cas, nous allons essayer de déterminer les genres préférés de l'entrée à partir des films et des classements donnés.

Commençons par créer une entrée utilisateur à qui recommander des films :

Avis : Pour ajouter des films, il suffit d'augmenter la quantité d'éléments dans le __ userInput__. N'hésitez pas à en ajouter d'autres ! Veillez simplement à l'écrire en majuscules et si un film commence par un "The", comme "The Matrix", alors écrivez-le comme ceci : Matrix, The' 

In [22]:
userInput = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


#### Ajouter movieId à l'utilisateur
Une fois l'entrée terminée, extrayons les identifiants des films d'entrée du dataframe des films et ajoutons-les à celle-ci.

Nous pouvons y parvenir en filtrant d'abord les lignes qui contiennent le titre des films d'entrée et en fusionnant ensuite ce sous-ensemble avec la dataframe d'entrée. Nous supprimons également les colonnes inutiles pour l'entrée afin d'économiser de l'espace mémoire.

In [23]:
#Filtering out the movies by title

inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]


In [24]:
inputId

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985


In [25]:
# Puis on le fusionne pour obtenir le film. C'est implicitement la fusion par titre.
inputMovies = pd.merge(inputId, inputMovies)
inputMovies

Unnamed: 0,movieId,title,genres,year,rating
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,3.5
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,2.0
2,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,5.0
3,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,4.5
4,1968,"Breakfast Club, The","[Comedy, Drama]",1985,5.0


In [26]:
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
#L'input final 
#Si un film que vous avez ajouté ci-dessus n'est pas ici, alors il pourrait ne pas être dans l'original 
#dataframe ou si l'orthographe est différente, veuillez vérifier les majuscules.
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Nous allons commencer par apprendre les préférences de l'entrée, donc nous allons obtenir le sous-ensemble de films que l'entrée a regardé à partir de la Dataframe contenant des genres définis avec des valeurs binaires.

In [27]:
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
293,296,Pulp Fiction,"[Comedy, Crime, Drama, Thriller]",1994,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1246,1274,Akira,"[Action, Adventure, Animation, Sci-Fi]",1988,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1885,1968,"Breakfast Club, The","[Comedy, Drama]",1985,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Nous n'aurons besoin que de la table des genres, alors nettoyons un peu les données en réinitialisant l'index et en supprimant les colonnes FilmId, Titre, Genres et Année.

In [28]:
#Réinitialisation de l'indice pour éviter de nouveaux problèmes
userMovies = userMovies.reset_index(drop=True)
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Nous sommes maintenant prêts à apprendre les préférences de l'entrée !

Pour ce faire, nous allons transformer chaque genre en poids. Nous pouvons le faire en utilisant les commentaires des entrées et en les multipliant dans le tableau des genres des entrées, puis en additionnant le tableau résultant par colonne. Cette opération est en fait un produit ponctuel entre une matrice et un vecteur, que nous pouvons donc réaliser simplement en appelant la fonction "dot" de Pandas.

In [30]:
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


In [29]:
inputMovies['rating']

0    3.5
1    2.0
2    5.0
3    4.5
4    5.0
Name: rating, dtype: float64

In [31]:
userGenreTable.transpose()

Unnamed: 0,0,1,2,3,4
Adventure,1.0,1.0,0.0,1.0,0.0
Animation,1.0,0.0,0.0,1.0,0.0
Children,1.0,1.0,0.0,0.0,0.0
Comedy,1.0,0.0,1.0,0.0,1.0
Fantasy,1.0,1.0,0.0,0.0,0.0
Romance,0.0,0.0,0.0,0.0,0.0
Drama,0.0,0.0,1.0,0.0,1.0
Action,0.0,0.0,0.0,1.0,0.0
Crime,0.0,0.0,1.0,0.0,0.0
Thriller,0.0,0.0,1.0,0.0,0.0


In [32]:
#Produit point a point pour obtenir les poids des données.
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])

userProfile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Maintenant, nous avons les poids pour chacune des préférences de l'utilisateur. C'est ce que l'on appelle le profil de l'utilisateur. Grâce à lui, nous pouvons recommander des films qui répondent aux préférences de l'utilisateur.

Commençons par extraire la table des genres de la base de données originale :

In [33]:
#Maintenant, nous allons obtenir les genres de chaque film dans notre base de données originale
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#Supprimer les colonnes non nécessaires
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
genreTable.shape

(34208, 20)

Avec le profil d'entrée et la liste complète des films et de leurs genres en main, nous allons prendre la moyenne pondérée de chaque film en fonction du profil d'entrée et recommander les vingt meilleurs films qui le satisfont le plus.

In [35]:
#Multipliez les genres par les poids et prenez ensuite la moyenne pondérée.
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

movieId
1    0.594406
2    0.293706
3    0.188811
4    0.328671
5    0.188811
dtype: float64

In [36]:
#Trier nos recommandations par ordre décroissant
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
recommendationTable_df.head()

movieId
5018      0.748252
26093     0.734266
27344     0.720280
148775    0.685315
6902      0.678322
dtype: float64

In [37]:
#Le tableau des recommandations finales
movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
664,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
1824,1907,Mulan,"[Adventure, Animation, Children, Comedy, Drama...",1998
2902,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
4923,5018,Motorama,"[Adventure, Comedy, Crime, Drama, Fantasy, Mys...",1991
6793,6902,Interstate 60,"[Adventure, Comedy, Drama, Fantasy, Mystery, S...",2002
8605,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
8783,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
9296,27344,Revolutionary Girl Utena: Adolescence of Utena...,"[Action, Adventure, Animation, Comedy, Drama, ...",1999
9825,32031,Robots,"[Adventure, Animation, Children, Comedy, Fanta...",2005
11716,51632,Atlantis: Milo's Return,"[Action, Adventure, Animation, Children, Comed...",2003
