# Etude sur Développez un moteur de recommandation de films

Un site sur le cinéma vous contacte, car ils souhaitent lancer un moteur de recommandation de film pour sauver les soirées ciné de leurs futurs clients. Pour ce faire, vous devrez créer un moteur de recommandations de films. Le seul problème : il n'y a pas encore d’utilisateurs.

#### Contexte du projet

Votre client, vous communique une base de données publique d’informations sur des films, à défaut d'avoir des données sur leurs utilisateurs. https://www.dropbox.com/s/5779djv4tefh6vz/imdb-5000-movie-dataset%20%281%29.zip?dl=0

À l’aide de méthodes non supervisées vous devrez élaborer un système capable de retourner 5 recommandations de films similaires et intéressants pour le visiteur. Ceci à partir d’un nom de film (ou un id).


#### Modalités pédagogiques

Travail individuel. Le travail implique de faire du traitement de texte et notamment pour le feature engineering. Pour cela voici des ressources pour monter en compétence :

cours : https://www.dropbox.com/sh/jinhd7xyjq8gmxi/AAAXcjnfKGIB5gRBE3iI_3JPa?dl=0

https://www.dropbox.com/sh/6p3eqo93l85oj67/AADI0DkLpKHAmPy_h4KksX5ja?dl=0

​

exos : https://www.dropbox.com/sh/rgcc3mrs7hsvk5z/AAA4Rtcah1SKOqqUI3vgrsdGa?dl=0


#### Critères de performance

    Analyse exploratoire propre et commentée
    Veille synthétique et claire
    Choix de la démarche de modélisation pertinente
    Sélection des variables de la base de modélisation pertinente
    Tests de plusieurs algorithmes avec évaluation de la performance
    Choix du modèle final justifié

### 1 - import librarys

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

### 2 - set option pour l'affichage

In [2]:
pd.set_option("max_columns", None)
pd.set_option("max_colwidth", None)
pd.set_option("max_row", 500)

### 3 - read dataset

In [3]:
df = pd.read_csv( "movie_metadata.csv")

In [4]:
df.head(1)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000


In [5]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              4939 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5030 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5036 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64

### nettoyage des données

In [6]:
# make a copy of data set
df_cp = df.copy()

#### J'enlève "\xa0" après movie_title

In [7]:
df_cp['movie_title']=='Avatar\xa0'

0        True
1       False
2       False
3       False
4       False
        ...  
5038    False
5039    False
5040    False
5041    False
5042    False
Name: movie_title, Length: 5043, dtype: bool

In [8]:
# remove all the space after titles
df_cp['movie_title'] = df_cp['movie_title'].str.strip(' ')

In [9]:
# \xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
df_cp['movie_title'] = df_cp['movie_title'].str[:-1]

In [10]:
df_cp['movie_title']=='Avatar\xa0'

0       False
1       False
2       False
3       False
4       False
        ...  
5038    False
5039    False
5040    False
5041    False
5042    False
Name: movie_title, Length: 5043, dtype: bool

#### J'enlève des mêmes titles de movies.

In [11]:
df_cp[df_cp['movie_title']=='Ben-Hur']

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
367,Color,Timur Bekmambetov,1.0,141.0,335.0,635.0,Ayelet Zurer,11000.0,,Adventure|Drama|History,Morgan Freeman,Ben-Hur,57,13379,Moises Arias,2.0,,http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1,1.0,English,USA,PG-13,,2016.0,745.0,6.1,2.35,0
2613,Color,Timur Bekmambetov,1.0,141.0,335.0,635.0,Ayelet Zurer,11000.0,,Adventure|Drama|History,Morgan Freeman,Ben-Hur,62,13390,Moises Arias,2.0,chariot race|epic|false accusation|jerusalem|slave,http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1,1.0,English,USA,PG-13,100000000.0,2016.0,744.0,6.1,2.35,0
3967,Color,Timur Bekmambetov,1.0,141.0,335.0,635.0,Ayelet Zurer,11000.0,,Adventure|Drama|History,Morgan Freeman,Ben-Hur,67,13391,Moises Arias,2.0,chariot race|epic|false accusation|jerusalem|slave,http://www.imdb.com/title/tt2638144/?ref_=fn_tt_tt_1,1.0,English,USA,PG-13,100000000.0,2016.0,744.0,6.0,2.35,0


In [12]:
df_cp.movie_title.value_counts()

The Fast and the Furious    3
Ben-Hur                     3
Home                        3
Victor Frankenstein         3
King Kong                   3
                           ..
The Whole Ten Yards         1
North Country               1
88 Minutes                  1
Love & Other Drugs          1
My Date with Drew           1
Name: movie_title, Length: 4916, dtype: int64

In [13]:
# drop_duplicates() remove duplicates from the data frame
df_cp = df_cp.drop_duplicates(subset=['director_name','movie_title'],keep='last')

In [14]:
df_cp.movie_title.value_counts()

Out of the Blue                        2
The Host                               2
The Dead Zone                          2
Avatar                                 1
Held Up                                1
                                      ..
Trapped                                1
Barney's Version                       1
The Imaginarium of Doctor Parnassus    1
An Unfinished Life                     1
My Date with Drew                      1
Name: movie_title, Length: 4916, dtype: int64

These 3 films have different director names, it means they are not identic.

In [15]:
df_cp[df_cp['movie_title']=='The Dead Zone']

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
2702,Color,,18.0,60.0,,186.0,Nicole de Boer,443.0,,Drama|Fantasy|Mystery|Sci-Fi,David Ogden Stiers,The Dead Zone,7122,981,Chris Bruno,1.0,psychic|psychic power|psychometry|spin off|supernatural power,http://www.imdb.com/title/tt0281432/?ref_=fn_tt_tt_1,77.0,English,Canada,TV-14,,,319.0,7.5,,576
3130,Color,David Cronenberg,112.0,103.0,0.0,275.0,Herbert Lom,1000.0,,Horror|Sci-Fi|Thriller,Tom Skerritt,The Dead Zone,44804,2013,Anthony Zerbe,0.0,car accident|coma|evil politician|psychic|vision,http://www.imdb.com/title/tt0085407/?ref_=fn_tt_tt_1,182.0,English,USA,R,10000000.0,1983.0,278.0,7.2,1.85,0


In [16]:
#df_cp.index.unique()

#### remplacer Nan par strings

In [17]:
df_cp.head(1)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000


In [18]:
# fill string type columns in with Unknown/Unrated if it is Null for 3NF
# at the end we will not use unknown words for ai
df_cp = df_cp.fillna({'color':'Unknown', 
           'director_name':'Unknown', 
           'actor_2_name':'Unknown', 
           'genres':'Unknown',
           'actor_1_name':'Unknown', 
           'actor_3_name':'Unknown', 
           'plot_keywords':'Unknown', 
           'movie_imdb_link':'Unknown',
           'language':'Unknown',
           'country': 'Unknown',
           'content_rating':'Unrated'
          })

In [19]:
df_cp[df_cp['director_name'] == 'Unknown'].head(2)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
177,Color,Unknown,21.0,60.0,,184.0,Philip Michael Thomas,982.0,,Action|Crime|Drama|Mystery|Thriller,Don Johnson,Miami Vice,16769,1687,John Diehl,2.0,cult tv|detective|drugs|police|undercover,http://www.imdb.com/title/tt0086759/?ref_=fn_tt_tt_1,74.0,English,USA,TV-14,1500000.0,,321.0,7.5,1.33,0
260,Color,Unknown,29.0,60.0,,432.0,Dirk Benedict,669.0,,Action|Adventure|Crime,George Peppard,The A-Team,25402,1655,Dwight Schultz,4.0,1980s|cult tv|famous opening theme|good versus evil|hero for hire,http://www.imdb.com/title/tt0084967/?ref_=fn_tt_tt_1,97.0,English,USA,TV-PG,,,554.0,7.6,4.0,0


In [20]:
# fill numeric (int, float) type columns in with -1 if it is Null
df_cp = df_cp.fillna({'num_critic_for_reviews':-1, 
           'duration':-1, 
           'director_facebook_likes':-1, 
           'actor_3_facebook_likes':-1, 
           'actor_1_facebook_likes':-1, 
           'gross':-1,
           'num_voted_users':-1,
           'cast_total_facebook_likes':-1,
           'facenumber_in_poster':-1,
           'num_user_for_reviews':-1,
           'budget':-1,
           'title_year':-1,
           'actor_2_facebook_likes':-1,
           'imdb_score':-1,
           'aspect_ratio':-1,
           'movie_facebook_likes':-1
          })

#df_cp = df_cp.fillna(-1)

In [21]:
df_cp.isnull().sum()

color                        0
director_name                0
num_critic_for_reviews       0
duration                     0
director_facebook_likes      0
actor_3_facebook_likes       0
actor_2_name                 0
actor_1_facebook_likes       0
gross                        0
genres                       0
actor_1_name                 0
movie_title                  0
num_voted_users              0
cast_total_facebook_likes    0
actor_3_name                 0
facenumber_in_poster         0
plot_keywords                0
movie_imdb_link              0
num_user_for_reviews         0
language                     0
country                      0
content_rating               0
budget                       0
title_year                   0
actor_2_facebook_likes       0
imdb_score                   0
aspect_ratio                 0
movie_facebook_likes         0
dtype: int64

In [22]:
df_cp.shape

(4919, 28)

In [23]:
# change title_year type to int
df_cp['title_year'] = df_cp['title_year'].astype(int)

In [24]:
#export to csv
df_cp.to_csv('movie.csv', index = False)

#### prepare director csv

In [25]:
director = df_cp['director_name'].unique()

In [26]:
director

array(['James Cameron', 'Gore Verbinski', 'Sam Mendes', ...,
       'Scott Smith', 'Benjamin Roberds', 'Daniel Hsia'], dtype=object)

In [27]:
df_director = pd.DataFrame(director, columns=['director_name']) 

In [28]:
#export to csv
df_director.to_csv('df_director.csv', index = False)

#### Je vais spliter director_name to first_name and last_name

In [29]:
df_director_split = df_director.copy()

In [30]:
for i in range(1,10,1):
    print(df_director.loc[df_director['director_name'].str.split().str.len() == i])

     director_name
30             McG
106        Unknown
171          Pitof
462   Costa-Gavras
1114           RZA
1409       Maïwenn
1667        Shekar
1947          Remo
2333     Valentine
          director_name
0         James Cameron
1        Gore Verbinski
2            Sam Mendes
3     Christopher Nolan
4           Doug Walker
...                 ...
2393      Shane Carruth
2395    Anthony Vallone
2396        Scott Smith
2397   Benjamin Roberds
2398        Daniel Hsia

[2182 rows x 1 columns]
                  director_name
34           Guillermo del Toro
88          Jennifer Yuh Nelson
89           M. Night Shyamalan
101                  Jon M. Chu
107       Alejandro G. Iñárritu
130         Mark Steven Johnson
133             James L. Brooks
137               Kirk De Micco
149                 Jan de Bont
183        Michael Patrick King
198              Brian De Palma
200              Alan J. Pakula
205          Paul W.S. Anderson
262                F. Gary Gray
300         Phil 

In [31]:
df_director_split.iloc[610:620]

Unnamed: 0,director_name
610,Patrick Tatopoulos
611,Gary David Goldberg
612,Alan Poul
613,Luke Greenfield
614,Gil Junger
615,Michael Ritchie
616,Steven E. de Souza
617,Alexandre Aja
618,Michael Rymer
619,Hugh Wilson


In [32]:
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 1, 'first_name'] = df_director_split['director_name'].str.split().str[0]
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 1, 'last_name'] = df_director_split['director_name'].str.split().str[-1]

In [33]:
# split director_name into last_name and first_name if director_name composed only 2 names
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 2, 'first_name'] = df_director_split['director_name'].str.split().str[0]
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 2, 'last_name'] = df_director_split['director_name'].str.split().str[-1]

In [34]:
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 3, 'first_name'] = df_director_split['director_name'].str.split().str[0]
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 3, 'last_name'] = df_director_split['director_name'].str.split().str[-2] + ' ' + df_director_split['director_name'].str.split().str[-1]

In [35]:
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 4, 'first_name'] = df_director_split['director_name'].str.split().str[0]
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 4, 'last_name'] = df_director_split['director_name'].str.split().str[-3] + ' ' + df_director_split['director_name'].str.split().str[-2] + ' ' + df_director_split['director_name'].str.split().str[-1]

In [36]:
df_director_split.loc[df_director_split['director_name'].str.split().str.len() == 4]

Unnamed: 0,director_name,first_name,last_name
616,Steven E. de Souza,Steven,E. de Souza
1184,Preston A. Whitmore II,Preston,A. Whitmore II
1192,Daisy von Scherler Mayer,Daisy,von Scherler Mayer
1318,Florian Henckel von Donnersmarck,Florian,Henckel von Donnersmarck
1362,Álex de la Iglesia,Álex,de la Iglesia
1671,Analeine Cal y Mayor,Analeine,Cal y Mayor
1722,Fernando León de Aranoa,Fernando,León de Aranoa
2051,Regardt van den Bergh,Regardt,van den Bergh


In [37]:
df_director_split.shape

(2399, 3)

In [38]:
#export to csv
df_director_split.to_csv('df_director_split.csv', index = False)

#### prepare actor.csv

In [39]:
from itertools import chain

df_actor123 = df_cp[['actor_1_name','actor_2_name','actor_3_name']]
actor_unique = list(set(chain.from_iterable(df_actor123.values)))
df_actor = pd.DataFrame(actor_unique, columns=['actor_name'])

In [40]:
df_actor.nunique()

actor_name    6256
dtype: int64

In [41]:
#export to csv
df_actor.to_csv('df_actor.csv', index = False)

#### Je vais spliter actor_name to first_name and last_name

In [42]:
df_actor_split = df_actor.copy()

In [43]:
for i in range(1,10,1):
    print(df_actor.loc[df_actor['actor_name'].str.split().str.len() == i])

            actor_name
20                Akon
21        Brahmanandam
676        Ann-Margret
749               T.I.
909              Bhama
978               Pelé
1094           Steve-O
1206           Lalaine
1226       Prabhudheva
1291           Maïwenn
1343            Divine
1415             Ice-T
1517           Revathy
1572              Flea
1731              Mako
1899              Bono
2106           Charice
2167           Luenell
2280               Eve
2571           Chester
2944          Mo'Nique
3020            Xzibit
3065              Leon
3088              Remo
3106             Slash
3129            Denden
3172            Common
3314             Akima
3434             Topol
3565           Aaliyah
3724             Rekha
3835             Terry
3868           Prabhas
4020       Laura-Leigh
4035              Pink
4037            Malika
4054          Madhavan
4201             Lemmy
4207              Rain
4235           Prateik
4243         Will.i.am
4401             Björk
4526       

Empty DataFrame
Columns: [actor_name]
Index: []
Empty DataFrame
Columns: [actor_name]
Index: []
Empty DataFrame
Columns: [actor_name]
Index: []
Empty DataFrame
Columns: [actor_name]
Index: []
Empty DataFrame
Columns: [actor_name]
Index: []


In [44]:
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 1]

Unnamed: 0,actor_name
20,Akon
21,Brahmanandam
676,Ann-Margret
749,T.I.
909,Bhama
978,Pelé
1094,Steve-O
1206,Lalaine
1226,Prabhudheva
1291,Maïwenn


In [45]:
df_actor_split

Unnamed: 0,actor_name
0,Michael Chapman
1,Martin Sharpe
2,Jordan Trovillion
3,Akie Kotabe
4,Kamatari Fujiwara
...,...
6251,Yayan Ruhian
6252,Ivan Dixon
6253,Fernanda Torres
6254,Kang-ho Song


In [46]:
df_actor_split.isnull().sum()

actor_name    0
dtype: int64

In [47]:
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 1, 'first_name'] = df_actor_split['actor_name'].str.split().str[0]
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 1, 'last_name'] = df_actor_split['actor_name'].str.split().str[0]

In [48]:
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 2, 'first_name'] = df_actor_split['actor_name'].str.split().str[-2]
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 2, 'last_name'] = df_actor_split['actor_name'].str.split().str[-1]

In [49]:
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 3, 'first_name'] = df_actor_split['actor_name'].str.split().str[0]
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 3, 'last_name'] = df_actor_split['actor_name'].str.split().str[-2] + ' ' + df_actor_split['actor_name'].str.split().str[-1]

In [50]:
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 4, 'first_name'] = df_actor_split['actor_name'].str.split().str[0]
df_actor_split.loc[df_actor_split['actor_name'].str.split().str.len() == 4, 'last_name'] = df_actor_split['actor_name'].str.split().str[-3] + ' ' + df_actor_split['actor_name'].str.split().str[-2] + ' ' + df_actor_split['actor_name'].str.split().str[-1]

In [51]:
df_actor_split.isnull().sum()

actor_name    0
first_name    0
last_name     0
dtype: int64

In [52]:
#export to csv
df_actor_split.to_csv('df_actor_split.csv', index = False)

#### prepare genres.csv

In [53]:
df_cp['genres'].value_counts()

Drama                                                     233
Comedy                                                    205
Comedy|Drama                                              189
Comedy|Drama|Romance                                      185
Comedy|Romance                                            157
                                                         ... 
Crime|Drama|History                                         1
Action|Animation|Comedy|Family|Fantasy|Sci-Fi               1
Biography|Crime|Drama|History|Thriller                      1
Adventure|Animation|Comedy|Drama|Family|Fantasy|Sci-Fi      1
Comedy|Crime|Horror                                         1
Name: genres, Length: 914, dtype: int64

In [54]:
genres = df_cp['genres'].unique()

In [55]:
df_genres= pd.DataFrame(genres, columns=['genres']) 

In [56]:
# separate genre with | and then stack them
stacked = pd.DataFrame(df_genres.genres.str.split(pat='|').tolist()).stack()

In [57]:
stacked

0    0       Action
     1    Adventure
     2      Fantasy
     3       Sci-Fi
1    0       Action
            ...    
912  1        Drama
     2       Horror
913  0       Comedy
     1        Crime
     2       Horror
Length: 3569, dtype: object

In [58]:
genres = pd.DataFrame(stacked.value_counts().index.to_list(), columns = ["genres"])

In [59]:
genres = genres.sort_values(by="genres", inplace=False, ignore_index=True)

In [60]:
genres.to_csv('genres.csv', index = False)