# Sommaire

[Import des librairies et fonctions](#Import)

[Import des bases de données](#Import)

[Exploration des données](#Exploration)

[Nettoyage et transformation des données](#Nettoyage)

## KPI

[1) TOP 1000 des films les mieux notés : répartition du nombre de films par réalisateurs](#TOP_1000_des_films_les_mieux_notes)


[2) TOP 1000 des films les mieux notés : répartition du nombre de films par genre](#TOP_1000_des_films_les_mieux_notes1)

[3) TOP 1000 des films les mieux notés : nombre d'occurence des acteurs principaux](#TOP_1000_des_films_les_mieux_notes2)

[4) Evolution du nombre de films par genre cinématographique au cours des années](#Evolution_du_nombre)

## **Import des librairies et fonctions <a id='Import'></a>**

### Librairies

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
import statsmodels.api as sm
import statsmodels.formula.api as smf
import plotly.express as px
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import re

### Fonctions

In [None]:
def remove_outliers(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    df_filtered = df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
    return df_filtered

def keep_outliers(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    df_filtered = df[(df[column] <= lower_bound) | (df[column] >= upper_bound)]
    return df_filtered

def outliers_funct_echantillon(df, colonne):
    lower_bound = df[colonne].quantile(0.9956)
    upper_bound = df[colonne].quantile(1)
    outliers = df[(df[colonne] >= lower_bound) & (df[colonne] <= upper_bound)][colonne]
    return outliers

## **Import des bases de données**

In [None]:
# import des base de données
df_names = pd.read_csv('https://datasets.imdbws.com/name.basics.tsv.gz', sep ='\t', nrows=1000000)
df_title = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', sep = '\t', dtype={'isAdult': str}, nrows=1000000)
df_title_crew = pd.read_csv('https://datasets.imdbws.com/title.crew.tsv.gz', sep ='\t', nrows=10000000)
df_title_principals = pd.read_csv('https://datasets.imdbws.com/title.principals.tsv.gz', sep ='\t', nrows=1000000)
df_ratings = pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', sep ='\t',nrows = 1000000)

## **Exploration de données <a id='Exploration'></a>**

### df_ratings

In [None]:
# Visualisation de df
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2008
1,tt0000002,5.7,270
2,tt0000003,6.5,1926
3,tt0000004,5.4,178
4,tt0000005,6.2,2701


In [None]:
#Informations générales
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1000000 non-null  object 
 1   averageRating  1000000 non-null  float64
 2   numVotes       1000000 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 22.9+ MB


In [None]:
#Description Statistique
df_ratings.describe()

Unnamed: 0,averageRating,numVotes
count,1000000.0,1000000.0
mean,6.907042,1170.645
std,1.37395,19620.16
min,1.0,5.0
25%,6.2,12.0
50%,7.1,28.0
75%,7.8,108.0
max,10.0,2830977.0


### df_names

In [None]:
# Visualisation de df
df_names.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack","tt0117057,tt0038355,tt0037382,tt0075213"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,soundtrack,music_department","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,soundtrack,writer","tt0072562,tt0077975,tt0078723,tt0080455"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0083922,tt0069467,tt0050986,tt0050976"


In [None]:
#Informations générales
df_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column             Non-Null Count    Dtype 
---  ------             --------------    ----- 
 0   nconst             1000000 non-null  object
 1   primaryName        999999 non-null   object
 2   birthYear          1000000 non-null  object
 3   deathYear          1000000 non-null  object
 4   primaryProfession  962695 non-null   object
 5   knownForTitles     1000000 non-null  object
dtypes: object(6)
memory usage: 45.8+ MB


In [None]:
#Description Statistique
df_names.describe()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
count,1000000,999999,1000000,1000000,962695,1000000
unique,1000000,941393,375,333,11942,760287
top,nm0000001,John Williams,\N,\N,actor,\N
freq,1,27,749388,865201,223091,17277


### df_title_crew

In [None]:
# Visualisation de df
df_title_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


In [None]:
#Informations générales
df_title_crew.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   tconst     object
 1   directors  object
 2   writers    object
dtypes: object(3)
memory usage: 228.9+ MB


In [None]:
#Description Statistique
df_title_crew.describe()

Unnamed: 0,tconst,directors,writers
count,10000000,10000000,10000000
unique,10000000,894109,1282395
top,tt0000001,\N,\N
freq,1,4277048,4845596


### df_title

In [None]:
# Visualisation de df
df_title.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [None]:
#Informations générales
df_title.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   tconst          1000000 non-null  object
 1   titleType       1000000 non-null  object
 2   primaryTitle    1000000 non-null  object
 3   originalTitle   1000000 non-null  object
 4   isAdult         1000000 non-null  object
 5   startYear       1000000 non-null  object
 6   endYear         1000000 non-null  object
 7   runtimeMinutes  1000000 non-null  object
 8   genres          1000000 non-null  object
dtypes: object(9)
memory usage: 68.7+ MB


In [None]:
#Description Statistique
df_title.describe()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
count,1000000,1000000,1000000,1000000,1000000,1000000,1000000,1000000,1000000
unique,1000000,10,722884,730146,2,139,84,639,1721
top,tt0000001,tvEpisode,Episode #1.1,Episode #1.1,0,2006,\N,\N,Drama
freq,1,466199,2285,2285,958372,62923,977791,442168,99549


### df_title_principals

In [None]:
# Visualisation de df
df_title_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [None]:
#Informations générales
df_title_principals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   tconst      1000000 non-null  object
 1   ordering    1000000 non-null  int64 
 2   nconst      1000000 non-null  object
 3   category    1000000 non-null  object
 4   job         1000000 non-null  object
 5   characters  1000000 non-null  object
dtypes: int64(1), object(5)
memory usage: 45.8+ MB


In [None]:
#Description Statistique
df_title_principals.describe()

Unnamed: 0,ordering
count,1000000.0
mean,5.24463
std,2.84
min,1.0
25%,3.0
50%,5.0
75%,8.0
max,10.0


## **Nettoyage et transformation des données <a id='Nettoyage'></a>**

#### Table: df_title

In [None]:
# Conversions en valeur numérique
df_title['startYear'] = pd.to_datetime (df_title['startYear'], errors = 'coerce').dt.year
df_title['endYear'] = pd.to_datetime (df_title['endYear'], errors = 'coerce').dt.year
df_title['runtimeMinutes'] = pd.to_numeric (df_title['runtimeMinutes'], errors = 'coerce')

#### Table: df_names

In [None]:
# Conversions en valeur numérique
df_names['birthYear'] = pd.to_datetime (df_names['birthYear'], errors = 'coerce').dt.year
df_names['deathYear'] = pd.to_datetime (df_names['deathYear'], errors = 'coerce').dt.year

#### df_title - filtrer sur 'movies'

In [None]:
# Filtre: 'type == only movies'
df_title_only_movies = df_title.loc[df_title['titleType'] == 'movie']
df_title_only_movies

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45.0,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100.0,"Documentary,News,Sport"
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100.0,\N
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,0,1906.0,,70.0,"Action,Adventure,Biography"
587,tt0000591,movie,The Prodigal Son,L'enfant prodigue,0,1907.0,,90.0,Drama
...,...,...,...,...,...,...,...,...,...
999463,tt10057306,movie,Game-Changer,Game-Changer,0,,,,Family
999484,tt10057342,movie,The Orchestration of Audrey,The Orchestration of Audrey,0,,,64.0,Drama
999745,tt10057796,movie,Best Wishes,Best Wishes,0,2019.0,,,Drama
999770,tt10057838,movie,;,;,0,2017.0,,,\N


In [None]:
#EXPLODE
#Si plusieurs genres sont attribués à un flim, créer autant de lignes que de genres où chaque ligne possedera un genre distinct

df_title_only_movies['genres'] = df_title_only_movies['genres'].apply(lambda x: x.split(','))
df_title_only_movies = df_title_only_movies.explode('genres')
df_title_only_movies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_title_only_movies['genres'] = df_title_only_movies['genres'].apply(lambda x: x.split(','))


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894.0,,45.0,Romance
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100.0,Documentary
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100.0,News
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,0,1897.0,,100.0,Sport
498,tt0000502,movie,Bohemios,Bohemios,0,1905.0,,100.0,\N
...,...,...,...,...,...,...,...,...,...
999463,tt10057306,movie,Game-Changer,Game-Changer,0,,,,Family
999484,tt10057342,movie,The Orchestration of Audrey,The Orchestration of Audrey,0,,,64.0,Drama
999745,tt10057796,movie,Best Wishes,Best Wishes,0,2019.0,,,Drama
999770,tt10057838,movie,;,;,0,2017.0,,,\N


In [None]:
#Filtrer sur les films dont la durée est supérieur à 60 min
df_title_only_movies = df_title_only_movies[df_title_only_movies['runtimeMinutes'] > 60]

In [None]:
# supprimer les colonnes nulles
df_title_only_movies.isna().sum()

tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear             73
endYear           260320
runtimeMinutes         0
genres                 0
dtype: int64

In [None]:
df_title_only_movies = df_title_only_movies.drop('endYear', axis = 1)

In [None]:
df_title_only_movies = df_title_only_movies.drop('isAdult', axis = 1)

In [None]:
# suppression des '\\N'
df_title_only_movies = df_title_only_movies[df_title_only_movies['genres'] != '\\N']

#### df_title_principals

In [None]:
df_title_principals = df_title_principals.drop('job', axis = 1)

In [None]:
df_title_principals = df_title_principals.drop('characters', axis = 1)

In [None]:
df_title_principals

Unnamed: 0,tconst,ordering,nconst,category
0,tt0000001,1,nm1588970,self
1,tt0000001,2,nm0005690,director
2,tt0000001,3,nm0374658,cinematographer
3,tt0000002,1,nm0721526,director
4,tt0000002,2,nm1335271,composer
...,...,...,...,...
999995,tt0116669,1,nm0000245,actor
999996,tt0116669,2,nm0000178,actress
999997,tt0116669,3,nm0450116,actor
999998,tt0116669,4,nm0000182,actress


# **KPI**

## 1- TOP 1000 des films les mieux notés : répartition du nombre de films par réalisateurs <a id="TOP_1000_des_films_les_mieux_notes"></a>

### **Top 1000 movies**

In [None]:
#Appliquer fonction outliers_funct_echantillon

df_ratings_by_numVotes = df_ratings
df_ratings_by_numVotes['numVotes'] = outliers_funct_echantillon(df_ratings_by_numVotes, colonne='numVotes')

In [None]:
#Supprimer les valeurs aberantes
df_ratings_by_numVotes.dropna(subset=['numVotes'], inplace=True)

In [None]:
df_ratings_by_numVotes

Unnamed: 0,tconst,averageRating,numVotes
297,tt0000417,8.2,54390.0
2845,tt0010323,8.0,68768.0
3361,tt0012349,8.2,132734.0
3643,tt0013442,7.9,103696.0
4230,tt0015324,8.2,55176.0
...,...,...,...
998451,tt2872732,6.4,525376.0
998453,tt2872750,7.3,42811.0
998491,tt2873282,6.6,198292.0
999417,tt2879552,8.1,96364.0


In [None]:
#Merge des tables: df_ratings_by_numVotes ET df_title_only_movies
df_ratings_movie_numVotes = df_title_only_movies.merge(df_ratings_by_numVotes, on='tconst')

In [None]:
df_ratings_movie_numVotes

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0010323,movie,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,1920.0,76.0,Horror,8.0,68768.0
1,tt0010323,movie,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,1920.0,76.0,Mystery,8.0,68768.0
2,tt0010323,movie,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,1920.0,76.0,Thriller,8.0,68768.0
3,tt0012349,movie,The Kid,The Kid,1921.0,68.0,Comedy,8.2,132734.0
4,tt0012349,movie,The Kid,The Kid,1921.0,68.0,Drama,8.2,132734.0
...,...,...,...,...,...,...,...,...,...
6812,tt10028196,movie,Laal Singh Chaddha,Laal Singh Chaddha,2022.0,159.0,Comedy,5.6,177200.0
6813,tt10028196,movie,Laal Singh Chaddha,Laal Singh Chaddha,2022.0,159.0,Drama,5.6,177200.0
6814,tt10028196,movie,Laal Singh Chaddha,Laal Singh Chaddha,2022.0,159.0,Romance,5.6,177200.0
6815,tt10039344,movie,Countdown,Countdown,2019.0,90.0,Horror,5.4,43973.0


In [None]:
#Top 100 des films les mieux notés
df_top_movie_by_averageRating = df_ratings_movie_numVotes.sort_values(by= 'averageRating', ascending = False)
df_top_movie_by_averageRating = df_top_movie_by_averageRating.drop_duplicates(subset='tconst', keep='first')
df_top_movie_by_averageRating['Rank'] = df_top_movie_by_averageRating['averageRating'].rank(ascending=False).astype(int)
df_top_movie_by_averageRating = df_top_movie_by_averageRating.drop(['titleType'], axis=1)
df_top_movie_by_averageRating = df_top_movie_by_averageRating.head(1000)

In [None]:
df_top_movie_by_averageRating.head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
2221,tt0111161,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
548,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
3868,tt0252487,The Chaos Class,Hababam Sinifi,1975.0,87.0,Comedy,9.2,42480.0,2
250,tt0050083,12 Angry Men,12 Angry Men,1957.0,96.0,Drama,9.0,844503.0,6
5928,tt0468569,The Dark Knight,The Dark Knight,2008.0,152.0,Drama,9.0,2812354.0,6


### **Films par réalisateurs**

In [None]:
# Filtre: ['category'] == 'director'
df_directors = df_title_principals.loc[df_title_principals['category'] == 'director']
df_directors

Unnamed: 0,tconst,ordering,nconst,category
1,tt0000001,2,nm0005690,director
3,tt0000002,1,nm0721526,director
5,tt0000003,1,nm0721526,director
9,tt0000004,1,nm0721526,director
13,tt0000005,3,nm0005690,director
...,...,...,...,...
999976,tt0116664,5,nm0894591,director
999981,tt0116665,1,nm0676189,director
999984,tt0116666,3,nm0304298,director
999990,tt0116668,5,nm0851444,director


In [None]:
# Jointure 'df_directors' ET 'df_names'
name_directors = pd.merge(df_directors,df_names, how = 'outer', on = 'nconst', indicator = True)
name_directors

Unnamed: 0,tconst,ordering,nconst,category,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,_merge
0,tt0000001,2.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
1,tt0000005,3.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
2,tt0000006,1.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
3,tt0000007,3.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
4,tt0000008,2.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
...,...,...,...,...,...,...,...,...,...,...
1087400,,,nm10084254,,Sona Sood,,,,\N,right_only
1087401,,,nm10084255,,Evan Pantely,,,composer,tt8919048,right_only
1087402,,,nm10084256,,Lucie Leroux,,,costume_designer,\N,right_only
1087403,,,nm10084258,,Chandrashekar,,,actor,tt8919066,right_only


In [None]:
name_directors['_merge'].value_counts()

right_only    978676
both          107417
left_only       1312
Name: _merge, dtype: int64

In [None]:
df_director_names = name_directors[name_directors['category'].isin(['director'])]
df_director_names

Unnamed: 0,tconst,ordering,nconst,category,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,_merge
0,tt0000001,2.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
1,tt0000005,3.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
2,tt0000006,1.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
3,tt0000007,3.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
4,tt0000008,2.0,nm0005690,director,William K.L. Dickson,1860.0,1935.0,"cinematographer,director,producer","tt0219560,tt1496763,tt0308254,tt1428455",both
...,...,...,...,...,...,...,...,...,...,...
108724,tt0116661,5.0,nm0378794,director,Kelli Herd,,,"director,writer,producer","tt0177271,tt0116661,tt1381714",both
108725,tt0116663,5.0,nm0355145,director,Eyal Halfon,,,"director,writer,producer","tt0116663,tt4967126,tt0468729,tt0163586",both
108726,tt0116664,5.0,nm0894591,director,Darko Vernic,1952.0,2000.0,"director,writer,cinematographer","tt0334619,tt0334129,tt14633880,tt0116664",both
108727,tt0116665,1.0,nm0676189,director,Vladimir Petek,1940.0,2003.0,"director,cinematographer,editor","tt0116665,tt2364658,tt8659628,tt0268001",both


In [None]:
# Jointure 'df_director_names' ET 'df_top_movie_by_averageRating'
df_top_directors_movies = pd.merge(df_director_names, df_top_movie_by_averageRating, how = 'right', on = 'tconst')
df_top_directors_movies

Unnamed: 0,tconst,ordering,nconst,category,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
0,tt0111161,5.0,nm0001104,director,Frank Darabont,1959.0,,"writer,producer,director","tt0884328,tt0111161,tt0120689,tt1520211",both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
1,tt0068646,5.0,nm0000338,director,Francis Ford Coppola,1939.0,,"producer,director,writer","tt0078788,tt0071360,tt0068646,tt0071562",both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
2,tt0252487,,,,,,,,,,The Chaos Class,Hababam Sinifi,1975.0,87.0,Comedy,9.2,42480.0,2
3,tt0050083,5.0,nm0001486,director,Sidney Lumet,1924.0,2011.0,"director,producer,writer","tt0292963,tt0072890,tt0050083,tt0070666",both,12 Angry Men,12 Angry Men,1957.0,96.0,Drama,9.0,844503.0,6
4,tt0468569,,,,,,,,,,The Dark Knight,The Dark Knight,2008.0,152.0,Drama,9.0,2812354.0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1065,tt0096446,5.0,nm0000165,director,Ron Howard,1954.0,,"producer,actor,writer","tt0268978,tt0367279,tt0075213,tt0112384",both,Willow,Willow,1988.0,126.0,Drama,7.2,128492.0,1050
1066,tt0962736,,,,,,,,,,The Young Victoria,The Young Victoria,2009.0,105.0,History,7.2,65661.0,1050
1067,tt0070379,5.0,nm0000217,director,Martin Scorsese,1942.0,,"producer,director,actor","tt0075314,tt0070379,tt0099685,tt5537002",both,Mean Streets,Mean Streets,1973.0,112.0,Thriller,7.2,117246.0,1050
1068,tt0159097,,,,,,,,,,The Virgin Suicides,The Virgin Suicides,1999.0,97.0,Romance,7.2,165610.0,1050


In [None]:
top_realisateur = df_top_directors_movies['primaryName'].value_counts()
top_realisateur.head(50).index.tolist()

['Alfred Hitchcock',
 'Stanley Kubrick',
 'Steven Spielberg',
 'Martin Scorsese',
 'Akira Kurosawa',
 'Billy Wilder',
 'Wilfred Jackson',
 'Hamilton Luske',
 'Francis Ford Coppola',
 'Brian De Palma',
 'John Huston',
 'Federico Fellini',
 'Richard Donner',
 'Rob Reiner',
 'James Cameron',
 'Clyde Geronimi',
 'Joel Coen',
 'Ethan Coen',
 'Sidney Lumet',
 'Hayao Miyazaki',
 'Sergio Leone',
 'Robert Zemeckis',
 'Tim Burton',
 'Ingmar Bergman',
 'Howard Hawks',
 'David Lynch',
 'David Lean',
 'John McTiernan',
 'John Ford',
 'John Landis',
 'Andrei Tarkovsky',
 'Krzysztof Kieslowski',
 'David Hand',
 'Frank Capra',
 'Robert Altman',
 'Don Siegel',
 'Elia Kazan',
 'Alan Parker',
 'Richard Attenborough',
 'Mel Brooks',
 'Jack Kinney',
 'Sam Raimi',
 'John Hughes',
 'Michael Curtiz',
 'Ridley Scott',
 'Fred Zinnemann',
 'Penny Marshall',
 'Luc Besson',
 'John Carpenter',
 'Jim Jarmusch']

In [None]:
# Convertissez la série en DataFrame
df_top = pd.DataFrame({'Réalisateurs': top_realisateur.index, 'Nombre de films': top_realisateur.values})

# Utilisez Plotly Express pour créer le graphique à barres
fig = px.bar(df_top[:10], x='Réalisateurs', y='Nombre de films', title='NOMBRE DE FILMS PAR REALISATEURS DANS LE TOP 1000')

fig.update_layout(title={'y': 0.9, 'x': 0.5, 'xanchor': 'center'}, height= 600, width=1000)

# Affichez le graphique
fig.show()


## 2- TOP 1000 des films les mieux notés : répartition du nombre de films par genre <a id='TOP_1000_des_films_les_mieux_notes1'></a>

In [None]:
genres = df_title_only_movies['genres'].unique()
genres

array(['Documentary', 'News', 'Sport', 'Action', 'Adventure', 'Biography',
       'Drama', 'Fantasy', 'Romance', 'History', 'War', 'Thriller',
       'Crime', 'Mystery', 'Horror', 'Comedy', 'Western', 'Family',
       'Sci-Fi', 'Animation', 'Music', 'Musical', 'Film-Noir', 'Adult',
       'Reality-TV', 'Talk-Show'], dtype=object)

In [None]:
df_grp_films_genres = df_title_only_movies['genres'].value_counts()
df_grp_films_genres

Drama          74656
Comedy         41630
Romance        19760
Crime          16266
Action         15804
Documentary    10045
Adventure       9951
Thriller        9529
Horror          7175
Mystery         5535
Family          5049
Musical         5033
War             4342
Fantasy         4257
Western         3255
Adult           3252
History         3225
Music           3204
Sci-Fi          3008
Biography       2693
Sport           1312
Animation       1289
Film-Noir        836
News              15
Reality-TV         6
Talk-Show          1
Name: genres, dtype: int64

In [None]:
nb_films_genres = df_top_movie_by_averageRating['genres'].value_counts()
nb_films_genres

Drama          266
Comedy         108
Crime           81
Action          77
Thriller        67
Romance         55
Adventure       54
Biography       41
Mystery         31
Horror          28
Sci-Fi          28
War             26
Fantasy         26
Family          22
History         20
Animation       18
Western         13
Music           12
Sport           10
Musical          8
Film-Noir        7
Documentary      2
Name: genres, dtype: int64

In [None]:
films_genres_top_movie = nb_films_genres.reset_index()
films_genres_top_movie.columns = ['Genres', 'Nombre de films']

films_genres_top_movie['Pourcentage'] = round(((films_genres_top_movie['Nombre de films'] / films_genres_top_movie['Nombre de films'].sum()) * 100), 2)

fig = px.treemap(films_genres_top_movie, path=['Genres','Pourcentage'], values='Pourcentage',
                 title='NOMBRE DE FILMS PAR GENRES DANS LE TOP 1000')

fig.update_layout(title={'y': 0.85, 'x': 0.5, 'xanchor': 'center'}, height=600, width=1050)

fig.update_traces(textinfo='percent root')

fig.show()

## 3- TOP 1000 des films les mieux notés : nombre d'occurence des acteurs principaux <a id='TOP_1000_des_films_les_mieux_notes2'></a>

In [None]:
df_actors_actress = df_title_principals.loc[(df_title_principals['category'] == 'actor') | (df_title_principals['category'] == 'actress')]
df_actors_actress

Unnamed: 0,tconst,ordering,nconst,category
11,tt0000005,1,nm0443482,actor
12,tt0000005,2,nm0653042,actor
16,tt0000007,1,nm0179163,actor
17,tt0000007,2,nm0183947,actor
21,tt0000008,1,nm0653028,actor
...,...,...,...,...
999993,tt0116668,8,nm0638832,actor
999995,tt0116669,1,nm0000245,actor
999996,tt0116669,2,nm0000178,actress
999997,tt0116669,3,nm0450116,actor


In [None]:
# Jointure 'df_name_actors_actress' ET 'df_names'
df_actors_actress_names = pd.merge(df_names, df_actors_actress, how = 'outer', on = 'nconst', indicator = True)
df_actors_actress_names

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge
0,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0025164,1.0,actor,both
1,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0026942,2.0,actor,both
2,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0027125,1.0,actor,both
3,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0027630,1.0,actor,both
4,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0028333,1.0,actor,both
...,...,...,...,...,...,...,...,...,...,...
1371656,nm1748350,,,,,,tt0116645,4.0,actress,right_only
1371657,nm1071554,,,,,,tt0116653,1.0,actress,right_only
1371658,nm1880915,,,,,,tt0116657,2.0,actor,right_only
1371659,nm1888074,,,,,,tt0116657,3.0,actress,right_only


In [None]:
df_actors_actress_names['_merge'].value_counts()

left_only     895789
both          453495
right_only     22377
Name: _merge, dtype: int64

In [None]:
df_actors_actress_names = df_actors_actress_names.query("category == 'actor' or category == 'actress'")
df_actors_actress_names

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge
0,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0025164,1.0,actor,both
1,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0026942,2.0,actor,both
2,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0027125,1.0,actor,both
3,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0027630,1.0,actor,both
4,nm0000001,Fred Astaire,1899.0,1987.0,"soundtrack,actor,miscellaneous","tt0050419,tt0053137,tt0031983,tt0072308",tt0028333,1.0,actor,both
...,...,...,...,...,...,...,...,...,...,...
1371656,nm1748350,,,,,,tt0116645,4.0,actress,right_only
1371657,nm1071554,,,,,,tt0116653,1.0,actress,right_only
1371658,nm1880915,,,,,,tt0116657,2.0,actor,right_only
1371659,nm1888074,,,,,,tt0116657,3.0,actress,right_only


In [None]:
# Jointure 'df_actors_actress_names' ET 'df_top_movie_by_averageRating'
df_top_actor_actress = pd.merge(df_actors_actress_names, df_top_movie_by_averageRating, how = 'right', on = 'tconst')
df_top_actor_actress

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
0,nm0000151,Morgan Freeman,1937.0,,"actor,producer,soundtrack","tt0468569,tt0097239,tt0114369,tt0405159",tt0111161,2.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
1,nm0000209,Tim Robbins,1958.0,,"actor,producer,director","tt0091225,tt0327056,tt0111161,tt0105151",tt0111161,1.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
2,nm0006669,William Sadler,1950.0,,"actor,soundtrack,producer","tt0111161,tt0884328,tt0099423,tt0101452",tt0111161,4.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
3,nm0348409,Bob Gunton,1945.0,,"actor,soundtrack","tt4513678,tt0111161,tt0285331,tt0129290",tt0111161,3.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
4,nm0000008,Marlon Brando,1924.0,2004.0,"actor,soundtrack,director","tt0078788,tt0070849,tt0047296,tt0068646",tt0068646,1.0,actor,both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2606,nm0000172,Harvey Keitel,1939.0,,"actor,producer,soundtrack","tt2278388,tt0110912,tt0103759,tt0105236",tt0070379,2.0,actor,both,Mean Streets,Mean Streets,1973.0,112.0,Thriller,7.2,117246.0,1050
2607,nm0698998,David Proval,1942.0,,"actor,miscellaneous","tt0117331,tt0070379,tt0111161,tt0098546",tt0070379,3.0,actor,both,Mean Streets,Mean Streets,1973.0,112.0,Thriller,7.2,117246.0,1050
2608,nm0732364,Amy Robinson,1948.0,,"producer,actress,miscellaneous","tt0120681,tt0126916,tt0070379,tt0088680",tt0070379,4.0,actress,both,Mean Streets,Mean Streets,1973.0,112.0,Thriller,7.2,117246.0,1050
2609,,,,,,,tt0159097,,,,The Virgin Suicides,The Virgin Suicides,1999.0,97.0,Romance,7.2,165610.0,1050


In [None]:
top_actors_actress = df_top_actor_actress['primaryName'].value_counts()
top_actors_actress.head(100).index.tolist()

['Robert De Niro',
 'Clint Eastwood',
 'James Stewart',
 'Al Pacino',
 'Dustin Hoffman',
 'Harrison Ford',
 'Gene Hackman',
 'Sean Connery',
 'Cary Grant',
 'Jack Nicholson',
 'Humphrey Bogart',
 'John Wayne',
 'Joe Pesci',
 'Tom Hanks',
 'Michael Caine',
 'Diane Keaton',
 'Paul Newman',
 'William Holden',
 'Dennis Hopper',
 'Johnny Depp',
 'Anthony Hopkins',
 'Audrey Hepburn',
 'Woody Allen',
 'Robin Williams',
 'Toshirô Mifune',
 'Laurence Olivier',
 'Arnold Schwarzenegger',
 'Morgan Freeman',
 'Kevin Costner',
 'Marlon Brando',
 'Robert Redford',
 'Harvey Keitel',
 'Charles Chaplin',
 'Gregory Peck',
 'Robert Duvall',
 'Shirley MacLaine',
 'Val Kilmer',
 'Anthony Quinn',
 'Takashi Shimura',
 'Jack Lemmon',
 'Willem Dafoe',
 'John Cleese',
 'Faye Dunaway',
 'Mel Gibson',
 'Tatsuya Nakadai',
 'Charles Bronson',
 'Grace Kelly',
 'Keanu Reeves',
 'Danny Aiello',
 'Claude Rains',
 'Steve McQueen',
 'Christopher Lloyd',
 'Charlton Heston',
 'Brad Pitt',
 'Gene Wilder',
 'Winona Ryder',
 '

In [None]:
# Convertissez la série en DataFrame
df_best = pd.DataFrame({'Acteur | Actrice': top_actors_actress.index, 'Nombre de films': top_actors_actress.values})

# Utilisez Plotly Express pour créer le graphique à barres
fig = px.bar(df_best[:10], x='Acteur | Actrice', y='Nombre de films', title='NOMBRE DE FILMS PAR ACTEUR OU ACTRICE DANS LE TOP 1000')

fig.update_layout(title={'y': 0.9, 'x': 0.5, 'xanchor': 'center'}, height= 600, width=1000)

# Affichez le graphique
fig.show()

## 4- Evolution du nombre de films par genre cinématographique au cours des années <a id='Evolution_du_nombre'></a>

In [None]:
df_title_only_movies.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,1897.0,100.0,Documentary
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,1897.0,100.0,News
144,tt0000147,movie,The Corbett-Fitzsimmons Fight,The Corbett-Fitzsimmons Fight,1897.0,100.0,Sport
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906.0,70.0,Action
570,tt0000574,movie,The Story of the Kelly Gang,The Story of the Kelly Gang,1906.0,70.0,Adventure


In [None]:
# Identifier les top genres
top_genres = df_title_only_movies['genres'].value_counts().head(10)
top_genres

Drama          74656
Comedy         41630
Romance        19760
Crime          16266
Action         15804
Documentary    10045
Adventure       9951
Thriller        9529
Horror          7175
Mystery         5535
Name: genres, dtype: int64

In [None]:
top_genres = df_title_only_movies['genres'].value_counts().head(7).index

# Renommer la colonne 'genres' en 'top_genre'
df_title_only_movies = df_title_only_movies.rename(columns={'genres': 'Top Genres'})

# Filtrer le DataFrame pour inclure uniquement les films des top genres
df_top_genres = df_title_only_movies[df_title_only_movies['Top Genres'].isin(top_genres)]

# Regrouper par année et top_genre, compter le nombre de films par année et par top_genre
df_nb_films_top_genres = df_top_genres.groupby(['startYear', 'Top Genres']).size().reset_index(name='FilmCount')

# Trier le DataFrame par ordre décroissant du nombre de films pour chaque année
df_nb_films_top_genres = df_nb_films_top_genres.sort_values(by=['startYear', 'FilmCount'], ascending=[True, False])

# Création du graphique avec Plotly Express
fig = px.line(df_nb_films_top_genres, x='startYear', y='FilmCount', color='Top Genres',
              labels={'startYear': 'Année', 'FilmCount': 'Nombre de films'},
              title='Évolution du nombre de films par top genre cinématographique au cours du temps',
              width=1200, height=800)

fig.update_layout(title={'y': 0.9, 'x': 0.5, 'xanchor': 'center'}, height= 600, width=1000)

fig.show()


## 5- Top 10 des films par année

# **Recommandation**

## Création de Table "df_artistes" (Top 100 Actor, Top 100 Actress et Top 20 Director)

In [None]:
df_top_actor_actress.head() # Kpi n°3

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
0,nm0000151,Morgan Freeman,1937.0,,"actor,producer,soundtrack","tt0468569,tt0097239,tt0114369,tt0405159",tt0111161,2.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
1,nm0000209,Tim Robbins,1958.0,,"actor,producer,director","tt0091225,tt0327056,tt0111161,tt0105151",tt0111161,1.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
2,nm0006669,William Sadler,1950.0,,"actor,soundtrack,producer","tt0111161,tt0884328,tt0099423,tt0101452",tt0111161,4.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
3,nm0348409,Bob Gunton,1945.0,,"actor,soundtrack","tt4513678,tt0111161,tt0285331,tt0129290",tt0111161,3.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
4,nm0000008,Marlon Brando,1924.0,2004.0,"actor,soundtrack,director","tt0078788,tt0070849,tt0047296,tt0068646",tt0068646,1.0,actor,both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2


In [None]:
# sélectionner seulement les acteurs:
name_actor = df_top_actor_actress[df_top_actor_actress['category'] == 'actor'].drop_duplicates()
name_actor.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
0,nm0000151,Morgan Freeman,1937.0,,"actor,producer,soundtrack","tt0468569,tt0097239,tt0114369,tt0405159",tt0111161,2.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
1,nm0000209,Tim Robbins,1958.0,,"actor,producer,director","tt0091225,tt0327056,tt0111161,tt0105151",tt0111161,1.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
2,nm0006669,William Sadler,1950.0,,"actor,soundtrack,producer","tt0111161,tt0884328,tt0099423,tt0101452",tt0111161,4.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
3,nm0348409,Bob Gunton,1945.0,,"actor,soundtrack","tt4513678,tt0111161,tt0285331,tt0129290",tt0111161,3.0,actor,both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
4,nm0000008,Marlon Brando,1924.0,2004.0,"actor,soundtrack,director","tt0078788,tt0070849,tt0047296,tt0068646",tt0068646,1.0,actor,both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2


In [None]:
# voir les top 100 acteurs:
top_actors = name_actor['primaryName'].value_counts().head(100).index.tolist()
top_actors

['Robert De Niro',
 'Clint Eastwood',
 'Al Pacino',
 'James Stewart',
 'Dustin Hoffman',
 'Harrison Ford',
 'Gene Hackman',
 'Cary Grant',
 'Sean Connery',
 'Humphrey Bogart',
 'Jack Nicholson',
 'Michael Caine',
 'Paul Newman',
 'Joe Pesci',
 'Tom Hanks',
 'John Wayne',
 'William Holden',
 'Anthony Hopkins',
 'Robert Redford',
 'Arnold Schwarzenegger',
 'Charles Chaplin',
 'Johnny Depp',
 'Toshirô Mifune',
 'Dennis Hopper',
 'Morgan Freeman',
 'Robin Williams',
 'Woody Allen',
 'Gregory Peck',
 'Marlon Brando',
 'Robert Duvall',
 'Harvey Keitel',
 'Kevin Costner',
 'Laurence Olivier',
 'Tatsuya Nakadai',
 'Claude Rains',
 'Christopher Lloyd',
 'Danny Aiello',
 'Charles Bronson',
 'Keanu Reeves',
 'Steve McQueen',
 'Tim Robbins',
 'Dan Aykroyd',
 'Mel Gibson',
 'Charlton Heston',
 'Willem Dafoe',
 'Val Kilmer',
 'Jack Lemmon',
 'Anthony Quinn',
 'Gene Wilder',
 'John Cleese',
 'John Malkovich',
 'Denzel Washington',
 'Henry Fonda',
 'Takashi Shimura',
 'Bruce Willis',
 'Brad Pitt',
 'O

In [None]:
# DF acteurs:
df_top_actors = name_actor[name_actor['primaryName'].isin(top_actors)]
df_actors_100 = df_top_actors.loc[:,['tconst','primaryName']]
df_actors_100

Unnamed: 0,tconst,primaryName
0,tt0111161,Morgan Freeman
1,tt0111161,Tim Robbins
4,tt0068646,Marlon Brando
5,tt0068646,Al Pacino
9,tt0050083,Henry Fonda
...,...,...
2579,tt0064276,Jack Nicholson
2580,tt0064276,Dennis Hopper
2600,tt0096446,Val Kilmer
2605,tt0070379,Robert De Niro


In [None]:
# sélectionner seulement les actrices:
name_actress = df_top_actor_actress[df_top_actor_actress['category'] == 'actress'].drop_duplicates()
name_actress.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,tconst,ordering,category,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
6,nm0000473,Diane Keaton,1946.0,,"actress,producer,director","tt0337741,tt0075686,tt0082979,tt0356680",tt0068646,4.0,actress,both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
17,nm0000473,Diane Keaton,1946.0,,"actress,producer,director","tt0337741,tt0075686,tt0082979,tt0356680",tt0071562,4.0,actress,both,The Godfather Part II,The Godfather Part II,1974.0,202.0,Crime,9.0,1339273.0,6
21,nm0328751,Caroline Goodall,1959.0,,"actress,producer,writer","tt0106582,tt0360139,tt0102057,tt0109635",tt0108052,4.0,actress,both,Schindler's List,Schindler's List,1993.0,195.0,History,9.0,1422563.0,6
24,nm0000235,Uma Thurman,1970.0,,"actress,producer,soundtrack","tt0266697,tt0118688,tt0378194,tt0110912",tt0110912,2.0,actress,both,Pulp Fiction,Pulp Fiction,1994.0,154.0,Crime,8.9,2170768.0,9
34,nm0000398,Sally Field,1946.0,,"actress,producer,soundtrack","tt0109830,tt0098384,tt0443272,tt0076729",tt0109830,4.0,actress,both,Forrest Gump,Forrest Gump,1994.0,142.0,Drama,8.8,2206221.0,12


In [None]:
# voir les top 100 actrices:
top_actress = name_actress['primaryName'].value_counts().head(100).index.tolist()
top_actress

['Diane Keaton',
 'Audrey Hepburn',
 'Winona Ryder',
 'Shirley MacLaine',
 'Mia Farrow',
 'Carrie Fisher',
 'Katharine Hepburn',
 'Faye Dunaway',
 'Susan Sarandon',
 'Geena Davis',
 'Barbara Hershey',
 'Grace Kelly',
 'Kirsten Dunst',
 'Whoopi Goldberg',
 'Talia Shire',
 'Sigourney Weaver',
 'Ingrid Bergman',
 'Jamie Lee Curtis',
 'Juliette Lewis',
 'Kathy Bates',
 'Julie Delpy',
 'Frances McDormand',
 'Jessica Tandy',
 'Emma Thompson',
 'Meryl Streep',
 'Laura Dern',
 'Vera Miles',
 'Janet Leigh',
 'Giulietta Masina',
 'Yvonne Furneaux',
 'Shelley Winters',
 'Bibi Andersson',
 'Kim Basinger',
 'Piper Laurie',
 'Lee Remick',
 'Holly Hunter',
 'Anouk Aimée',
 'Teresa Wright',
 'Elisabeth Shue',
 "Beverly D'Angelo",
 'Irène Jacob',
 'Andie MacDowell',
 'Madeline Kahn',
 'Jessica Lange',
 'Julie Andrews',
 'Jean Arthur',
 'Melinda Dillon',
 'Glenn Close',
 'Katharine Ross',
 'Mary Steenburgen',
 'Kate Winslet',
 'Brooke Adams',
 'Patricia Neal',
 'Sissy Spacek',
 'Catherine Deneuve',
 'Ju

In [None]:
# DF actrices:
df_top_actress = name_actress[name_actress['primaryName'].isin(top_actress)]
df_actress_100 = df_top_actress.loc[:,['tconst','primaryName']]
df_actress_100

Unnamed: 0,tconst,primaryName
6,tt0068646,Diane Keaton
17,tt0071562,Diane Keaton
34,tt0109830,Sally Field
36,tt0109830,Robin Wright
43,tt0080684,Carrie Fisher
...,...,...
2573,tt0097239,Jessica Tandy
2585,tt0110005,Kate Winslet
2595,tt0110367,Winona Ryder
2596,tt0110367,Susan Sarandon


In [None]:
df_top_directors_movies.head() # Kpi n°1

Unnamed: 0,tconst,ordering,nconst,category,primaryName,birthYear,deathYear,primaryProfession,knownForTitles,_merge,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
0,tt0111161,5.0,nm0001104,director,Frank Darabont,1959.0,,"writer,producer,director","tt0884328,tt0111161,tt0120689,tt1520211",both,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
1,tt0068646,5.0,nm0000338,director,Francis Ford Coppola,1939.0,,"producer,director,writer","tt0078788,tt0071360,tt0068646,tt0071562",both,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
2,tt0252487,,,,,,,,,,The Chaos Class,Hababam Sinifi,1975.0,87.0,Comedy,9.2,42480.0,2
3,tt0050083,5.0,nm0001486,director,Sidney Lumet,1924.0,2011.0,"director,producer,writer","tt0292963,tt0072890,tt0050083,tt0070666",both,12 Angry Men,12 Angry Men,1957.0,96.0,Drama,9.0,844503.0,6
4,tt0468569,,,,,,,,,,The Dark Knight,The Dark Knight,2008.0,152.0,Drama,9.0,2812354.0,6


In [None]:
# voir les top 50 réalisateurs:
top_director = df_top_directors_movies['primaryName'].value_counts().head(20).index.tolist()
top_director

['Alfred Hitchcock',
 'Stanley Kubrick',
 'Steven Spielberg',
 'Martin Scorsese',
 'Akira Kurosawa',
 'Billy Wilder',
 'Wilfred Jackson',
 'Hamilton Luske',
 'Francis Ford Coppola',
 'Brian De Palma',
 'John Huston',
 'Federico Fellini',
 'Richard Donner',
 'Rob Reiner',
 'James Cameron',
 'Clyde Geronimi',
 'Joel Coen',
 'Ethan Coen',
 'Sidney Lumet',
 'Hayao Miyazaki']

In [None]:
# DF les réalisateurs:
df_top_director = df_top_directors_movies[df_top_directors_movies['primaryName'].isin(top_director)]
df_director_20 = df_top_director.loc[:,['tconst','primaryName']]
df_director_20


Unnamed: 0,tconst,primaryName
1,tt0068646,Francis Ford Coppola
3,tt0050083,Sidney Lumet
5,tt0071562,Francis Ford Coppola
6,tt0108052,Steven Spielberg
16,tt0099685,Martin Scorsese
...,...,...
1037,tt0043274,Clyde Geronimi
1038,tt0043274,Wilfred Jackson
1039,tt0043274,Hamilton Luske
1045,tt0097733,Richard Donner


In [None]:
# Concaténer les 3 df
df_artistes = pd.concat([df_director_20, df_actress_100, df_actors_100], axis = 0)
df_artistes

Unnamed: 0,tconst,primaryName
1,tt0068646,Francis Ford Coppola
3,tt0050083,Sidney Lumet
5,tt0071562,Francis Ford Coppola
6,tt0108052,Steven Spielberg
16,tt0099685,Martin Scorsese
...,...,...
2579,tt0064276,Jack Nicholson
2580,tt0064276,Dennis Hopper
2600,tt0096446,Val Kilmer
2605,tt0070379,Robert De Niro


## get_dummies pour convertir les données catégorielles (genres, actor, actress et director) en valeurs binaires

In [None]:
df_top_movie_by_averageRating # KPI n°1

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank
2221,tt0111161,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1
548,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2
3868,tt0252487,The Chaos Class,Hababam Sinifi,1975.0,87.0,Comedy,9.2,42480.0,2
250,tt0050083,12 Angry Men,12 Angry Men,1957.0,96.0,Drama,9.0,844503.0,6
5928,tt0468569,The Dark Knight,The Dark Knight,2008.0,152.0,Drama,9.0,2812354.0,6
...,...,...,...,...,...,...,...,...,...
1481,tt0096446,Willow,Willow,1988.0,126.0,Drama,7.2,128492.0,1050
6692,tt0962736,The Young Victoria,The Young Victoria,2009.0,105.0,History,7.2,65661.0,1050
583,tt0070379,Mean Streets,Mean Streets,1973.0,112.0,Thriller,7.2,117246.0,1050
3337,tt0159097,The Virgin Suicides,The Virgin Suicides,1999.0,97.0,Romance,7.2,165610.0,1050


In [None]:
# get.dummies sur les genres
df_dummies_genres = pd.concat([df_top_movie_by_averageRating, df_top_movie_by_averageRating['genres'].str.get_dummies(sep=',')], axis = 1)
df_dummies_genres.head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank,Action,...,Horror,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western
2221,tt0111161,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1,0,...,0,0,0,0,0,0,0,0,0,0
548,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2,0,...,0,0,0,0,0,0,0,0,0,0
3868,tt0252487,The Chaos Class,Hababam Sinifi,1975.0,87.0,Comedy,9.2,42480.0,2,0,...,0,0,0,0,0,0,0,0,0,0
250,tt0050083,12 Angry Men,12 Angry Men,1957.0,96.0,Drama,9.0,844503.0,6,0,...,0,0,0,0,0,0,0,0,0,0
5928,tt0468569,The Dark Knight,The Dark Knight,2008.0,152.0,Drama,9.0,2812354.0,6,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# jointure entre df_dummies_genres et df_artistes:
df_artistes = pd.merge(df_dummies_genres,df_artistes, how='left', on='tconst')
# df_artistes = df_artistes.drop('primaryName_x', axis = 1)

In [None]:
df_artistes.head()

Unnamed: 0,tconst,primaryTitle,originalTitle,startYear,runtimeMinutes,genres,averageRating,numVotes,Rank,Action,...,Music,Musical,Mystery,Romance,Sci-Fi,Sport,Thriller,War,Western,primaryName
0,tt0111161,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1,0,...,0,0,0,0,0,0,0,0,0,Morgan Freeman
1,tt0111161,The Shawshank Redemption,The Shawshank Redemption,1994.0,142.0,Drama,9.3,2830977.0,1,0,...,0,0,0,0,0,0,0,0,0,Tim Robbins
2,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2,0,...,0,0,0,0,0,0,0,0,0,Francis Ford Coppola
3,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2,0,...,0,0,0,0,0,0,0,0,0,Diane Keaton
4,tt0068646,The Godfather,The Godfather,1972.0,175.0,Drama,9.2,1973182.0,2,0,...,0,0,0,0,0,0,0,0,0,Marlon Brando


In [None]:
# fonction pour empêcher le bug avec les str
def join_without_nan(lst):
    unique_values = set()
    for x in lst:
        if pd.notnull(x):
            unique_values.add(str(x))
    return ', '.join(unique_values)

# créer une table sans répétition de lignes au niveau des films
df_artistes1 = df_artistes.groupby(["tconst", "primaryTitle"], as_index=False).agg({
    'originalTitle':'first',
    'Rank':'first',
    'startYear': 'first',
    'runtimeMinutes': 'first',
    'genres': 'first',
    'averageRating': 'first',
    'numVotes': 'first',
    'Action': 'first',
    'Adventure': 'first',
    'Animation': 'first',
    'Biography': 'first',
    'Comedy': 'first',
    'Crime': 'first',
    'Documentary': 'first',
    'Drama': 'first',
    'Family': 'first',
    'Fantasy': 'first',
    'Film-Noir': 'first',
    'History': 'first',
    'Horror': 'first',
    'Music': 'first',
    'Musical': 'first',
    'Mystery': 'first',
    'Romance': 'first',
    'Sci-Fi': 'first',
    'Sport': 'first',
    'Thriller': 'first',
    'War': 'first',
    'Western': 'first',
    'primaryName': join_without_nan,
}).reset_index()

In [None]:
# get.dummies sur les top acteurs:
df_finale = pd.concat([df_artistes1, df_artistes1['primaryName'].str.get_dummies(sep=',')], axis = 1)
df_finale

Unnamed: 0,index,tconst,primaryTitle,originalTitle,Rank,startYear,runtimeMinutes,genres,averageRating,numVotes,...,Val Kilmer,Vera Miles,Veronica Cartwright,Whoopi Goldberg,Wilfred Jackson,Willem Dafoe,William Holden,Winona Ryder,Woody Allen,Yvonne Furneaux
0,0,tt0010323,The Cabinet of Dr. Caligari,Das Cabinet des Dr. Caligari,258,1920.0,76.0,Horror,8.0,68768.0,...,0,0,0,0,0,0,0,0,0,0
1,1,tt0012349,The Kid,The Kid,119,1921.0,68.0,Family,8.2,132734.0,...,0,0,0,0,0,0,0,0,0,0
2,2,tt0013442,Nosferatu,"Nosferatu, eine Symphonie des Grauens",332,1922.0,94.0,Horror,7.9,103696.0,...,0,0,0,0,0,0,0,0,0,0
3,3,tt0015648,Battleship Potemkin,Bronenosets Potemkin,332,1925.0,75.0,Thriller,7.9,60664.0,...,0,0,0,0,0,0,0,0,0,0
4,4,tt0015864,The Gold Rush,The Gold Rush,178,1925.0,95.0,Drama,8.1,116847.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,995,tt0986233,Hunger,Hunger,721,2008.0,96.0,Drama,7.5,72845.0,...,0,0,0,0,0,0,0,0,0,0
996,996,tt0986264,Like Stars on Earth,Taare Zameen Par,83,2007.0,162.0,Family,8.3,203343.0,...,0,0,0,0,0,0,0,0,0,0
997,997,tt0988045,Sherlock Holmes,Sherlock Holmes,611,2009.0,128.0,Action,7.6,660683.0,...,0,0,0,0,0,0,0,0,0,0
998,998,tt0993846,The Wolf of Wall Street,The Wolf of Wall Street,119,2013.0,180.0,Biography,8.2,1536505.0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# pour exporter la table "df_genres_dummies":
df_finale.to_csv('df_reco_finale.csv', index=False)

## Système de Recommandation - KNN

In [None]:
# definition des variables X et y
X = df_finale.select_dtypes(include = 'number')
y = df_finale['primaryTitle']
y = pd.DataFrame(y)

# Standardisation des données
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Création du modèle
model = KNeighborsClassifier(n_neighbors = 10)
# Entrainement du modèle avec les données d'entrainement
model.fit(X_scaled, y)


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



In [None]:
# transformation du "X_scaled" en df:
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = X.columns
X_scaled

Unnamed: 0,index,Rank,startYear,runtimeMinutes,averageRating,numVotes,Action,Adventure,Animation,Biography,...,Val Kilmer,Vera Miles,Veronica Cartwright,Whoopi Goldberg,Wilfred Jackson,Willem Dafoe,William Holden,Winona Ryder,Woody Allen,Yvonne Furneaux
0,-1.730320,-0.841935,-3.457911,-1.652899,0.642999,-0.624592,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
1,-1.726856,-1.324381,-3.407041,-1.950907,1.189303,-0.445441,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
2,-1.723391,-0.585093,-3.356171,-0.982382,0.369848,-0.526769,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
3,-1.719927,-0.585093,-3.203563,-1.690150,0.369848,-0.647290,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
4,-1.716463,-1.119602,-3.203563,-0.945131,0.916151,-0.489936,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,1.716463,0.765063,1.018612,-0.907880,-0.722760,-0.613174,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
996,1.719927,-1.449331,0.967743,1.550682,1.462455,-0.247684,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
997,1.723391,0.383271,1.069482,0.284150,-0.449608,1.033203,3.462227,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
998,1.726856,-1.324381,1.272960,2.221199,1.189303,3.486146,-0.288831,-0.238919,-0.135388,4.836346,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639


In [None]:
# pour trouver l'index pour pouvoir localiser le(s) film(s)
# ici, rechercher un film ou un acteur, par exemple
index_movie = np.array(df_finale.loc[(df_finale['primaryTitle'] == 'Street Fighter')|(df_finale['Robert De Niro'] == 1)].index)
index_movie

array([242, 316, 323, 373, 451])

In [None]:
# ici on va voir la table du "index_movie" sur "X_scaled"
# cela veut dire une table "X_scaled" du film recherché
# ici on voit la distance
X_scaled.loc[X_scaled.index.isin(index_movie)]

Unnamed: 0,index,Rank,startYear,runtimeMinutes,averageRating,numVotes,Action,Adventure,Animation,Biography,...,Val Kilmer,Vera Miles,Veronica Cartwright,Whoopi Goldberg,Wilfred Jackson,Willem Dafoe,William Holden,Winona Ryder,Woody Allen,Yvonne Furneaux
242,-0.892007,-1.324381,-0.609214,-0.237363,1.189303,1.699786,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
316,-0.635663,-1.449331,-0.202257,4.046496,1.462455,0.221736,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
323,-0.611414,-0.585093,-0.151388,0.433154,0.369848,-0.232426,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
373,-0.438209,0.765063,0.001221,0.209648,-0.72276,-0.557739,3.462227,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639
451,-0.168009,-0.310897,0.255569,0.023394,0.096696,-0.379512,-0.288831,-0.238919,-0.135388,-0.206768,...,-0.044766,-0.054855,-0.044766,-0.044766,-0.031639,-0.031639,-0.070888,-0.063372,-0.063372,-0.031639


In [None]:
################## Recommandation:
def recommend_movies(df):

    # Entrer les variables: films, acteurs et genre
    # "strip()" pour supprimer les espace vide avant et après un "str"
    movie_title = input("\nEntrer le nom du film: ").strip()
    movie_title = re.sub(r'\s{2,}', ' ', movie_title).lower()

    actor_name = input("\nEntrer le nom de l'actrice/ acteur/ réalisatrice/ réalisateur : ").strip()
    actor_name = re.sub(r'\s{2,}', ' ', actor_name).lower()

    type_genre = input("\nEntrer le nom du genre: ").strip()
    type_genre = re.sub(r'\s{2,}', ' ', type_genre).lower()

    X = df.select_dtypes(include='number')
    y = df['primaryTitle']

    # Conversion de la donnée sur la même échelle
    X_scaled = StandardScaler().fit_transform(X)
    X_scaled = pd.DataFrame(X_scaled)
    X_scaled.columns = X.columns

    # Modèle KNeighborsClassifier
    model = KNeighborsClassifier(n_neighbors=100, metric='euclidean').fit(X_scaled, y)

    # Obtenir l'index pour pouvoir localiser le film (si spécifié)
    while True:
        if not movie_title:
            index_movie = np.array([])
            break
        elif movie_title in df['primaryTitle'].str.lower().values:
            index_movie = np.array(df.loc[df['primaryTitle'].str.lower() == movie_title].index[:1])
            break
        else:
            movie_title = input("\nFilm non trouvé. Entrer le nom du film: ").strip()
            movie_title = re.sub(r'\s{2,}', ' ', movie_title).lower()

    # Obtenir l'index pour pouvoir localiser l'acteur (si spécifié)
    while True:
        if not actor_name:
            index_actor = np.array([])
            break
        elif actor_name in df.columns.str.strip().str.lower():
            actor_column = df.columns[df.columns.str.strip().str.lower() == actor_name].values[0]
            index_actor = np.array(df.loc[df[actor_column] == 1].index[:3])
            break
        else:
            actor_name = input("\nNom non trouvé. Entrer le nom de l'actrice/ acteur/ réalisatrice/ réalisateur : ").strip()
            actor_name = re.sub(r'\s{2,}', ' ', actor_name).lower()

    # Obtenir l'index pour pouvoir localiser le genre (si spécifié)
    while True:
        if not type_genre:
            index_genre = np.array([])
            break
        elif type_genre in df.columns.str.strip().str.lower():
            genre_column = df.columns[df.columns.str.strip().str.lower() == type_genre].values[0]
            index_genre = np.array(df.loc[df[genre_column] == 1].index[:3])
            break
        else:
            type_genre = input("\nGenre non trouvé. Entrer le nom du genre: ").strip()
            type_genre = re.sub(r'\s{2,}', ' ', type_genre).lower()

    # Ligne du film
    ligne_movie = X_scaled.loc[X_scaled.index.isin(index_movie)]

    ligne_movie_mean = pd.DataFrame(X_scaled.loc[X_scaled.index.isin(index_movie)].mean()).T

    # Ligne de l'acteur
    ligne_actor = X_scaled.loc[X_scaled.index.isin(index_actor)]

    ligne_actor_mean = pd.DataFrame(X_scaled.loc[X_scaled.index.isin(index_actor)].mean()).T

    # Ligne de l'genre
    ligne_genre = X_scaled.loc[X_scaled.index.isin(index_genre)]

    ligne_genre_mean = pd.DataFrame(X_scaled.loc[X_scaled.index.isin(index_genre)].mean()).T

    # Calcul des pondérations

    movie_weight = 3 if len(ligne_movie) > 0 else 0
    actor_weight = 1 if len(ligne_actor) > 0 else 0
    genre_weight = 1 if len(ligne_genre) > 0 else 0

    # Calcul de la chimère
    chimere = (ligne_movie.sum(axis=0) * movie_weight) + ((ligne_actor_mean.sum(axis=0)) * actor_weight) + (ligne_genre_mean.sum(axis=0) * genre_weight)

    chimere = pd.DataFrame(chimere).T

    # Obtenir les voisins les plus proches
    recommandation = model.kneighbors(chimere) # gère un tableau multidimensionnel; un liste de listes: indice et distance
    index_films = recommandation[1][0] # accède au premier élément de cette liste indice, qui s'agit de l'indice du premier film recommandé
    distance_films = recommandation[0][0] # accède au premier élément de cette liste distance, qui s'agit de la distance entre le film de recommandé

    # Trier les indices des films recommandés en fonction des distances
    sorted_indices = index_films[np.argsort(distance_films)]

    # Sélectionner les films recommandés
    recommended_movies = df.iloc[sorted_indices, [2, 4, 5, 6, 7, 8,9,32]]
    recommended_movies = pd.concat([df.loc[index_actor], df.loc[index_genre], recommended_movies])

    # Sélectionner les films recommandés
    recommended_movies = df.iloc[sorted_indices, [2, 4, 5, 6, 7, 8,9,32]]
    recommended_movies1 = pd.concat([df.loc[index_actor], df.loc[index_genre], recommended_movies])

    # Sélectionner les colonnes spécifiques
    recommended_movies2 = recommended_movies1.iloc[:, [2, 4, 5, 6, 7, 8,9,32]].head(10)

    # Trier par genre et année
    recommended_movies3 = recommended_movies2.sort_values(by=['genres', 'startYear'], ascending=[False, False])

    recommended_movies4 = pd.concat([df.loc[index_movie], recommended_movies3])

    recommended_movies5 = recommended_movies4.drop_duplicates(subset=['primaryTitle'])

    recommended_movies6 = recommended_movies5.iloc[:, [2, 4, 5, 6, 7, 8,9,32]].head(10)

    recommended_movies7 = recommended_movies6.reset_index(drop=True)

    return recommended_movies7

In [None]:
recommend_movies(df_finale)


Entrer le nom du film: the godfather

Entrer le nom de l'actrice/ acteur/ réalisatrice/ réalisateur : 

Entrer le nom du genre: 


Unnamed: 0,primaryTitle,Rank,startYear,runtimeMinutes,genres,averageRating,numVotes,primaryName
0,The Godfather,2,1972.0,175.0,Drama,9.2,1973182.0,"Francis Ford Coppola, Al Pacino, Marlon Brando..."
1,Apocalypse Now,58,1979.0,147.0,War,8.4,698514.0,"Francis Ford Coppola, Robert Duvall, Marlon Br..."
2,The Conversation,411,1974.0,113.0,Thriller,7.8,118808.0,"Francis Ford Coppola, Gene Hackman, John Cazale"
3,Annie Hall,258,1977.0,93.0,Romance,8.0,274632.0,"Woody Allen, Diane Keaton"
4,Manhattan Murder Mystery,931,1993.0,104.0,Mystery,7.3,46214.0,"Woody Allen, Diane Keaton"
5,The Dark Knight,6,2008.0,152.0,Drama,9.0,2812354.0,
6,Fight Club,12,1999.0,139.0,Drama,8.8,2266552.0,
7,The Godfather Part III,611,1990.0,162.0,Drama,7.6,417616.0,"Francis Ford Coppola, Al Pacino, Talia Shire, ..."
8,Manhattan,411,1979.0,96.0,Drama,7.8,145348.0,"Woody Allen, Diane Keaton"
9,Superman,827,1978.0,143.0,Action,7.4,184665.0,"Richard Donner, Gene Hackman, Marlon Brando"


In [None]:
recommend_movies(df_finale)


Entrer le nom du film: titanic

Entrer le nom de l'actrice/ acteur/ réalisatrice/ réalisateur : 

Entrer le nom du genre: 


Unnamed: 0,primaryTitle,Rank,startYear,runtimeMinutes,genres,averageRating,numVotes,primaryName
0,Titanic,332,1997.0,194.0,Drama,7.9,1257325.0,
1,Interstellar,17,2014.0,169.0,Drama,8.7,2026029.0,
2,The Dark Knight,6,2008.0,152.0,Drama,9.0,2812354.0,
3,Batman Begins,119,2005.0,140.0,Drama,8.2,1549433.0,
4,Catch Me If You Can,178,2002.0,141.0,Drama,8.1,1065280.0,
5,Fight Club,12,1999.0,139.0,Drama,8.8,2266552.0,
6,The Green Mile,25,1999.0,189.0,Crime,8.6,1377461.0,
7,The Lord of the Rings: The Fellowship of the Ring,12,2001.0,178.0,Adventure,8.8,1965383.0,
8,The Lord of the Rings: The Return of the King,6,2003.0,201.0,Action,9.0,1938516.0,
9,The Lord of the Rings: The Two Towers,12,2002.0,179.0,Action,8.8,1747855.0,


In [None]:
recommend_movies(df_finale)


Entrer le nom du film: brave

Entrer le nom de l'actrice/ acteur/ réalisatrice/ réalisateur : 

Entrer le nom du genre: 

Film non trouvé. Entrer le nom du film: 


Unnamed: 0,primaryTitle,Rank,startYear,runtimeMinutes,genres,averageRating,numVotes,primaryName
0,Adaptation.,510,2002.0,115.0,Drama,7.7,200440.0,
1,Moulin Rouge!,611,2001.0,127.0,Drama,7.6,296056.0,
2,The Experiment,510,2001.0,120.0,Drama,7.7,95623.0,
3,And Your Mother Too,510,2001.0,106.0,Drama,7.7,127763.0,
4,The Game,510,1997.0,129.0,Drama,7.7,421261.0,
5,Gattaca,510,1997.0,106.0,Drama,7.7,318385.0,
6,Open Your Eyes,510,1997.0,119.0,Drama,7.7,72485.0,
7,Star Trek: First Contact,611,1996.0,111.0,Drama,7.6,130754.0,
8,Boyz n the Hood,411,1991.0,112.0,Drama,7.8,151084.0,
9,Wings of Desire,332,1987.0,128.0,Drama,7.9,75378.0,
