# Pipeline de données

**Objectif:**


- Supprimer une ligne analysée avec des valeurs incohérentes, identifiée par *id* = 0
-  Filtrer sur la colonne *release_date* pour ne garder que les films déjà sortis au moment de l'extraction des données. Suppression si la valeur est vide ou >= 2023-06-01
- Extraire l'année de la colonne *release_date* car ce niveau de détail nous suffit
- Ajouter la valeur "Non renseigné" pour les listes vides, colonnes concernées : *genres*, *production_companies*
- Transformer les listes de *genres* et *production_companies*
- Supprimer les colonnes inutiles pour nos visualisations : *id*, *original_language*, *tagline*

In [21]:
import pandas as pd

df = pd.read_csv(
    "top_1000_popular_movies_tmdb.csv",
    sep=",",            # séparateur
    encoding="utf-8",   # encodage standard
    on_bad_lines="skip",# saute les lignes mal formées
    engine="python"     # parser plus tolérant que 'c'
)

print(df.shape)
df.head()


(10001, 15)


Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline
0,0,385687,Fast X,2023-05-17,"['Action', 'Crime', 'Thriller']",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"['Universal Pictures', 'Original Film', 'One R...",652000000.0,142.0,The end of the road begins.
1,1,603692,John Wick: Chapter 4,2023-03-22,"['Action', 'Thriller', 'Crime']",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"['Thunder Road', '87Eleven', 'Summit Entertain...",431769200.0,170.0,"No way back, one way out."
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Family', 'Adventure', 'Fantasy'...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"['Universal Pictures', 'Illumination', 'Ninten...",1308767000.0,92.0,
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"['Action', 'Adventure', 'Animation', 'Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"['Columbia Pictures', 'Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters
4,4,536437,Hypnotic,2023-05-11,"['Mystery', 'Thriller', 'Science Fiction']",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"['Studio 8', 'Solstice Productions', 'Ingeniou...",0.0,94.0,Control is an illusion.


- Supprimer une ligne analysée avec des valeurs incohérentes, identifiée par *id* = 0

In [22]:
df_filtre = df[df["id"] != 0]
print(df_filtre.shape)
df_filtre.head()

(10000, 15)


Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline
0,0,385687,Fast X,2023-05-17,"['Action', 'Crime', 'Thriller']",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"['Universal Pictures', 'Original Film', 'One R...",652000000.0,142.0,The end of the road begins.
1,1,603692,John Wick: Chapter 4,2023-03-22,"['Action', 'Thriller', 'Crime']",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"['Thunder Road', '87Eleven', 'Summit Entertain...",431769200.0,170.0,"No way back, one way out."
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Family', 'Adventure', 'Fantasy'...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"['Universal Pictures', 'Illumination', 'Ninten...",1308767000.0,92.0,
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"['Action', 'Adventure', 'Animation', 'Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"['Columbia Pictures', 'Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters
4,4,536437,Hypnotic,2023-05-11,"['Mystery', 'Thriller', 'Science Fiction']",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"['Studio 8', 'Solstice Productions', 'Ingeniou...",0.0,94.0,Control is an illusion.


Filtrer sur la colonne *release_date* pour ne garder que les films déjà sortis au moment de l'extraction des données. Suppression si la valeur est vide ou >= 2023-06-01

In [23]:
pd.options.mode.copy_on_write = True

df_filtre["release_date"] = pd.to_datetime(df_filtre["release_date"], errors="coerce")
df_filtre.dtypes

df_filtre_date =df_filtre[df_filtre["release_date"] < "2023-06-01"]

print(df_filtre_date.shape)
df_filtre_date.head()

(9733, 15)


Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline
0,0,385687,Fast X,2023-05-17,"['Action', 'Crime', 'Thriller']",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"['Universal Pictures', 'Original Film', 'One R...",652000000.0,142.0,The end of the road begins.
1,1,603692,John Wick: Chapter 4,2023-03-22,"['Action', 'Thriller', 'Crime']",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"['Thunder Road', '87Eleven', 'Summit Entertain...",431769200.0,170.0,"No way back, one way out."
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Family', 'Adventure', 'Fantasy'...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"['Universal Pictures', 'Illumination', 'Ninten...",1308767000.0,92.0,
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"['Action', 'Adventure', 'Animation', 'Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"['Columbia Pictures', 'Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters
4,4,536437,Hypnotic,2023-05-11,"['Mystery', 'Thriller', 'Science Fiction']",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"['Studio 8', 'Solstice Productions', 'Ingeniou...",0.0,94.0,Control is an illusion.


Récupérer année

In [24]:
df_filtre_date["year"] = df_filtre_date["release_date"].dt.year
df_filtre_date.head()

Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline,year
0,0,385687,Fast X,2023-05-17,"['Action', 'Crime', 'Thriller']",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"['Universal Pictures', 'Original Film', 'One R...",652000000.0,142.0,The end of the road begins.,2023
1,1,603692,John Wick: Chapter 4,2023-03-22,"['Action', 'Thriller', 'Crime']",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"['Thunder Road', '87Eleven', 'Summit Entertain...",431769200.0,170.0,"No way back, one way out.",2023
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Family', 'Adventure', 'Fantasy'...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"['Universal Pictures', 'Illumination', 'Ninten...",1308767000.0,92.0,,2023
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"['Action', 'Adventure', 'Animation', 'Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"['Columbia Pictures', 'Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters,2023
4,4,536437,Hypnotic,2023-05-11,"['Mystery', 'Thriller', 'Science Fiction']",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"['Studio 8', 'Solstice Productions', 'Ingeniou...",0.0,94.0,Control is an illusion.,2023


In [25]:
print(df_filtre_date.shape)

(9733, 16)


In [26]:
df_filtre_date.dtypes

Unnamed: 0                      object
id                               int64
title                           object
release_date            datetime64[ns]
genres                          object
original_language               object
vote_average                   float64
vote_count                     float64
popularity                     float64
overview                        object
budget                         float64
production_companies            object
revenue                        float64
runtime                        float64
tagline                         object
year                             int32
dtype: object

Remplacer les valeurs null

In [27]:
# Remplacer les valeurs nulles de genres : []
df_filtre_date["genres"] = df_filtre_date["genres"].replace("[]", "['Non renseigné']")

# Remplacer les valeurs nulles de production_companies : [] et None
df_filtre_date["production_companies"] = df_filtre_date["production_companies"].replace("[]", "['Non renseigné']")
df_filtre_date["production_companies"] = df_filtre_date["production_companies"].fillna("['Non renseigné']")

print(df_filtre_date.shape)
df_filtre_date.head()

(9733, 16)


Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline,year
0,0,385687,Fast X,2023-05-17,"['Action', 'Crime', 'Thriller']",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"['Universal Pictures', 'Original Film', 'One R...",652000000.0,142.0,The end of the road begins.,2023
1,1,603692,John Wick: Chapter 4,2023-03-22,"['Action', 'Thriller', 'Crime']",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"['Thunder Road', '87Eleven', 'Summit Entertain...",431769200.0,170.0,"No way back, one way out.",2023
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"['Animation', 'Family', 'Adventure', 'Fantasy'...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"['Universal Pictures', 'Illumination', 'Ninten...",1308767000.0,92.0,,2023
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"['Action', 'Adventure', 'Animation', 'Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"['Columbia Pictures', 'Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters,2023
4,4,536437,Hypnotic,2023-05-11,"['Mystery', 'Thriller', 'Science Fiction']",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"['Studio 8', 'Solstice Productions', 'Ingeniou...",0.0,94.0,Control is an illusion.,2023


Transformation des listes genres et production_companies

In [28]:
import ast
import json

# Récupération des éléments de la liste
df_filtre_date["genres"] = df_filtre_date["genres"].apply(ast.literal_eval)
df_filtre_date["production_companies"] = df_filtre_date["production_companies"].apply(ast.literal_eval)

# Conversion dans un format accepté par JSON (guillemets)
df_filtre_date["genres"] = df_filtre_date["genres"].apply(json.dumps)
df_filtre_date["production_companies"] = df_filtre_date["production_companies"].apply(json.dumps)

print(df_filtre_date.shape)
df_filtre_date.head()

(9733, 16)


Unnamed: 0.1,Unnamed: 0,id,title,release_date,genres,original_language,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,tagline,year
0,0,385687,Fast X,2023-05-17,"[""Action"", ""Crime"", ""Thriller""]",English,7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"[""Universal Pictures"", ""Original Film"", ""One R...",652000000.0,142.0,The end of the road begins.,2023
1,1,603692,John Wick: Chapter 4,2023-03-22,"[""Action"", ""Thriller"", ""Crime""]",English,7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"[""Thunder Road"", ""87Eleven"", ""Summit Entertain...",431769200.0,170.0,"No way back, one way out.",2023
2,2,502356,The Super Mario Bros. Movie,2023-04-05,"[""Animation"", ""Family"", ""Adventure"", ""Fantasy""...",English,7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"[""Universal Pictures"", ""Illumination"", ""Ninten...",1308767000.0,92.0,,2023
3,3,569094,Spider-Man: Across the Spider-Verse,2023-05-31,"[""Action"", ""Adventure"", ""Animation"", ""Science ...",English,8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"[""Columbia Pictures"", ""Sony Pictures Animation...",313522200.0,140.0,It's how you wear the mask that matters,2023
4,4,536437,Hypnotic,2023-05-11,"[""Mystery"", ""Thriller"", ""Science Fiction""]",English,6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"[""Studio 8"", ""Solstice Productions"", ""Ingeniou...",0.0,94.0,Control is an illusion.,2023


Suppression des colonnes

In [29]:
df_final = df_filtre_date.drop(columns=["Unnamed: 0","id","release_date","original_language", "tagline"])

print(df_final.shape)
df_final.head()

(9733, 11)


Unnamed: 0,title,genres,vote_average,vote_count,popularity,overview,budget,production_companies,revenue,runtime,year
0,Fast X,"[""Action"", ""Crime"", ""Thriller""]",7.4,1347.0,8363.473,Over many missions and against impossible odds...,340000000.0,"[""Universal Pictures"", ""Original Film"", ""One R...",652000000.0,142.0,2023
1,John Wick: Chapter 4,"[""Action"", ""Thriller"", ""Crime""]",7.9,2896.0,4210.313,"With the price on his head ever increasing, Jo...",90000000.0,"[""Thunder Road"", ""87Eleven"", ""Summit Entertain...",431769200.0,170.0,2023
2,The Super Mario Bros. Movie,"[""Animation"", ""Family"", ""Adventure"", ""Fantasy""...",7.8,4628.0,3394.458,"While working underground to fix a water main,...",100000000.0,"[""Universal Pictures"", ""Illumination"", ""Ninten...",1308767000.0,92.0,2023
3,Spider-Man: Across the Spider-Verse,"[""Action"", ""Adventure"", ""Animation"", ""Science ...",8.8,1160.0,2859.047,"After reuniting with Gwen Stacy, Brooklyn’s fu...",100000000.0,"[""Columbia Pictures"", ""Sony Pictures Animation...",313522200.0,140.0,2023
4,Hypnotic,"[""Mystery"", ""Thriller"", ""Science Fiction""]",6.5,154.0,2654.854,A detective becomes entangled in a mystery inv...,70000000.0,"[""Studio 8"", ""Solstice Productions"", ""Ingeniou...",0.0,94.0,2023


### Sauvegarde du dataset au format csv

In [30]:
df_final.to_csv("movie_dataset_transform.csv", index=False)