Installation de la bibliothèque Spark

In [182]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, explode, avg, to_date
from pyspark.sql.types import DateType

Initialisation de la session Spark

In [183]:
spark = SparkSession.builder.appName('MoviesAnalyse').getOrCreate()

Etape 1 : Préparation des données

Chargement du fichier csv dans un DataFrame spark

In [184]:
chemin_fichier = "./rotten_tomatoes_movies.csv"
movies_df = spark.read.csv(chemin_fichier, header=True, inferSchema=True)

Affichage des données

In [185]:
movies_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+
|         movie_title|          movie_info|   critics_consensus|              rating|               genre|           directors|             writers|                cast|    in_theaters_date|   on_streaming_date|  runtime_in_minutes|         studio_name|  tomatometer_status|  tomatometer_rating|   tomatometer_count|audience_rating|      audience_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------

Nettoyage des données

Suppression des lignes vides ou invalides

In [186]:
cleaned_df = movies_df.dropna(subset=["in_theaters_date", "on_streaming_date", "tomatometer_rating"])
cleaned_df = movies_df.dropna()

Conversion des dates dans le format yyyy-MM-dd

In [187]:
cleaned_df = cleaned_df.withColumn("in_theaters_date", to_date(col("in_theaters_date"), "yyyy-MM-dd"))
cleaned_df = cleaned_df.withColumn("on_streaming_date", to_date(col("on_streaming_date"), "yyyy-MM-dd"))

Etape 2 : Manipulation des données & analyses

Filtre sur les films avec une note basse (< 20)

In [188]:
low_rated_movies = cleaned_df.filter(col("tomatometer_rating") < 20)

Filtre sur les films sortis (en cinéma) après 2000

In [189]:
movies_after_2000 = cleaned_df.filter(col("in_theaters_date") > "2000-01-01")

Filtre sur la note moyenne des films par studio

In [190]:
average_rating_by_studio = cleaned_df.groupBy("studio_name").agg(avg("tomatometer_rating").alias("avg_rating")).filter(col("avg_rating").isNotNull()).orderBy("avg_rating", ascending=False)

Filtre sur les notes moyennes des films par directeur

In [191]:
average_rating_by_director = cleaned_df.groupBy("directors").agg(avg("tomatometer_rating").alias("avg_rating")).filter(col("avg_rating").isNotNull()).orderBy("avg_rating", ascending=False)

Etape 3 : Fonctionnalités avancées

Division des genres multiples en genres individuels

On sépare les genres en utilisant les virgules

In [192]:
genres_df = cleaned_df.withColumn("genre", explode(split(col("genre"), ", ")))

Durée moyenne d'un film pour chaque genre

In [193]:
average_duration_by_genre = genres_df.groupBy("genre").agg(avg("runtime_in_minutes").alias("avg_duration")).filter(col("avg_duration").isNotNull()).orderBy("avg_duration", ascending=False)

On affiche les résultats

Les films les moins biens notés

In [194]:
low_rated_movies.show()

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         10,000 B.C.|A young outcast f...|Wit

Les films sortis au cinéma après l'année 2000

In [195]:
movies_after_2000.show()

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|Percy Jackson & t...|A teenager discov...|Tho

Moyenne des notes par les studios

In [196]:
average_rating_by_studio.show()

+-----------+----------+
|studio_name|avg_rating|
+-----------+----------+
| 1959-11-18|     222.0|
| 1961-10-30|     165.0|
| 2009-12-18|     162.0|
| 1993-06-01|     154.0|
| 2002-01-25|     131.0|
| 1982-04-02|     119.0|
| 2013-12-18|     119.0|
| 2011-09-30|     118.0|
| 2007-03-09|     116.0|
| 2016-11-18|     116.0|
| 2003-04-25|     116.0|
| 1988-06-29|     116.0|
| 2007-04-06|     115.0|
| 1996-05-31|     115.0|
| 1989-05-19|     114.0|
| 2008-09-04|     114.0|
| 2017-10-27|     110.0|
| 2008-12-05|     108.0|
| 2008-05-17|     107.0|
| 1988-06-15|     107.0|
+-----------+----------+
only showing top 20 rows



Moyenne des notes par les directeurs

In [197]:
average_rating_by_director.show()

+--------------------+----------+
|           directors|avg_rating|
+--------------------+----------+
|                   G|     222.0|
|With enough narra...|     165.0|
|It might be more ...|     162.0|
|The Firm is a big...|     154.0|
|Though it may not...|     131.0|
|It's just as unev...|     119.0|
|Paul Schrader's k...|     119.0|
|Bunraku admirably...|     118.0|
|Bleed for This ri...|     116.0|
|A simple-minded b...|     116.0|
|Eddie Murphy was ...|     116.0|
|         Matt Dillon|     116.0|
|The Hoax is an en...|     115.0|
|Stylish and inven...|     115.0|
|The Women is a to...|     114.0|
|All I See Is You ...|     110.0|
|What Cadillac Rec...|     108.0|
|Despite a talente...|     107.0|
|This derivative p...|     107.0|
|Kevin Costner at ...|     107.0|
+--------------------+----------+
only showing top 20 rows



Durée moyenne des film pour chaque genre

In [198]:
average_duration_by_genre.show()

+--------------------+------------------+
|               genre|      avg_duration|
+--------------------+------------------+
|             Western|117.63157894736842|
|            Classics| 114.8936170212766|
|Faith & Spirituality|113.71428571428571|
|    Sports & Fitness|110.71428571428571|
|               Drama|110.09607843137255|
|Art House & Inter...|109.51196172248804|
|  Action & Adventure|109.33513513513513|
|             Romance|109.06306306306307|
|  Mystery & Suspense|106.88592233009709|
|Science Fiction &...|105.33165829145729|
|       Anime & Manga|             105.0|
|Musical & Perform...|104.16071428571429|
|              Comedy|100.36717428087987|
|              Horror| 98.87958115183245|
|       Gay & Lesbian|              96.5|
|          Television|              96.0|
|    Special Interest|              95.6|
|       Kids & Family| 95.06766917293233|
|         Documentary| 91.77272727272727|
|         Cult Movies|              90.4|
+--------------------+------------

On ferme la session Spark

In [199]:
spark.stop()