Installation de la bibliothèque Spark

In [10]:
from pyspark.sql import SparkSession

In [11]:
spark = SparkSession.builder.appName('Introduction').getOrCreate()

Télécharger le CSV

In [12]:
from google.colab import files

In [13]:
uploaded = files.upload()

Saving Rotten Tomatoes Movies.csv to Rotten Tomatoes Movies.csv


Charger le fichier CSV

In [14]:
file_path = "/content/Rotten Tomatoes Movies.csv"

In [15]:
movies_df = spark.read.csv(file_path, header=True, inferSchema=True)

Voir le fichier CSV

In [16]:
movies_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+
|         movie_title|          movie_info|   critics_consensus|              rating|               genre|           directors|             writers|                cast|    in_theaters_date|   on_streaming_date|  runtime_in_minutes|         studio_name|  tomatometer_status|  tomatometer_rating|   tomatometer_count|audience_rating|      audience_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------

Nettoyer les données vides ou invalides

In [17]:
movies_cleaned_df = movies_df.dropna()

Convertir les dates dans le format format yyyy-MM-dd

In [18]:
from pyspark.sql.functions import to_date

In [19]:
from pyspark.sql.functions import to_date, col # Import the 'col' function

movies_cleaned_df = movies_cleaned_df.withColumn(
    "in_theaters_date", to_date(col("in_theaters_date"), "yyyy-MM-dd")
).withColumn(
    "on_streaming_date", to_date(col("on_streaming_date"), "yyyy-MM-dd")
)

Vérifier les modifications

In [20]:
movies_cleaned_df.select("in_theaters_date", "on_streaming_date").show()

+----------------+-----------------+
|in_theaters_date|on_streaming_date|
+----------------+-----------------+
|      2010-02-12|       2010-06-29|
|      2010-04-30|       2010-10-19|
|            NULL|             NULL|
|      1954-01-01|       2003-05-20|
|      2008-03-07|       2008-06-24|
|      1935-08-01|       1935-06-06|
|      2005-09-03|       2006-08-08|
|            NULL|             NULL|
|      2005-06-10|       2005-10-11|
|      2004-09-24|       2005-04-12|
|      2005-06-17|       2006-04-11|
|            NULL|             NULL|
|      2006-01-13|       2006-04-25|
|            NULL|             NULL|
|      2006-03-03|       2006-06-27|
|      2007-01-12|       2007-05-15|
|      2004-08-26|       2005-10-18|
|      2006-01-27|       2006-07-04|
|      2005-12-23|       2006-04-25|
|      2005-08-05|       2005-12-13|
+----------------+-----------------+
only showing top 20 rows



Suppression des valeurs NULL dans les dates (anciens textes transformés)

In [23]:
date_columns = ["in_theaters_date", "on_streaming_date"]

In [24]:
movies_cleaned_df = movies_cleaned_df.dropna(subset=date_columns)

Vérifier

In [26]:
movies_cleaned_df.select("in_theaters_date", "on_streaming_date").show()

+----------------+-----------------+
|in_theaters_date|on_streaming_date|
+----------------+-----------------+
|      2010-02-12|       2010-06-29|
|      2010-04-30|       2010-10-19|
|      1954-01-01|       2003-05-20|
|      2008-03-07|       2008-06-24|
|      1935-08-01|       1935-06-06|
|      2005-09-03|       2006-08-08|
|      2005-06-10|       2005-10-11|
|      2004-09-24|       2005-04-12|
|      2005-06-17|       2006-04-11|
|      2006-01-13|       2006-04-25|
|      2006-03-03|       2006-06-27|
|      2007-01-12|       2007-05-15|
|      2004-08-26|       2005-10-18|
|      2006-01-27|       2006-07-04|
|      2005-12-23|       2006-04-25|
|      2005-08-05|       2005-12-13|
|      2005-09-30|       2006-01-24|
|      2005-09-16|       2006-03-28|
|      1986-07-18|       1999-06-01|
|      2006-04-07|       2006-08-22|
+----------------+-----------------+
only showing top 20 rows



Filtrer : Les films ayant une note très basse (tomatometer_rating < 20)

In [27]:
low_rated_movies = movies_cleaned_df.filter(col("tomatometer_rating") < 20)

Vérifier

In [28]:
low_rated_movies.show()

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         10,000 B.C.|A young outcast f...|Wit

Les films sortis en cinéma après l année 2000

In [29]:
recent_movies = movies_cleaned_df.filter(col("on_streaming_date") >= "2000-01-01")

Vérifier

In [30]:
recent_movies.show()

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|Percy Jackson & t...|A teenager discov...|Tho

La note moyenne des films par studio

In [31]:
average_rating_by_studio = movies_cleaned_df.groupBy("studio_name") \
    .agg({"tomatometer_rating": "avg"}) \
    .withColumnRenamed("avg(tomatometer_rating)", "average_rating")

Vérifier

In [32]:
average_rating_by_studio.show()

+--------------------+------------------+
|         studio_name|    average_rating|
+--------------------+------------------+
|    Relativity Media| 33.32142857142857|
|  New World Pictures| 69.66666666666667|
|Alluvial Film Com...|              92.0|
|       Shout Factory|              65.0|
|            El Deseo|              84.0|
|Oscilloscope Pict...| 80.36842105263158|
|       Cavu Pictures|              72.5|
|        Toho Company|              86.0|
|  Fine Line Features|60.785714285714285|
|           HBO Video| 74.55555555555556|
|             42 West|              50.0|
|   Empire Film Group|              15.0|
|        Disneynature|              77.0|
| Perdido Productions|              71.0|
|  Big World Pictures|              77.5|
|          Wellspring|              67.6|
|       October Films|              82.8|
|               GKIDS|              88.0|
|          NCM Fathom|              74.0|
|Electric City Ent...|              82.0|
+--------------------+------------

Note moyenne des films par directeur

In [33]:
average_rating_by_director = movies_cleaned_df.groupBy("directors") \
    .agg({"tomatometer_rating": "avg"}) \
    .withColumnRenamed("avg(tomatometer_rating)", "average_rating")

Vérifier

In [34]:
average_rating_by_director.show()

+--------------------+------------------+
|           directors|    average_rating|
+--------------------+------------------+
|    Laurence Olivier|              91.0|
|        Jim Jarmusch|            74.625|
|          John Wells|53.333333333333336|
|Harry Elfont, Deb...|              40.0|
|         John Milius|              66.0|
|          Will Gluck|59.333333333333336|
|          Rob Bowman|              26.5|
|       Paul Morrison|              24.0|
|        Michael Kang|              87.0|
|Molly Bingham, St...|              82.0|
|       Carlos Brooks|              61.0|
|      Chan-wook Park|              77.6|
|       Greg Pritikin|              71.0|
|        Zak Hilditch|              86.0|
|Jon Hurwitz, Hayd...|              45.0|
|    Peter Strickland|              89.5|
|       Peter Sattler|              75.0|
|     Larry Fessenden|              81.5|
|         Randy Moore|              56.0|
|         Josh Radnor|              70.0|
+--------------------+------------

Diviser les genres multiples d une colonne en genres individuels

In [35]:
from pyspark.sql.functions import explode, split, col

# Vérifier si la colonne 'genre' existe)
if 'genre' in movies_cleaned_df.columns:
    # Si elle existe, on l'utilise
    movies_genres_df = movies_cleaned_df.withColumn("genre", explode(split(col("genre"), ",")))
else:
    # Si ce n'est pas le cas, warning
    print("Attention: La coloonne 'genre' n'existe pas.")
    movies_genres_df = movies_cleaned_df.withColumn("genre", explode(split(col("genres"), ",")))

Vérifier

In [36]:
movies_genres_df.select("genre").show()

+--------------------+
|               genre|
+--------------------+
|  Action & Adventure|
|              Comedy|
|               Drama|
| Science Fiction ...|
|              Comedy|
|  Action & Adventure|
|               Drama|
|       Kids & Family|
|  Action & Adventure|
|            Classics|
|               Drama|
|  Action & Adventure|
|            Classics|
|  Mystery & Suspense|
|               Drama|
|Art House & Inter...|
|               Drama|
| Faith & Spiritua...|
|               Drama|
|  Mystery & Suspense|
+--------------------+
only showing top 20 rows



Calculer la durée moyenne des films pour chaque genre

In [37]:
average_duration_by_genre = movies_genres_df.groupBy("genre") \
    .agg({"runtime_in_minutes": "avg"}) \
    .withColumnRenamed("avg(runtime_in_minutes)", "average_runtime")

Vérifier

In [38]:
average_duration_by_genre.show()

+--------------------+------------------+
|               genre|   average_runtime|
+--------------------+------------------+
|  Action & Adventure|109.54932912391476|
|             Romance|             105.0|
|       Kids & Family| 95.20731707317073|
|         Documentary| 98.66666666666667|
|             Romance|108.60377358490567|
|    Special Interest| 98.75935828877006|
|            Classics|120.80898876404494|
|               Drama|110.15460992907802|
|         Documentary| 95.94252873563218|
|              Comedy| 100.1852487135506|
|               Drama|109.04821802935011|
|       Gay & Lesbian| 99.71428571428571|
|Art House & Inter...|107.84421364985164|
|  Mystery & Suspense|106.92134831460675|
|       Kids & Family|           104.375|
|Science Fiction &...|103.61538461538461|
|            Classics|112.77464788732394|
|          Television| 99.74285714285715|
|Musical & Perform...|             113.0|
|         Cult Movies| 96.61111111111111|
+--------------------+------------