<a href="https://colab.research.google.com/github/Forkou-francine/Documents_By_PAF/blob/main/Tp_notation_films.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **TP: PySpark pour analyse de données de films.**

**Objectif**: Utiliser PySpark pour manipuler un jeu de données sur les films, effectuer des transformations simples et réaliser des analyses

Travail réalisé par:

1. Ange-Francine PENE FORKOU
2. Stanislas Raphael MACOS



###**Étape 1: Prise en main - Préparation des données**
1. Charger le fichier CSV contenant les données des films dans un DataFrame PySpark :

In [11]:
#Import des librairies
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, explode, split, avg

#Instanciation d'une session spark
spark = SparkSession.builder.appName('RottenTomatoesMovieAnalysis').getOrCreate()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [12]:
# Chargement du fichier CSV
file_path = "Rotten_Tomatoes_Movies.csv"

#Création du dataFrame qui contiendra les données des films
df_movies = spark.read.csv(file_path, header=True, inferSchema=True)

# Afficher un aperçu des données
df_movies.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+
|         movie_title|          movie_info|   critics_consensus|              rating|               genre|           directors|             writers|                cast|    in_theaters_date|   on_streaming_date|  runtime_in_minutes|         studio_name|  tomatometer_status|  tomatometer_rating|   tomatometer_count|audience_rating|      audience_count|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------

2.   Nettoyer les données en supprimant les lignes vides ou invalides :


*   Supprimer les lignes avec des valeurs nulles ou invalides

In [23]:
# Liste des colonnes importantes à conserver pour l'analyse
"""
  Les colonnes sont sélectionnées car elles contiennent des informations
  critiques pour l'analyse comme les dates, les notes, et les noms des studios/réalisateurs.
"""
list_columns = ["tomatometer_rating", "critics_consensus", "in_theaters_date", "on_streaming_date", "studio_name", "directors"]

# Suppression des lignes contenant des valeurs nulles dans les colonnes importantes
df_movies_cleaned = df_movies.dropna(subset=list_columns)
#df_movies_cleaned.show(10)

+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+--------------------+
|         movie_title|          movie_info|   critics_consensus|             rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|  runtime_in_minutes|         studio_name|  tomatometer_status|  tomatometer_rating|   tomatometer_count|audience_rating|      audience_count|
+--------------------+--------------------+--------------------+-------------------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+

### **Etape 2: Manipulation et analyse des données**
1. Filtrage

In [25]:
# Les films avec une note très basse (tomatometer_rating < 20)
df_movies_low_rating = df_movies_cleaned.filter(col("tomatometer_rating") < 20)
df_movies_low_rating.show(5)

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         10,000 B.C.|A young outcast f...|Wit

In [26]:
# Les films sortis (en cinema) après l'année 2000
df_movies_after_2000 = df_movies_cleaned.filter(col("in_theaters_date") > "2000-01-01")
df_movies_after_2000.show(5)

+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|           directors|             writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+--------------------+--------------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|Percy Jackson & t...|A teenager discov...|Tho

2. Moyennes

In [24]:
# La note moyenne des films par studio
df_avg_rating_studio = df_movies_cleaned.groupBy("studio_name").agg(avg("tomatometer_rating").alias("avg_rating"))
df_avg_rating_studio.show(5)

+--------------------+------------------+
|         studio_name|        avg_rating|
+--------------------+------------------+
|Virgil Films & En...|              13.0|
|    Relativity Media|33.310344827586206|
|  New World Pictures|              50.1|
| and seeks out th...|              NULL|
|       Nicholas Hope|              NULL|
+--------------------+------------------+
only showing top 5 rows



In [27]:
# La moyenne des films par directeurs
df_avg_rating_director = df_movies_cleaned.groupBy("directors").agg(avg("tomatometer_rating").alias("avg_rating"))
df_avg_rating_director.show(5)

+--------------------+----------+
|           directors|avg_rating|
+--------------------+----------+
|Comedy, Science F...|      NULL|
|         John Milius|      59.5|
|    Laurence Olivier|      79.0|
|      Jean Yarbrough|      30.0|
|        Jim Jarmusch|      77.0|
+--------------------+----------+
only showing top 5 rows



### **Etape 3 : Utilisation de fonctions avancées**



1.   Diviser les genres multiples d'une colone en genres individuels



In [29]:
# On divise les genres selon le séparateur (,)
df_individual_genres = df_movies_cleaned.withColumn("genre", explode(split(col("genre"), ", ")))
df_individual_genres.show(5)

+--------------------+--------------------+--------------------+------+--------------------+-----------------+-----------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|         movie_title|          movie_info|   critics_consensus|rating|               genre|        directors|          writers|                cast|in_theaters_date|on_streaming_date|runtime_in_minutes|         studio_name|tomatometer_status|tomatometer_rating|tomatometer_count|audience_rating|audience_count|
+--------------------+--------------------+--------------------+------+--------------------+-----------------+-----------------+--------------------+----------------+-----------------+------------------+--------------------+------------------+------------------+-----------------+---------------+--------------+
|Percy Jackson & t...|A teenager discov...|Though it may see...|

2. Calculer la durée moyennes des films pour chaque genre