# TP PySpark - Analyse du Dataset Spotify

Ce notebook r√©pond aux 10 questions du TP en utilisant les op√©rations PySpark:
- **Filtres** (`filter`)
- **S√©lections** (`select`)
- **Agr√©gations** (`groupBy`)
- **Joins** (`join`)

## 1. Cr√©er une SparkSession

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count, max, min, desc, asc
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

spark = SparkSession.builder \
    .appName("TP_PySpark_Spotify") \
    .getOrCreate()

print("‚úÖ SparkSession cr√©√©e avec succ√®s!")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/10 10:11:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ SparkSession cr√©√©e avec succ√®s!


## 2. Charger le CSV avec un sch√©ma explicite

In [2]:
# D√©finir le sch√©ma explicitement pour √©viter les erreurs de type
schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("artists", StringType(), True),
    StructField("duration_ms", IntegerType(), True),
    StructField("release_date", StringType(), True),
    StructField("year", IntegerType(), True),
    StructField("acousticness", DoubleType(), True),
    StructField("danceability", DoubleType(), True),
    StructField("energy", DoubleType(), True),
    StructField("instrumentalness", DoubleType(), True),
    StructField("liveness", DoubleType(), True),
    StructField("loudness", DoubleType(), True),
    StructField("speechiness", DoubleType(), True),
    StructField("tempo", DoubleType(), True),
    StructField("valence", DoubleType(), True),
    StructField("mode", IntegerType(), True),
    StructField("key", IntegerType(), True),
    StructField("popularity", IntegerType(), True),
    StructField("explicit", IntegerType(), True)
])

df_spotify = spark.read.csv("spotify-data.csv", header=True, schema=schema)
df_spotify.show(5)

                                                                                

+--------------------+--------------------+--------------------+-----------+------------+----+------------+------------+------+----------------+--------+--------+-----------+-------+-------+----+---+----------+--------+
|                  id|                name|             artists|duration_ms|release_date|year|acousticness|danceability|energy|instrumentalness|liveness|loudness|speechiness|  tempo|valence|mode|key|popularity|explicit|
+--------------------+--------------------+--------------------+-----------+------------+----+------------+------------+------+----------------+--------+--------+-----------+-------+-------+----+---+----------+--------+
|6KbQ3uYMLKb5jDxLF...|Singende Bataillo...| ['Carl Woitschach']|     158648|        1928|1928|       0.995|       0.708| 0.195|           0.563|   0.151| -12.428|     0.0506|118.469|  0.779|   1| 10|         0|       0|
|6KuQTIu1KoTTkLXKr...|Fantasiest√ºcke, O...|['Robert Schumann...|     282133|        1928|1928|       0.994|       0.379

In [3]:
# Afficher le sch√©ma
df_spotify.printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- release_date: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- mode: integer (nullable = true)
 |-- key: integer (nullable = true)
 |-- popularity: integer (nullable = true)
 |-- explicit: integer (nullable = true)



In [4]:
# Nombre total de lignes
print(f"üìà Nombre total de lignes: {df_spotify.count()}")



üìà Nombre total de lignes: 169909


                                                                                

## Cr√©ation du DataFrame des d√©cennies

Ce code g√©n√®re une liste de tuples o√π chaque tuple contient une ann√©e (y) et le nom de sa d√©cennie (ex: 1985 ‚Üí "1980s").

La formule `y//10 * 10` arrondit √† la d√©cennie inf√©rieure.

In [5]:
decades_data = [
    (y, f"{y//10 * 10}s")
    for y in range(1920, 2021)
]
decades_df = spark.createDataFrame(decades_data, ["year", "decade_name"])
decades_df.show(5)

[Stage 4:>                                                          (0 + 1) / 1]

+----+-----------+
|year|decade_name|
+----+-----------+
|1920|      1920s|
|1921|      1920s|
|1922|      1920s|
|1923|      1920s|
|1924|      1920s|
+----+-----------+
only showing top 5 rows


                                                                                

## Explication: repartition vs coalesce

### `repartition(n)`
- Effectue un "full shuffle" des donn√©es
- Peut augmenter OU diminuer le nombre de partitions
- Plus lent car redistribue toutes les donn√©es sur le cluster
- Utilis√© quand on veut une distribution √©quilibr√©e des donn√©es

### `coalesce(n)`
- Fusionne les partitions existantes SANS shuffle
- Ne peut QUE diminuer le nombre de partitions
- Beaucoup plus rapide car pas de transfert r√©seau
- Utilis√© pour r√©duire le nombre de fichiers en sortie

---
# Questions M√©tier

## Q1: Chansons publi√©es apr√®s 2015 avec popularit√© > 85

In [6]:
popular_recent_songs = df_spotify.filter(
    (col("year") > 2015) & (col("popularity") > 85)
)
popular_recent_songs.select("name", "artists", "year", "popularity").show(10, truncate=False)
print(f"üìä Nombre de chansons trouv√©es: {popular_recent_songs.count()}")

                                                                                

+---------------------------------------------+---------------------------------------------------------------------+----+----------+
|name                                         |artists                                                              |year|popularity|
+---------------------------------------------+---------------------------------------------------------------------+----+----------+
|goosebumps                                   |['Travis Scott']                                                     |2016|92        |
|Jocelyn Flores                               |['XXXTENTACION']                                                     |2017|87        |
|Clean White Noise - Loopable with no fade    |['Erik Eriksson', 'White Noise Baby Sleep', 'White Noise for Babies']|2017|86        |
|Believer                                     |['Imagine Dragons']                                                  |2017|87        |
|Perfect                                      |['Ed Sheeran'] 



üìä Nombre de chansons trouv√©es: 128


                                                                                

## Q2: Chansons non explicites ET non instrumentales

In [7]:
non_explicit_non_instrumental = df_spotify.filter(
    (col("explicit") == 0) & (col("instrumentalness") == 0)
)
non_explicit_non_instrumental.select("name", "artists", "explicit", "instrumentalness").show(10, truncate=False)
print(f"üìä Nombre de chansons trouv√©es: {non_explicit_non_instrumental.count()}")

+------------------------------+-----------------------+--------+----------------+
|name                          |artists                |explicit|instrumentalness|
+------------------------------+-----------------------+--------+----------------+
|Chapter 1.18 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 1.3 - Zamek kaniowski |['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 4.12 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 4.10 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 2.11 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 3.4 - Zamek kaniowski |['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 1.13 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 2.23 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0             |
|Chapter 2.16 - Zamek kaniowski|['Seweryn Goszczy≈Ñski']|0       |0.0          

                                                                                

üìä Nombre de chansons trouv√©es: 37151


## Q3: Chansons tr√®s dansables OU tr√®s positives

In [8]:
danceable_or_positive = df_spotify.filter(
    (col("danceability") > 0.8) | (col("valence") > 0.8)
)
danceable_or_positive.select("name", "artists", "danceability", "valence").show(10, truncate=False)
print(f"üìä Nombre de chansons trouv√©es: {danceable_or_positive.count()}")

+---------------------------------------------------------+-----------------------------------+------------+-------+
|name                                                     |artists                            |danceability|valence|
+---------------------------------------------------------+-----------------------------------+------------+-------+
|Chapter 1.18 - Zamek kaniowski                           |['Seweryn Goszczy≈Ñski']            |0.749       |0.88   |
|Per aspera ad astra                                      |['Carl Woitschach']                |0.555       |0.857  |
|Invocaci√≥n al Tango - Remasterizado                      |['Francisco Canaro', 'Luis Scalon']|0.787       |0.849  |
|Tendr√°s Que Llorar Conmigo - Instrumental (Remasterizado)|['Francisco Canaro']               |0.763       |0.832  |
|Quisiste Cachar un Gil - Instrumental (Remasterizado)    |['Francisco Canaro']               |0.833       |0.568  |
|La Recova - Instrumental (Remasterizado)                 |['



üìä Nombre de chansons trouv√©es: 39677


                                                                                

## Q4: Dur√©e moyenne des chansons par ann√©e de sortie

In [9]:
avg_duration_by_year = df_spotify.groupBy("year").agg(
    avg("duration_ms").alias("avg_duration_ms")
).orderBy("year")

# Convertir en minutes pour plus de lisibilit√©
avg_duration_by_year = avg_duration_by_year.withColumn(
    "avg_duration_minutes",
    col("avg_duration_ms") / 60000
)
avg_duration_by_year.show(20)



+----+------------------+--------------------+
|year|   avg_duration_ms|avg_duration_minutes|
+----+------------------+--------------------+
|NULL|              NULL|                NULL|
|1921|    229911.9140625|      3.831865234375|
|1922|167904.54166666666|   2.798409027777778|
|1923| 178354.2142857143|  2.9725702380952383|
|1924|188461.64978902953|  3.1410274964838254|
|1925|184004.24809160305|  3.0667374681933843|
|1926|170391.50114416477|   2.839858352402746|
|1927| 184643.5101010101|  3.0773918350168348|
|1928| 217253.4771573604|  3.6208912859560067|
|1929|169983.42532467534|  2.8330570887445887|
|1930|195895.26640926642|   3.264921106821107|
|1931|178691.72100840337|   2.978195350140056|
|1932|193475.14435146443|   3.224585739191074|
|1933|  192454.038585209|  3.2075673097534834|
|1934| 185728.3981818182|   3.095473303030303|
|1935|218078.28859934854|  3.6346381433224755|
|1936|242105.25430210325|  4.0350875717017205|
|1937| 205470.5755033557|   3.424509591722595|
|1938|244455.

                                                                                

## Q5: Artiste avec le plus grand nombre de titres

In [10]:
# Compter le nombre de titres par artiste
artist_count = df_spotify.groupBy("artists").agg(
    count("*").alias("nombre_titres")
).orderBy(desc("nombre_titres"))

print("üé§ Top 10 des artistes les plus prolifiques:")
artist_count.show(10, truncate=False)

üé§ Top 10 des artistes les plus prolifiques:




+----------------------+-------------+
|artists               |nombre_titres|
+----------------------+-------------+
|['–≠—Ä–Ω–µ—Å—Ç –•–µ–º–∏–Ω–≥—É—ç–π']  |1215         |
|['Francisco Canaro']  |938          |
|['–≠—Ä–∏—Ö –ú–∞—Ä–∏—è –†–µ–º–∞—Ä–∫'] |781          |
|['Ignacio Corsini']   |620          |
|['Frank Sinatra']     |592          |
|['Bob Dylan']         |539          |
|['The Rolling Stones']|512          |
|['Johnny Cash']       |502          |
|['The Beach Boys']    |491          |
|['Elvis Presley']     |488          |
+----------------------+-------------+
only showing top 10 rows


                                                                                

In [11]:
top_artist = artist_count.first()
print(f"üèÜ L'artiste avec le plus de titres: {top_artist['artists']} avec {top_artist['nombre_titres']} titres")

[Stage 24:>                                                         (0 + 7) / 7]

üèÜ L'artiste avec le plus de titres: ['–≠—Ä–Ω–µ—Å—Ç –•–µ–º–∏–Ω–≥—É—ç–π'] avec 1215 titres


                                                                                

## Q6: Caract√©ristiques moyennes par Mode (majeur/mineur)

In [12]:
characteristics_by_mode = df_spotify.groupBy("mode").agg(
    avg("energy").alias("avg_energy"),
    avg("acousticness").alias("avg_acousticness")
).orderBy("mode")

print("üéµ Mode 0 = Mineur, Mode 1 = Majeur")
characteristics_by_mode.show()

üéµ Mode 0 = Mineur, Mode 1 = Majeur




+----+------------------+------------------+
|mode|        avg_energy|  avg_acousticness|
+----+------------------+------------------+
|NULL| 8712.954153432946|28779.414001728608|
|   0|0.5463335744795589|1.7549986961803765|
|   1|0.4840932445979928|0.5012573746052835|
|1942|              NULL|              NULL|
|1951|              NULL|              NULL|
+----+------------------+------------------+



                                                                                

## Q7: Ann√©e avec le Loudness le plus faible et le plus √©lev√©

In [13]:
loudness_by_year = df_spotify.groupBy("year").agg(
    avg("loudness").alias("avg_loudness")
).orderBy("avg_loudness")

print("üîä Distribution du loudness moyen par ann√©e (ordre croissant):")
loudness_by_year.show(10)

üîä Distribution du loudness moyen par ann√©e (ordre croissant):




+----+-------------------+
|year|       avg_loudness|
+----+-------------------+
|1922|-19.179958333333328|
|1946| -17.47131085164836|
|1928| -17.31900169204738|
|1921|        -17.0954375|
|1945| -16.93371228070176|
|1929|-16.607088744588747|
|1926|-16.410621281464525|
|1941| -15.83473347193347|
|1952|-15.737432241289653|
|1949|-15.517253059177525|
+----+-------------------+
only showing top 10 rows


                                                                                

In [14]:
# Ann√©e avec le loudness le plus faible
quietest_year = loudness_by_year.first()
print(f"üìâ Ann√©e avec le loudness le plus FAIBLE: {quietest_year['year']} (avg: {quietest_year['avg_loudness']:.2f} dB)")

# Ann√©e avec le loudness le plus √©lev√©
loudest_year = loudness_by_year.orderBy(desc("avg_loudness")).first()
print(f"üìà Ann√©e avec le loudness le plus √âLEV√â: {loudest_year['year']} (avg: {loudest_year['avg_loudness']:.2f} dB)")

                                                                                

üìâ Ann√©e avec le loudness le plus FAIBLE: 1922 (avg: -19.18 dB)
üìà Ann√©e avec le loudness le plus √âLEV√â: None (avg: 3733.57 dB)


## Q8: Association des chansons avec leur d√©cennie (JOIN)

In [15]:
df_with_decades = df_spotify.join(
    decades_df,
    on="year",
    how="inner"
)

df_with_decades.select("name", "artists", "year", "decade_name").show(10, truncate=False)

                                                                                

+------------------------------------------------+-----------------------------------------------------------------------------------+----+-----------+
|name                                            |artists                                                                            |year|decade_name|
+------------------------------------------------+-----------------------------------------------------------------------------------+----+-----------+
|Aragonaise (Act IV Entr'acte)                   |['Georges Bizet', 'Arturo Toscanini']                                              |1921|1920s      |
|La Payasa - Remasterizado                       |['Ignacio Corsini']                                                                |1921|1920s      |
|La Brisa - Remasterizado                        |['Ignacio Corsini']                                                                |1921|1920s      |
|Quand Il Y A Une Femme Dans Un Coin             |['Maurice Chevalier']                 

In [16]:
# Distribution par d√©cennie
print("üìä Distribution des chansons par d√©cennie:")
df_with_decades.groupBy("decade_name").count().orderBy("decade_name").show()

üìä Distribution des chansons par d√©cennie:




+-----------+-----+
|decade_name|count|
+-----------+-----+
|      1920s| 4443|
|      1930s| 8873|
|      1940s|14852|
|      1950s|19400|
|      1960s|19950|
|      1970s|19964|
|      1980s|19958|
|      1990s|19953|
|      2000s|19921|
|      2010s|19866|
|      2020s| 1752|
+-----------+-----+



                                                                                

## Q9: Artiste le plus populaire (popularit√© moyenne) et ses titres

In [17]:
# Trouver l'artiste avec la popularit√© moyenne la plus √©lev√©e
# (en ne gardant que ceux avec au moins 5 titres pour √©viter les biais)
avg_popularity_by_artist = df_spotify.groupBy("artists").agg(
    avg("popularity").alias("avg_popularity"),
    count("*").alias("nombre_titres")
).filter(col("nombre_titres") >= 5).orderBy(desc("avg_popularity"))

print("üé§ Top 10 des artistes les plus populaires (min 5 titres):")
avg_popularity_by_artist.show(10, truncate=False)

üé§ Top 10 des artistes les plus populaires (min 5 titres):




+------------------+-----------------+-------------+
|artists           |avg_popularity   |nombre_titres|
+------------------+-----------------+-------------+
|['Tones And I']   |81.83333333333333|6            |
|['Ava Max']       |80.5             |6            |
|['Camilo']        |79.4             |5            |
|['Billie Eilish'] |78.8             |30           |
|['6ix9ine']       |77.6             |5            |
|['Harry Styles']  |77.03846153846153|26           |
|['Lewis Capaldi'] |76.85714285714286|14           |
|['Arizona Zervas']|76.8             |5            |
|['Bad Bunny']     |74.86666666666666|30           |
|['Anuel AA']      |74.08333333333333|12           |
+------------------+-----------------+-------------+
only showing top 10 rows


                                                                                

In [18]:
most_popular_artist = avg_popularity_by_artist.first()
print(f"üèÜ L'artiste le plus populaire: {most_popular_artist['artists']}")
print(f"   Popularit√© moyenne: {most_popular_artist['avg_popularity']:.2f}")
print(f"   Nombre de titres: {most_popular_artist['nombre_titres']}")



üèÜ L'artiste le plus populaire: ['Tones And I']
   Popularit√© moyenne: 81.83
   Nombre de titres: 6


                                                                                

In [19]:
# Afficher tous les titres de cet artiste
print(f"üéµ Tous les titres de {most_popular_artist['artists']}:")
df_spotify.filter(col("artists") == most_popular_artist['artists']) \
    .select("name", "year", "popularity") \
    .orderBy(desc("popularity")) \
    .show(20, truncate=False)

üéµ Tous les titres de ['Tones And I']:
+-------------------+----+----------+
|name               |year|popularity|
+-------------------+----+----------+
|Dance Monkey       |2019|94        |
|Dance Monkey       |2019|84        |
|Never Seen The Rain|2020|81        |
|Bad Child          |2020|80        |
|Ur So F**kInG cOoL |2020|77        |
|Never Seen The Rain|2019|75        |
+-------------------+----+----------+



## Q10: √âcart de popularit√© par rapport √† la moyenne de l'ann√©e

In [20]:
# Calculer la popularit√© moyenne par ann√©e
avg_pop_by_year = df_spotify.groupBy("year").agg(
    avg("popularity").alias("avg_popularity_year")
)

# Joindre avec le DataFrame original
df_with_avg_pop = df_spotify.join(avg_pop_by_year, on="year", how="inner")

# Calculer l'√©cart
df_with_deviation = df_with_avg_pop.withColumn(
    "popularity_deviation",
    col("popularity") - col("avg_popularity_year")
)

print("üìä Chansons avec leur √©cart de popularit√©:")
df_with_deviation.select(
    "name", "artists", "year", "popularity", 
    "avg_popularity_year", "popularity_deviation"
).orderBy(desc("popularity_deviation")).show(10, truncate=False)

üìä Chansons avec leur √©cart de popularit√©:


                                                                                

+-----------------------------------------------------------------------------+--------------------------------------------+----+----------+-------------------+--------------------+
|name                                                                         |artists                                     |year|popularity|avg_popularity_year|popularity_deviation|
+-----------------------------------------------------------------------------+--------------------------------------------+----+----------+-------------------+--------------------+
|Gymnop√©die No. 1                                                             |['Erik Satie', 'Philippe Entremont']        |1949|65        |3.3405215646940825 |61.659478435305914  |
|Whatever Will Be, Will Be (Que Sera, Sera) (with Frank DeVol & His Orchestra)|['Doris Day', 'Frank DeVol & His Orchestra']|1948|63        |1.7391537225495448 |61.260846277450455  |
|Can't Help Falling in Love                                                   |['Elvis Pr

                                                                                

In [21]:
# Top des chansons les plus "surperformantes"
print("üöÄ Top 10 des chansons qui surperforment le plus:")
df_with_deviation.select(
    "name", "artists", "year", "popularity", "popularity_deviation"
).orderBy(desc("popularity_deviation")).show(10, truncate=False)

üöÄ Top 10 des chansons qui surperforment le plus:
+-----------------------------------------------------------------------------+--------------------------------------------+----+----------+--------------------+
|name                                                                         |artists                                     |year|popularity|popularity_deviation|
+-----------------------------------------------------------------------------+--------------------------------------------+----+----------+--------------------+
|Gymnop√©die No. 1                                                             |['Erik Satie', 'Philippe Entremont']        |1949|65        |61.659478435305914  |
|Whatever Will Be, Will Be (Que Sera, Sera) (with Frank DeVol & His Orchestra)|['Doris Day', 'Frank DeVol & His Orchestra']|1948|63        |61.260846277450455  |
|Can't Help Falling in Love                                                   |['Elvis Presley']                           |1961|78      

In [22]:
# Top des chansons les plus "sous-performantes"
print("üìâ Top 10 des chansons qui sous-performent le plus:")
df_with_deviation.select(
    "name", "artists", "year", "popularity", "popularity_deviation"
).orderBy(asc("popularity_deviation")).show(10, truncate=False)

üìâ Top 10 des chansons qui sous-performent le plus:
+--------------------------------+--------------------------------------------------------------------------------------------------+----+----------+--------------------+
|name                            |artists                                                                                           |year|popularity|popularity_deviation|
+--------------------------------+--------------------------------------------------------------------------------------------------+----+----------+--------------------+
|Reality Check                   |['Swae Lee']                                                                                      |2020|0         |-63.09760273972603  |
|Traicionera - Remix             |['Sebastian Yatra', 'Cosculluela', 'Cali Y El Dandee']                                            |2020|0         |-63.09760273972603  |
|Bleed                           |['A Boogie Wit da Hoodie']                               

---
## Arr√™t de la SparkSession

In [23]:
spark.stop()
print("üõë SparkSession arr√™t√©e.")

üõë SparkSession arr√™t√©e.
