# 05_Steam_Plateformes

---

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Row

### Chargement du dataset

In [0]:
df_steam_flat = spark.read.json("/dbfs/FileStore/export/steam_flat_prep.json")

---

### Analyse des plateformes

La plupart des jeux sont-ils disponibles sur Windows/Mac/Linux ?

In [0]:
df_steam_flat.select("platforms").limit(5).toPandas()

Unnamed: 0,platforms
0,"{'linux': False, 'mac': False, 'windows': True}"
1,"{'linux': False, 'mac': False, 'windows': True}"
2,"{'linux': False, 'mac': False, 'windows': True}"
3,"{'linux': False, 'mac': False, 'windows': True}"
4,"{'linux': False, 'mac': False, 'windows': True}"


Je veux récupérer la colonne `platforms` qui est un dictionnaire

In [0]:
df_platforms = df_steam_flat \
    .withColumn("platform_linux", F.col("platforms").getField("linux")) \
    .withColumn("platform_mac", F.col('platforms').getField("mac")) \
    .withColumn("platform_windows", F.col('platforms').getField("windows")) \
    .drop("platforms")
df_platforms.printSchema()

root
 |-- appid: string (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ccu: long (nullable = true)
 |-- developer: string (nullable = true)
 |-- discount: double (nullable = true)
 |-- genre: string (nullable = true)
 |-- header_image: string (nullable = true)
 |-- initialprice: double (nullable = true)
 |-- languages: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negative: long (nullable = true)
 |-- owners: string (nullable = true)
 |-- positive: long (nullable = true)
 |-- price: double (nullable = true)
 |-- publisher: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- required_age: long (nullable = true)
 |-- short_description: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- 1980s: long (nullable = true)
 |    |-- 1990's: long (nullable = true)
 |    |-- 2.5D: long (nullable = true)
 |    |-- 2D: long (nullable = true)
 |    |-- 2D Fighter: long (nulla

In [0]:
df_platforms = df_platforms \
    .withColumn("platform_linux", F.when(F.col("platform_linux") == True, 1).otherwise(0)) \
    .withColumn("platform_mac", F.when(F.col("platform_mac") == True, 1).otherwise(0)) \
    .withColumn("platform_windows", F.when(F.col("platform_windows") == True, 1).otherwise(0))


In [0]:
df_platforms.select("appid", "platform_linux", "platform_mac", "platform_windows").limit(5).toPandas()

Unnamed: 0,appid,platform_linux,platform_mac,platform_windows
0,686260,0,0,1
1,686270,0,0,1
2,686290,0,0,1
3,686300,0,0,1
4,686340,0,0,1


Maintenant, pour chaque jeux, nous pouvons savoir sur quelle(s) plateforme(s) il est disponible.

In [0]:
display(df_platforms.agg(
    F.sum("platform_linux").alias("nb_games_linux"),
    F.sum("platform_mac").alias("nb_games_mac"),
    F.sum("platform_windows").alias("nb_games_windows")
))


nb_games_linux,nb_games_mac,nb_games_windows
8457,12769,55670


Databricks visualization. Run in Databricks to view.

55670 jeux sont disponibles sur **Windows**, soit quasiment la totalité des jeux.

12769 sont disponibles sur Mac et 8457 sur Linux

---

Certains genres ont-ils tendance à être disponibles de préférence sur certaines plateformes ?

In [0]:
df_steam_flat.select("genre").limit(5).toPandas()

Unnamed: 0,genre
0,"Indie, Simulation, Strategy"
1,"Adventure, Free to Play, Indie"
2,"Action, Adventure, Casual, Indie, Sports, Stra..."
3,"Action, Casual, Indie"
4,"Adventure, Free to Play, Indie"


Je veux récupérer la colonne `genre` qui est une chaine de caractère où chaque genre est séparé par une virgule

In [0]:
df_platforms_genre = df_platforms \
  .withColumn("genre_list", F.split(F.col("genre"),",")) \
    .drop("genre")
df_platforms_genre.select("genre_list").limit(5).toPandas()

Unnamed: 0,genre_list
0,"[Indie, Simulation, Strategy]"
1,"[Adventure, Free to Play, Indie]"
2,"[Action, Adventure, Casual, Indie, Sports,..."
3,"[Action, Casual, Indie]"
4,"[Adventure, Free to Play, Indie]"


In [0]:
df_platforms_genre_exploded = df_platforms_genre.withColumn("genre", F.explode("genre_list"))
df_platforms_genre_exploded.select("appid", "genre").limit(5).toPandas()

Unnamed: 0,appid,genre
0,686260,Indie
1,686260,Simulation
2,686260,Strategy
3,686270,Adventure
4,686270,Free to Play


Maintenant, pour chaque jeux, nous pouvons savoir à quel(s) genre(s) il est associé.

In [0]:
# Nombre de jeux disponibles pour chacune des plateformes, par genre
df_genre_prefered_platforms = df_platforms_genre_exploded.groupBy("genre").agg(
    F.count("appid").alias("nb_games"),
    F.sum("platform_linux").alias("linux_sum"),
    F.sum("platform_mac").alias("mac_sum"),
    F.sum("platform_windows").alias("windows_sum"))

df_genre_prefered_platforms.limit(5).toPandas()

Unnamed: 0,genre,nb_games,linux_sum,mac_sum,windows_sum
0,Sports,2425,282,478,2424
1,Education,141,7,18,141
2,Massively Multiplayer,45,1,8,45
3,Simulation,9449,1382,2165,9446
4,Sexual Content,54,7,13,54


In [0]:
# Calcul du pourcentage de disponibilités des genres, selon les plateformes
df_ratio_genre_platforms = df_genre_prefered_platforms.select(
    "genre",
    (F.col("linux_sum") / F.col("nb_games") * 100).alias("linux_percent"),
    (F.col("mac_sum") / F.col("nb_games") * 100).alias("mac_percent"),
    (F.col("windows_sum") / F.col("nb_games") * 100).alias("windows_percent"))

# Affichage du meilleur pourcentage, et du nom de la plateforme associée
df_ratio_genre_platforms = df_ratio_genre_platforms \
    .withColumn("max_percent", F.greatest("linux_percent", "mac_percent", "windows_percent")) \
    .withColumn("max_column", \
        F.when(F.col("max_percent") == F.col("linux_percent"), "linux_percent") \
        .when(F.col("max_percent") == F.col("mac_percent"), "mac_percent") \
        .when(F.col("max_percent") == F.col("windows_percent"), "windows_percent"))

display(df_ratio_genre_platforms)

genre,linux_percent,mac_percent,windows_percent,max_percent,max_column
Sports,11.628865979381445,19.711340206185568,99.95876288659794,99.95876288659794,windows_percent
Education,4.964539007092199,12.76595744680851,100.0,100.0,windows_percent
Massively Multiplayer,2.2222222222222223,17.77777777777778,100.0,100.0,windows_percent
Simulation,14.625886337178535,22.912477510847708,99.96825060853,99.96825060853,windows_percent
Sexual Content,12.962962962962962,24.074074074074076,100.0,100.0,windows_percent
Adventure,17.423423423423422,28.144144144144143,99.97297297297295,99.97297297297295,windows_percent
Web Publishing,6.153846153846154,21.53846153846154,100.0,100.0,windows_percent
Sports,1.4705882352941175,10.294117647058822,100.0,100.0,windows_percent
Audio Production,5.405405405405405,26.126126126126124,98.1981981981982,98.1981981981982,windows_percent
Design & Illustration,15.53398058252427,30.582524271844655,100.0,100.0,windows_percent


Databricks visualization. Run in Databricks to view.

Tous les genres sont disponibles sur **Windows** à au moins 98%. 

Il s'agit de la plateforme principale, quel que soit le genre