# 02_Steam_Preprocessing

---

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Row

### Chargement du dataset

In [0]:
df_steam_flat = spark.read.json("/dbfs/FileStore/export/steam_flat.json")

In [0]:
df_steam_flat.printSchema()

root
 |-- appid: long (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ccu: long (nullable = true)
 |-- developer: string (nullable = true)
 |-- discount: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- header_image: string (nullable = true)
 |-- initialprice: string (nullable = true)
 |-- languages: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negative: long (nullable = true)
 |-- owners: string (nullable = true)
 |-- platforms: struct (nullable = true)
 |    |-- linux: boolean (nullable = true)
 |    |-- mac: boolean (nullable = true)
 |    |-- windows: boolean (nullable = true)
 |-- positive: long (nullable = true)
 |-- price: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- required_age: string (nullable = true)
 |-- short_description: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- 1980s: lon

In [0]:
print(f"nb rows : {df_steam_flat.count()}, nb colonnes : {len(df_steam_flat.columns)}")

nb rows : 55691, nb colonnes : 22


---

### Preprocessing
#### Nettoyage et transformation des données avant leur utilisation

Valeurs manquantes

In [0]:
display(df_steam_flat.describe())

summary,appid,ccu,developer,discount,genre,header_image,initialprice,languages,name,negative,owners,positive,price,publisher,release_date,required_age,short_description,type,website
count,55691.0,55691.0,55691,55691.0,55691,55691,55691.0,55691,55691,55691.0,55691,55691.0,55691.0,55691,55691,55691,55691,55691,55691
mean,1025603.0926720656,138.9596164550825,67392.0,2.603777989262179,,,797.5663033524268,,Infinity,241.8376937027527,,1470.8755992889337,773.2849832109317,2001.0,,0.1978882344490734,,,
stddev,522784.9683283419,6002.067909130784,210681.70504552333,12.887080174743142,,,1104.7624778413358,,,5765.413761559603,,30982.73347953487,1093.1345827234525,1921.8937275510318,,2.296292461481821,,,
min,10.0,0.0,,0.0,,https://cdn.akamai.steamstatic.com/steam/apps/10/header.jpg?t=1666823513,0.0,,Fieldrunners 2,0.0,"0 .. 20,000",0.0,0.0,,,0,,game,
max,2190950.0,874053.0,＼上／,90.0,Web Publishing,https://cdn.akamai.steamstatic.com/steam/apps/999990/header.jpg?t=1610733322,9999.0,Turkish,～Daydream～蝶が舞う頃に,908515.0,"500,000 .. 1,000,000",5943345.0,9999.0,Ｌｅｍｏｎ　Ｂａｌｍ,2022/11/7,MA 15+,"🚗 Take part in a roller coaster of emotions with Louise embarking on a road trip of a lifetime through the late 1960s USA, trying to show her son Mitch how to navigate the often cruel modern world. Your choices matter! ✅",hardware,www.windybeard.com


In [0]:
# Il ne semble pas y avoir de valeurs manquantes, je vérifie
def count_missing(col_name):
  return F.sum(F.col(col_name).isNull().cast("int")).alias(col_name)
# then we can apply it to all columns using a list comprehension
missing_values = df_steam_flat.select(*[count_missing(c) for c in df_steam_flat.columns]).toPandas()
display(missing_values)

appid,categories,ccu,developer,discount,genre,header_image,initialprice,languages,name,negative,owners,platforms,positive,price,publisher,release_date,required_age,short_description,tags,type,website
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Il n'y a pas de valeur manquante

---

 `required_age`

In [0]:
df_steam_flat.select(F.col("required_age")).distinct().count()

Out[42]: 21

In [0]:
df_steam_flat.groupBy("required_age").agg(F.countDistinct("appid").alias("nb_games")).orderBy(F.desc("nb_games")).toPandas()

Unnamed: 0,required_age,nb_games
0,0,55030
1,15,264
2,18,223
3,16,38
4,17,38
5,12,32
6,13,26
7,14,10
8,10,7
9,6,4


Je vais remplacer 21+ par 21, 7+ par 7 et MA 15+ par 15, de cette manière l'age minimum pourra être converti est Integer pour être requêté.

In [0]:
df_steam_flat = df_steam_flat \
    .replace(to_replace="21+", value="21", subset=["required_age"]) \
    .replace(to_replace="7+", value="7", subset=["required_age"]) \
    .replace(to_replace="MA 15+", value="15", subset=["required_age"])

In [0]:
df_steam_flat.groupBy("required_age").agg(F.countDistinct("appid").alias("nb_games")).orderBy(F.desc("nb_games")).toPandas()

Unnamed: 0,required_age,nb_games
0,0,55030
1,15,265
2,18,223
3,16,38
4,17,38
5,12,32
6,13,26
7,14,10
8,10,7
9,6,4


Il y a 4 jeux pour lesquels l'age minimum est 180 ans, ce qui semble être une coquille.

In [0]:
df_steam_flat.filter(F.col("required_age") == "180").toPandas()

Unnamed: 0,appid,categories,ccu,developer,discount,genre,header_image,initialprice,languages,name,...,platforms,positive,price,publisher,release_date,required_age,short_description,tags,type,website
0,1091210,"[Single-player, Steam Achievements]",0,ТЯН ТЯН ТЯН ЛАМПОВАЯ ТЯН INDUSTRIES,0,"Action, Simulation",https://cdn.akamai.steamstatic.com/steam/apps/...,99,English,Kissing Simulator,...,"{'linux': False, 'mac': False, 'windows': True}",62,99,Kavkaz Sila Games,2019/07/15,180,Kissing Simulator,"{'1980s': None, '1990's': None, '2.5D': None, ...",game,
1,758050,"[Single-player, Steam Achievements]",0,Abu Insdustries,0,"Indie, Simulation",https://cdn.akamai.steamstatic.com/steam/apps/...,299,"English, Not supported, Russian",Internet Simulator,...,"{'linux': False, 'mac': False, 'windows': True}",26,299,Kavkaz Sila Games,2018/03/15,180,Web-browser simulator from beautiful locations,"{'1980s': None, '1990's': None, '2.5D': None, ...",game,
2,844090,[Single-player],0,Ludwig van Beethoven,0,"Simulation, Early Access",https://cdn.akamai.steamstatic.com/steam/apps/...,299,"English, Russian",Piano Simulator,...,"{'linux': False, 'mac': False, 'windows': True}",2,299,Kavkaz Siia Games,2018/05/29,180,The Piano Simulator,"{'1980s': None, '1990's': None, '2.5D': None, ...",game,
3,873190,"[Single-player, Steam Achievements]",1,Censored,0,Simulation,https://cdn.akamai.steamstatic.com/steam/apps/...,99,Russian,ЕСТЬ ДВА СТУЛА,...,"{'linux': False, 'mac': False, 'windows': True}",1289,99,Kavkaz Sila Games,2018/06/19,180,Тюремные Загадки,"{'1980s': None, '1990's': None, '2.5D': None, ...",game,


Je supprime ces données du dataset

In [0]:
df_steam_flat = df_steam_flat.filter(F.col("required_age") != "180")

Je visualise la distribution de l'âge

In [0]:
display(df_steam_flat.groupBy("required_age").agg(F.countDistinct("appid").alias("nb_games")).orderBy(F.desc("nb_games")))

required_age,nb_games
0,55030
15,265
18,223
16,38
17,38
12,32
13,26
14,10
10,7
6,4


Databricks visualization. Run in Databricks to view.

La plupart des jeux sont accessibles à tous

In [0]:
df_steam_flat.count()

Out[49]: 55687

---

`game`

In [0]:
df_steam_flat.select(F.col("type")).distinct().count()

Out[50]: 2

In [0]:
df_steam_flat.select(F.col("type")).distinct().collect()

Out[51]: [Row(type='game'), Row(type='hardware')]

In [0]:
df_steam_flat.groupBy("type").count().show()

+--------+-----+
|    type|count|
+--------+-----+
|    game|55686|
|hardware|    1|
+--------+-----+



Je trouve un seul jeu qui ne soit pas de type "game", je le supprime.

In [0]:
display(df_steam_flat.filter(F.col("type") == "hardware"))

appid,categories,ccu,developer,discount,genre,header_image,initialprice,languages,name,negative,owners,platforms,positive,price,publisher,release_date,required_age,short_description,tags,type,website
353380,"List(Full controller support, Remote Play Together)",0,,0,,https://cdn.akamai.steamstatic.com/steam/apps/353380/header.jpg?t=1617990330,0,,Steam Link,1771,"500,000 .. 1,000,000","List(true, true, true)",5803,0,Anima Locus,2015/11/10,0,"Extend your Steam gaming experience to your mobile device, TV, or another PC - all you need is a local network or internet connection. In addition, the Steam Link app now supports Remote Play Together. Now you can join games hosted on a friend’s PC just by clicking a link.","List(null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 10, null, 8, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 5, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 11, null, null, null, null, null, null, null, 6, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 15, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 6, null, null, null, null, null, null, null, null, null, 10, null, 16, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 23, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, 39, null, null, null, null, null, null, null, null, null, null, null, null, null, 440, null, null, null, null, null, null, null, null, null, 11, 17, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null)",hardware,https://store.steampowered.com/remoteplay


In [0]:
df_steam_flat = df_steam_flat.filter(F.col("type") == "game")

In [0]:
df_steam_flat.groupBy("type").count().show()

+----+-----+
|type|count|
+----+-----+
|game|55686|
+----+-----+



---

Format des données

Je reprends le format des données. Je vais en caster certaines pour me permettre de les requêter plus facilement :

appid : Identifiant, Texte

categories : Catégories, Liste

ccu : Nombre de jours en simultanés, Long 

developer : Développeur, Texte

discount : Réduction de prix, Float

genre : Catégorie, Type de jeu, Texte

header_image : Image de présentation, Texte

initialprice : Prix d'origine, Float

languages : Langues, Texte

name : Nom, Texte

negative : Nombre d'avis négatifs, Long

owners : Plage d'acheteurs (entre x et y jeux vendus), Texte

platforms : Plateforme de disponibilité du jeu, Dictionnaire

positive : Nombre d'avis positifs, Long

price : Prix, Float

publisher : Editeur, Texte

release_date : Date de sortie, Date

required_age : (cf. ci-dessous) Age minimum, Integer

short_descrition : Description, Texte

tags : Mots-clés + votes des utilisateurs sur chacun des mot-clé (Steam permet aux joueurs d'ajouter des mots-clés aux jeux, et ces votes sont comptabilisés pour afficher les tags les plus populaires), Dictionnaire

type : (cf. ci-dessous) Game, Texte

website : URL, Texte


In [0]:
df_steam_flat.printSchema()

root
 |-- appid: long (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- ccu: long (nullable = true)
 |-- developer: string (nullable = true)
 |-- discount: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- header_image: string (nullable = true)
 |-- initialprice: string (nullable = true)
 |-- languages: string (nullable = true)
 |-- name: string (nullable = true)
 |-- negative: long (nullable = true)
 |-- owners: string (nullable = true)
 |-- platforms: struct (nullable = true)
 |    |-- linux: boolean (nullable = true)
 |    |-- mac: boolean (nullable = true)
 |    |-- windows: boolean (nullable = true)
 |-- positive: long (nullable = true)
 |-- price: string (nullable = true)
 |-- publisher: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- required_age: string (nullable = true)
 |-- short_description: string (nullable = true)
 |-- tags: struct (nullable = true)
 |    |-- 1980s: lon

Les données à caster sont donc :

appid : Texte

discount: Float

initialprice : Float

price: Float

release_date : Date

required_age : Integer

In [0]:
df_steam_flat = df_steam_flat \
  .withColumn("appid", F.col("appid").cast("string")) \
  .withColumn("discount", F.col("discount").cast("float")) \
  .withColumn("initialprice", F.col("initialprice").cast("float")) \
  .withColumn("price", F.col("price").cast("float")) \
  .withColumn("required_age", F.col("required_age").cast("int"))

---

`release_date`

In [0]:
df_steam_flat.select(F.col("release_date")).show(20)

+------------+
|release_date|
+------------+
|   2000/11/1|
|  2021/05/14|
|  2020/10/16|
|  2020/10/14|
|  2019/03/30|
|  2019/06/24|
|  2019/01/24|
|   2019/04/8|
|   2019/01/6|
|   2021/09/9|
|  2019/12/17|
|  2021/02/16|
|   2019/01/3|
|   2019/02/1|
|  2019/11/22|
|  2019/05/24|
|   2019/02/4|
|   2021/12/5|
|  2019/02/11|
|  2019/03/21|
+------------+
only showing top 20 rows



In [0]:
  df_steam_flat = df_steam_flat.withColumn("release_date", F.to_timestamp(F.col("release_date"), format="y/M/d"))

In [0]:
df_steam_flat.select(F.col("release_date")).show(20)


+-------------------+
|       release_date|
+-------------------+
|2000-11-01 00:00:00|
|2021-05-14 00:00:00|
|2020-10-16 00:00:00|
|2020-10-14 00:00:00|
|2019-03-30 00:00:00|
|2019-06-24 00:00:00|
|2019-01-24 00:00:00|
|2019-04-08 00:00:00|
|2019-01-06 00:00:00|
|2021-09-09 00:00:00|
|2019-12-17 00:00:00|
|2021-02-16 00:00:00|
|2019-01-03 00:00:00|
|2019-02-01 00:00:00|
|2019-11-22 00:00:00|
|2019-05-24 00:00:00|
|2019-02-04 00:00:00|
|2021-12-05 00:00:00|
|2019-02-11 00:00:00|
|2019-03-21 00:00:00|
+-------------------+
only showing top 20 rows



Je visualise les dates de sortie des jeux vidéos

In [0]:
display(df_steam_flat.groupBy("release_date").agg(F.countDistinct("appid").alias("nb_games")).orderBy(F.desc("nb_games")))

release_date,nb_games
,222
2020-01-17T00:00:00.000+0000,74
2022-09-30T00:00:00.000+0000,64
2020-10-15T00:00:00.000+0000,63
2021-09-30T00:00:00.000+0000,62
2021-10-14T00:00:00.000+0000,59
2022-09-01T00:00:00.000+0000,58
2021-02-26T00:00:00.000+0000,58
2020-07-31T00:00:00.000+0000,57
2021-06-17T00:00:00.000+0000,57


Databricks visualization. Run in Databricks to view.

Cette visualisation rapide me permet de constater qu'il y a une explosion de la sortie de jeux sur steam à partir de l'année 2014

---

Etude des prix

In [0]:
display(df_steam_flat.describe())

summary,appid,ccu,developer,discount,genre,header_image,initialprice,languages,name,negative,owners,positive,price,publisher,required_age,short_description,type,website
count,55686.0,55686.0,55686,55686.0,55686,55686,55686.0,55686,55686,55686.0,55686,55686.0,55686.0,55686,55686.0,55686,55686,55686
mean,1025624.7874510648,138.97207556656969,67392.0,2.6040117803397624,,,797.6236217361635,,Infinity,241.8229716625364,,1470.878694824552,773.3401213949646,2001.0,0.1857378874402902,,,
stddev,522798.4087328436,6002.3372242551,210681.70504552333,12.887635112557996,,,1104.7949315559604,,,5765.668726815331,,30984.117176566942,1093.167581116487,1921.8937275510318,1.7214162791461722,,,
min,10.0,0.0,,0.0,,https://cdn.akamai.steamstatic.com/steam/apps/10/header.jpg?t=1666823513,0.0,,Fieldrunners 2,0.0,"0 .. 20,000",0.0,0.0,,0.0,,game,
max,999990.0,874053.0,＼上／,90.0,Web Publishing,https://cdn.akamai.steamstatic.com/steam/apps/999990/header.jpg?t=1610733322,99900.0,Turkish,～Daydream～蝶が舞う頃に,908515.0,"500,000 .. 1,000,000",5943345.0,99900.0,Ｌｅｍｏｎ　Ｂａｌｍ,35.0,"🚗 Take part in a roller coaster of emotions with Louise embarking on a road trip of a lifetime through the late 1960s USA, trying to show her son Mitch how to navigate the often cruel modern world. Your choices matter! ✅",game,www.windybeard.com


Les prix max me semblent élevés

`initialeprice`

In [0]:
display(df_steam_flat.groupBy("initialprice").agg(F.countDistinct("appid").alias("appid_count")))

initialprice,appid_count
3599.0,1
550.0,1
500.0,20
360.0,1
2599.0,1
969.0,1
3595.0,2
379.0,2
1558.0,1
6499.0,7


Databricks visualization. Run in Databricks to view.

`price`

In [0]:
display(df_steam_flat.groupBy("price").agg(F.countDistinct("appid").alias("appid_count")))

price,appid_count
3599.0,4
714.0,2
3749.0,1
769.0,1
550.0,1
64.0,3
107.0,1
500.0,20
1438.0,1
1079.0,4


Databricks visualization. Run in Databricks to view.

Il y a des valeurs extrêmes concernant les prix, je vais les filtrer

In [0]:
# Liste des colonnes à traiter
col_outliers = ["price", "initialprice"]

# Pour chaque colonne, calculer la moyenne et l'écart-type, puis filtrer les outliers
for col in col_outliers:
    # Calcul des statistiques
    stats = df_steam_flat.select(
        F.mean(F.col(col)).alias("mean"),
        F.stddev(F.col(col)).alias("stddev")
    ).collect()[0]

    mean_val = stats["mean"]
    stddev_val = stats["stddev"]

    upper_bound = mean_val + 3 * stddev_val
    lower_bound = mean_val - 3 * stddev_val

    # Filtrer les outliers
    df_steam_flat = df_steam_flat.filter((F.col(col) <= upper_bound) & (F.col(col) >= lower_bound))

Je visualise à nouveau la distribution des prix

`initialprice`

In [0]:
display(df_steam_flat.groupBy("initialprice").agg(F.countDistinct("appid").alias("appid_count")))

initialprice,appid_count
550.0,1
500.0,20
360.0,1
2599.0,1
969.0,1
379.0,2
1558.0,1
187.0,1
189.0,1
90.0,34


Databricks visualization. Run in Databricks to view.

`price`

In [0]:
display(df_steam_flat.groupBy("price").agg(F.countDistinct("appid").alias("appid_count")))

price,appid_count
714.0,2
769.0,1
550.0,1
64.0,3
107.0,1
500.0,20
1079.0,4
360.0,1
911.0,1
2599.0,1


Databricks visualization. Run in Databricks to view.

In [0]:
df_steam_flat.count()

Out[81]: 54474

Les données sont maintenant prêtes pour l'analyse

---

Sauvegarde du fichier préprocessé

In [0]:
df_steam_flat.write.mode("overwrite").json("/dbfs/FileStore/export/steam_flat_prep.json")