# Limpieza y Preparación de Datos de Steam con ETL: Un Enfoque Basado en JSON

## Descripción:
En este proyecto, realizaremos un proceso de ETL (Extracción, Transformación y Carga) en un conjunto de datos de Steam que contiene varias características de los juegos, incluyendo el editor, géneros, nombre de la aplicación, título, URL, fecha de lanzamiento, etiquetas, URL de las reseñas, especificaciones, precio, acceso anticipado, ID y desarrollador.

### Extracción: 
Leeremos el archivo JSON que contiene los datos de Steam. Los datos pueden incluir información sobre juegos, reseñas de usuarios, detalles de los desarrolladores, entre otros.

### Transformación: 
En esta etapa, limpiaremos y transformaremos los datos. Esto puede implicar la eliminación de registros duplicados o irrelevantes, el manejo de valores faltantes, la conversión de tipos de datos y la normalización de texto. También podríamos derivar nuevas características que podrían ser útiles para el análisis posterior.

### Carga: 
Finalmente, los datos limpios y transformados se guardarán en un formato adecuado para el análisis posterior para está situación quedaran en formato csv. 

## 1) IMPORTE Y CARGA DE DATOS

In [1]:
import pandas as pd
import numpy as np
import ast

def load_json_lines(new_file_path):
    data = []
    with open(new_file_path, "r", encoding="utf-8") as file:
        for line in file:
            data.append(ast.literal_eval(line))
    return pd.DataFrame(data)
#Carga archivos formato csv
data_games = pd.read_json("./steam_games.json/output_steam_games.json", lines=True)
data_reviews = load_json_lines("./user_reviews.json/australian_user_reviews.json")


In [2]:
#Se eliminan valores Nulos para poder interpretar el dataset
data_games = data_games.dropna(how='all')
data_games.to_csv(r'steam_games.csv',index=False)

## 2) EXPLORACIÓN DE DATOS

In [3]:
data_games.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,0.0,761140.0,Kotoshiro
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980.0,Secret Level SRL
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290.0,Poolians.com
88313,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,0.0,767400.0,彼岸领域
88314,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,0.0,773570.0,


In [4]:
data_games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32135 entries, 88310 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   publisher     24083 non-null  object 
 1   genres        28852 non-null  object 
 2   app_name      32133 non-null  object 
 3   title         30085 non-null  object 
 4   url           32135 non-null  object 
 5   release_date  30068 non-null  object 
 6   tags          31972 non-null  object 
 7   reviews_url   32133 non-null  object 
 8   specs         31465 non-null  object 
 9   price         30758 non-null  object 
 10  early_access  32135 non-null  float64
 11  id            32133 non-null  float64
 12  developer     28836 non-null  object 
dtypes: float64(2), object(11)
memory usage: 3.4+ MB


In [5]:

print(data_games.isnull().sum())

publisher       8052
genres          3283
app_name           2
title           2050
url                0
release_date    2067
tags             163
reviews_url        2
specs            670
price           1377
early_access       0
id                 2
developer       3299
dtype: int64


In [6]:
data_games['early_access'].fillna(data_games['early_access'].mode().iloc[0], inplace=True)


In [7]:
data_games.head(6)



Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,0.0,761140.0,Kotoshiro
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980.0,Secret Level SRL
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290.0,Poolians.com
88313,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,0.0,767400.0,彼岸领域
88314,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,0.0,773570.0,
88315,Trickjump Games Ltd,"[Action, Adventure, Simulation]",Battle Royale Trainer,Battle Royale Trainer,http://store.steampowered.com/app/772540/Battl...,2018-01-04,"[Action, Adventure, Simulation, FPS, Shooter, ...",http://steamcommunity.com/app/772540/reviews/?...,"[Single-player, Steam Achievements]",3.99,0.0,772540.0,Trickjump Games Ltd


## LIMPIEZA Y PROCESAMIENTO DE DATOS

In [8]:
data_games.replace(['', 'null', 'None'], np.nan, inplace=True)

In [9]:
data_games.rename(columns={'app_name': 'name', 'id': 'item_id'}, inplace=True)

In [10]:
data_games[['publisher', 'developer']].isnull().sum()

publisher    8061
developer    3299
dtype: int64

In [11]:
#Calculamos valores faltantes para item_id
data_games[data_games['item_id'].isna()]

Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
88384,,,,,http://store.steampowered.com/,,,,,19.99,0.0,,
119271,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,0.0,,"Rocksteady Studios,Feral Interactive (Mac)"


In [12]:
# eliminar valores nulos para item_id que sean nulos
data_games = data_games.dropna(subset=['item_id'])

In [13]:
# Visualizar los datos de la columna 'release_date'
unique_release_dates = data_games['release_date'].unique()

for value in unique_release_dates:
    print(value)

2018-01-04
2017-07-24
2017-12-07
None
Soon..
2018-01-03
2017-12-22
2017-12-23
1997-06-30
1998-11-08
2016-11-25
2018-01-01
2017-12-30
2006-07-06
2006-07-11
2017
Beta测试已开启
2017-12-29
2018-03-30
2005-08-09
2006-09-29
2006-11-20
2006-11-29
2006-11-24
2006-12-14
2006-12-19
2003-08-23
2006-12-21
2006-04-17
2006-08-01
2005-07-12
2007-06-26
2006-07-24
2002-11-12
2000-11-17
2003-10-23
2006-10-16
1998-10-31
2007-06-01
1995-04-01
2007-06-05
2006-05-23
2006-10-17
1996-06-17
2007-06-29
2006-12-20
1995-04-30
1995-06-01
1994-05-05
1994-08-03
2001-11-20
1997-02-28
1993-10-10
2003-09-09
2007-08-03
2007-05-31
2007-03-20
2006-01-01
2007-08-21
2006-10-09
2004-09-22
2006-06-26
2007-12-14
1998-06-30
1999-10-25
2004-04-20
2003-07-01
2003-10-14
2006-04-07
2001-07-25
2008-10-28
2007-10-10
2007-10-23
2007-08-01
2007-09-01
2007-12-03
2008-02-08
2007-05-08
2006-02-21
2006-05-02
2007-10-16
2006-05-30
2006-10-26
2005-03-14
2003-02-03
1998-04-30
2008-03-21
2003-02-18
2004-03-23
2000-10-25
2006-12-12
2005-03-15
2008-

In [14]:
#Convertir la columa release_date a formato fecha
data_games['release_date']=pd.to_datetime(data_games['release_date'], errors='coerce', exact=False)
# para nuestro analisis solo es necesario el año por tanto debemos ingresar una nueva columna llamada release_year
data_games['year'] = data_games['release_date'].dt.year.astype('Int64')



In [15]:
data_games.head()

Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer,year
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,0.0,761140.0,Kotoshiro,2018.0
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980.0,Secret Level SRL,2018.0
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290.0,Poolians.com,2017.0
88313,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,0.0,767400.0,彼岸领域,2017.0
88314,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,NaT,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,0.0,773570.0,,


In [16]:
#   llenamos valores faltantes en la columna developer de la columna publisher
data_games.loc[data_games['developer'].isnull() & ~data_games['publisher'].isnull(), 'developer'] = data_games['publisher']

In [17]:
#Contamos valores faltantes 
data_games[['publisher', 'developer']].isnull().sum()

publisher    8060
developer    3233
dtype: int64

In [18]:
data_games.head()

Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer,year
88310,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,0.0,761140.0,Kotoshiro,2018.0
88311,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980.0,Secret Level SRL,2018.0
88312,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290.0,Poolians.com,2017.0
88313,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,0.0,767400.0,彼岸领域,2017.0
88314,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,NaT,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,0.0,773570.0,,


In [19]:
#Identificamos variables de la columna publisher
unique_values = data_games['publisher'].unique()
unique_values

array(['Kotoshiro', 'Making Fun, Inc.', 'Poolians.com', ...,
       'OrtiGames/OrtiSoft', 'INGAME', 'Bidoniera Games'], dtype=object)

In [20]:
#Validamos datos faltantes
null_counts = data_games['publisher'].isna().value_counts()
null_counts

publisher
False    24073
True      8060
Name: count, dtype: int64

In [21]:
# Se usa metodo para reem´lazar valores string 
data_games['publisher'].fillna(data_games['publisher'].mode().iloc[0], inplace=True)

In [22]:
data_games.drop(['title','url','reviews_url','early_access','publisher','release_date'], axis=1, inplace=True)

In [23]:
# Se establece umbral del 80% para decidir que columnas eliminar por valores nulos
umbral_nulos = 0.8

# Filtra las columnas que superan el umbral
columnas_a_eliminar = data_games.columns[data_games.isnull().mean() > umbral_nulos]

# Muestra las columnas y su respectivo porcentaje de valores nulos
print(f"Columnas con más del {umbral_nulos * 100}% de valores nulos (candidatas a eliminar):\n{columnas_a_eliminar}")

Columnas con más del 80.0% de valores nulos (candidatas a eliminar):
Index([], dtype='object')


In [24]:
# Si aún hay valores nulos después de la interpolación, se llenan con la mediana. 
data_games['year'] = data_games['year'].fillna(data_games['year'].median())

In [25]:
# Convertir 'price' a tipo numérico, asignar NaN a 'Free To Play'
data_games['price'] = pd.to_numeric(data_games['price'], errors='coerce')

## Comprobamos limpieza del dataset y guardamos en formato csv

In [32]:
# calculamos cada columna del data frame con el fin de ver que halla quedado limpia 
for col in data_games.columns:
    print(f"Columna: {col}")
    

    print(data_games[col].value_counts())
    
    # Contar los valores faltantes
    print(f"Valores faltantes: {data_games[col].isnull().sum()}")

Columna: genres
genres
[Action]                                                          1880
[Action, Indie]                                                   1650
[Simulation]                                                      1396
[Casual, Simulation]                                              1359
[Action, Adventure, Indie]                                        1082
                                                                  ... 
[Action, Adventure, Racing, Simulation, Strategy]                    1
[Action, Adventure, Casual, Indie, Racing, Sports, Strategy]         1
[Action, Adventure, Casual, Indie, Racing, Simulation, Sports]       1
[Action, Massively Multiplayer, RPG, Strategy]                       1
[Adventure, Casual, RPG, Simulation, Early Access]                   1
Name: count, Length: 883, dtype: int64
Valores faltantes: 3282
Columna: name
name
Soundtrack                                        3
Luna                                              2
Ultimate A

In [29]:
# almacenar archivos en nuestra carpeta de proyecto
data_games.to_csv('steam_games_cleaned.csv', index=False)
data_games.to_json('steam_games_cleaned.json', orient='records', lines=True)


In [30]:
data_games.to_parquet('steam_games_cleaned.parquet', index=False)