# Proceso de Extracción, Transformación y Carga (ETL)
En esta sección, nos adentraremos en el proceso de Extracción, Transformación y Carga (ETL) de los datos relacionados con los juegos disponibles en la plataforma Steam. Estos datos abarcan información variada como género, desarrolladores, precio, entre otros aspectos relevantes.

El objetivo principal de este proceso es asegurar que los datos estén listos y en condiciones óptimas para su análisis. Comenzaremos importando las bibliotecas necesarias y nos aseguraremos de tenerlas instaladas correctamente para evitar cualquier inconveniente durante la ejecución.

A lo largo del informe, nos enfocaremos en identificar y resolver posibles problemas que puedan surgir en los datos. Aplicaremos técnicas de limpieza y preprocesamiento para garantizar su calidad y coherencia. Finalmente, almacenaremos los datos transformados en un formato adecuado para su uso futuro en exploraciones y análisis adicionales.

In [1]:
import pandas as pd
import numpy as np
import json

In [3]:
# SE CARGAN LOS ARCHIVOS ORIGINALES
# Nombre del archivo JSON
nombre_archivo = "C:\\Users\\Gary Alexander Bean\\Desktop\\output_steam_games .json"

# Se abre el archivo en modo lectura ('r') con codificación utf-8
with open(nombre_archivo, 'r', encoding='utf-8') as archivo:
    # Leer todas las líneas del archivo
    lineas = archivo.readlines()

# Se carga objetos JSON individualmente
objetos_json = [json.loads(line) for line in lineas]

# Se crea DataFrame
df_SteamGames = pd.DataFrame(objetos_json)
#SE EXPLORAN Y SE ENTIENDEN LOS DATOS
df_SteamGames

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,False,773640,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,False,733530,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,False,610660,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,False,658870,"xropi,stev3ns"


In [4]:
print('\nInformacion general del dataframe')
df_SteamGames.info()


Informacion general del dataframe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   publisher     24083 non-null  object
 1   genres        28852 non-null  object
 2   app_name      32133 non-null  object
 3   title         30085 non-null  object
 4   url           32135 non-null  object
 5   release_date  30068 non-null  object
 6   tags          31972 non-null  object
 7   reviews_url   32133 non-null  object
 8   specs         31465 non-null  object
 9   price         30758 non-null  object
 10  early_access  32135 non-null  object
 11  id            32133 non-null  object
 12  developer     28836 non-null  object
dtypes: object(13)
memory usage: 11.9+ MB


Se nota que algunas columnas como 'app_name', 'url', 'reviews_url' y 'early_access' no tienen valores nulos, lo que sugiere que están completas en todos los registros. Por otro lado, columnas como 'genres', 'tags' y 'specs' parecen contener listas de categorías, lo que puede ser útil para organizar y clasificar los juegos.

La columna 'release_date' aparentemente contiene fechas, por lo que sería conveniente convertirla al tipo de datos de fecha y hora para facilitar el análisis temporal. En cuanto a la columna 'early_access', que actualmente tiene el tipo de datos 'object', podríamos considerar convertirla a un tipo de datos booleano (True o False) para representar si el juego tiene acceso temprano o no.

Además, la columna 'price' podría ser un tipo de datos numérico, lo que nos permitiría realizar cálculos y análisis relacionados con los precios de los juegos. Es importante realizar estas conversiones de tipos de datos para asegurarnos de que los datos estén correctamente estructurados y sean coherentes para su análisis posterior.

In [5]:
#  SE LIMPIAN Y PREPROCESAN LOS DATOS

# Se eliminan las filas con todos los valores nulos
df_SteamGames=df_SteamGames.dropna(how='all').reset_index(drop=True)
df_SteamGames.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,2018-01-04,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,False,761140,Kotoshiro
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,2018-01-04,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,False,643980,Secret Level SRL
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,2017-07-24,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,False,670290,Poolians.com
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,2017-12-07,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,False,767400,彼岸领域
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,False,773570,


In [6]:
# Se reemplazan valores vacíos, 'null' y 'None' con NaN en todo el DataFrame
df_SteamGames.replace(['', 'null', 'None'], np.nan, inplace=True)

# Se renombran columnas
df_SteamGames.rename(columns={'app_name': 'name', 'id': 'item_id'}, inplace=True)
# Se cuentan la cantidad de valores nulos en las columnas 'publisher' y 'developer'
print(df_SteamGames[['publisher', 'developer']].isnull().sum())

# Se cuentan la cantidad de valores nulos en las columnas 'name' y 'title'
print(df_SteamGames[['name', 'title']].isnull().sum())

publisher    8061
developer    3299
dtype: int64
name        2
title    2050
dtype: int64


In [7]:
valores_nulos_por_columna = df_SteamGames.isnull().sum()

# Muestra de los resultados
print(valores_nulos_por_columna)

publisher       8061
genres          3283
name               2
title           2050
url                0
release_date    2067
tags             163
reviews_url        2
specs            670
price           1377
early_access       0
item_id            2
developer       3299
dtype: int64


In [8]:
# Se filtran registros donde item_id es nulo o NaN
df_SteamGames[df_SteamGames['item_id'].isna()]

Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
74,,,,,http://store.steampowered.com/,,,,,19.99,False,,
30961,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,False,,"Rocksteady Studios,Feral Interactive (Mac)"


In [9]:
# Se eliminan las filas donde item_id es nulo o NaN
df_SteamGames=df_SteamGames.dropna(subset=['item_id'])

# Se filtran registros donde 'name' es nulo o NaN
df_SteamGames[df_SteamGames['name'].isna()]

Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
2580,,"[Action, Indie]",,,http://store.steampowered.com/app/317160/_/,2014-08-26,"[Action, Indie]",http://steamcommunity.com/app/317160/reviews/?...,"[Single-player, Game demo]",,False,317160,


In [10]:
# Se elimina la columna 'title' ya que la columna'name' contiene los mismos valores pero con 1 solo nulo. 
df_SteamGames.drop(['title'] ,axis=1, inplace=True)

# Se convierte 'release_date' a tipo datetime
df_SteamGames['release_date'] = pd.to_datetime(df_SteamGames['release_date'], errors='coerce', exact=False)

# Se crea la columna 'year' a partir de 'release_date'
df_SteamGames['year'] = df_SteamGames['release_date'].dt.year.astype('Int64')



# Se imprime información sobre nulos en 'year' después de la conversión
print("Nulos después de la conversión a datetime:")
print(df_SteamGames['year'].isnull().sum())

Nulos después de la conversión a datetime:
2351


In [11]:
# Si aun hay valores nulos después de la interpolación, se llenan con la mediana. 
df_SteamGames['year'] = df_SteamGames['year'].fillna(df_SteamGames['year'].median())

# Volvemos a verificar si hay valores nulos en la columna 'year'
hay_nulos_en_year = df_SteamGames['year'].isnull().any()

if hay_nulos_en_year:
    print("Hay valores nulos en la columna 'year'.")
else:
    print("No hay valores nulos en la columna 'year'.")

No hay valores nulos en la columna 'year'.


In [12]:
#  Se asigna los valores de la columna 'publisher' a la columna 'developer' en las filas donde 'developer' es nula y 'publisher' no es nula.
df_SteamGames.loc[df_SteamGames['developer'].isnull() & ~df_SteamGames['publisher'].isnull(), 'developer'] = df_SteamGames['publisher']
# Se eliminan las siguientes columnas 'title','url','reviews_url','early_access','publisher','release_date' por considerarse irrelevantes

df_SteamGames.drop(['url','reviews_url','early_access','publisher','release_date'] ,axis=1, inplace=True)

In [13]:
df_SteamGames.head()

Unnamed: 0,genres,name,tags,specs,price,item_id,developer,year
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",Free To Play,643980,Secret Level SRL,2018
2,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",Free to Play,670290,Poolians.com,2017
3,"[Action, Adventure, Casual]",弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017
4,,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,773570,,2016


In [14]:
# Se convierte 'price' a tipo numérico, asignar NaN a 'Free To Play'
df_SteamGames['price'] = pd.to_numeric(df_SteamGames['price'], errors='coerce')
df_SteamGames.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 0 to 32134
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   genres     28851 non-null  object 
 1   name       32132 non-null  object 
 2   tags       31971 non-null  object 
 3   specs      31464 non-null  object 
 4   price      28846 non-null  float64
 5   item_id    32133 non-null  object 
 6   developer  28900 non-null  object 
 7   year       32133 non-null  Int64  
dtypes: Int64(1), float64(1), object(6)
memory usage: 3.2+ MB


In [15]:
genres = df_SteamGames['genres']
tags = df_SteamGames['tags']
specs = df_SteamGames['specs']

print('Genres:', genres)
print('Tags:', tags)
print('Specs:', specs)

Genres: 0            [Action, Casual, Indie, Simulation, Strategy]
1                     [Free to Play, Indie, RPG, Strategy]
2        [Casual, Free to Play, Indie, Simulation, Sports]
3                              [Action, Adventure, Casual]
4                                                      NaN
                               ...                        
32130                [Casual, Indie, Simulation, Strategy]
32131                            [Casual, Indie, Strategy]
32132                          [Indie, Racing, Simulation]
32133                                      [Casual, Indie]
32134                                                  NaN
Name: genres, Length: 32133, dtype: object
Tags: 0            [Strategy, Action, Indie, Casual, Simulation]
1        [Free to Play, Strategy, Indie, RPG, Card Game...
2        [Free to Play, Simulation, Sports, Casual, Ind...
3                              [Action, Adventure, Casual]
4                          [Action, Indie, Casual, Sports]

In [16]:
# Se llenan los nulos de la columna generes con corchetes para poder transformar a listas
df_SteamGames['genres'] = df_SteamGames['genres'].fillna('[]')
# Se transforma la columna genero a listas, se la desanida y utilizamos get dummies para darle un valor numerico
genres_df = pd.DataFrame(df_SteamGames['genres'].tolist())
genres_df_obj = genres_df.stack()
genres_df1 = pd.get_dummies(genres_df_obj)
genres_df1 = genres_df1.groupby(level= [0], axis= 0).sum()

  genres_df1 = genres_df1.groupby(level= [0], axis= 0).sum()


In [17]:
# Se evalua el aporte que tiene cada genero para analizar con cual quedarnos
genres_df1.sum().sort_values(ascending=False)/len(genres_df1)

Indie                        0.493511
Action                       0.352286
Casual                       0.257741
Adventure                    0.256496
Strategy                     0.216506
Simulation                   0.208477
RPG                          0.170510
]                            0.102138
[                            0.102138
Free to Play                 0.063206
Early Access                 0.045498
Sports                       0.039119
Massively Multiplayer        0.034482
Racing                       0.033704
Design &amp; Illustration    0.014316
Utilities                    0.010581
Web Publishing               0.008340
Animation &amp; Modeling     0.005695
Education                    0.003890
Video Production             0.003610
Software Training            0.003268
Audio Production             0.002894
Photo Editing                0.002396
Accounting                   0.000218
dtype: float64

In [18]:
# Se seleccionan las columnas que me voy a quedar
genres = genres_df1[['Indie','Action','Casual','Adventure','Strategy','Simulation','RPG','Free to Play','Early Access' ,'Sports','Massively Multiplayer','Racing'  ,'Design &amp; Illustration' ,'Utilities']]
# Se unen los dos dataframe y se elimina la columna de genres
df_SteamGames = pd.concat([df_SteamGames, genres],axis=1)
df_SteamGames.drop(columns=['genres'],inplace=True)
# Se eliminan las filas que tienen valores nulos en la columna 'Indie'.
df_SteamGames.dropna(subset='Indie',inplace=True)
# Se cambia la columna 'item_id' a tipo int
df_SteamGames['item_id'] = df_SteamGames['item_id'].astype('Int64')
df_SteamGames

Unnamed: 0,name,tags,specs,price,item_id,developer,year,Indie,Action,Casual,...,Strategy,Simulation,RPG,Free to Play,Early Access,Sports,Massively Multiplayer,Racing,Design &amp; Illustration,Utilities
0,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018,1.0,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",,643980,Secret Level SRL,2018,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",,670290,Poolians.com,2017,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
3,弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,773570,,2016,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32130,Colony On Mars,"[Strategy, Indie, Casual, Simulation]","[Single-player, Steam Achievements]",1.99,773640,"Nikita ""Ghost_RUS""",2018,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32131,LOGistICAL: South Africa,"[Strategy, Indie, Casual]","[Single-player, Steam Achievements, Steam Clou...",4.99,733530,Sacada,2018,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32132,Russian Roads,"[Indie, Simulation, Racing]","[Single-player, Steam Achievements, Steam Trad...",1.99,610660,Laush Dmitriy Sergeevich,2018,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
74,,,,,,,,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Se eliminará todas las filas en las que el valor de la columna item_id es nulo 
df_SteamGames.dropna(subset=['item_id'], inplace=True)
df_SteamGames

Unnamed: 0,name,tags,specs,price,item_id,developer,year,Indie,Action,Casual,...,Strategy,Simulation,RPG,Free to Play,Early Access,Sports,Massively Multiplayer,Racing,Design &amp; Illustration,Utilities
0,Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018,1.0,1.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",,643980,Secret Level SRL,2018,1.0,0.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",,670290,Poolians.com,2017,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
3,弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,773570,,2016,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32128,BAE 2,"[Indie, Casual]",[Single-player],,769330,Riviysky,2018,1.0,0.0,1.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32129,Kebab it Up!,"[Action, Indie, Casual, Violent, Adventure]","[Single-player, Steam Achievements, Steam Cloud]",1.99,745400,Bidoniera Games,2018,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32130,Colony On Mars,"[Strategy, Indie, Casual, Simulation]","[Single-player, Steam Achievements]",1.99,773640,"Nikita ""Ghost_RUS""",2018,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32131,LOGistICAL: South Africa,"[Strategy, Indie, Casual]","[Single-player, Steam Achievements, Steam Clou...",4.99,733530,Sacada,2018,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
df_SteamGames.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32131 entries, 0 to 32132
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   name                       32130 non-null  object 
 1   tags                       31969 non-null  object 
 2   specs                      31462 non-null  object 
 3   price                      28844 non-null  float64
 4   item_id                    32131 non-null  Int64  
 5   developer                  28899 non-null  object 
 6   year                       32131 non-null  Int64  
 7   Indie                      32131 non-null  float64
 8   Action                     32131 non-null  float64
 9   Casual                     32131 non-null  float64
 10  Adventure                  32131 non-null  float64
 11  Strategy                   32131 non-null  float64
 12  Simulation                 32131 non-null  float64
 13  RPG                        32131 non-null  float64


In [21]:
# Se guarda el archivo con formato Parque
df_SteamGames.to_parquet('steam_games_limpio.parquet', index=False)