### **ETL process (Extract, Transform & Load) - Plataforma de videojuegos Stream**
#### **DATOS: Stream_games**

Empezaremos por entender, transfomar y disponibilizar, encontrando problemas en los datos, aplicando tecnicas de limpieza y preprocesamiente. Finalmente, almacenaremos los datos transformatos para futuras exploraciones.  

Entenderemos el contexto y la informacion necesaria para nuestro analisis, tratando consistencia a esos datos.

**Nota: es necesario instalar las siguientes librerias dentro del entorno:**

*`pip install pandas numpy gdwon` *


### **Importamos las librerias necesarias**

In [34]:

import numpy as np # para trabajar con arrays y matrices

import pandas as pd # para el analisis de tablas tabulares

import os # permite interactuar con el sistema operativo

import gdown # permite descargar archivos desde Google Drive con su ID 

import json # formato json es usado para intercambiar datos, no necesita instalarse

import warnings 
warnings.filterwarnings("ignore") # Este código se utiliza para filtrar y silenciar advertencias que puedan surgir durante la ejecución del programa.

import ast # es útil cuando necesitas analizar o manipular código Python en un nivel más profundo que el proporcionado por el análisis de cadenas.

### **1. creamos la funcion de carga  para el .json file y convertimos esos datos en Dataframe usando pandas**

In [35]:
def dowload_json(url, file_name):

    ''' 
    Descarga un archivo json el cual se da como parametro en la funcion
    Lo guarda localmente y lo lee en un dataframe usando pandas


    Parametros:
        - url (str): URL del archivo guardado en Google Drive  
        - file_name (str): Nombre dado al archivo

    Returns:
        - df (pd.DataFrame): Pandas Dataframe creado apartir del archivo json
    '''

    # verificamos si el archivo ya existe
    if not os.path.exists(file_name):
        # si no existe, lo descargamos desde Google Drive usando gdown.download
        gdown.download(url, file_name, quiet=False)

    # leemos el Json file en un Dataframe de Pandas 
    df= pd.read_json(file_name, lines= True)

    return df

In [36]:
# asignamos la URL a una variable y le damos el nombre al archivo en la variable output
url = 'https://drive.google.com/uc?export=download&id=1ga6pEOIlTOIrt0hvHUBwBC4CGDPenN29'
output = 'steam_games.json'


# usamos la funcion que creamos 
games_df = dowload_json(url, output)

### **2. Primera vista a los datos y su estructura**

In [37]:
# traemos una vista rapida de la tabla

print('DataFrame - First & Last Rows:')
games_df

DataFrame - First & Last Rows:


Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
120440,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,2018-01-04,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,0.0,773640.0,"Nikita ""Ghost_RUS"""
120441,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,2018-01-04,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,0.0,733530.0,Sacada
120442,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,2018-01-04,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,0.0,610660.0,Laush Dmitriy Sergeevich
120443,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,2017-09-02,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,0.0,658870.0,"xropi,stev3ns"


### **# Analisis**
1. Al examinar el DataFrame 'Games', notamos que todas las filas iniciales contienen valores NaN (null/None) en cada columna.
   
2. Es necesario investigar la fuente de datos o el proceso de carga para asegurarse de que se ejecutó correctamente.

3. Las columnas **['genres', 'tags', 'specs']** contienen listas anidadas en formato JSON.

4. Procedemos abriendo el archivo en modo lectura, con codificación UTF-8.

5. Aproximadamente el 73% del conjunto de datos total contiene valores nulos en todas las columnas.
   

In [38]:
# Filas que tienen Nulos en todas sus columnas

print(f'total filas de games_df: {games_df.shape[0]}')
print(f'total filas de games_df con nulos en todas sus columnas: {games_df.isnull().all(axis=1).sum()}')


total filas de games_df: 120445
total filas de games_df con nulos en todas sus columnas: 88310


In [39]:
# Info relevante - tipos de dato
print(' DF Games - General Insights:')
games_df.info()

 DF Games - General Insights:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120445 entries, 0 to 120444
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   publisher     24083 non-null  object 
 1   genres        28852 non-null  object 
 2   app_name      32133 non-null  object 
 3   title         30085 non-null  object 
 4   url           32135 non-null  object 
 5   release_date  30068 non-null  object 
 6   tags          31972 non-null  object 
 7   reviews_url   32133 non-null  object 
 8   specs         31465 non-null  object 
 9   price         30758 non-null  object 
 10  early_access  32135 non-null  float64
 11  id            32133 non-null  float64
 12  developer     28836 non-null  object 
dtypes: float64(2), object(11)
memory usage: 11.9+ MB


### **# Analisis**
1. Columnas como **['app_name', 'url', 'reviews_url', 'early_access']** no tienen nulos.

2. Columnas como **['genres', 'tags', 'specs']** contiene listas con datos anidados categoricos.

3. Columnna **'release_date'** tiene valores con fechas & los vamos a convertir a tipo 'datetime' ya que estan en tipo 'objeto' 

4. Column **'early_access'** podria ser convertido a'float64' to 'Boolean'(bool) datatype.
        - 1 para si, 0 para no
        - Esta columna no es relevante a nuestro analisis asi que la borraremos.
   
5. Colum **'price'** se podria convertir a tipo de dato a 'float'


### **3. DF Fase de Limpieza y Pre-Procesamiento**
1. limpieza de valores duplicados.
2. limpieza de valores nulos
3. identificacion y limpieza de columnas innecesarias

In [40]:
# DF estructura antes de limpieza
print (f'DF estructura completa: {games_df.shape}')


# eliminamos filas que tengan en todos sus celdas valores nulos
games_df = games_df.dropna(how= 'all').reset_index(drop= True)
print (f'DF estructura despues de borrar las filas que tienen en todas sus celdas valores nulos: {games_df.shape}')


DF estructura completa: (120445, 13)
DF estructura despues de borrar las filas que tienen en todas sus celdas valores nulos: (32135, 13)


In [41]:
# remplazamos  empty, 'null', y 'None' con NaN en todo el DF 
games_df.replace(['', 'null', 'None'], np.nan, inplace= True)

# Renombramos algunas columnas
games_df.rename(columns={'app_name': 'name', 'id': 'item_id'}, inplace=True)

# contamos el total de nulls en las columnas: ['publisher', 'developer']
games_df[['publisher', 'developer']].isnull().sum()

publisher    8061
developer    3299
dtype: int64

### **# Analisis**
1. Las columnas **'developer'** y **'publisher'** comparten bastantes valores, pero tienen discrepancias o algunos valores nulos.
   
    1.  'developer' columna: 3,299 null valores
   
    2.  'publisher' columna: 8,061 null valores
   
Por esto, vamos a transferir el valor de 'publisher' a 'developer' Solo Si, 'developer' esta vacio, empty o null y 'publisher' tiene un valor.

In [42]:
# filtramos los registros donde 'item_id' es null o NaN, son registros innecesarios.

print(f'Numero de registros donde "item_id" es NaN: {games_df["item_id"].isna().sum()}')
games_df[games_df['item_id'].isna()]

Numero de registros donde "item_id" es NaN: 2


Unnamed: 0,publisher,genres,name,title,url,release_date,tags,reviews_url,specs,price,early_access,item_id,developer
74,,,,,http://store.steampowered.com/,,,,,19.99,0.0,,
30961,"Warner Bros. Interactive Entertainment, Feral ...","[Action, Adventure]",Batman: Arkham City - Game of the Year Edition,Batman: Arkham City - Game of the Year Edition,http://store.steampowered.com/app/200260,2012-09-07,"[Action, Open World, Batman, Adventure, Stealt...",,"[Single-player, Steam Achievements, Steam Trad...",19.99,0.0,,"Rocksteady Studios,Feral Interactive (Mac)"


In [43]:
# Borramos registros donde item_id es Null o Nan
games_df = games_df.dropna(subset=['item_id'])

#### Transformaciones a los datos

In [44]:
'''
# convertimos 'genres' a un tipo de dato Lista - manteniendo los NaNs existentes
games_df['genres'] = games_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

# 'explotamos' esa lista a filas
genre_df = games_df.explode('genres')

# devolvemos la lista en generos unicos incluyendo valores NaN  
genre_list = genre_df['genres'].unique()
genre_list


games_df['tags'] = games_df['tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

games_df['genres'] = games_df.apply(
    lambda row: [tag for tag in row['tags'] if tag in genre_list] if isinstance(row['tags'], list) else row['genres'],
    axis=1
    )'''

"\n# convertimos 'genres' a un tipo de dato Lista - manteniendo los NaNs existentes\ngames_df['genres'] = games_df['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)\n\n# 'explotamos' esa lista a filas\ngenre_df = games_df.explode('genres')\n\n# devolvemos la lista en generos unicos incluyendo valores NaN  \ngenre_list = genre_df['genres'].unique()\ngenre_list\n\n\ngames_df['tags'] = games_df['tags'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)\n\ngames_df['genres'] = games_df.apply(\n    lambda row: [tag for tag in row['tags'] if tag in genre_list] if isinstance(row['tags'], list) else row['genres'],\n    axis=1\n    )"

In [45]:
# Revisamos los datos en la columna 'release_date'
unique_release_dates = games_df['release_date'].unique()

for value in unique_release_dates:
    print(value)

2018-01-04
2017-07-24
2017-12-07
None
Soon..
2018-01-03
2017-12-22
2017-12-23
1997-06-30
1998-11-08
2016-11-25
2018-01-01
2017-12-30
2006-07-06
2006-07-11
2017
Beta测试已开启
2017-12-29
2018-03-30
2005-08-09
2006-09-29
2006-11-20
2006-11-29
2006-11-24
2006-12-14
2006-12-19
2003-08-23
2006-12-21
2006-04-17
2006-08-01
2005-07-12
2007-06-26
2006-07-24
2002-11-12
2000-11-17
2003-10-23
2006-10-16
1998-10-31
2007-06-01
1995-04-01
2007-06-05
2006-05-23
2006-10-17
1996-06-17
2007-06-29
2006-12-20
1995-04-30
1995-06-01
1994-05-05
1994-08-03
2001-11-20
1997-02-28
1993-10-10
2003-09-09
2007-08-03
2007-05-31
2007-03-20
2006-01-01
2007-08-21
2006-10-09
2004-09-22
2006-06-26
2007-12-14
1998-06-30
1999-10-25
2004-04-20
2003-07-01
2003-10-14
2006-04-07
2001-07-25
2008-10-28
2007-10-10
2007-10-23
2007-08-01
2007-09-01
2007-12-03
2008-02-08
2007-05-08
2006-02-21
2006-05-02
2007-10-16
2006-05-30
2006-10-26
2005-03-14
2003-02-03
1998-04-30
2008-03-21
2003-02-18
2004-03-23
2000-10-25
2006-12-12
2005-03-15
2008-

In [46]:
# Columna 'item_id' la convertimos a int, para que elimine la parte decimal
games_df['item_id'] = games_df['item_id'].astype(float).astype(int)

# luego pasamos ese numero entero a str
games_df['item_id'] = games_df['item_id'].astype(str)


In [47]:
# columna 'release_date' a formato de fecha 
games_df['release_date']=pd.to_datetime(games_df['release_date'], errors= 'coerce', exact= False)

# Vamos a extrar solo el año y crear una nueva columna 'release_year'
games_df['year']= games_df['release_date'].dt.year.astype('Int64')

# Eliminamos la columna 'release_date'
games_df = games_df.drop(columns=['release_date'])

In [48]:
games_df

Unnamed: 0,publisher,genres,name,title,url,tags,reviews_url,specs,price,early_access,item_id,developer,year
0,Kotoshiro,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,Lost Summoner Kitty,http://store.steampowered.com/app/761140/Lost_...,"[Strategy, Action, Indie, Casual, Simulation]",http://steamcommunity.com/app/761140/reviews/?...,[Single-player],4.99,0.0,761140,Kotoshiro,2018
1,"Making Fun, Inc.","[Free to Play, Indie, RPG, Strategy]",Ironbound,Ironbound,http://store.steampowered.com/app/643980/Ironb...,"[Free to Play, Strategy, Indie, RPG, Card Game...",http://steamcommunity.com/app/643980/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free To Play,0.0,643980,Secret Level SRL,2018
2,Poolians.com,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,Real Pool 3D - Poolians,http://store.steampowered.com/app/670290/Real_...,"[Free to Play, Simulation, Sports, Casual, Ind...",http://steamcommunity.com/app/670290/reviews/?...,"[Single-player, Multi-player, Online Multi-Pla...",Free to Play,0.0,670290,Poolians.com,2017
3,彼岸领域,"[Action, Adventure, Casual]",弹炸人2222,弹炸人2222,http://store.steampowered.com/app/767400/2222/,"[Action, Adventure, Casual]",http://steamcommunity.com/app/767400/reviews/?...,[Single-player],0.99,0.0,767400,彼岸领域,2017
4,,,Log Challenge,,http://store.steampowered.com/app/773570/Log_C...,"[Action, Indie, Casual, Sports]",http://steamcommunity.com/app/773570/reviews/?...,"[Single-player, Full controller support, HTC V...",2.99,0.0,773570,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32130,Ghost_RUS Games,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,Colony On Mars,http://store.steampowered.com/app/773640/Colon...,"[Strategy, Indie, Casual, Simulation]",http://steamcommunity.com/app/773640/reviews/?...,"[Single-player, Steam Achievements]",1.99,0.0,773640,"Nikita ""Ghost_RUS""",2018
32131,Sacada,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,LOGistICAL: South Africa,http://store.steampowered.com/app/733530/LOGis...,"[Strategy, Indie, Casual]",http://steamcommunity.com/app/733530/reviews/?...,"[Single-player, Steam Achievements, Steam Clou...",4.99,0.0,733530,Sacada,2018
32132,Laush Studio,"[Indie, Racing, Simulation]",Russian Roads,Russian Roads,http://store.steampowered.com/app/610660/Russi...,"[Indie, Simulation, Racing]",http://steamcommunity.com/app/610660/reviews/?...,"[Single-player, Steam Achievements, Steam Trad...",1.99,0.0,610660,Laush Dmitriy Sergeevich,2018
32133,SIXNAILS,"[Casual, Indie]",EXIT 2 - Directions,EXIT 2 - Directions,http://store.steampowered.com/app/658870/EXIT_...,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...",http://steamcommunity.com/app/658870/reviews/?...,"[Single-player, Steam Achievements, Steam Cloud]",4.99,0.0,658870,"xropi,stev3ns",2017


In [49]:
# Verificar si 'developer' está vacía y 'publisher' está llena, luego copiar el valor
games_df.loc[games_df['developer'].isnull() & ~games_df['publisher'].isnull(), 'developer'] = games_df['publisher']

In [50]:
# Se eliminan las columnas 'title','url','reviews_url','early_access','publisher','release_date' por considerarse no relevantes
games_df.drop(['title','url','reviews_url','early_access','publisher'], axis=1, inplace=True)

In [51]:
# vamos a establecer un umbral de 80% para decidir que columnas eliminar por valores nulos 
umbral_nulos = 0.8

# Calculamos el promedio de valores nulos por columna
porcentaje_nulos = games_df.isnull().mean()

# Filtra las columnas que superan el umbral 
Columnas_Null80 =  porcentaje_nulos[porcentaje_nulos > umbral_nulos]

# Muestra las columnas y su respectivo porcentaje de valores nulos
print("Columnas con más del {}% de valores nulos (candidatas a eliminar):".format(umbral_nulos * 100))
for columna, porcentaje in Columnas_Null80.items():
    print("{}: {:.2%}".format(columna, porcentaje))

Columnas con más del 80.0% de valores nulos (candidatas a eliminar):


In [52]:
# Si aún hay valores nulos después de la interpolación, se llenan con la mediana. 
games_df['year'] = games_df['year'].fillna(games_df['year'].median())

In [53]:
# Convertir 'price' a tipo numérico, asignar NaN a 'Free To Play'
games_df['price'] = pd.to_numeric(games_df['price'], errors='coerce')

In [54]:
games_df

Unnamed: 0,genres,name,tags,specs,price,item_id,developer,year
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",,643980,Secret Level SRL,2018
2,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",,670290,Poolians.com,2017
3,"[Action, Adventure, Casual]",弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017
4,,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,773570,,2016
...,...,...,...,...,...,...,...,...
32130,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,"[Strategy, Indie, Casual, Simulation]","[Single-player, Steam Achievements]",1.99,773640,"Nikita ""Ghost_RUS""",2018
32131,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,"[Strategy, Indie, Casual]","[Single-player, Steam Achievements, Steam Clou...",4.99,733530,Sacada,2018
32132,"[Indie, Racing, Simulation]",Russian Roads,"[Indie, Simulation, Racing]","[Single-player, Steam Achievements, Steam Trad...",1.99,610660,Laush Dmitriy Sergeevich,2018
32133,"[Casual, Indie]",EXIT 2 - Directions,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...","[Single-player, Steam Achievements, Steam Cloud]",4.99,658870,"xropi,stev3ns",2017


In [55]:
games_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32133 entries, 0 to 32134
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   genres     28851 non-null  object 
 1   name       32132 non-null  object 
 2   tags       31971 non-null  object 
 3   specs      31464 non-null  object 
 4   price      28846 non-null  float64
 5   item_id    32133 non-null  object 
 6   developer  28900 non-null  object 
 7   year       32133 non-null  Int64  
dtypes: Int64(1), float64(1), object(6)
memory usage: 3.2+ MB


### 4. Guardamos los datos limpios
Guardamos estos archivos en los siguientes formatos para disponibilizarlos. CSV/ Json/ Parquet

In [56]:
 # Estos archivos seran almacenados localmente
# games_df.to_csv('steam_games_limpios.csv', index= False)
# games_df.to_json('steam_games_limpios.json', lines= False)
games_df.to_parquet('../Dataframes/Dataframes_limpios/steam_games_limpios.parquet', index= False)

In [57]:
games_df

Unnamed: 0,genres,name,tags,specs,price,item_id,developer,year
0,"[Action, Casual, Indie, Simulation, Strategy]",Lost Summoner Kitty,"[Strategy, Action, Indie, Casual, Simulation]",[Single-player],4.99,761140,Kotoshiro,2018
1,"[Free to Play, Indie, RPG, Strategy]",Ironbound,"[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",,643980,Secret Level SRL,2018
2,"[Casual, Free to Play, Indie, Simulation, Sports]",Real Pool 3D - Poolians,"[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",,670290,Poolians.com,2017
3,"[Action, Adventure, Casual]",弹炸人2222,"[Action, Adventure, Casual]",[Single-player],0.99,767400,彼岸领域,2017
4,,Log Challenge,"[Action, Indie, Casual, Sports]","[Single-player, Full controller support, HTC V...",2.99,773570,,2016
...,...,...,...,...,...,...,...,...
32130,"[Casual, Indie, Simulation, Strategy]",Colony On Mars,"[Strategy, Indie, Casual, Simulation]","[Single-player, Steam Achievements]",1.99,773640,"Nikita ""Ghost_RUS""",2018
32131,"[Casual, Indie, Strategy]",LOGistICAL: South Africa,"[Strategy, Indie, Casual]","[Single-player, Steam Achievements, Steam Clou...",4.99,733530,Sacada,2018
32132,"[Indie, Racing, Simulation]",Russian Roads,"[Indie, Simulation, Racing]","[Single-player, Steam Achievements, Steam Trad...",1.99,610660,Laush Dmitriy Sergeevich,2018
32133,"[Casual, Indie]",EXIT 2 - Directions,"[Indie, Casual, Puzzle, Singleplayer, Atmosphe...","[Single-player, Steam Achievements, Steam Cloud]",4.99,658870,"xropi,stev3ns",2017
