# ETL (Ingeniería de Datos)

### Importar Librerías

In [1]:
import warnings

# Ignorar todas las advertencias
warnings.filterwarnings("ignore")

import pandas as pd
import sys
import os
import numpy as np

In [2]:
# Obtener el directorio de trabajo actual
current_dir = os.getcwd()

# Navegar hacia el directorio raíz del proyecto
project_root = os.path.abspath(os.path.join(current_dir, '..'))

# Agregar la ruta del proyecto al sys.path
sys.path.append(project_root)

Importar funciones para ETL (funciones creadas en la carpeta function, archivo ETL)

In [3]:
from functions.ETL import load_data, normalize, export # (funciones creadas en la carpeta function, archivo ETL)

## Extracción de datos

Ruta del archivo:

In [4]:
path = r'..\data\steam_games.json.gz'

Extracción y Visualización de datos

In [5]:
df = load_data(path) # (funciones creadas en la carpeta function, archivo ETL)
df.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


## Transformación de los Datos

### Limpieza de los datos

#### Eliminar Filas innecesarias

Eliminar filas con todos los elementos vacíos:

In [6]:
df_clean = df.dropna(how='all')
len(df_clean)

32135

app_name tiene dos elementos vacíos que respecto a otras columnas lo hace imposible de identificar un registro

In [7]:
df_clean = df_clean.dropna(subset=['app_name'])
len(df_clean)

32133

Identificar duplicados por id y app_name:

In [8]:
df_clean[df_clean.duplicated(df_clean.columns[df_clean.columns.isin(['app_name', 'id'])])]

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
102883,Bethesda Softworks,[Action],Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,http://store.steampowered.com/app/612880/Wolfe...,2017-10-26,"[Action, FPS, Gore, Violent, Alternate History...",http://steamcommunity.com/app/612880/reviews/?...,"[Single-player, Steam Achievements, Full contr...",59.99,0.0,612880.0,Machine Games


Eliminar duplicados

In [9]:
df_clean = df_clean.drop_duplicates(df_clean.columns[df_clean.columns.isin(['app_name', 'id'])], keep='first')
len(df_clean)

32132

Comparar columnas app_name con title:

In [10]:
distincts = df_clean[df_clean['app_name'] != df_clean['title']]
len(distincts)

2603

In [11]:
len(distincts) - distincts['title'].isnull().sum()

np.int64(555)

Las filas que no coinciden entre app_name y title se deben eliminar ya que son datos inconsistentes y no se puede determinar el nombre del juego. Existen más de 2000 valores nulos en la columna title y 555 que no coinciden con la columna app_name que sí está completa. Visualizando el contenido se puede observar que la diferencia se encientra en el caracter & que en la columna app_name se muestra normalmente mientras que en la columna title se muestra $amp;

In [12]:
filtered = distincts[distincts['title'].notnull()]
filtered.head(3)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88390,Telltale Games,"[Action, Adventure]",Sam & Max 101: Culture Shock,Sam &amp; Max 101: Culture Shock,http://store.steampowered.com/app/8200/Sam__Ma...,2006-10-17,"[Point & Click, Comedy, Adventure, Detective, ...",http://steamcommunity.com/app/8200/reviews/?br...,[Single-player],19.99,0.0,8200.0,Telltale Games
88393,Telltale Games,"[Action, Adventure]",Sam & Max 102: Situation: Comedy,Sam &amp; Max 102: Situation: Comedy,http://store.steampowered.com/app/8210/Sam__Ma...,2006-12-20,"[Adventure, Action]",http://steamcommunity.com/app/8210/reviews/?br...,[Single-player],19.99,0.0,8210.0,Telltale Games
88419,Electronic Arts,[Strategy],Command & Conquer: Red Alert 3,Command &amp; Conquer: Red Alert 3,http://store.steampowered.com/app/17480/Comman...,2008-10-28,"[Strategy, RTS, Base Building, Multiplayer, Co...",http://steamcommunity.com/app/17480/reviews/?b...,[Single-player],19.99,0.0,17480.0,EA Los Angeles


#### Eliminar columnas innecesarias reordenándolas:

In [13]:
# Se listan las listas a cnservar de manera ordenada y se filtran
cols = ['id', 
        'app_name', 
        'genres', 
        'tags', 
        'specs', 
        'url', 
        'reviews_url', 
        'price',  
        'release_date', 
        'developer', 
        'publisher']

df_games = df_clean[cols]

In [14]:
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


#### Imputar valores nulos:

In [15]:
df_games.isnull().sum()

id                 1
app_name           0
genres          3282
tags             162
specs            669
url                0
reviews_url        1
price           1376
release_date    2066
developer       3297
publisher       8050
dtype: int64

Las columnas genres y tags tienen datos similares por lo que pueden usarse para rellenar valores nulos.

In [16]:
df_games['genres'].fillna(df_games['tags'], inplace=True)
df_games['genres'].isnull().sum()

np.int64(138)

Las columnas developer y publisher suelen coincidir en algunos casos, por lo que pueden usarse para rellenar valores nulos.

In [17]:
df_games['developer'].fillna(df_games['publisher'], inplace=True)
df_games['developer'].isnull().sum()

np.int64(3232)

#### Imputar columna price en caso de existir el valor 'Free to Play' en genres o tags

In [18]:
df_games['price'] = df_games.apply(lambda row: 'Free to Play' if isinstance(row['tags'], list) and 'Free to Play' in row['tags'] and pd.isna(row['price']) else row['price'], axis=1)

In [19]:
df_games['price'] = df_games.apply(lambda row: 0 if isinstance(row['genres'], list) and 'Free to Play' in row['genres'] and pd.isna(row['price']) else row['price'], axis=1)

In [20]:
df_games['price'].isnull().sum()

np.int64(1171)

Verificar id nulo

In [21]:
df_games[df_games['id'].isnull()]

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
119271,,Batman: Arkham City - Game of the Year Edition,"[Action, Adventure]","[Action, Open World, Batman, Adventure, Stealt...","[Single-player, Steam Achievements, Steam Trad...",http://store.steampowered.com/app/200260,,19.99,2012-09-07,"Rocksteady Studios,Feral Interactive (Mac)","Warner Bros. Interactive Entertainment, Feral ..."


In [22]:
df_games[df_games['id'] == 200260]


Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
89378,200260.0,Batman: Arkham City - Game of the Year Edition,"[Action, Adventure]","[Action, Open World, Batman, Adventure, Stealt...","[Single-player, Steam Achievements, Steam Trad...",http://store.steampowered.com/app/200260/Batma...,http://steamcommunity.com/app/200260/reviews/?...,19.99,2012-09-07,"Rocksteady Studios,Feral Interactive (Mac)","Warner Bros. Interactive Entertainment, Feral ..."


#### Conversión de datos numéricos

In [23]:
df_games = df_games.dropna(subset='id')
df_games['id'] = df_games['id'].astype('int64')
df_games['id'].dtype

dtype('int64')

#### Se crea un DataFrame con las URLs de los juegos y las reviews
- Las URLs de los juegos pueden necesitarse para mejorar el sistema recomendando las páginas de los juegos.
- Las URLs de las reviews pueden utilizarse a futuro para Web Scraping

In [24]:
games_urls = df_games[['id', 'url', 'reviews_url']]

Eliminar columnas tags, publisher, url y url_reviews

In [25]:
df_steam = df_games.drop(columns=['tags', 'publisher', 'url', 'reviews_url'])

In [26]:
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer
88310,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",[Single-player],4.99,2018-01-04,Kotoshiro
88311,643980,Ironbound,"[Free to Play, Indie, RPG, Strategy]","[Single-player, Multi-player, Online Multi-Pla...",Free To Play,2018-01-04,Secret Level SRL
88312,670290,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]","[Single-player, Multi-player, Online Multi-Pla...",Free to Play,2017-07-24,Poolians.com


Comprobar tipos de datos

In [27]:
[print(c, df_steam[c].dtype) for c in df_steam.columns]

id int64
app_name object
genres object
specs object
price object
release_date object
developer object


[None, None, None, None, None, None, None]

#### Convertir los demás datos

In [28]:
#se convierten los textos a formato str
df_steam = df_steam.astype({'app_name': 'str', 'specs': 'str', 'price': 'str', 'developer': 'str'})
[print(c, df_steam[c].dtype) for c in df_steam.columns]

id int64
app_name object
genres object
specs object
price object
release_date object
developer object


[None, None, None, None, None, None, None]

In [29]:
# Función para eliminar "Free to Play" de los géneros
def delete_free_to_play(genres):
    '''
    Elimina el str "Free to Play" en ambos casos, si está un string, lista o array
    '''
    # Si el valor es None o NaN, devolverlo sin cambios
    if genres is None or pd.isna(genres).all():
        return genres

    # Si genres es una lista o array
    if isinstance(genres, (list, np.ndarray)):
        # Filtrar "Free to Play" de la lista
        return [genre for genre in genres if genre != "Free to Play"]

    # Si genres es una cadena, dividirla en una lista
    if isinstance(genres, str):
        genres_list = genres.split(",")
        # Eliminar "Free to Play" de la lista y unir de nuevo
        genres_list = [genre.strip() for genre in genres_list if genre != "Free to Play"]
        return ", ".join(genres_list)

    # Devolver el valor sin cambios si no coincide con los tipos anteriores
    return genres

In [30]:
df_steam['genres'] = df_steam['genres'].apply(delete_free_to_play)
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer
88310,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",['Single-player'],4.99,2018-01-04,Kotoshiro
88311,643980,Ironbound,"[Indie, RPG, Strategy]","['Single-player', 'Multi-player', 'Online Mult...",Free To Play,2018-01-04,Secret Level SRL
88312,670290,Real Pool 3D - Poolians,"[Casual, Indie, Simulation, Sports]","['Single-player', 'Multi-player', 'Online Mult...",Free to Play,2017-07-24,Poolians.com


In [31]:
# Definir función para manejar los precios
# si incluyen la palabra Free, el precio se configura en 0
def handle_price(price):
    '''
    Si el precio contiene la palabra Free o free retorna 0.0,
    sino, intenta convertirlo a float y, en caso de no poder, retorna NaN
    '''
    if 'Free' in price or 'free' in price: 
        return 0.0
    # Si no contiene la palabra Free, intenta convertirlo a float y, en caso de no poder, retorna NaN
    try:
        return float(price)
    except ValueError:
        return np.nan

In [32]:
# se cambian los valores que implican Free to Play a Free
df_steam['price'] = df_steam['price'].apply(handle_price)
# verificar tipo de dato de la columna price

df_steam['price'].dtype

dtype('float64')

In [33]:
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer
88310,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",['Single-player'],4.99,2018-01-04,Kotoshiro
88311,643980,Ironbound,"[Indie, RPG, Strategy]","['Single-player', 'Multi-player', 'Online Mult...",0.0,2018-01-04,Secret Level SRL
88312,670290,Real Pool 3D - Poolians,"[Casual, Indie, Simulation, Sports]","['Single-player', 'Multi-player', 'Online Mult...",0.0,2017-07-24,Poolians.com


In [34]:
# Convertir las fechas en diferentes formatos a un formato unificado de fecha
df_steam['release_date'] = pd.to_datetime(df_steam['release_date'], errors='coerce')

# Extraer el año de las fechas convertidas
df_steam['release_year'] = df_steam['release_date'].dt.year

# Mostrar un ejemplo de las primeras filas
df_steam.head()

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer,release_year
88310,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",['Single-player'],4.99,2018-01-04,Kotoshiro,2018.0
88311,643980,Ironbound,"[Indie, RPG, Strategy]","['Single-player', 'Multi-player', 'Online Mult...",0.0,2018-01-04,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[Casual, Indie, Simulation, Sports]","['Single-player', 'Multi-player', 'Online Mult...",0.0,2017-07-24,Poolians.com,2017.0
88313,767400,弹炸人2222,"[Action, Adventure, Casual]",['Single-player'],0.99,2017-12-07,彼岸领域,2017.0
88314,773570,Log Challenge,"[Action, Indie, Casual, Sports]","['Single-player', 'Full controller support', '...",2.99,NaT,,


In [35]:
# eliminar columna release_date y mostrar los tres primeros resultados
df_steam = df_steam.drop(columns=['release_date'])
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year
88310,761140,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]",['Single-player'],4.99,Kotoshiro,2018.0
88311,643980,Ironbound,"[Indie, RPG, Strategy]","['Single-player', 'Multi-player', 'Online Mult...",0.0,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[Casual, Indie, Simulation, Sports]","['Single-player', 'Multi-player', 'Online Mult...",0.0,Poolians.com,2017.0


## Carga de Datos

Se guarda el archivo trabajado en formato parquet y CSV en sus carpetas correspondientes para ser trabajados de acuerdo a la situación. Si los directorios no existen, se crean.

In [36]:
export(df_steam, project_root, 'steam_games')
# Se exporta, las URLs de los juegos y las reviews en caso de que a futuro se planee realizar web scraping
export(games_urls, project_root, 'games_urls')

Creando el directorio d:\Henry-DataScience\LABS\Proyecto Individual 1 Steam\VideoGameRecommender\src\CSV...
Creando el directorio d:\Henry-DataScience\LABS\Proyecto Individual 1 Steam\VideoGameRecommender\src\Parquet...
Archivos exportados exitosamente.
Archivos exportados exitosamente.
