# ETL (Ingeniería de Datos)

### Importar Librerías

In [155]:
import pandas as pd
import sys
import os
import numpy as np
import warnings
import ast

# Ignorar todas las advertencias
warnings.filterwarnings("ignore")

In [156]:
# Obtener el directorio de trabajo actual
current_dir = os.getcwd()

# Navegar hacia el directorio raíz del proyecto
project_root = os.path.abspath(os.path.join(current_dir, '..'))

# Agregar la ruta del proyecto al sys.path
sys.path.append(project_root)

Importar funciones para ETL (funciones creadas en la carpeta function, archivo ETL)

In [157]:
from functions.ETL import load_data, normalize, export # (funciones creadas en la carpeta function, archivo ETL)

## Extracción de datos

Ruta del archivo:

In [158]:
path = r'..\data\steam_games.json.gz'

Extracción y Visualización de datos

In [159]:
df = load_data(path) # (funciones creadas en la carpeta function, archivo ETL)
df.head()

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
0,,,,,,,,,,,,,
1,,,,,,,,,,,,,
2,,,,,,,,,,,,,
3,,,,,,,,,,,,,
4,,,,,,,,,,,,,


## Transformación de los Datos

### Limpieza de los datos

#### Eliminar Filas innecesarias

Eliminar filas con todos los elementos vacíos:

In [160]:
df_clean = df.dropna(how='all')
len(df_clean)

32135

app_name tiene dos elementos vacíos que respecto a otras columnas lo hace imposible de identificar un registro

In [161]:
df_clean = df_clean.dropna(subset=['app_name'])
len(df_clean)

32133

Identificar duplicados por id y app_name:

In [162]:
df_clean[df_clean.duplicated(df_clean.columns[df_clean.columns.isin(['app_name', 'id'])])]

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
102883,Bethesda Softworks,[Action],Wolfenstein II: The New Colossus,Wolfenstein II: The New Colossus,http://store.steampowered.com/app/612880/Wolfe...,2017-10-26,"[Action, FPS, Gore, Violent, Alternate History...",http://steamcommunity.com/app/612880/reviews/?...,"[Single-player, Steam Achievements, Full contr...",59.99,0.0,612880.0,Machine Games


Eliminar duplicados

In [163]:
df_clean = df_clean.drop_duplicates(df_clean.columns[df_clean.columns.isin(['app_name', 'id'])], keep='first')
len(df_clean)

32132

Comparar columnas app_name con title:

In [164]:
distincts = df_clean[df_clean['app_name'] != df_clean['title']]
len(distincts)

2603

In [165]:
# Ver la diferencia entre los valores diferentes entre title y app_name
len(distincts) - distincts['title'].isnull().sum()

np.int64(555)

Las filas que no coinciden entre app_name y title se deben eliminar ya que son datos inconsistentes y no se puede determinar el nombre del juego. Existen más de 2000 valores nulos en la columna title y 555 que no coinciden con la columna app_name que sí está completa. Visualizando el contenido se puede observar que la diferencia se encientra en el caracter & que en la columna app_name se muestra normalmente mientras que en la columna title se muestra $amp;

In [166]:
filtered = distincts[distincts['title'].notnull()]
filtered.head(3)

Unnamed: 0,publisher,genres,app_name,title,url,release_date,tags,reviews_url,specs,price,early_access,id,developer
88390,Telltale Games,"[Action, Adventure]",Sam & Max 101: Culture Shock,Sam &amp; Max 101: Culture Shock,http://store.steampowered.com/app/8200/Sam__Ma...,2006-10-17,"[Point & Click, Comedy, Adventure, Detective, ...",http://steamcommunity.com/app/8200/reviews/?br...,[Single-player],19.99,0.0,8200.0,Telltale Games
88393,Telltale Games,"[Action, Adventure]",Sam & Max 102: Situation: Comedy,Sam &amp; Max 102: Situation: Comedy,http://store.steampowered.com/app/8210/Sam__Ma...,2006-12-20,"[Adventure, Action]",http://steamcommunity.com/app/8210/reviews/?br...,[Single-player],19.99,0.0,8210.0,Telltale Games
88419,Electronic Arts,[Strategy],Command & Conquer: Red Alert 3,Command &amp; Conquer: Red Alert 3,http://store.steampowered.com/app/17480/Comman...,2008-10-28,"[Strategy, RTS, Base Building, Multiplayer, Co...",http://steamcommunity.com/app/17480/reviews/?b...,[Single-player],19.99,0.0,17480.0,EA Los Angeles


#### Eliminar columnas innecesarias reordenándolas:

In [167]:
# Se listan las listas a conservar de manera ordenada y se filtran
cols = ['id', 
        'app_name', 
        'genres', 
        'tags', 
        'specs', 
        'url', 
        'reviews_url', 
        'price',  
        'release_date', 
        'developer', 
        'publisher']

df_games = df_clean[cols]

In [168]:
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


#### Imputar valores nulos:

In [169]:
df_games.isnull().sum()

id                 1
app_name           0
genres          3282
tags             162
specs            669
url                0
reviews_url        1
price           1376
release_date    2066
developer       3297
publisher       8050
dtype: int64

Las columnas genres, tags specs tienen datos similares por lo que pueden unirse, llevar a lowercase y dividirse según los elementos.

In [170]:
# Rellenar los valores NaN con listas vacías
df_games['genres'] = df_games['genres'].apply(lambda x: x if isinstance(x, list) else [])
df_games['tags'] = df_games['tags'].apply(lambda x: x if isinstance(x, list) else [])
df_games['specs'] = df_games['specs'].apply(lambda x: x if isinstance(x, list) else [])
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy]","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[Free to Play, Indie, RPG, Strategy]","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Sports]","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


In [171]:
# Concatenar las listas de las columnas 'genres', 'tags' y 'specs' fila por fila
df_games['genres'] = df_games.apply(lambda row: row['genres'] + row['tags'] + row['specs'], axis=1)

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[Action, Casual, Indie, Simulation, Strategy, ...","[Strategy, Action, Indie, Casual, Simulation]",[Single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[Free to Play, Indie, RPG, Strategy, Free to P...","[Free to Play, Strategy, Indie, RPG, Card Game...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[Casual, Free to Play, Indie, Simulation, Spor...","[Free to Play, Simulation, Sports, Casual, Ind...","[Single-player, Multi-player, Online Multi-Pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


In [172]:
# Convertir los elementos de cada lista en las columnas 'genres' y specs a minúsculas
df_games['genres'] = df_games['genres'].apply(lambda genres: [g.lower() for g in genres])
df_games['specs'] = df_games['specs'].apply(lambda genres: [g.lower() for g in genres])

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[action, casual, indie, simulation, strategy, ...","[Strategy, Action, Indie, Casual, Simulation]",[single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[free to play, indie, rpg, strategy, free to p...","[Free to Play, Strategy, Indie, RPG, Card Game...","[single-player, multi-player, online multi-pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[casual, free to play, indie, simulation, spor...","[Free to Play, Simulation, Sports, Casual, Ind...","[single-player, multi-player, online multi-pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


In [173]:
# Filtrar los elementos de 'genres' eliminando los elementos con la palabra player
df_games['genres'] = df_games['genres'].apply(lambda specs: [s for s in specs if 'player' not in s])
# Filtrar los elementos de 'specs' que contengan la palabra 'player'
df_games['specs'] = df_games['specs'].apply(lambda specs: [s for s in specs if 'player' in s])

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[action, casual, indie, simulation, strategy, ...","[Strategy, Action, Indie, Casual, Simulation]",[single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[free to play, indie, rpg, strategy, free to p...","[Free to Play, Strategy, Indie, RPG, Card Game...","[single-player, multi-player, online multi-pla...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[casual, free to play, indie, simulation, spor...","[Free to Play, Simulation, Sports, Casual, Ind...","[single-player, multi-player, online multi-pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


In [174]:
# Eliminar duplicados en cada lista de la columna 'genres' y 'specs'
df_games['genres'] = df_games['genres'].apply(lambda genres: list(set(genres)))
df_games['specs'] = df_games['specs'].apply(lambda genres: list(set(genres)))

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]","[Strategy, Action, Indie, Casual, Simulation]",[single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[character customization, strategy, free to pl...","[Free to Play, Strategy, Indie, RPG, Card Game...","[multi-player, cross-platform multiplayer, sin...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[Free to Play, Simulation, Sports, Casual, Ind...","[multi-player, single-player, online multi-pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


Las columnas developer y publisher suelen coincidir en algunos casos, por lo que pueden usarse para rellenar valores nulos.

In [175]:
df_games['developer'].fillna(df_games['publisher'], inplace=True)
df_games['developer'].isnull().sum()

np.int64(3232)

#### Imputar columna price en caso de existir el valor 'free to play' en genres

In [176]:
# Rellenar los valores vacíos en 'price' con 0 si en 'genres' está 'free to play'
df_games['price'] = df_games.apply(
    lambda row: 0 if pd.isna(row['price']) and 'free to play' in row['genres'] else row['price'], 
    axis=1
)

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]","[Strategy, Action, Indie, Casual, Simulation]",[single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[character customization, strategy, free to pl...","[Free to Play, Strategy, Indie, RPG, Card Game...","[multi-player, cross-platform multiplayer, sin...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[Free to Play, Simulation, Sports, Casual, Ind...","[multi-player, single-player, online multi-pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


In [177]:
df_games['price'].isnull().sum()

np.int64(1171)

In [178]:
# Eliminar "free to play" de las listas en la columna 'genres'
df_games['genres'] = df_games['genres'].apply(lambda x: [genre for genre in x if genre != 'free to play'])

# Ver el resultado
df_games.head(3)

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
88310,761140.0,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]","[Strategy, Action, Indie, Casual, Simulation]",[single-player],http://store.steampowered.com/app/761140/Lost_...,http://steamcommunity.com/app/761140/reviews/?...,4.99,2018-01-04,Kotoshiro,Kotoshiro
88311,643980.0,Ironbound,"[character customization, strategy, design & i...","[Free to Play, Strategy, Indie, RPG, Card Game...","[multi-player, cross-platform multiplayer, sin...",http://store.steampowered.com/app/643980/Ironb...,http://steamcommunity.com/app/643980/reviews/?...,Free To Play,2018-01-04,Secret Level SRL,"Making Fun, Inc."
88312,670290.0,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[Free to Play, Simulation, Sports, Casual, Ind...","[multi-player, single-player, online multi-pla...",http://store.steampowered.com/app/670290/Real_...,http://steamcommunity.com/app/670290/reviews/?...,Free to Play,2017-07-24,Poolians.com,Poolians.com


Identificar id nulo

In [179]:
df_games[df_games['id'].isnull()]

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
119271,,Batman: Arkham City - Game of the Year Edition,"[open world, metroidvania, controller, beat 'e...","[Action, Open World, Batman, Adventure, Stealt...",[single-player],http://store.steampowered.com/app/200260,,19.99,2012-09-07,"Rocksteady Studios,Feral Interactive (Mac)","Warner Bros. Interactive Entertainment, Feral ..."


In [180]:
# en la URL figura el id, por lo que se verifica si el registro ya existe
df_games[df_games['id'] == 200260]

Unnamed: 0,id,app_name,genres,tags,specs,url,reviews_url,price,release_date,developer,publisher
89378,200260.0,Batman: Arkham City - Game of the Year Edition,"[open world, metroidvania, controller, beat 'e...","[Action, Open World, Batman, Adventure, Stealt...",[single-player],http://store.steampowered.com/app/200260/Batma...,http://steamcommunity.com/app/200260/reviews/?...,19.99,2012-09-07,"Rocksteady Studios,Feral Interactive (Mac)","Warner Bros. Interactive Entertainment, Feral ..."


In [181]:
# se eliminan los datos con id nulos tras verificar duplicidad
df_games = df_games.dropna(subset='id')
# se cambia el tipo de dato de la columna
df_games['id'] = df_games['id'].astype('int64')
df_games['id'].dtype

dtype('int64')

In [182]:
#volver a verificar datos nulos
df_games.isnull().sum()

id                 0
app_name           0
genres             0
tags               0
specs              0
url                0
reviews_url        0
price           1171
release_date    2066
developer       3232
publisher       8050
dtype: int64

#### Se crea un DataFrame con las URLs de los juegos y las reviews
- Las URLs de los juegos pueden necesitarse para mejorar el sistema recomendando las páginas de los juegos.
- Las URLs de las reviews pueden utilizarse a futuro para Web Scraping

In [183]:
games_urls = df_games[['id', 'url', 'reviews_url']]

Eliminar columnas tags, publisher, url y url_reviews

In [184]:
# Eliminar columnas innecesarias y visualizar
df_steam = df_games.drop(columns=['tags', 'publisher', 'url', 'reviews_url'])
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,2018-01-04,Kotoshiro
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",Free To Play,2018-01-04,Secret Level SRL
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",Free to Play,2017-07-24,Poolians.com


Comprobar tipos de datos

In [185]:
[print(c, df_steam[c].dtype) for c in df_steam.columns]

id int64
app_name object
genres object
specs object
price object
release_date object
developer object


[None, None, None, None, None, None, None]

#### Convertir los demás datos

In [186]:
#se convierten los textos a formato str
df_steam = df_steam.astype({'app_name': 'str', 'price': 'str', 'developer': 'str'})
[print(c, df_steam[c].dtype) for c in df_steam.columns]

id int64
app_name object
genres object
specs object
price object
release_date object
developer object


[None, None, None, None, None, None, None]

In [187]:
# Definir función para manejar los precios
# si incluyen la palabra Free, el precio se configura en 0
def handle_price(price):
    '''
    Si el precio contiene la palabra Free o free retorna 0.0,
    sino, intenta convertirlo a float y, en caso de no poder, retorna NaN
    '''
    if 'Free' in price or 'free' in price: 
        return 0.0
    # Si no contiene la palabra Free, intenta convertirlo a float y, en caso de no poder, retorna NaN
    try:
        return float(price)
    except ValueError:
        return np.nan

In [188]:
# se cambian los valores que implican Free to Play a Free
df_steam['price'] = df_steam['price'].apply(handle_price)
# verificar tipo de dato de la columna price

df_steam['price'].dtype

dtype('float64')

In [189]:
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,2018-01-04,Kotoshiro
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",0.0,2018-01-04,Secret Level SRL
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",0.0,2017-07-24,Poolians.com


In [190]:
# Convertir las fechas en diferentes formatos a un formato unificado de fecha
df_steam['release_date'] = pd.to_datetime(df_steam['release_date'], errors='coerce')

# Extraer el año de las fechas convertidas
df_steam['release_year'] = df_steam['release_date'].dt.year

# Mostrar un ejemplo de las primeras filas
df_steam.head()

Unnamed: 0,id,app_name,genres,specs,price,release_date,developer,release_year
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,2018-01-04,Kotoshiro,2018.0
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",0.0,2018-01-04,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",0.0,2017-07-24,Poolians.com,2017.0
88313,767400,弹炸人2222,"[casual, action, adventure]",[single-player],0.99,2017-12-07,彼岸领域,2017.0
88314,773570,Log Challenge,"[oculus rift, room-scale, action, casual, spor...",[single-player],2.99,NaT,,


In [191]:
# eliminar columna release_date y mostrar los tres primeros resultados
df_steam = df_steam.drop(columns=['release_date'])
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,Kotoshiro,2018.0
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",0.0,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",0.0,Poolians.com,2017.0


In [192]:
df_steam.isna().sum()

id                 0
app_name           0
genres             0
specs              0
price           1181
developer          0
release_year    2351
dtype: int64

### tratamiento de los géneros

In [193]:
# Diccionario para reemplazar abreviaturas por nombres completos
genre_replacements = {
    'fps': 'first-person shooter',
    'first person shooter': 'first-person shooter',
    'first-person shooter': 'first-person shooter',
    'first person': 'first-person',
    'tps': 'third-person shooter',
    'third person': 'third-person',
    'third person shooter': 'third-person shooter',
    'third-person shooter': 'third-person shooter',
    'sim': 'simulation',
    'simulator': 'simulation',
    'rts': 'real-time strategy',
    'platform': 'platformer',
    'story rich': 'story-rich',
    'rpg': 'role-playing',
    'dungeon': 'role-playing',
    'shoot': 'shooter',
    'combat': 'fighting',
    'fiction': 'sci-fi',
    'real time': 'real-time',
    "shoot 'em up": "shoot'em up"
}

In [194]:
# Se define una función que agrega elementos a la lista géneros con valores normalizados
def replace_genres(genre_list):
    '''
    Recibe una lista de géneros, compara los elementos buscando substrings y agrega géneros acorde con lo que encontró.
    Si el valor es nulo o no es una lista, lo devuelve sin cambios.
    '''
    # Si el género es NaN (un solo valor NaN) o no es una lista
    if genre_list is None or (isinstance(genre_list, float) and pd.isna(genre_list)):
        return genre_list  # Devolvemos el valor sin cambios si es NaN
    
    if not isinstance(genre_list, list):
        return genre_list  # Si no es una lista, devolver sin cambios
    
    replaced_genres = list(genre_list)  # Clonamos la lista original para evitar modificarla directamente
    
    for genre in genre_list:
        # Recorremos el diccionario de reemplazos y agregamos el género correspondiente si el substring está presente
        for key, replacement in genre_replacements.items():
            if key in genre and replacement not in replaced_genres:  # Evitar duplicados
                replaced_genres.append(replacement)

    return replaced_genres  # Devolvemos la lista original más los géneros agregados

In [195]:
df_steam['genres'] = df_steam['genres'].apply(replace_genres)
df_steam.head()

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,Kotoshiro,2018.0
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",0.0,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",0.0,Poolians.com,2017.0
88313,767400,弹炸人2222,"[casual, action, adventure]",[single-player],0.99,彼岸领域,2017.0
88314,773570,Log Challenge,"[oculus rift, room-scale, action, casual, spor...",[single-player],2.99,,


In [196]:
# lista con todos los géneros finales normalizados
genres = [
    'indie', 'action', 'adventure', 'casual', 'simulation', 'strategy', "co-op", 'puzzle', 'sports', 'online', 
    'platformer', "story rich", 'sci-fi', 'fantasy', 'horror', 'anime', 'shooter', 'racing', "first-person", 
    'local', 'sandbox', 'arcade', 'retro', "turn-based", 'mmo', 'comedy', 'survival', 'classic', 'gore', 
    "third-person", "visual novel", 'utilities', 'exploration', "hidden object", 'rogue-like', 'tactical', 
    'rpgmaker', 'design', 'illustration', 'physics', 'mystery', 'psychological', 'education', 'historical', 
    "fast-paced", 'building', "tower defense", 'relaxing', 'war', 'hack', 'slash', 'crafting', 'jrpg', 
    'animation', 'modeling', "party-based", 'fighting', 'rogue-lite', 'competitive', "role-playing", 'pvp', 'card', 
    "post-apocalyptic", 'board','cyberpunk', "turn-based", "top-down", 'metroidvania', 'drama', 'military', 'mmorpg', 
    'futuristic', 'city builder', 'flight', 'romance', 'trains', 'driving', 'dark', 'surreal', 'windows mixed reality', 
    'video', 'production', 'mature', 'steampunk', 'economy', "game development", "team-based", 'dating', 'audio', 
    'thriller', 'dark humor', "text-based", 'lovecraftian', 'arena', "real-time", 'clicker', 'otome', 'moba', 
    'gamemaker', 'mod', 'science', 'crpg', 'tactics', 'fmv', "e-sports", "god game", 'pve', 'hunting', 'mythology', 
    "martial arts", "class-based", 'agriculture', '6dof', "lore-rich", 'sailing', 'offroad', 'campaign', 'fishing', 
    'philisophical', 'mining', 'investigation', 'conversation', 'wrestling', 'nsfw', "shoot'em up"
]

In [197]:
# Se define una función que revisará todos los elementos de géneros y devolverá sólo los géneros normalizados
def new_genres(old_genres):
    '''
    Esta función recibe una lista con todos los géneros antiguos no normalizados, compara elemento por elemento buscando si existe un substring que coincida con los géneros finales para agregar estos últimos a una lista final que luego será retornada.
    '''
    # Lista para almacenar los géneros finales
    res = []

    # Si el género no es NaN (un solo valor NaN) y es una lista
    if old_genres is not None and not (isinstance(old_genres, float) and pd.isna(old_genres)):
        # recorremos cada elemento de la lista original
        for genre in old_genres:
            # se recorre la lista de géneros finales y agrega el género si el substring está presente
            for final_genre in genres:
                if final_genre in genre and final_genre not in res and 'steam ' not in genre: # evitar duplicados y que considere los tags con steam
                    res.append(final_genre)

    return res  # Devolvemos la lista de géneros finales

In [198]:
# se crea una nueva columna con los nuevos valores de género para compararlos
df_steam['new_genres'] = df_steam['genres'].apply(new_genres)

# Se compara la lista de géneros finales con la lista de géneros finales normalizados
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year,new_genres
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,Kotoshiro,2018.0,"[simulation, action, strategy, casual, indie]"
88311,643980,Ironbound,"[character customization, strategy, design & i...","[multi-player, cross-platform multiplayer, sin...",0.0,Secret Level SRL,2018.0,"[strategy, design, illustration, board, pvp, t..."
88312,670290,Real Pool 3D - Poolians,"[simulation, in-app purchases, stats, casual, ...","[multi-player, single-player, online multi-pla...",0.0,Poolians.com,2017.0,"[simulation, casual, sports, indie, strategy, ..."


In [199]:
# reemplazar generos por new_genres y eliminar la columna auxiliar new_genres
df_steam['genres'] = df_steam['new_genres']
df_steam = df_steam.drop(columns=['new_genres'])

#visualizar
df_steam.head(3)

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year
88310,761140,Lost Summoner Kitty,"[simulation, action, strategy, casual, indie]",[single-player],4.99,Kotoshiro,2018.0
88311,643980,Ironbound,"[strategy, design, illustration, board, pvp, t...","[multi-player, cross-platform multiplayer, sin...",0.0,Secret Level SRL,2018.0
88312,670290,Real Pool 3D - Poolians,"[simulation, casual, sports, indie, strategy, ...","[multi-player, single-player, online multi-pla...",0.0,Poolians.com,2017.0


In [200]:
# desanidar los géneros y specs
df_steam = df_steam.explode('genres')
df_steam = df_steam.explode('specs')
df_steam.reset_index(drop=True, inplace=True)

#mostrar últimos cinco elementows
df_steam.tail()

Unnamed: 0,id,app_name,genres,specs,price,developer,release_year
192950,658870,EXIT 2 - Directions,indie,single-player,4.99,"xropi,stev3ns",2017.0
192951,681550,Maze Run VR,simulation,single-player,4.99,,
192952,681550,Maze Run VR,action,single-player,4.99,,
192953,681550,Maze Run VR,adventure,single-player,4.99,,
192954,681550,Maze Run VR,indie,single-player,4.99,,


## Carga de Datos

Se guarda el archivo trabajado en formato parquet y CSV en sus carpetas correspondientes para ser trabajados de acuerdo a la situación. Si los directorios no existen, se crean.

In [201]:
export(df_steam, project_root, 'steam_games')
# Se exporta, las URLs de los juegos y las reviews en caso de que a futuro se planee realizar web scraping
# descomentar siguiente línea para exportar archivos con URLs en caso de necesitarlo
#export(games_urls, project_root, 'games_urls')
# Función de exportación en ./functions/ETL.py

Archivos exportados exitosamente.
Archivos exportados exitosamente.
