# üìä RA1 - An√°lisis de Videojuegos con Pandas

## Fase 1: Exploraci√≥n y Limpieza de Datos

Objetivo: cargar, explorar, limpiar, y normalizar el dataset para dejarlo listo para ETL. Se estructura en 6 pasos claros con explicaciones breves y c√≥digo bien organizado.

### 1. Importar librer√≠as
Explicaci√≥n breve: cargamos Pandas para el manejo de datos y configuramos una visualizaci√≥n simple.

In [31]:
import pandas as pd
import numpy as np
import sqlite3
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
print(f"Pandas: {pd.__version__}")

Pandas: 2.1.4


### 2. Cargar el dataset
Usamos una carga robusta con rutas relativas y de contenedor.

In [32]:
import os

possible_paths = [
    '../data/videogames.csv',   # ejecuci√≥n local
    '/app/data/videogames.csv'  # ejecuci√≥n en contenedor
]

csv_path = next((p for p in possible_paths if os.path.exists(p)), possible_paths[0])

df = pd.read_csv(csv_path)
print(f"Ruta cargada: {csv_path}")
print(f"Dimensiones: {df.shape[0]} filas x {df.shape[1]} columnas")
df.head()

Ruta cargada: ../data/videogames.csv
Dimensiones: 10000 filas x 21 columnas


Unnamed: 0,name,genre,cost,platform,popularity,pegi,year,developer,publisher,region,mode,engine,award,dlc_support,language,metascore,user_score,reviews,rating_source,copies_sold_millions,revenue_millions_usd
0,Super Mario Odyssey,Action,74.45,Mobile,56,7+,2011,Capcom,Square Enix,?,Multiplayer,CryEngine,Indie Award,Unknown,JP,?,?,9.388708962265735,Metacritic,41.93,?
1,God of War,RPG,0,Mobile,?,7+,2023,Rockstar,Nintendo,Global,Online,Unity,?,Y,DE,98.1,8.4,?,IGN,1.5M,
2,Persona 5 Royal,Shooter,Free,PS,64,12,2020,nintendo,Square Enix,,Single-player,Custom Engine,GotY,Y,DE,31.7,2.6,?,IGN,25.08,889.0
3,NBA 2K24,Puzzle,,Mobile,972.7113240416031,RP,2017,Sony,Square Enix,Global,Single-player,Custom Engine,NONE,?,ES,80/100,,?,OpenCritic,,$500M
4,Overwatch,?,33.4,PC,612.6268621737502,18+,2015,Nintendo,Bandai Namco,NA/EU,Multiplayer,Custom,Indie Award,Unknown,IT,36.0,2.3,,Metacritic,,$1B


### 3. Analizar tipos de datos
Vemos la estructura del DataFrame y un resumen de nulos por columna.

In [33]:
print("="*70)
print("INFO DEL DATASET")
print("="*70)
df.info()

print("\n" + "="*70)
print("NULOS POR COLUMNA")
print("="*70)
nulls = pd.DataFrame({
    'columna': df.columns,
    'nulos': df.isnull().sum(),
    '%': (df.isnull().sum() / len(df) * 100).round(2)
}).query('nulos > 0').sort_values('%', ascending=False)

nulls

INFO DEL DATASET
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   name                  9714 non-null   object
 1   genre                 10000 non-null  object
 2   cost                  8350 non-null   object
 3   platform              10000 non-null  object
 4   popularity            7467 non-null   object
 5   pegi                  10000 non-null  object
 6   year                  9768 non-null   object
 7   developer             10000 non-null  object
 8   publisher             10000 non-null  object
 9   region                8602 non-null   object
 10  mode                  10000 non-null  object
 11  engine                10000 non-null  object
 12  award                 8578 non-null   object
 13  dlc_support           10000 non-null  object
 14  language              10000 non-null  object
 15  metascore           

Unnamed: 0,columna,nulos,%
reviews,reviews,2566,25.66
popularity,popularity,2533,25.33
metascore,metascore,2491,24.91
user_score,user_score,2439,24.39
copies_sold_millions,copies_sold_millions,2015,20.15
revenue_millions_usd,revenue_millions_usd,1977,19.77
cost,cost,1650,16.5
award,award,1422,14.22
region,region,1398,13.98
name,name,286,2.86


### 4. Detectar y tratar valores faltantes / duplicados
Reemplazamos valores especiales por NaN y eliminamos duplicados.

In [34]:
# Copia de trabajo
df_clean = df.copy()

# 4.a Valores "raros" -> NaN
specials = ['?', 'N/A', 'Unknown', 'unknown', '', ' ', 'nan', 'NaN']
df_clean = df_clean.replace(specials, np.nan)

# 4.b Eliminar duplicados
before = len(df_clean)
df_clean = df_clean.drop_duplicates()
removed = before - len(df_clean)

print(f"Duplicados eliminados: {removed}")

# ---- Tratamiento de valores faltantes ----
print("\nResumen de nulos ANTES del tratamiento:")
print(df_clean.isnull().sum().sort_values(ascending=False))

# 4.c Eliminar columnas con m√°s del 60% de nulos
missing_ratio_cols = df_clean.isnull().mean()
cols_to_drop = missing_ratio_cols[missing_ratio_cols > 0.6].index.tolist()
df_clean = df_clean.drop(columns=cols_to_drop)

# 4.d Eliminar filas con m√°s del 60% de nulos
missing_ratio_rows = df_clean.isnull().mean(axis=1)
rows_to_drop = missing_ratio_rows[missing_ratio_rows > 0.6].index
df_clean = df_clean.drop(index=rows_to_drop)

# 4.e Imputar el resto:
#     - num√©ricas -> mediana
#     - categ√≥ricas/texto -> moda
num_cols = df_clean.select_dtypes(include='number').columns
cat_cols = df_clean.select_dtypes(exclude='number').columns

for col in num_cols:
    med = df_clean[col].median()
    df_clean[col] = df_clean[col].fillna(med)

for col in cat_cols:
    moda = df_clean[col].mode()
    if not moda.empty:
        df_clean[col] = df_clean[col].fillna(moda.iloc[0])

print("\nResumen de nulos DESPU√âS del tratamiento:")
print(df_clean.isnull().sum().sort_values(ascending=False))
print(f"\nNulos totales restantes: {int(df_clean.isnull().sum().sum())}")

Duplicados eliminados: 0

Resumen de nulos ANTES del tratamiento:
reviews                 5062
user_score              5002
metascore               4977
popularity              4974
copies_sold_millions    4014
revenue_millions_usd    3981
cost                    3353
dlc_support             2863
region                  2851
award                   2849
pegi                    2082
mode                    1466
genre                   1446
rating_source           1385
engine                  1209
language                 927
publisher                845
developer                695
name                     588
year                     453
platform                   0
dtype: int64

Resumen de nulos DESPU√âS del tratamiento:
name                    0
engine                  0
copies_sold_millions    0
rating_source           0
reviews                 0
user_score              0
metascore               0
language                0
dlc_support             0
award                   0
mode    

### 5. Normalizar / transformar columnas
Estandarizamos texto (espacios/caso), unificamos plataformas y parseamos m√©tricas num√©ricas.

In [35]:
# 5.a Texto: limpiar espacios
text_cols = df_clean.select_dtypes(include=['object']).columns
for col in text_cols:
    df_clean[col] = df_clean[col].astype(str).str.strip()

# 5.b Unificar plataformas
platform_map = {
    'ps': 'PS', 'playstation': 'PS', 'ps1': 'PS', 'ps2': 'PS', 'ps3': 'PS', 'ps4': 'PS', 'ps5': 'PS',
    'xbox': 'Xbox', 'xbox one': 'Xbox', 'xbox series': 'Xbox', 'xbox series x': 'Xbox', 'xbox series s': 'Xbox',
    'pc': 'PC', 'windows': 'PC',
    'mobile': 'Mobile',
    'nintendo switch': 'Switch', 'switch': 'Switch'
}

if 'platform' in df_clean.columns:
    df_clean['platform'] = (
        df_clean['platform']
        .astype(str)
        .str.lower()
        .map(lambda x: platform_map.get(x, x.title()))
    )

# 5.c Funciones auxiliares de parseo
def parse_cost(x):
    """Convierte precios tipo '$59.99', '‚Ç¨49,99', 'free' a float."""
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in ['free', 'gratis', '0', '0.0']:
        return 0.0
    s = s.replace('$', '').replace('‚Ç¨', '')
    s = s.replace(',', '.')
    s = re.sub(r'[^0-9\.]', '', s)
    if s == '':
        return np.nan
    try:
        return float(s)
    except ValueError:
        return np.nan

def parse_score(x):
    """Convierte '85', '8.5', etc. a float (NaN si no se puede)."""
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    if s in ['tbd', 'tba', 'n/a', 'na', '-']:
        return np.nan
    s = s.replace(',', '.')
    try:
        return float(s)
    except ValueError:
        return np.nan

def parse_millions(x):
    """Convierte '10M', '5.5m', '0.8B' a millones (float)."""
    if pd.isna(x):
        return np.nan
    s = str(x).strip().lower()
    s = s.replace(',', '.')
    m = re.match(r'([0-9]*\.?[0-9]+)\s*([mb]?)', s)
    if not m:
        return np.nan
    value = float(m.group(1))
    suf = m.group(2)
    if suf == 'b':   # billones -> millones
        value *= 1000
    return value

# 5.d Aplicar transformaciones si existen las columnas
if 'cost' in df_clean.columns:
    df_clean['cost_usd'] = df_clean['cost'].apply(parse_cost)
if 'metascore' in df_clean.columns:
    df_clean['metascore_num'] = df_clean['metascore'].apply(parse_score)
if 'user_score' in df_clean.columns:
    df_clean['user_score_num'] = df_clean['user_score'].apply(parse_score)
if 'copies_sold_millions' in df_clean.columns:
    df_clean['copies_sold_millions_num'] = df_clean['copies_sold_millions'].apply(parse_millions)
if 'revenue_millions_usd' in df_clean.columns:
    df_clean['revenue_millions_usd_num'] = df_clean['revenue_millions_usd'].apply(parse_millions)

# 5.e Normalizar m√©tricas num√©ricas a [0,1] en columnas nuevas *_scaled
num_cols = df_clean.select_dtypes(include='number').columns
for col in num_cols:
    col_min = df_clean[col].min()
    col_max = df_clean[col].max()
    if col_min == col_max:
        df_clean[col + '_scaled'] = 0.0
    else:
        df_clean[col + '_scaled'] = (df_clean[col] - col_min) / (col_max - col_min)

# Por si alguna transformaci√≥n ha generado nuevos NaN, los imputamos de nuevo
num_cols_all = df_clean.select_dtypes(include='number').columns
for col in num_cols_all:
    if df_clean[col].isna().any():
        med = df_clean[col].median()
        df_clean[col] = df_clean[col].fillna(med)

print("Nulos totales tras transformaciones:", int(df_clean.isnull().sum().sum()))

cols_preview = [c for c in [
    'cost_usd', 'metascore_num', 'user_score_num',
    'copies_sold_millions_num', 'revenue_millions_usd_num'
] if c in df_clean.columns]

df_clean[cols_preview].head()

Nulos totales tras transformaciones: 0


Unnamed: 0,cost_usd,metascore_num,user_score_num,copies_sold_millions_num,revenue_millions_usd_num
0,74.45,59.4,5.1,41.93,992.9
1,0.0,98.1,8.4,1.5,992.9
2,0.0,31.7,2.6,25.08,889.0
3,0.0,59.4,5.1,1.5,992.9
4,33.4,36.0,2.3,1.5,992.9


---

## Proceso ETL con Pandas

Objetivo: Extraer, Transformar y Cargar (ETL) los datos limpios a una base de datos SQLite en el directorio warehouse.

### 6. EXTRACCI√ìN (E)
Leemos el dataset ya limpio y normalizado de la Fase 1.

In [40]:
df_etl = df_clean.copy()

print("="*70)
print("EXTRACCI√ìN - Dataset limpio cargado")
print("="*70)
print(f"Registros: {len(df_etl)}")
print(f"Columnas: {len(df_etl.columns)}")
print(f"Nulos totales: {df_etl.isnull().sum().sum()}")
print("\nPrimeras filas:")
df_etl.head()

EXTRACCI√ìN - Dataset limpio cargado
Registros: 10000
Columnas: 31
Nulos totales: 0

Primeras filas:


Unnamed: 0,name,genre,cost,platform,popularity,pegi,year,developer,publisher,region,mode,engine,award,dlc_support,language,metascore,user_score,reviews,rating_source,copies_sold_millions,revenue_millions_usd,cost_usd,metascore_num,user_score_num,copies_sold_millions_num,revenue_millions_usd_num,cost_usd_scaled,metascore_num_scaled,user_score_num_scaled,copies_sold_millions_num_scaled,revenue_millions_usd_num_scaled
0,Super Mario Odyssey,Action,74.45,Mobile,56.0,7+,2011,Capcom,Square Enix,JP,Multiplayer,CryEngine,Indie Award,Yes,JP,80/100,9/10,9.388708962265737,Metacritic,41.93,$500M,74.45,59.4,5.1,41.93,992.9,0.621193,0.4925,0.51,0.838956,0.496371
1,God of War,RPG,0,Mobile,14.0,7+,2023,Rockstar,Nintendo,Global,Online,Unity,Nominated,Y,DE,98.1,8.4,28238.0,IGN,1.5M,$500M,0.0,98.1,8.4,1.5,992.9,0.0,0.97625,0.84,0.027108,0.496371
2,Persona 5 Royal,Shooter,Free,PS,64.0,12,2020,nintendo,Square Enix,JP,Single-player,Custom Engine,GotY,Y,DE,31.7,2.6,28238.0,IGN,25.08,889.0,0.0,31.7,2.6,25.08,889.0,0.0,0.14625,0.26,0.500602,0.444361
3,NBA 2K24,Puzzle,Free,Mobile,972.7113240416032,RP,2017,Sony,Square Enix,Global,Single-player,Custom Engine,NONE,Yes,ES,80/100,9/10,28238.0,OpenCritic,1.5M,$500M,0.0,59.4,5.1,1.5,992.9,0.0,0.4925,0.51,0.027108,0.496371
4,Overwatch,Adventure,33.4,PC,612.6268621737502,18+,2015,Nintendo,Bandai Namco,NA/EU,Multiplayer,Custom,Indie Award,Yes,IT,36.0,2.3,28238.0,Metacritic,1.5M,$1B,33.4,36.0,2.3,1.5,992.9,0.278682,0.2,0.23,0.027108,0.496371


### 7. TRANSFORMACI√ìN (T)
Aplicamos transformaciones adicionales: agregaciones, nuevas columnas calculadas.

In [41]:
# Transformaci√≥n 1: Crear columnas calculadas
if 'metascore_num' in df_etl.columns and 'user_score_num' in df_etl.columns:
    df_etl['score_promedio'] = (df_etl['metascore_num'] + df_etl['user_score_num']) / 2
    df_etl['score_promedio'] = df_etl['score_promedio'].round(2)

if 'copies_sold_millions_num' in df_etl.columns:
    df_etl['categoria_ventas'] = pd.cut(
        df_etl['copies_sold_millions_num'],
        bins=[-float('inf'), 1, 5, 10, float('inf')],
        labels=['Bajo', 'Moderado', 'Exitoso', 'Blockbuster']
    )

if 'revenue_millions_usd_num' in df_etl.columns and 'copies_sold_millions_num' in df_etl.columns:
    # Evitar divisi√≥n por cero
    df_etl['ingreso_por_copia'] = (
        df_etl['revenue_millions_usd_num'] /
        df_etl['copies_sold_millions_num'].replace({0: pd.NA})
    ).round(2)

# Transformaci√≥n 2: Agregaci√≥n por g√©nero
if 'genre' in df_etl.columns:
    df_by_genre = df_etl.groupby('genre').agg({
        'name': 'count',                      # <-- AQU√ç 'name' en vez de 'title'
        'metascore_num': 'mean',
        'user_score_num': 'mean',
        'copies_sold_millions_num': 'sum',
        'revenue_millions_usd_num': 'sum'
    }).rename(columns={
        'name': 'total_juegos',
        'metascore_num': 'metascore_promedio',
        'user_score_num': 'user_score_promedio',
        'copies_sold_millions_num': 'total_copias_vendidas',
        'revenue_millions_usd_num': 'total_ingresos_millones'
    }).round(2).reset_index()
    
    print("Agregaci√≥n por g√©nero:")
    print(df_by_genre.head(10))

# Transformaci√≥n 3: Agregaci√≥n por plataforma
if 'platform' in df_etl.columns:
    df_by_platform = df_etl.groupby('platform').agg({
        'name': 'count',                      # <-- aqu√≠ tambi√©n
        'copies_sold_millions_num': 'sum',
        'revenue_millions_usd_num': 'sum'
    }).rename(columns={
        'name': 'total_juegos',
        'copies_sold_millions_num': 'total_copias_vendidas',
        'revenue_millions_usd_num': 'total_ingresos_millones'
    }).round(2).reset_index()
    
    print("\nAgregaci√≥n por plataforma:")
    print(df_by_platform.head(10))

print("\n" + "="*70)
print("TRANSFORMACI√ìN COMPLETADA")
print("="*70)
print("Nuevas columnas a√±adidas: score_promedio, categoria_ventas, ingreso_por_copia")
print("Tablas agregadas: df_by_genre, df_by_platform")


Agregaci√≥n por g√©nero:
        genre  total_juegos  metascore_promedio  user_score_promedio  \
0      Action           704               59.19                 5.16   
1   Adventure          2187               60.05                 5.06   
2       Indie           707               59.67                 5.11   
3      Puzzle           739               59.70                 5.11   
4         RPG          1452               59.76                 5.09   
5      Racing           711               59.18                 5.08   
6     Shooter           697               58.76                 5.10   
7  Simulation           677               59.22                 5.12   
8      Sports           735               58.97                 5.05   
9    Strategy           699               58.93                 5.15   

   total_copias_vendidas  total_ingresos_millones  
0                8056.83                 708563.6  
1               27161.72                2196125.2  
2                8200.83  

### 8. CARGA (L)
Guardamos el resultado final en SQLite (warehouse_pandas.db) en el directorio warehouse.

In [42]:
warehouse_paths = [
    '../warehouse/warehouse_pandas.db',
    '/app/warehouse/warehouse_pandas.db'
]

for path in warehouse_paths:
    warehouse_dir = Path(path).parent
    if warehouse_dir.exists() or str(warehouse_dir).startswith('..'):
        db_path = path
        warehouse_dir.mkdir(parents=True, exist_ok=True)
        break
else:
    db_path = warehouse_paths[0]
    Path(db_path).parent.mkdir(parents=True, exist_ok=True)

# Crear conexi√≥n a SQLite
conn = sqlite3.connect(db_path)

print("="*70)
print("CARGA - Guardando datos en SQLite")
print("="*70)
print(f"Base de datos: {db_path}\n")

# ==========================================================
# 1) DIMENSIONES + TABLA DE HECHOS (modelo estrella)
# ==========================================================

# DIMENSI√ìN G√âNERO
if 'genre' in df_etl.columns:
    dim_genre = (
        df_etl[['genre']]
        .drop_duplicates()
        .sort_values('genre')
        .reset_index(drop=True)
    )
    dim_genre['genre_id'] = dim_genre.index + 1
    dim_genre = dim_genre[['genre_id', 'genre']].rename(columns={'genre': 'genre_name'})
    dim_genre.to_sql('dim_genre', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla de dimensi√≥n 'dim_genre' cargada: {len(dim_genre)} registros")
else:
    print("‚ö† No se ha creado 'dim_genre' porque no existe la columna 'genre' en df_etl.")

# DIMENSI√ìN PLATAFORMA
if 'platform' in df_etl.columns:
    dim_platform = (
        df_etl[['platform']]
        .drop_duplicates()
        .sort_values('platform')
        .reset_index(drop=True)
    )
    dim_platform['platform_id'] = dim_platform.index + 1
    dim_platform = dim_platform[['platform_id', 'platform']].rename(columns={'platform': 'platform_name'})
    dim_platform.to_sql('dim_platform', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla de dimensi√≥n 'dim_platform' cargada: {len(dim_platform)} registros")
else:
    print("‚ö† No se ha creado 'dim_platform' porque no existe la columna 'platform' en df_etl.")

# DIMENSI√ìN A√ëO
if 'year' in df_etl.columns:
    dim_year = (
        df_etl[['year']]
        .drop_duplicates()
        .sort_values('year')
        .reset_index(drop=True)
    )
    dim_year['year_id'] = dim_year.index + 1
    dim_year = dim_year[['year_id', 'year']].rename(columns={'year': 'year_value'})
    dim_year.to_sql('dim_year', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla de dimensi√≥n 'dim_year' cargada: {len(dim_year)} registros")
else:
    print("‚ö† No se ha creado 'dim_year' porque no existe la columna 'year' en df_etl.")

# DIMENSI√ìN PUBLISHER
if 'publisher' in df_etl.columns:
    dim_publisher = (
        df_etl[['publisher']]
        .drop_duplicates()
        .sort_values('publisher')
        .reset_index(drop=True)
    )
    dim_publisher['publisher_id'] = dim_publisher.index + 1
    dim_publisher = dim_publisher[['publisher_id', 'publisher']].rename(columns={'publisher': 'publisher_name'})
    dim_publisher.to_sql('dim_publisher', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla de dimensi√≥n 'dim_publisher' cargada: {len(dim_publisher)} registros")
else:
    print("‚ö† No se ha creado 'dim_publisher' porque no existe la columna 'publisher' en df_etl.")

# TABLA DE HECHOS
if all(c in df_etl.columns for c in ['genre', 'platform']):
    fact_videogame = df_etl.copy()

    # unir con dimensiones para obtener los IDs
    fact_videogame = fact_videogame.merge(
        dim_genre, left_on='genre', right_on='genre_name', how='left'
    ).merge(
        dim_platform, left_on='platform', right_on='platform_name', how='left'
    )

    if 'year' in df_etl.columns:
        fact_videogame = fact_videogame.merge(
            dim_year, left_on='year', right_on='year_value', how='left'
        )
    if 'publisher' in df_etl.columns:
        fact_videogame = fact_videogame.merge(
            dim_publisher, left_on='publisher', right_on='publisher_name', how='left'
        )

    # id de juego
    fact_videogame['game_id'] = range(1, len(fact_videogame) + 1)

    # columnas que queremos en la tabla de hechos
    fact_cols = [
        'game_id',
        'name',
        'genre_id',
        'platform_id',
        'year_id',
        'publisher_id',
        'metascore_num',
        'user_score_num',
        'score_promedio',
        'copies_sold_millions_num',
        'revenue_millions_usd_num',
        'ingreso_por_copia',
        'categoria_ventas'
    ]
    fact_cols = [c for c in fact_cols if c in fact_videogame.columns]

    fact_videogame = fact_videogame[fact_cols]
    fact_videogame.to_sql('fact_videogame', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla de hechos 'fact_videogame' cargada: {len(fact_videogame)} registros")
else:
    print("‚ö† No se ha creado 'fact_videogame' por falta de columnas 'genre' o 'platform'.")

# ==========================================================
# 2) Tablas adicionales (las que ya ten√≠as)
# ==========================================================

# Tabla principal completa
df_etl.to_sql('videogames', conn, if_exists='replace', index=False)
print(f"\n‚úì Tabla 'videogames' cargada: {len(df_etl)} registros")

# Tabla agregada: by_genre
if 'df_by_genre' in locals():
    df_by_genre.to_sql('by_genre', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla 'by_genre' cargada: {len(df_by_genre)} registros")

# Tabla agregada: by_platform
if 'df_by_platform' in locals():
    df_by_platform.to_sql('by_platform', conn, if_exists='replace', index=False)
    print(f"‚úì Tabla 'by_platform' cargada: {len(df_by_platform)} registros")

# Cerrar conexi√≥n
conn.close()

print("\n" + "="*70)
print("PROCESO ETL COMPLETADO CON √âXITO")
print("="*70)
print("Tablas disponibles: dim_genre, dim_platform, dim_year, dim_publisher, fact_videogame, videogames, by_genre, by_platform")
print(f"Base de datos creada: {db_path}")


CARGA - Guardando datos en SQLite
Base de datos: ../warehouse/warehouse_pandas.db

‚úì Tabla de dimensi√≥n 'dim_genre' cargada: 11 registros
‚úì Tabla de dimensi√≥n 'dim_platform' cargada: 5 registros
‚úì Tabla de dimensi√≥n 'dim_year' cargada: 41 registros
‚úì Tabla de dimensi√≥n 'dim_publisher' cargada: 12 registros
‚úì Tabla de hechos 'fact_videogame' cargada: 10000 registros

‚úì Tabla 'videogames' cargada: 10000 registros
‚úì Tabla 'by_genre' cargada: 11 registros
‚úì Tabla 'by_platform' cargada: 5 registros

PROCESO ETL COMPLETADO CON √âXITO
Tablas disponibles: dim_genre, dim_platform, dim_year, dim_publisher, fact_videogame, videogames, by_genre, by_platform
Base de datos creada: ../warehouse/warehouse_pandas.db


### 9. Verificaci√≥n de la base de datos
Consultamos la base de datos para verificar que los datos se han cargado correctamente.

In [43]:
# Conectar a la base de datos
conn = sqlite3.connect(db_path)

# Listar todas las tablas
query_tables = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql(query_tables, conn)
print("Tablas en la base de datos:")
print(tables)

# Verificar registros en cada tabla
print("\n" + "="*70)
print("CONTEO DE REGISTROS POR TABLA")
print("="*70)

for table_name in tables['name']:
    query = f"SELECT COUNT(*) as count FROM {table_name};"
    result = pd.read_sql(query, conn)
    print(f"{table_name}: {result['count'].iloc[0]} registros")

# Mostrar muestra de la tabla principal
print("\n" + "="*70)
print("MUESTRA DE LA TABLA 'videogames'")
print("="*70)
query_sample = "SELECT * FROM videogames LIMIT 5;"
sample = pd.read_sql(query_sample, conn)
print(sample)

# Cerrar conexi√≥n
conn.close()

print("\n‚úÖ Verificaci√≥n completada. Base de datos funcionando correctamente.")

Tablas en la base de datos:
             name
0       dim_genre
1    dim_platform
2        dim_year
3   dim_publisher
4  fact_videogame
5      videogames
6        by_genre
7     by_platform

CONTEO DE REGISTROS POR TABLA
dim_genre: 11 registros
dim_platform: 5 registros
dim_year: 41 registros
dim_publisher: 12 registros
fact_videogame: 10000 registros
videogames: 10000 registros
by_genre: 11 registros
by_platform: 5 registros

MUESTRA DE LA TABLA 'videogames'
                  name      genre   cost platform         popularity pegi  \
0  Super Mario Odyssey     Action  74.45   Mobile                 56   7+   
1           God of War        RPG      0   Mobile                 14   7+   
2      Persona 5 Royal    Shooter   Free       PS                 64   12   
3             NBA 2K24     Puzzle   Free   Mobile  972.7113240416031   RP   
4            Overwatch  Adventure   33.4       PC  612.6268621737502  18+   

   year developer     publisher  region           mode         engine  \
