# Inserci√≥n de Datos de Cat√°strofes Naturales

Este notebook extrae informaci√≥n sobre la frecuencia y el impacto de desastres naturales por pa√≠s a partir del CSV  
`Indicator_11_1_Physical_Risks_Climate_related_disasters_frequency_7212563912390016675.csv`  
y la carga en la tabla de hechos de nuestra base de datos.

**Objetivos**  
1. Cargar y previsualizar los datos de cat√°strofes naturales.  
2. Transformar el dataset a formato ‚Äúlargo‚Äù si es necesario.  
3. Normalizar y mapear los pa√≠ses a sus c√≥digos internos.  
4. Mapear cada fila de desastre al identificador de indicador en la tabla `Indicadores`.  
5. Preparar las tuplas de hechos y realizar la inserci√≥n en lotes.  
6. Verificar que no queden registros hu√©rfanos y cerrar la conexi√≥n.

### Importar librer√≠as y establecer conexi√≥n a MySQL

En esta celda cargamos los paquetes necesarios, leemos las variables de entorno  y abrimos un cursor hacia la base de datos donde est√°n definidas las tablas.


In [1]:
import os
import pandas as pd
import pymysql
from pymysql.constants import CLIENT
from dotenv import load_dotenv

load_dotenv()
# Cargar variables de entorno
DB_HOST     = os.getenv('DB_HOST')
DB_USER     = os.getenv('DB_USER')
DB_PASSWORD = os.getenv('DB_PASSWORD')
DB_NAME     = os.getenv('DB_NAME')

# Conectar a MySQL con multi-statements habilitado
conexion = pymysql.connect(
    host=DB_HOST,
    user=DB_USER,
    password=DB_PASSWORD,
    database=DB_NAME,
    client_flag=CLIENT.MULTI_STATEMENTS
)
cursor = conexion.cursor()
print(f"üîó Conectado a la base de datos `{DB_NAME}` como {DB_USER}@{DB_HOST}")


üîó Conectado a la base de datos `tfm_cambio_climatico` como root@localhost


### 1. Lectura del CSV de cat√°strofes naturales

- Utilizamos `pd.read_csv()` para cargar el fichero.  
- Inspeccionamos columnas y primeras filas para entender su estructura.


In [2]:
# Ruta al CSV (ajusta si fuera necesario)
csv_path = "../../data/fuentes/climaticos/Indicator_11_1_Physical_Risks_Climate_related_disasters_frequency_7212563912390016675.csv"

# Lectura
df = pd.read_csv(csv_path, encoding='utf-8')

# Mostrar las columnas y un primer vistazo
print(f"Dimensiones tras melt: {df.shape}")
print("Columnas disponibles:", df.columns.tolist())
df.head(5)


Dimensiones tras melt: (1972, 55)
Columnas disponibles: ['ObjectId', 'Country', 'ISO2', 'ISO3', 'Indicator', 'Unit', 'Source', 'CTS Code', 'CTS Name', 'CTS Full Descriptor', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023', '2024']


Unnamed: 0,ObjectId,Country,ISO2,ISO3,Indicator,Unit,Source,CTS Code,CTS Name,CTS Full Descriptor,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,1,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Adaptation, Clima...",...,,,,1.0,,,1.0,,,
1,2,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Adaptation, Clima...",...,,,,,,,,,1.0,1.0
2,3,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Adaptation, Clima...",...,1.0,4.0,1.0,3.0,6.0,5.0,2.0,5.0,2.0,5.0
3,4,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Adaptation, Clima...",...,4.0,,2.0,1.0,1.0,1.0,1.0,1.0,,2.0
4,5,"Afghanistan, Islamic Rep. of",AF,AFG,"Climate related disasters frequency, Number of...",Number of,"The Emergency Events Database (EM-DAT) , Centr...",ECCD,Climate Related Disasters Frequency,"Environment, Climate Change, Adaptation, Clima...",...,,,2.0,,,1.0,,,,


### 2. Transformar a formato ‚Äúlargo‚Äù (melt)

El CSV viene con una columna por a√±o (2000, 2001, ‚Ä¶). Para facilitar inserci√≥n en hechos, pivotamos esas columnas a dos:  
- `Year`  
- `Value`


In [3]:
# Detectar columnas que son a√±os (solo d√≠gitos)
year_cols = [c for c in df.columns if c.isdigit()]

# Pivotar de wide ‚Üí long
df_long = df.melt(
    id_vars=["Country","ISO2","Indicator","Unit"],
    value_vars=year_cols,
    var_name="Year",
    value_name="Value"
)

# Convertir Year a int
df_long["Year"] = df_long["Year"].astype(int)

# Vistazo post-melt
print(f"Dimensiones tras melt: {df_long.shape}")
df_long.head(5)


Dimensiones tras melt: (88740, 6)


Unnamed: 0,Country,ISO2,Indicator,Unit,Year,Value
0,"Afghanistan, Islamic Rep. of",AF,"Climate related disasters frequency, Number of...",Number of,1980,
1,"Afghanistan, Islamic Rep. of",AF,"Climate related disasters frequency, Number of...",Number of,1980,
2,"Afghanistan, Islamic Rep. of",AF,"Climate related disasters frequency, Number of...",Number of,1980,1.0
3,"Afghanistan, Islamic Rep. of",AF,"Climate related disasters frequency, Number of...",Number of,1980,
4,"Afghanistan, Islamic Rep. of",AF,"Climate related disasters frequency, Number of...",Number of,1980,


## 5. Limpieza de datos

- Convertimos el campo `Value` a num√©rico,  
- Contamos y eliminamos las filas sin valor (`NaN`).


In [4]:
# Asegurar que Value sea float
df_long['Value'] = pd.to_numeric(df_long['Value'], errors='coerce')

n_missing = df_long['Value'].isna().sum()
print(f"‚ö†Ô∏è Filas sin valor: {n_missing} ser√°n descartadas")
df_long = df_long.dropna(subset=['Value'])


‚ö†Ô∏è Filas sin valor: 66504 ser√°n descartadas


## 6. Mapeo de pa√≠ses

Para evitar problemas de inconsistencia en nombres, usamos la columna **ISO2** como clave:
1. Extraemos del cat√°logo de `Paises` sus c√≥digos v√°lidos.  
2. Filtramos aquellas filas cuyo `ISO2` no exista en la dimensi√≥n.


In [5]:
# 6.1 Cargar c√≥digos v√°lidos de Paises
cursor.execute("SELECT codigo FROM Paises;")
valid_iso2 = {row[0] for row in cursor.fetchall()}

# 6.2 Asignar pais_id = ISO2 directamente
df_long['pais_id'] = df_long['ISO2']

# 6.3 Mostrar cu√°les son los ISO2 que no existen en la tabla Paises
invalid_iso2 = set(df_long['pais_id']) - valid_iso2
print("‚ö†Ô∏è C√≥digos ISO2 no reconocidos en Paises:", invalid_iso2)

# Opcional: ver muestras de las filas que los contienen
if invalid_iso2:
    display(df_long[df_long['pais_id'].isin(invalid_iso2)].drop_duplicates(subset=['pais_id', 'Country']).head(100))

# 6.4 Filtrar solo los ISO2 existentes en la tabla Paises
antes = len(df_long)
df_long = df_long[df_long['pais_id'].isin(valid_iso2)]
print(f"Filas descartadas por pa√≠s desconocido: {antes - len(df_long)}")


‚ö†Ô∏è C√≥digos ISO2 no reconocidos en Paises: {'CS', nan, 'AN'}


Unnamed: 0,Country,ISO2,Indicator,Unit,Year,Value,pais_id
3584,Soviet Union (former),,"Climate related disasters frequency, Number of...",Number of,1981,1.0,
5162,Namibia,,"Climate related disasters frequency, Number of...",Number of,1982,1.0,
6594,Germany Fed Rep (former),,"Climate related disasters frequency, Number of...",Number of,1983,1.0,
8562,Germany Dem Rep (former),,"Climate related disasters frequency, Number of...",Number of,1984,2.0,
18986,Netherlands Antilles,AN,"Climate related disasters frequency, Number of...",Number of,1989,1.0,AN
25183,Serbia and Montenegro,CS,"Climate related disasters frequency, Number of...",Number of,1992,1.0,CS
31679,Azores Island,,"Climate related disasters frequency, Number of...",Number of,1996,1.0,
37796,Canary Island,,"Climate related disasters frequency, Number of...",Number of,1999,1.0,
74438,Saint Barth√©lemy,,"Climate related disasters frequency, Number of...",Number of,2017,1.0,
74446,Saint Martin (French Part),,"Climate related disasters frequency, Number of...",Number of,2017,1.0,


Filas descartadas por pa√≠s desconocido: 268


### 7. Mapeo de indicadores

En este paso vamos a traducir cada descripci√≥n de indicador tal como aparece en el CSV a su **c√≥digo interno** en la tabla `Indicadores`, y luego obtener el `id` correspondiente para cargar en la tabla de hechos.

- Primero, cargamos el diccionario `{code ‚Üí id}` de la dimensi√≥n `Indicadores`.
- Definimos un mapeo expl√≠cito `{texto_csv ‚Üí code_bd}` para los 14 indicadores de cat√°strofes naturales.
- Traducimos las descripciones del DataFrame a su c√≥digo interno y luego al `id`.
- Mostramos los indicadores que no se han podido mapear (si los hay).


In [6]:
# 7.1 Cargar mapeo {code_bd ‚Üí id} de la tabla Indicadores
cursor.execute("""
    SELECT codigo, id
      FROM Indicadores
     WHERE categoria_id = (
       SELECT id FROM Categorias WHERE nombre = 'Cat√°strofes naturales'
     );
""")
dim_ind_map = { code: iid for code, iid in cursor.fetchall() }

# 7.2 Mapeo expl√≠cito de texto CSV ‚Üí code en BD
indicator_map = {
    "Climate related disasters frequency, Number of Disasters: Drought":"desastres_sequia",
    "Climate related disasters frequency, Number of Disasters: Extreme temperature":"desastres_temp_extrema",
    "Climate related disasters frequency, Number of Disasters: Flood":"desastres_inundacion",
    "Climate related disasters frequency, Number of Disasters: Landslide":"desastres_deslizamiento",
    "Climate related disasters frequency, Number of Disasters: Storm":"desastres_tormenta",
    "Climate related disasters frequency, Number of Disasters: TOTAL":"desastres_total",
    "Climate related disasters frequency, Number of Disasters: Wildfire":"desastres_incendios",
    "Climate related disasters frequency, Number of People Affected: Drought":"afectados_sequia",
    "Climate related disasters frequency, Number of People Affected: Extreme temperature":"afectados_temp_extrema",
    "Climate related disasters frequency, Number of People Affected: Flood":"afectados_inundacion",
    "Climate related disasters frequency, Number of People Affected: Landslide":"afectados_deslizamiento",
    "Climate related disasters frequency, Number of People Affected: Storm":"afectados_tormenta",
    "Climate related disasters frequency, Number of People Affected: TOTAL":"afectados_total",
    "Climate related disasters frequency, Number of People Affected: Wildfire":"afectados_incendios",
}

# 7.3 A√±adir columna con el c√≥digo en la BD
df_long['indicador_code'] = df_long['Indicator'].map(indicator_map)

# 7.4 Traducir c√≥digo ‚Üí id
df_long['indicador_id'] = df_long['indicador_code'].map(dim_ind_map)

# 7.5 Detectar y mostrar descripciones no mapeadas
unmatched = df_long[df_long['indicador_id'].isna()]['Indicator'].unique()
if len(unmatched):
    print("‚ö†Ô∏è Estos indicadores NO se han podido mapear:")
    for txt in unmatched:
        print("   ‚Ä¢", txt)
else:
    print("‚úÖ Todos los indicadores han sido correctamente mapeados.")

# 7.6 Filtrar solo filas con indicador v√°lido
antes = len(df_long)
df_long = df_long[df_long['indicador_id'].notna()]
print(f"Filas descartadas por indicador desconocido: {antes - len(df_long)}")


‚úÖ Todos los indicadores han sido correctamente mapeados.
Filas descartadas por indicador desconocido: 0


#### 8. Eliminar duplicados exactos

Nos aseguramos de no tener m√°s de una fila para la misma combinaci√≥n  (`pais_id`, `Year`, `indicador_id`).  


In [7]:
antes = len(df_long)
df_long = df_long.drop_duplicates(subset=['pais_id','Year','indicador_id'])
print(f"‚ùáÔ∏è Filas antes: {antes} ‚Üí despu√©s de de eliminar duplicados: {len(df_long)}")


‚ùáÔ∏è Filas antes: 21968 ‚Üí despu√©s de de eliminar duplicados: 21968


#### 9. Insertar en `Hechos`

1. Preparamos una lista de tuplas con  
   (`pais_id`, `periodo_id=17`, `anio`, `indicador_id`, `valor`).  
2. Insertamos en lotes de 1 000 para no bloquear la base.  
3. Cerramos la conexi√≥n.  


In [8]:
# 8.1 Preparar tuplas para INSERT
to_insert = [
    (
      row['pais_id'],
      17,  # periodo_id fijo para Cat√°strofes naturales
      int(row['Year']),
      int(row['indicador_id']),
      float(row['Value'])
    )
    for _, row in df_long.iterrows()
]

# 8.2 SQL y batch insert
sql = """
INSERT INTO Hechos
  (pais_id, periodo_id, anio, indicador_id, valor)
VALUES (%s, %s, %s, %s, %s);
"""
batch_size = 1000
total = len(to_insert)
print(f"Total registros a insertar: {total}")

for i in range(0, total, batch_size):
    chunk = to_insert[i:i+batch_size]
    cursor.executemany(sql, chunk)
    conexion.commit()
    print(f"  ‚úî Filas insertadas {i+1}‚Äì{min(i+batch_size, total)}")

# 8.3 Cierre
cursor.close()
conexion.close()
print("‚úÖ Inserci√≥n de cat√°strofes naturales completada.")


Total registros a insertar: 21968
  ‚úî Filas insertadas 1‚Äì1000
  ‚úî Filas insertadas 1001‚Äì2000
  ‚úî Filas insertadas 2001‚Äì3000
  ‚úî Filas insertadas 3001‚Äì4000
  ‚úî Filas insertadas 4001‚Äì5000
  ‚úî Filas insertadas 5001‚Äì6000
  ‚úî Filas insertadas 6001‚Äì7000
  ‚úî Filas insertadas 7001‚Äì8000
  ‚úî Filas insertadas 8001‚Äì9000
  ‚úî Filas insertadas 9001‚Äì10000
  ‚úî Filas insertadas 10001‚Äì11000
  ‚úî Filas insertadas 11001‚Äì12000
  ‚úî Filas insertadas 12001‚Äì13000
  ‚úî Filas insertadas 13001‚Äì14000
  ‚úî Filas insertadas 14001‚Äì15000
  ‚úî Filas insertadas 15001‚Äì16000
  ‚úî Filas insertadas 16001‚Äì17000
  ‚úî Filas insertadas 17001‚Äì18000
  ‚úî Filas insertadas 18001‚Äì19000
  ‚úî Filas insertadas 19001‚Äì20000
  ‚úî Filas insertadas 20001‚Äì21000
  ‚úî Filas insertadas 21001‚Äì21968
‚úÖ Inserci√≥n de cat√°strofes naturales completada.
