# 02 – Exploración de Eurostat (HICP y HBS)

En este notebook se exploran los ficheros descargados de Eurostat:

- `data_raw/eurostat/prc_hicp_aind_tabular.tsv` (Índice Armonizado de Precios al Consumo, HICP).
- `data_raw/eurostat/hbs_str_t211_tabular.tsv` (Gasto de los hogares en alimentos, HBS).

**Objetivos:**

1. Verificar que los ficheros se leen correctamente desde la estructura del proyecto.
2. Inspeccionar la estructura (columnas, número de filas, tipos).
3. Identificar:
   - Países disponibles (códigos `geo`).
   - Años disponibles.
   - Códigos de categoría (COICOP u otros) relevantes para alimentación.
4. Comprobar la cobertura para los países de la UE-27 y para el periodo aproximado 2015–2023.

Este análisis servirá para definir después:
- Qué años y países se pueden cruzar con OpenFoodFacts y FAOSTAT.
- Cómo diseñar la tabla de correspondencia de categorías (taxonomías).

In [1]:
from pathlib import Path
import duckdb
import pandas as pd

# Carpeta raíz del proyecto (subimos desde notebooks/exploratory)
ROOT_DIR = Path("..").resolve().parent

DATA_RAW = ROOT_DIR / "data_raw" / "eurostat"

HICP_PATH = DATA_RAW / "prc_hicp_aind_tabular.tsv"
HBS_PATH = DATA_RAW / "hbs_str_t211_tabular.tsv"

ROOT_DIR, HICP_PATH.exists(), HBS_PATH.exists(), HICP_PATH, HBS_PATH

(WindowsPath('C:/Users/santi/OneDrive - UNIR/UNIR/MASTER ANÁLISIS Y VISUALIZACIÓN BIG DATA/TFM/dashboard-coherencia-ue-tfm'),
 True,
 True,
 WindowsPath('C:/Users/santi/OneDrive - UNIR/UNIR/MASTER ANÁLISIS Y VISUALIZACIÓN BIG DATA/TFM/dashboard-coherencia-ue-tfm/data_raw/eurostat/prc_hicp_aind_tabular.tsv'),
 WindowsPath('C:/Users/santi/OneDrive - UNIR/UNIR/MASTER ANÁLISIS Y VISUALIZACIÓN BIG DATA/TFM/dashboard-coherencia-ue-tfm/data_raw/eurostat/hbs_str_t211_tabular.tsv'))

In [2]:
# Conexión en memoria
con = duckdb.connect(database=":memory:")
con

<_duckdb.DuckDBPyConnection at 0x1efbf2b7070>

In [3]:
hicp_preview = con.execute(f"""
    SELECT *
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
    LIMIT 5
""").fetchdf()

hicp_preview

Unnamed: 0,"freq,unit,coicop,geo\TIME_PERIOD",1996,1997,1998,1999,2000,2001,2002,2003,2004,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
0,"A,CID_EA,TOT_X_NRG_FOOD,AT",:,:,:,:,:,:,-0.4,-0.6,-0.3,...,0.7,0.8,1.1,0.8,0.6,1.3,0.8,1.1,2.4,1.1
1,"A,CID_EA,TOT_X_NRG_FOOD,BE",:,:,:,:,:,:,-0.3,-0.3,-0.6,...,0.6,0.9,0.5,0.3,0.5,0.7,-0.2,0.0,1.0,0.5
2,"A,CID_EA,TOT_X_NRG_FOOD,BG",:,:,:,:,:,:,5.3,0.3,0.3,...,-1.7,-1.8,-1.5,1.1,0.8,0.5,-0.1,3.7,4.0,0.3
3,"A,CID_EA,TOT_X_NRG_FOOD,CY",:,:,:,:,:,:,-1.4,0.1,-2.0,...,-1.5,-1.6,-0.6,-0.9,-0.2,-1.5,-0.1,1.0,-1.1,-0.2
4,"A,CID_EA,TOT_X_NRG_FOOD,CZ",:,:,:,:,:,:,0.1,-1.2,0.1,...,-0.3,0.4,0.9,0.7,1.1,2.4,2.1,8.1,4.4,1.2


In [4]:
# Esquema (nombres de columnas y tipos)
hicp_schema = con.execute(f"""
    DESCRIBE
    SELECT *
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
""").fetchdf()

hicp_schema

Unnamed: 0,column_name,column_type,null,key,default,extra
0,"freq,unit,coicop,geo\TIME_PERIOD",VARCHAR,YES,,,
1,1996,VARCHAR,YES,,,
2,1997,VARCHAR,YES,,,
3,1998,VARCHAR,YES,,,
4,1999,VARCHAR,YES,,,
5,2000,VARCHAR,YES,,,
6,2001,VARCHAR,YES,,,
7,2002,VARCHAR,YES,,,
8,2003,VARCHAR,YES,,,
9,2004,VARCHAR,YES,,,


In [5]:
# Número de filas totales
hicp_n = con.execute(f"""
    SELECT COUNT(*) AS n_filas
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
""").fetchdf()

hicp_n

Unnamed: 0,n_filas
0,35251


In [6]:
hicp_preview.columns

Index(['freq,unit,coicop,geo\TIME_PERIOD', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017',
       '2018', '2019', '2020', '2021', '2022', '2023', '2024'],
      dtype='object')

In [7]:
# Nombre real de la primera columna
first_col = hicp_preview.columns[0]
first_col

'freq,unit,coicop,geo\\TIME_PERIOD'

In [8]:
# 1) Frecuencia (freq)
hicp_freq = con.execute(f"""
    SELECT DISTINCT 
        split_part("{first_col}", ',', 1) AS freq
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
    ORDER BY freq
""").fetchdf()

hicp_freq

Unnamed: 0,freq
0,A


In [9]:
# 2) Unidad / tipo de índice (unit)
hicp_units = con.execute(f"""
    SELECT DISTINCT
        split_part("{first_col}", ',', 2) AS unit
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
    ORDER BY unit
""").fetchdf()

hicp_units

Unnamed: 0,unit
0,CID_EA
1,INX_A_AVG
2,RCH_A_AVG


In [10]:
# 3) Categorías COICOP / HICP (coicop)
hicp_coicop = con.execute(f"""
    SELECT 
        split_part("{first_col}", ',', 3) AS coicop,
        COUNT(*) AS n
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
    GROUP BY coicop
    ORDER BY n DESC
    LIMIT 30
""").fetchdf()

hicp_coicop

Unnamed: 0,coicop,n
0,TOT_X_NRG_FOOD,117
1,CP03,90
2,CP04,90
3,CP08,90
4,CP00,90
5,CP11,90
6,CP02,90
7,CP09,90
8,CP10,90
9,CP01,90


In [11]:
# 4) Países (geo)
hicp_geo = con.execute(f"""
    SELECT 
        split_part("{first_col}", ',', 4) AS geo,
        COUNT(*) AS n
    FROM read_csv_auto('{HICP_PATH}', delim='\t', header=TRUE)
    GROUP BY geo
    ORDER BY n DESC
    LIMIT 30
""").fetchdf()

hicp_geo

Unnamed: 0,geo,n
0,EU27_2020,933
1,EU,932
2,EU28,932
3,PL,931
4,EA20,923
5,EA,922
6,EA19,920
7,DE,895
8,FR,875
9,EEA,872
