# 03 – Exploración de FAOSTAT

En este notebook se exploran los ficheros descargados de **FAOSTAT** y almacenados en:

`data_raw/faostat/`

**Objetivos:**

- Inventariar los ficheros CSV disponibles en la carpeta `data_raw/faostat`.
- Verificar que se pueden leer correctamente con DuckDB desde la estructura del proyecto.
- Inspeccionar la estructura típica de FAOSTAT (`Area`, `Item`, `Element`, `Year`, `Unit`, `Value`, etc.).
- Identificar, para cada fichero:
  - Áreas geográficas (`Area`) disponibles.
  - Elementos (`Element`) medidos.
  - Rango de años (`Year`) y cobertura del periodo 2015–2023.
- Extraer información que permita decidir qué tablas e indicadores de FAOSTAT se utilizarán en los ETL posteriores.

In [1]:
from pathlib import Path
import duckdb
import pandas as pd

# Carpeta raíz del proyecto
ROOT_DIR = Path("..").resolve().parent

# Carpeta con datos crudos de FAOSTAT
DATA_RAW = ROOT_DIR / "data_raw" / "faostat"

ROOT_DIR, DATA_RAW, DATA_RAW.exists()

(WindowsPath('C:/Users/santi/OneDrive - UNIR/UNIR/MASTER ANÁLISIS Y VISUALIZACIÓN BIG DATA/TFM/dashboard-coherencia-ue-tfm'),
 WindowsPath('C:/Users/santi/OneDrive - UNIR/UNIR/MASTER ANÁLISIS Y VISUALIZACIÓN BIG DATA/TFM/dashboard-coherencia-ue-tfm/data_raw/faostat'),
 True)

In [2]:
# Listado de ficheros CSV en data_raw/faostat
faostat_files = sorted(DATA_RAW.glob("*.csv"))

len(faostat_files), [f.name for f in faostat_files]

(1, ['FAOSTAT_data_en_12-9-2025.csv'])

In [3]:
# Conexión en memoria a DuckDB
con = duckdb.connect(database=":memory:")
con

<_duckdb.DuckDBPyConnection at 0x1d6438841f0>

In [4]:
def explorar_faostat_file(path: Path):
    """
    Explora un fichero FAOSTAT:
    - Muestra 5 filas de ejemplo.
    - Describe el esquema (columnas y tipos).
    - Calcula un resumen básico si las columnas estándar existen.
    """
    print(f"\n=== Explorando: {path.name} ===\n")

    # 1) Vista previa
    preview = con.execute(f"""
        SELECT *
        FROM read_csv_auto('{path}', header=TRUE)
        LIMIT 5
    """).fetchdf()
    display(preview)

    # 2) Esquema (nombres de columnas y tipos)
    schema_df = con.execute(f"""
        DESCRIBE
        SELECT *
        FROM read_csv_auto('{path}', header=TRUE)
    """).fetchdf()
    display(schema_df)

    # 3) Resumen básico si el fichero sigue la estructura estándar de FAOSTAT
    cols = {row["column_name"] for _, row in schema_df.iterrows()}

    if {"Year", "Area", "Item", "Element"}.issubset(cols):
        basic_stats = con.execute(f"""
            SELECT
                MIN("Year") AS min_year,
                MAX("Year") AS max_year,
                COUNT(*)     AS n_filas,
                COUNT(DISTINCT "Area")    AS n_areas,
                COUNT(DISTINCT "Item")    AS n_items,
                COUNT(DISTINCT "Element") AS n_elements
            FROM read_csv_auto('{path}', header=TRUE)
        """).fetchdf()
        print("\n--- Resumen básico ---")
        display(basic_stats)

        elements = con.execute(f"""
            SELECT "Element", COUNT(*) AS n
            FROM read_csv_auto('{path}', header=TRUE)
            GROUP BY "Element"
            ORDER BY n DESC
            LIMIT 20
        """).fetchdf()
        print("\n--- Elementos más frecuentes ---")
        display(elements)

        areas = con.execute(f"""
            SELECT "Area", COUNT(*) AS n
            FROM read_csv_auto('{path}', header=TRUE)
            GROUP BY "Area"
            ORDER BY n DESC
            LIMIT 20
        """).fetchdf()
        print("\n--- Áreas más frecuentes ---")
        display(areas)

        stats_2015_2023 = con.execute(f"""
            SELECT
                SUM( ("Year" BETWEEN 2015 AND 2023)::INT ) AS n_2015_2023,
                COUNT(*) AS n_total
            FROM read_csv_auto('{path}', header=TRUE)
        """).fetchdf()
        print("\n--- Cobertura 2015–2023 ---")
        display(stats_2015_2023)

In [5]:
# Comprobamos que haya al menos un CSV y exploramos todos
assert faostat_files, "No hay ficheros CSV en data_raw/faostat. Descarga primero los datos de FAOSTAT."

for path in faostat_files:
    explorar_faostat_file(path)


=== Explorando: FAOSTAT_data_en_12-9-2025.csv ===



Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code,Item,Year Code,Year,Release,Unit,Value,Flag,Flag Description
0,CAHD,Cost and Affordability of a Healthy Diet (CoAHD),40,Austria,6205,Value,70041,"Cost of a healthy diet (CoHD), LCU per person ...",2017,2017,July 2025 (SOFI report),LCU/cap/d,1.72,E,Estimated value
1,CAHD,Cost and Affordability of a Healthy Diet (CoAHD),40,Austria,6205,Value,70041,"Cost of a healthy diet (CoHD), LCU per person ...",2018,2018,July 2025 (SOFI report),LCU/cap/d,1.75,E,Estimated value
2,CAHD,Cost and Affordability of a Healthy Diet (CoAHD),40,Austria,6205,Value,70041,"Cost of a healthy diet (CoHD), LCU per person ...",2019,2019,July 2025 (SOFI report),LCU/cap/d,1.77,E,Estimated value
3,CAHD,Cost and Affordability of a Healthy Diet (CoAHD),40,Austria,6205,Value,70041,"Cost of a healthy diet (CoHD), LCU per person ...",2020,2020,July 2025 (SOFI report),LCU/cap/d,1.81,E,Estimated value
4,CAHD,Cost and Affordability of a Healthy Diet (CoAHD),40,Austria,6205,Value,70041,"Cost of a healthy diet (CoHD), LCU per person ...",2021,2021,July 2025 (SOFI report),LCU/cap/d,1.82,E,Estimated value


Unnamed: 0,column_name,column_type,null,key,default,extra
0,Domain Code,VARCHAR,YES,,,
1,Domain,VARCHAR,YES,,,
2,Area Code (M49),VARCHAR,YES,,,
3,Area,VARCHAR,YES,,,
4,Element Code,BIGINT,YES,,,
5,Element,VARCHAR,YES,,,
6,Item Code,BIGINT,YES,,,
7,Item,VARCHAR,YES,,,
8,Year Code,BIGINT,YES,,,
9,Year,BIGINT,YES,,,



--- Resumen básico ---


Unnamed: 0,min_year,max_year,n_filas,n_areas,n_items,n_elements
0,2017,2024,1512,27,16,1



--- Elementos más frecuentes ---


Unnamed: 0,Element,n
0,Value,1512



--- Áreas más frecuentes ---


Unnamed: 0,Area,n
0,Denmark,56
1,Hungary,56
2,Slovakia,56
3,Austria,56
4,Greece,56
5,France,56
6,Netherlands (Kingdom of the),56
7,Bulgaria,56
8,Portugal,56
9,Romania,56



--- Cobertura 2015–2023 ---


Unnamed: 0,n_2015_2023,n_total
0,1404.0,1512
