### PST Dataset Summary / Resumen del conjunto PST

- **Shape / Dimensiones**: `(264, 7)`  

| Column name           | Type     | Description (EN)                                                | Descripción (ES)                                           |
|-----------------------|----------|------------------------------------------------------------------|-------------------------------------------------------------|
| `año`                 | int      | Reference year                                                   | Año de referencia                                           |
| `concepto`            | object   | Emission concept (activity + pollutant type)                    | Concepto de emisión (actividad + tipo de contaminante)      |
| `tipo_territorio`     | object   | Territory type (e.g., municipality, region)                     | Tipo de territorio (municipio, región, etc.)                |
| `código_territorio`   | float    | Territory code (may be missing)                                 | Código del territorio (puede faltar)                        |
| `territorio`          | float    | Territory name (may be missing)                                 | Nombre del territorio (puede faltar)                        |
| `valor`               | int      | Emission value in metric tons                                   | Valor de emisión en toneladas métricas                      |
| `estado_dato`         | float    | Data status (e.g., estimated, validated; often missing)         | Estado del dato (estimado, validado; frecuentemente nulo)   |

> Note: Missing values in `territorio`, `código_territorio`, and `estado_dato` suggest regional aggregates or incomplete metadata.

In [2]:
# Cell 1: Parameters

import pandas as pd, numpy as np, geopandas as gpd
import os, sys, re, glob, yaml
from pathlib import Path

ROOT = Path.cwd()
while ROOT != ROOT.parent and not (ROOT/"config.yml").exists() and not (ROOT/".git").exists():
      ROOT = ROOT.parent
if str(ROOT) not in sys.path: sys.path.insert(0, str(ROOT))
cfg = yaml.safe_load((ROOT/"config.yml").read_text()) if (ROOT/"config.yml").exists() else {}
RAW_DIR   = ROOT / cfg.get("data", {}).get("raw_dir", "data/raw")
PROC_DIR  = ROOT / cfg.get("data", {}).get("processed_dir", "data/processed")
AUDIT_DIR = ROOT / cfg.get("data", {}).get("audit_dir", "data/ingest_audit")
ADOPTION_DEFAULT = cfg.get("defaults", {}).get("adoption_rate_default", 0.30)
PRIORITY         = cfg.get("defaults", {}).get("priority_districts", [10,11,12,13,15])
madrid_codes_official = set(cfg.get("defaults", {}).get("madrid_postal_codes_official", []))
print(f"RAW={RAW_DIR}\nPROC={PROC_DIR}\nAUDIT={AUDIT_DIR}")

RAW=c:\_Workspace\2_Work\1_Projects_Active\Datos_Abiertos_Madrid\Low-Carbon-Heating-Roadmap-for-Madrid\data\raw
PROC=c:\_Workspace\2_Work\1_Projects_Active\Datos_Abiertos_Madrid\Low-Carbon-Heating-Roadmap-for-Madrid\data\processed
AUDIT=c:\_Workspace\2_Work\1_Projects_Active\Datos_Abiertos_Madrid\Low-Carbon-Heating-Roadmap-for-Madrid\data\ingest_audit


In [3]:
# Cell 2: Ingestion / Ingesta
from src.loader import load_pst; from src.cleaning import inspect_dataframe
pst = load_pst(save=False)
inspect_dataframe(pst, name="pst")


 pst.shape → 264 rows × 8 columns

 head ()


Unnamed: 0,año,concepto,tipo_territorio,código_territorio,territorio,valor,unidad,estado_dato
0,2000,Emisión de partículas en suspensión en otras f...,Otros,,,830,t,
1,2000,Total emisión de partículas en suspensión,Otros,,,18926,t,
2,2000,Emisión de partículas en suspensión en la comb...,Otros,,,12,t,
3,2000,Emisión de partículas en suspensión en plantas...,Otros,,,2505,t,
4,2000,Emisión de partículas en suspensión en plantas...,Otros,,,1263,t,



 tail ()


Unnamed: 0,año,concepto,tipo_territorio,código_territorio,territorio,valor,unidad,estado_dato
259,2021,Emisión de partículas en suspensión en plantas...,Otros,,,1244,t,
260,2021,Emisión de partículas en suspensión en la comb...,Otros,,,61,t,
261,2021,Total emisión de partículas en suspensión,Otros,,,11050,t,
262,2021,Emisión de partículas en suspensión en agricul...,Otros,,,823,t,
263,2021,Emisión de partículas en suspensión en otras f...,Otros,,,284,t,



 info ():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   año                264 non-null    int64  
 1   concepto           264 non-null    object 
 2   tipo_territorio    264 non-null    object 
 3   código_territorio  0 non-null      float64
 4   territorio         0 non-null      float64
 5   valor              264 non-null    int64  
 6   unidad             264 non-null    object 
 7   estado_dato        0 non-null      float64
dtypes: float64(3), int64(2), object(3)
memory usage: 16.6+ KB

 describe all


Unnamed: 0,año,concepto,tipo_territorio,código_territorio,territorio,valor,unidad,estado_dato
count,264.0,264,264,0.0,0.0,264.0,264,0.0
unique,,12,1,,,,1,
top,,Emisión de partículas en suspensión en otras f...,Otros,,,,t,
freq,,22,264,,,,264,
mean,2010.5,,,,,2591.306818,,
std,6.356339,,,,,4639.757009,,
min,2000.0,,,,,0.0,,
25%,2005.0,,,,,281.0,,
50%,2010.5,,,,,714.5,,
75%,2016.0,,,,,2493.75,,


In [4]:
# Cell 3: Schema validation / Validación de esquema
# Validate required columns and types using src.cleaning.validate_schema / usa helper
from src.cleaning import validate_schema  
req_building_cols = ["año", "concepto","valor"]
try:
    validate_schema(pst, req_building_cols)
    print("pst schema OK")
except AssertionError as e:
    raise AssertionError(f"Schema error pst: {e}")
# Drop not required columns  / Soltar columnas innecesarias
pst = pst[["año", "concepto","valor"]]

pst schema OK


In [5]:
# Cell 4: Data quality report / Informe de calidad de datos
# Use dq_report to get structured report and store it / usa dq_report y guárdalo
from src.cleaning import dq_report
dq_pst = dq_report(pst)
# print compact summary / imprimir resumen compacto
print("rows_in:", dq_pst["rows_in"], "duplicate_rows:", dq_pst["duplicate_rows"])
for col, meta in list(dq_pst["columns"].items())[:8]:
    print(f"{col}: nulls={meta['null_count']} null_pct={meta['null_pct']:.2f}% uniques={meta['unique_nonnull']}")
# keep report in memory for audit / conservar para auditoría
dq_reports = {"df_pst": dq_pst}

rows_in: 264 duplicate_rows: 0
año: nulls=0 null_pct=0.00% uniques=22
concepto: nulls=0 null_pct=0.00% uniques=12
valor: nulls=0 null_pct=0.00% uniques=224


In [6]:
# --- 0. Rename columns to canonical names / Renombrar columnas ---
rename_dict = {
        "año": "year",                     # Año de referencia
        "concepto": "emission_concept",    # Actividad + tipo de contaminante
        "valor": "tons_pst"                # Emisiones en toneladas de partículas en suspensión
}
pst.rename(columns=rename_dict, inplace=True)

critical_cols = [
    "year", "tons_pst"
]

sentinels = ['99999999,99','99999999.99','99999999','99999999,00',
             'NaN','nan','NULL','-']


# --- 1. Convert to numeric with cleaning ---
for c in critical_cols:
    if c in pst.columns:
        s = pst[c].astype(str).replace(sentinels, pd.NA)
        s = s.str.replace(r'\.', '', regex=True).str.replace(',', '.', regex=True)
        s = s.str.replace(r'[^\d\.\-]', '', regex=True)
        pst[c] = pd.to_numeric(s, errors='coerce')


# --- 2. Drop rows with NaN in any critical column ---
n_before = len(pst)
pst = pst.dropna(subset=critical_cols)
n_after_nan = len(pst)

# --- 4. Final check ---
for col in critical_cols:
    n_null = pst[col].isna().sum()
    n_neg  = (pst[col] < 0).sum()
    n_zero = (pst[col] == 0).sum()
    print(f"{col}: nulls={n_null}, negatives={n_neg}, zeros={n_zero}")

# --- 5. Register transformation ---
    transforms = []
    transforms.append({
    "step": "rename_and_clean_pst",
    "renamed_columns": list(rename_dict.items()),
    "columns_cleaned": critical_cols,
    "sentinels_mapped": sentinels,
    "rows_before": n_before,
    "rows_dropped_nan": n_before - n_after_nan,
})

year: nulls=0, negatives=0, zeros=0
tons_pst: nulls=0, negatives=0, zeros=22


In [7]:
# Cell 6: Final Check / Revisión Final
pst.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   year              264 non-null    int64 
 1   emission_concept  264 non-null    object
 2   tons_pst          264 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 6.3+ KB


In [8]:
# Cell 7: Export and audit / Exportar y auditar
from src.io import save_df, write_audit_log  # persistence helpers / ayudantes de persistencia
# save processed artifact / guardar artefacto procesado
out_path = save_df(pst, str(PROC_DIR.joinpath("df_pst.csv")))
print("Saved:", out_path)
# build audit entry and write / construir objeto de auditoría y guardar
audit = {
    "source": "emision-de-contaminantes-atmosfericos-por-sectores-particulas-en-suspension-pst.csv",
    "rows_in": int(dq_pst["rows_in"]),   # cast to Python int
    "rows_out": int(len(pst)),           # ensure Python int
    "transforms": []
}

# normalize transforms too
for t in transforms:
    audit["transforms"].append({
        **t,
        "rows_before": int(t.get("rows_before", 0)),
        "rows_final": int(t.get("rows_final", 0)),
        "rows_dropped_nan": int(t.get("rows_dropped_nan", 0)),
        "rows_dropped_negatives_or_zeros": int(t.get("rows_dropped_negatives_or_zeros", 0))
    })

audit_path = write_audit_log(**audit)
print("Audit saved:", audit_path)

Saved: c:\_Workspace\2_Work\1_Projects_Active\Datos_Abiertos_Madrid\Low-Carbon-Heating-Roadmap-for-Madrid\data\processed\df_pst.csv
Audit saved: C:\_Workspace\2_Work\1_Projects_Active\Datos_Abiertos_Madrid\Low-Carbon-Heating-Roadmap-for-Madrid\data\ingest_audit\audit_emision-de-contaminantes-atmosfericos-por-sectores-particulas-en-suspension-pst_20251029_215558.json


In [9]:
# Cell 8: Reproducibility checks / Verificaciones de reproducibilidad

# list artifacts and audits / listar artefactos y auditorías
proc_files = glob.glob(str(PROC_DIR.joinpath("*.csv")))
audit_files = glob.glob(str(AUDIT_DIR.joinpath("*.json")))
print("processed files:", proc_files)
print("audit files:", audit_files)
# basic checks / comprobaciones básicas
assert proc_files, "No processed artifacts found in data/processed/  / No hay artefactos procesados"
assert audit_files, "No audit JSONs found in data/ingest_audit/  / No hay JSONs de auditoría"
# size checks (rows) / comprobación de filas mínima (adjust expected as needed)
min_rows_expected = 10
for p in proc_files:
    df = pd.read_csv(p, nrows=min_rows_expected)
    if df.shape[0] < min_rows_expected:
        raise RuntimeError(f"Artifact {p} has <{min_rows_expected} rows; check processing  / Artifact tiene pocas filas")
print("Reproducibility smoke tests passed / Pruebas de reproducibilidad OK")

processed files: ['c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\processed\\df_ceee.csv', 'c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\processed\\df_gei.csv', 'c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\processed\\df_pst.csv', 'c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\processed\\sql_buildings_train.csv']
audit files: ['c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\ingest_audit\\audit_atm_inventario_gei_20251029_213554.json', 'c:\\_Workspace\\2_Work\\1_Projects_Active\\Datos_Abiertos_Madrid\\Low-Carbon-Heating-Roadmap-for-Madrid\\data\\ingest_audit\\audit_emision-de-contaminantes-atmosfericos-por-sectores-particulas-en-suspension-pst_20251029_215558.json', '