
# Fase 01 ‚Äî Exploraci√≥n inicial del dataset (`01_explore`)

Este notebook forma parte del **workflow MLOps4OFP** y est√° dise√±ado para:

- trabajar **siempre sobre una variante concreta** de la fase `01_explore` (p.ej. `v001`),  
- leer sus par√°metros desde `executions/01_explore/vNNN/params.yaml`,  
- aplicar la **estrategia de limpieza** configurada en la variante,  
- generar un EDA (*Exploratory Data Analysis*) completo,  
- guardar todos los artefactos en la carpeta de la variante.
 
Las variantes se crean desde la l√≠nea de comandos con (valores de ejemplo):

```bash
make variant1 VARIANT=v001 RAW=/ruta/dataset.csv     CLEANING_STRATEGY=basic     NAN_VALUES='[-999999,-1]'     ERROR_VALUES='{"hum":[0,999],"temp":[-50]}'
```

Y el notebook se ejecuta para una variante con:

```bash
make nb1-run VARIANT=v001
```


In [1]:
# =====================================================================
# 1. IMPORTS COMUNES
# =====================================================================
import os
import sys
import json
import yaml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime, timezone
import seaborn as sns
import shutil

# Estilo gr√°fico com√∫n
plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["axes.grid"] = True

print("‚úî Imports y estilo cargados")

‚úî Imports y estilo cargados


In [2]:
# =====================================================================
# 2. CONTEXTO DE EJECUCI√ìN (bootstrap + run_context)
# =====================================================================
import sys
from pathlib import Path

# --- BOOTSTRAP M√çNIMO (antes de cualquier import mlops4ofp) ---
execution_dir = Path.cwd().resolve()

current = execution_dir
for _ in range(10):
    if (current / "mlops4ofp").exists():
        project_root = current
        break
    current = current.parent
else:
    raise RuntimeError("‚ùå No se pudo localizar project_root")

sys.path.insert(0, str(project_root))

print(f"üìÅ Project root a√±adido a PYTHONPATH: {project_root}")

# --- AHORA S√ç: imports normales ---
from mlops4ofp.tools.run_context import (
    detect_execution_dir,
    detect_project_root,
    assemble_run_context,
)
import yaml

PHASE = "01_explore"

# -------------------------------------------------------------------
# Selecci√≥n de variante
# -------------------------------------------------------------------
ACTIVE_VARIANT = os.environ.get("ACTIVE_VARIANT")

# Caso 1: ejecuci√≥n v√≠a Makefile (preferido)
if ACTIVE_VARIANT:
    print(f"[INFO] Variante activa (desde entorno): {ACTIVE_VARIANT}")

# Caso 2: notebook abierto manualmente
else:
    print("‚ö†Ô∏è ACTIVE_VARIANT no definido ‚Äî seleccionando variante autom√°ticamente‚Ä¶")

    variants_file = project_root / "executions" / PHASE / "variants.yaml"

    if not variants_file.exists():
        raise RuntimeError(
            "‚ùå No existen variantes todav√≠a.\n"
            "Crea una con:\n"
            "    make variant1 VARIANT=v001 RAW=tu_dataset.csv"
        )

    with open(variants_file, "r", encoding="utf-8") as f:
        reg = yaml.safe_load(f) or {}

    all_variants = sorted(reg.get("variants", {}).keys())
    if not all_variants:
        raise RuntimeError("‚ùå No hay variantes definidas en variants.yaml")

    ACTIVE_VARIANT = all_variants[-1]
    print(f"[INFO] Usando variante m√°s reciente: {ACTIVE_VARIANT}")

# -------------------------------------------------------------------
# Construcci√≥n del contexto final
# -------------------------------------------------------------------
from mlops4ofp.tools.params_manager import ParamsManager

pm = ParamsManager(PHASE, project_root)

print("PHASE_DIR seg√∫n ParamsManager:", pm.phase_dir)
print("¬øExiste PHASE_DIR?", pm.phase_dir.exists())
print("¬øExiste variante?", (pm.phase_dir / ACTIVE_VARIANT).exists())

pm.set_current(ACTIVE_VARIANT)

variant_root = pm.current_variant_dir()

ctx = assemble_run_context(
    execution_dir=detect_execution_dir(),
    project_root=project_root,
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=variant_root,
)

print("‚úî Contexto de ejecuci√≥n construido")
print(f"   Fase: {ctx['phase']}")
print(f"   Variante: {ctx['variant']}")
print(f"   Carpeta variante: {ctx['variant_root']}")


üìÅ Project root a√±adido a PYTHONPATH: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp
[INFO] Variante activa (desde entorno): v002
PHASE_DIR seg√∫n ParamsManager: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore
¬øExiste PHASE_DIR? True
¬øExiste variante? True
‚úî Contexto de ejecuci√≥n construido
   Fase: 01_explore
   Variante: v002
   Carpeta variante: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002


In [3]:
# =====================================================================
# 4. RUTAS DE DATOS Y SALIDAS (VARIANTE)
#    Todos los ficheros llevan como prefijo el nombre de la fase
# =====================================================================
from mlops4ofp.tools.run_context import build_phase_outputs
VARIANT_DIR = ctx["variant_root"]
PHASE_PREFIX = ctx["phase"]

# Datos de entrada (raw es compartido por todo el proyecto)
RAW_DIR = ctx["project_root"] / "data" / "01-raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)

# Salidas de esta fase (ficheros en la ra√≠z de la variante)
#OUTPUTS = {
#    "dataset": VARIANT_DIR / f"{PHASE_PREFIX}_dataset.parquet",
#    "report": VARIANT_DIR / f"{PHASE_PREFIX}_report.html",
#    "params": VARIANT_DIR / f"{PHASE_PREFIX}_params.json",
#    "metadata": VARIANT_DIR / f"{PHASE_PREFIX}_metadata.json",
#}
OUTPUTS = build_phase_outputs(
    variant_root=VARIANT_DIR,
    phase=ctx["phase"],
)

# Figuras (√∫nica subcarpeta permitida dentro de la variante)
FIGURES_DIR = ctx["figures_dir"]

print("‚úî Rutas de salida preparadas (scoped a la variante)")
OUTPUTS, FIGURES_DIR

‚úî Rutas de salida preparadas (scoped a la variante)


({'dataset': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_dataset.parquet'),
  'report': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_report.html'),
  'params': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_params.json'),
  'metadata': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_metadata.json')},
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/figures'))

## 1. Par√°metros de la variante y selecci√≥n del dataset bruto

In [4]:

# =====================================================================
# 3. Leer par√°metros de la variante activa
# =====================================================================
from mlops4ofp.tools.params_manager import validate_params

params_path = VARIANT_DIR / "params.yaml"
if not params_path.exists():
    raise FileNotFoundError(
        f"No existe el fichero de par√°metros de la variante: {params_path}\n"
        f"Primero crea la variante con:\n"
        f"    make variant1 VARIANT={ACTIVE_VARIANT} RAW=/ruta/dataset"
    )

# Leer par√°metros + validar contra traceability_schema.yaml
with open(params_path, "r", encoding="utf-8") as f:
    variant_params = yaml.safe_load(f) or {}
validate_params(PHASE, variant_params, ctx["project_root"])
print("‚úî Par√°metros validados correctamente seg√∫n schema.")

raw_dataset_path = variant_params.get("raw_dataset_path")
cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_by_column = variant_params.get("error_values_by_column", {})

# Validaci√≥n estricta de estrategias de limpieza
ALLOWED_STRATEGIES = {"none", "basic", "full"}
if cleaning_strategy not in ALLOWED_STRATEGIES:
    raise ValueError(
        f"cleaning_strategy='{cleaning_strategy}' no es v√°lida.\n"
        f"Opciones permitidas: {ALLOWED_STRATEGIES}"
    )

# Log compacto de par√°metros
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"raw_dataset_path    = {raw_dataset_path}")
print(f"cleaning_strategy   = {cleaning_strategy}")
print(f"nan_values          = {nan_values}")
print(f"error_values_by_col = {error_values_by_column}")
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")

# =====================================================================
# 4. Resolver ruta del dataset de entrada
# =====================================================================
raw_input = (ctx["project_root"] / raw_dataset_path).expanduser().resolve()
if not raw_input.exists():
    raise FileNotFoundError(
        f"El fichero indicado en raw_dataset_path no existe:\n{raw_input}"
    )

# =====================================================================
# 5. Copiar dataset bruto a data/01-raw (copia inmutable de trabajo)
# =====================================================================
raw_path = RAW_DIR / f"01_explore_raw_{raw_input.name}"
if not raw_path.exists():
    shutil.copy2(raw_input, raw_path)
    print(f"[VARIANT] Copiado dataset bruto a: {raw_path}")
else:
    print(f"[VARIANT] Ya exist√≠a la copia local del dataset bruto: {raw_path}")

raw_path



‚úî Par√°metros validados correctamente seg√∫n schema.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
raw_dataset_path    = data/01-raw/01_explore_raw_raw.csv
cleaning_strategy   = none
nan_values          = [-999999.0]
error_values_by_col = {}
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
[VARIANT] Ya exist√≠a la copia local del dataset bruto: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv


PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv')

## 2. Carga del dataset y vista r√°pida

In [5]:
# =====================================================
# 4. Cargar el dataset desde la copia local
# =====================================================

print(f"üìÑ Usando fichero de entrada (copia local): {raw_path}")

# Cargar CSV o Parquet
suffix = raw_path.suffix.lower()
if suffix == ".csv":
    df = pd.read_csv(raw_path)
elif suffix in {".parquet", ".pq"}:
    df = pd.read_parquet(raw_path)
else:
    raise ValueError(f"‚ùå Extensi√≥n de fichero no soportada: {suffix}")

print(f"‚úî Dataset cargado correctamente.")
print(f"   Filas: {len(df):,}  |  Columnas: {df.shape[1]}")
df.head()

üìÑ Usando fichero de entrada (copia local): /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv


‚úî Dataset cargado correctamente.
   Filas: 3,887,242  |  Columnas: 18


Unnamed: 0,Timestamp,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature
0,2022-05-01 00:00:01,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
1,2022-05-01 00:00:11,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
2,2022-05-01 00:00:21,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001
3,2022-05-01 00:00:31,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001
4,2022-05-01 00:00:41,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1


In [6]:
import io
print("=== INFO DEL DATAFRAME ===")
buf = io.StringIO()
df.info(buf)
info_str = buf.getvalue()
print(info_str)
print("=== TIPOS DE DATOS ===")
df.dtypes.to_frame("dtype")



=== INFO DEL DATAFRAME ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3887242 entries, 0 to 3887241
Data columns (total 18 columns):
 #   Column                               Dtype  
---  ------                               -----  
 0   Timestamp                            object 
 1   Battery_Active_Power                 float64
 2   Battery_Active_Power_Set_Response    float64
 3   PVPCS_Active_Power                   float64
 4   GE_Body_Active_Power                 float64
 5   GE_Active_Power                      float64
 6   GE_Body_Active_Power_Set_Response    float64
 7   FC_Active_Power_FC_END_Set           float64
 8   FC_Active_Power                      float64
 9   FC_Active_Power_FC_end_Set_Response  float64
 10  Island_mode_MCCB_Active_Power        float64
 11  MG-LV-MSB_AC_Voltage                 float64
 12  Receiving_Point_AC_Voltage           float64
 13  Island_mode_MCCB_AC_Voltage          float64
 14  Island_mode_MCCB_Frequency           float64
 15  MG-LV

Unnamed: 0,dtype
Timestamp,object
Battery_Active_Power,float64
Battery_Active_Power_Set_Response,float64
PVPCS_Active_Power,float64
GE_Body_Active_Power,float64
GE_Active_Power,float64
GE_Body_Active_Power_Set_Response,float64
FC_Active_Power_FC_END_Set,float64
FC_Active_Power,float64
FC_Active_Power_FC_end_Set_Response,float64


## 3. Preparaci√≥n del eje temporal (`segs`)

In [7]:

# ================================================================
# 4. PREPROCESADO TEMPORAL: creaci√≥n de `segs` y `segs_diff`
# ================================================================

print("=== 4. Preparaci√≥n del eje temporal ===")

df_original_cols = df.columns.tolist()

# -------------------------------------------------------------------
# 4.1 Detectar columna temporal "real"
# -------------------------------------------------------------------
time_col = None

# 1) Prioridad expl√≠cita: columna 'Timestamp'
if "Timestamp" in df.columns:
    time_col = "Timestamp"
    print("‚úî Detectada columna temporal 'Timestamp' (prioridad m√°xima).")

else:
    # 2) Buscar columnas que parezcan tiempo por nombre
    time_keywords = ["time", "timestamp", "fecha", "date"]
    candidates = [c for c in df.columns
                  if any(k in c.lower() for k in time_keywords)]
    if candidates:
        time_col = candidates[0]
        print(f"‚úî Detectada columna temporal candidata: {time_col}")

# -------------------------------------------------------------------
# 4.2 Construir / usar `segs` a partir de la mejor fuente de tiempo
# -------------------------------------------------------------------

if time_col is not None:
    # Usamos la columna temporal "real" como referencia
    print(f"‚Üí Usando '{time_col}' como fuente de tiempo para construir 'segs'.")
    ts = pd.to_datetime(df[time_col])
    # Epoch en segundos
    df["segs"] = (ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
    df = df.set_index("segs").sort_index()

elif "segs" in df.columns:
    print("‚úî No se encontr√≥ columna temporal explicita, pero existe 'segs'. Usando como √≠ndice.")
    df = df.set_index("segs").sort_index()

elif "epoc" in df.columns:
    print("‚úî Detectada columna 'epoc' ‚Äî renombrando a 'segs'.")
    df = df.rename(columns={"epoc": "segs"})
    df = df.set_index("segs").sort_index()

else:
    print("‚ö†Ô∏è No existe columna temporal ('Timestamp', 'time*', 'segs' o 'epoc').")
    print("   ‚Üí Se generar√° 'segs' autom√°ticamente usando el √≠ndice del dataframe.")
    start = pd.Timestamp("2020-01-01")
    df["segs"] = (
        (start + pd.to_timedelta(df.index, unit="s"))
        - pd.Timestamp("1970-01-01")
    ) // pd.Timedelta("1s")
    df = df.set_index("segs").sort_index()
    print("‚úî Columna 'segs' generada autom√°ticamente.")


# -------------------------------------------------------------------
# 4.3 Calcular diferencias temporales entre muestras consecutivas
# -------------------------------------------------------------------
df["segs_diff"] = df.index.to_series().diff()

median_step = float(df["segs_diff"].median())
print(f"‚úî Intervalo temporal mediano (Tu): {median_step:.6f} segundos")

print("\nPreview de √≠ndices y segs_diff:")
display(df[["segs_diff"]].head(5))



=== 4. Preparaci√≥n del eje temporal ===
‚úî Detectada columna temporal 'Timestamp' (prioridad m√°xima).
‚Üí Usando 'Timestamp' como fuente de tiempo para construir 'segs'.


‚úî Intervalo temporal mediano (Tu): 10.000000 segundos

Preview de √≠ndices y segs_diff:


Unnamed: 0_level_0,segs_diff
segs,Unnamed: 1_level_1
1651363201,
1651363211,10.0
1651363221,10.0
1651363231,10.0
1651363241,10.0


## 4. An√°lisis de valores y diagn√≥stico

In [8]:
print("=== 5. Diagn√≥stico previo a la limpieza ===")

with open(VARIANT_DIR / "params.yaml", "r", encoding="utf-8") as f:
    variant_params = yaml.safe_load(f) or {}

cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_map = variant_params.get("error_values_by_column", {})

print(f"[PARAMS] cleaning_strategy      = {cleaning_strategy}")
print(f"[PARAMS] nan_values             = {nan_values}")
print(f"[PARAMS] error_values_by_column = {error_values_by_column}")

report_preclean = {}

# -------------------------------------------------------------------
# 5.1 Valores nulos por columna
# -------------------------------------------------------------------
nulls = df.isna().sum()
report_preclean["nulls"] = nulls.to_dict()

print("\nüìå Valores nulos por columna:")
display(nulls[nulls > 0])

# -------------------------------------------------------------------
# 5.2 Valores constantes (sin variaci√≥n)
# -------------------------------------------------------------------
constant_cols = [c for c in df.columns if df[c].nunique(dropna=False) <= 1]
report_preclean["constant_columns"] = constant_cols

print("\nüìå Columnas constantes:")
display(constant_cols)

# -------------------------------------------------------------------
# 5.3 Valores fuera de rango t√≠pico usando IQR (outliers)
# -------------------------------------------------------------------
numeric_cols = df.select_dtypes(include=[np.number]).columns

outlier_summary = {}

for col in numeric_cols:
    series = df[col].dropna()
    if len(series) < 10:
        continue

    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = ((series < lower) | (series > upper)).sum()
    if outliers > 0:
        outlier_summary[col] = int(outliers)

report_preclean["outliers_IQR"] = outlier_summary

print("\nüìå Columnas con outliers detectados (m√©todo IQR):")
display(outlier_summary)

# -------------------------------------------------------------------
# 5.4 Valores sospechosos por par√°metros de usuario
# -------------------------------------------------------------------

suspect_map = {}

# 5.4.1 Valores globales a interpretar como NaN
for v in nan_values:
    for col in numeric_cols:
        cnt = (df[col] == v).sum()
        if cnt > 0:
            suspect_map.setdefault(col, {})[f"nan_value_{v}"] = int(cnt)

# 5.4.2 Valores err√≥neos por columna
for col, bad_values in error_values_map.items():
    if col in df.columns:
        for v in bad_values:
            cnt = (df[col] == v).sum()
            if cnt > 0:
                suspect_map.setdefault(col, {})[f"error_value_{v}"] = int(cnt)

report_preclean["suspect_values"] = suspect_map

print("\nüìå Valores sospechosos seg√∫n par√°metros de limpieza:")
display(suspect_map)

# -------------------------------------------------------------------
# 5.5 Guardar informe previo (para el HTML final)
# -------------------------------------------------------------------

report_preclean_path = VARIANT_DIR / "01_preclean_report.json"
with open(report_preclean_path, "w", encoding="utf-8") as f:
    json.dump(report_preclean, f, indent=2, ensure_ascii=False)

print(f"\n‚úî Informe previo guardado en: {report_preclean_path}")


=== 5. Diagn√≥stico previo a la limpieza ===
[PARAMS] cleaning_strategy      = none
[PARAMS] nan_values             = [-999999.0]
[PARAMS] error_values_by_column = {}

üìå Valores nulos por columna:


segs_diff    1
dtype: int64


üìå Columnas constantes:


[]


üìå Columnas con outliers detectados (m√©todo IQR):


{'Battery_Active_Power': 232483,
 'Battery_Active_Power_Set_Response': 456643,
 'PVPCS_Active_Power': 266862,
 'GE_Body_Active_Power': 2572,
 'GE_Active_Power': 17219,
 'GE_Body_Active_Power_Set_Response': 2572,
 'FC_Active_Power_FC_END_Set': 315946,
 'FC_Active_Power': 807635,
 'FC_Active_Power_FC_end_Set_Response': 331742,
 'Island_mode_MCCB_Active_Power': 5204,
 'MG-LV-MSB_AC_Voltage': 3135,
 'Receiving_Point_AC_Voltage': 21998,
 'Island_mode_MCCB_AC_Voltage': 13252,
 'Island_mode_MCCB_Frequency': 263054,
 'MG-LV-MSB_Frequency': 268234,
 'Inlet_Temperature_of_Chilled_Water': 2881,
 'Outlet_Temperature': 2617,
 'segs_diff': 11}


üìå Valores sospechosos seg√∫n par√°metros de limpieza:


{'Battery_Active_Power': {'nan_value_-999999.0': 2571},
 'Battery_Active_Power_Set_Response': {'nan_value_-999999.0': 2571},
 'PVPCS_Active_Power': {'nan_value_-999999.0': 2571},
 'GE_Body_Active_Power': {'nan_value_-999999.0': 2571},
 'GE_Active_Power': {'nan_value_-999999.0': 2571},
 'GE_Body_Active_Power_Set_Response': {'nan_value_-999999.0': 2571},
 'FC_Active_Power_FC_END_Set': {'nan_value_-999999.0': 1406},
 'FC_Active_Power': {'nan_value_-999999.0': 2571},
 'FC_Active_Power_FC_end_Set_Response': {'nan_value_-999999.0': 2571},
 'Island_mode_MCCB_Active_Power': {'nan_value_-999999.0': 2571},
 'MG-LV-MSB_AC_Voltage': {'nan_value_-999999.0': 2571},
 'Receiving_Point_AC_Voltage': {'nan_value_-999999.0': 2571},
 'Island_mode_MCCB_AC_Voltage': {'nan_value_-999999.0': 2571},
 'Island_mode_MCCB_Frequency': {'nan_value_-999999.0': 2571},
 'MG-LV-MSB_Frequency': {'nan_value_-999999.0': 2571},
 'Inlet_Temperature_of_Chilled_Water': {'nan_value_-999999.0': 2571},
 'Outlet_Temperature': {'nan


‚úî Informe previo guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_preclean_report.json


In [9]:
# ================================================================
# 5 bis. DETECCI√ìN DE OUTLIERS IQR (solo diagn√≥stico, no limpieza)
# ================================================================

print("\n=== 5 bis. Detecci√≥n de outliers (IQR) por columna ===")

num_cols = df.select_dtypes(include=[np.number]).columns

iqr_report = {}

for col in num_cols:
    series = df[col].dropna()
    if len(series) < 5:
        continue  # no hay suficientes datos para IQR

    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    low = q1 - 1.5 * iqr
    high = q3 + 1.5 * iqr

    mask_low = series < low
    mask_high = series > high

    n_low = mask_low.sum()
    n_high = mask_high.sum()
    total_outliers = n_low + n_high

    if total_outliers > 0:
        iqr_report[col] = {
            "low_threshold": float(low),
            "high_threshold": float(high),
            "num_low_outliers": int(n_low),
            "num_high_outliers": int(n_high),
            "total_outliers": int(total_outliers),
            "percent": float(100 * total_outliers / len(series)),
        }


if len(iqr_report) == 0:
    print("‚úî No se encontraron outliers IQR en ninguna columna.")
else:
    print("‚úî Informe de outliers IQR encontrado en columnas num√©ricas:")

    outlier_df = pd.DataFrame.from_dict(iqr_report, orient="index")
    display(outlier_df)

OUTLIER_REPORT = iqr_report  # por si el informe HTML lo necesita m√°s adelante




=== 5 bis. Detecci√≥n de outliers (IQR) por columna ===


‚úî Informe de outliers IQR encontrado en columnas num√©ricas:


Unnamed: 0,low_threshold,high_threshold,num_low_outliers,num_high_outliers,total_outliers,percent
Battery_Active_Power,-0.5,0.3,197248,35235,232483,5.980667
Battery_Active_Power_Set_Response,0.0,0.0,412078,44565,456643,11.747223
PVPCS_Active_Power,-22.5,37.5,2571,264291,266862,6.865073
GE_Body_Active_Power,-172.5,287.5,2571,1,2572,0.066165
GE_Active_Power,-172.450005,284.750008,17215,4,17219,0.442962
GE_Body_Active_Power_Set_Response,0.0,320.0,2571,1,2572,0.066165
FC_Active_Power_FC_END_Set,40.0,40.0,274736,41210,315946,8.127768
FC_Active_Power,30.5,42.5,771656,35979,807635,20.776556
FC_Active_Power_FC_end_Set_Response,40.0,40.0,276240,55502,331742,8.534123
Island_mode_MCCB_Active_Power,-275.0,141.0,5182,22,5204,0.133874


## 6. Aplicaci√≥n de la estrategia de limpieza

In [10]:
# ================================================================
# 6. APLICAR ESTRATEGIA DE LIMPIEZA
# ================================================================

print("\n=== 6. Aplicando estrategia de limpieza ===")

cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_by_column = variant_params.get("error_values_by_column", {})

print(f"[INFO] Estrategia de limpieza        : {cleaning_strategy}")
print(f"[INFO] Valores globales ‚Üí NaN       : {nan_values}")
print(f"[INFO] Valores err√≥neos por columna : {error_values_by_column}")

df_clean = df.copy()
nan_replacements_total = 0

# -------------------------------
# Estrategia: none
# -------------------------------
if cleaning_strategy == "none":
    print("‚û°Ô∏è No se aplica ninguna limpieza.")
    # No modificamos nada
    pass


# -------------------------------
# Estrategia: basic
# -------------------------------
elif cleaning_strategy == "basic":

    # 1) Reemplazar valores globales por NaN
    if nan_values:
        before = df_clean.isna().sum().sum()
        df_clean.replace(nan_values, np.nan, inplace=True)
        after = df_clean.isna().sum().sum()
        changed = after - before
        nan_replacements_total += changed
        print(f"‚úî Reemplazados valores {nan_values} por NaN (global): {changed}")

    # 2) Reemplazar valores err√≥neos por columna
    for col, bad_vals in error_values_by_column.items():
        if col in df_clean.columns:
            before = df_clean[col].isna().sum()
            df_clean[col].replace(bad_vals, np.nan, inplace=True)
            after = df_clean[col].isna().sum()
            changed = after - before
            nan_replacements_total += changed
            print(f"‚úî Columna {col}: reemplazados {bad_vals} ‚Üí NaN: {changed}")

    # 3) Eliminar columnas completamente vac√≠as
    before = df_clean.shape[1]
    df_clean.dropna(axis=1, how="all", inplace=True)
    after = df_clean.shape[1]
    print(f"‚úî Columnas eliminadas por quedar vac√≠as: {before - after}")


# -------------------------------
# Estrategia: full
# -------------------------------
elif cleaning_strategy == "full":

    # 1) Limpieza b√°sica primero
    if nan_values:
        before = df_clean.isna().sum().sum()
        df_clean.replace(nan_values, np.nan, inplace=True)
        after = df_clean.isna().sum().sum()
        nan_replacements_total += (after - before)
        #print(f"‚úî Reemplazados {nan_replacements_total} valores {nan_values} por NaN (global)")

    for col, bad_vals in error_values_by_column.items():
        if col in df_clean.columns:
            before = df_clean[col].isna().sum()
            df_clean[col].replace(bad_vals, np.nan, inplace=True)
            after = df_clean[col].isna().sum()
            nan_replacements_total += (after - before)
            print(f"‚úî Columna {col}: reemplazados {bad_vals} ‚Üí NaN")

    # 2) Eliminaci√≥n autom√°tica de outliers por IQR
    num_cols = df_clean.select_dtypes(include=[np.number]).columns
    total_outliers = 0

    for col in num_cols:
        series = df_clean[col]
        q1 = series.quantile(0.25)
        q3 = series.quantile(0.75)
        iqr = q3 - q1
        low = q1 - 1.5 * iqr
        high = q3 + 1.5 * iqr

        mask = (series < low) | (series > high)
        outliers = mask.sum()
        total_outliers += int(outliers)

        df_clean.loc[mask, col] = np.nan

    print(f"‚úî Outliers detectados y eliminados: {total_outliers}")

    # 3) Eliminar filas completamente vac√≠as
    before = df_clean.shape[0]
    df_clean.dropna(axis=0, how="all", inplace=True)
    after = df_clean.shape[0]
    print(f"‚úî Filas eliminadas por quedar vac√≠as: {before - after}")

else:
    raise ValueError(f"Estrategia de limpieza desconocida: {cleaning_strategy}")

print("‚úî Limpieza completada.")
print(f"üßÆ Total de valores convertidos a NaN: {nan_replacements_total}")

df_clean.head()


=== 6. Aplicando estrategia de limpieza ===
[INFO] Estrategia de limpieza        : none
[INFO] Valores globales ‚Üí NaN       : [-999999.0]
[INFO] Valores err√≥neos por columna : {}
‚û°Ô∏è No se aplica ninguna limpieza.
‚úî Limpieza completada.
üßÆ Total de valores convertidos a NaN: 0


Unnamed: 0_level_0,Timestamp,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature,segs_diff
segs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1651363201,2022-05-01 00:00:01,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,
1651363211,2022-05-01 00:00:11,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,10.0
1651363221,2022-05-01 00:00:21,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001,10.0
1651363231,2022-05-01 00:00:31,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001,10.0
1651363241,2022-05-01 00:00:41,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,10.0


## 6. Estad√≠sticos b√°sicos

In [11]:

df_clean.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Battery_Active_Power,3887242.0,-662.285805,25709.040089,-999999.0,-0.2,-0.1,0.0,119.900002
Battery_Active_Power_Set_Response,3887242.0,-663.310368,25709.012057,-999999.0,0.0,0.0,0.0,100.0
PVPCS_Active_Power,3887242.0,-652.845511,25709.283982,-999999.0,0.0,0.0,15.0,51.0
GE_Body_Active_Power,3887242.0,-619.330307,25710.219729,-999999.0,0.0,0.0,115.0,293.0
GE_Active_Power,3887242.0,-622.802896,25710.159461,-999999.0,-1.0,-0.7,113.300003,400.0
GE_Body_Active_Power_Set_Response,3887242.0,-506.601263,25713.070589,-999999.0,120.0,130.0,200.0,460.0
FC_Active_Power_FC_END_Set,3887242.0,-324.219143,19015.567318,-999999.0,40.0,40.0,40.0,80.0
FC_Active_Power,3887242.0,-631.596176,25709.831685,-999999.0,35.0,37.0,38.0,76.0
FC_Active_Power_FC_end_Set_Response,3887242.0,-623.818145,25710.029285,-999999.0,40.0,40.0,40.0,80.0
Island_mode_MCCB_Active_Power,3887242.0,-716.941724,25707.705093,-999999.0,-119.0,-26.0,-15.0,197.0


## 7. Histogramas por columna num√©rica

In [12]:
from mlops4ofp.tools.figures import save_figure

fig_paths = []

for col in num_cols:
    data = df_clean[col].dropna()
    if data.empty:
        continue

    bins = max(30, min(100, int(np.sqrt(len(data)))))
    out_path = FIGURES_DIR / f"hist_{col}.png"

    save_figure(
        out_path,
        plot_fn=lambda d=data, b=bins, c=col: (
            plt.hist(d, bins=b, edgecolor="black", alpha=0.7),
            plt.title(f"Histograma ‚Äî {c}")
        ),
        figsize=(8, 4),
    )
    fig_paths.append((f"Histograma {col}", out_path))




## 8. An√°lisis temporal de separaciones (`segs_diff`)

In [13]:
from mlops4ofp.tools.figures import save_figure

if "segs_diff" in df_clean.columns:
    out_path = FIGURES_DIR / "temporal_segs_diff.png"

    save_figure(
        out_path,
        plot_fn=lambda: (
            df_clean["segs_diff"].dropna().plot(),
            plt.title("Œî segs entre muestras consecutivas"),
        ),
        figsize=(10, 4),
    )
    fig_paths.append(("Œî segs entre muestras consecutivas", out_path))



In [14]:
def resolve_Tu_and_nan_repl(df_clean, Tu=None, nan_repl=None):
    """
    Resuelve Tu y nan_repl con fallback seguro.
    """
    # Tu
    if Tu is not None:
        Tu_value = float(Tu)
    else:
        if "segs_diff" not in df_clean.columns:
            raise RuntimeError("‚ùå No se encontr√≥ 'segs_diff' para calcular Tu.")
        Tu_value = float(df_clean["segs_diff"].median())

    # nan_repl
    try:
        nan_repl_value = int(nan_repl) if nan_repl is not None else 0
    except Exception:
        nan_repl_value = 0

    return Tu_value, nan_repl_value

Tu_value, nan_repl_value = resolve_Tu_and_nan_repl(
    df_clean=df_clean,
    Tu=globals().get("Tu"),
    nan_repl=globals().get("nan_repl"),
)

In [15]:
from mlops4ofp.tools.artifacts import (
    get_git_hash,
    save_numeric_dataset,
    save_params_and_metadata,
)

# --- Dataset ---
dataset_path = OUTPUTS["dataset"]

numeric_cols, df_out = save_numeric_dataset(
    df=df_clean,
    output_path=dataset_path,
    index_name="segs",
    drop_columns=["Timestamp", "segs_diff", "segs_dt"],
)

print(f"[OK] Dataset guardado en: {dataset_path}")

# --- Par√°metros generados ---
gen_params = {
    "Tu": float(Tu_value),
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "numeric_cols": numeric_cols,
    "nan_replacements_total": nan_repl_value,
}

# --- Metadatos ---
metadata_extra = {
    "dataset_explored": str(dataset_path),
    "Tu": float(Tu_value),
    "nan_replacements_total": nan_repl_value,
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "cleaning_strategy": variant_params.get("cleaning_strategy"),
    "nan_values": variant_params.get("nan_values"),
    "error_values_by_column": variant_params.get("error_values_by_column"),
}

save_params_and_metadata(
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=ctx["variant_root"],
    raw_path=raw_path,
    gen_params=gen_params,
    metadata_extra=metadata_extra,
    pm=pm,  # opcional
    git_commit=get_git_hash(),
)


[OK] Dataset guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_dataset.parquet


(PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_params.json'),
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_metadata.json'))

## 9. Matriz de correlaci√≥n

In [16]:

from mlops4ofp.tools.figures import save_figure

num_cols_corr = [c for c in num_cols if c != "segs_diff"]

if len(num_cols_corr) >= 2:
    corr = df_clean[num_cols_corr].corr()
    out_path = FIGURES_DIR / "corr_heatmap.png"

    save_figure(
        out_path,
        plot_fn=lambda: (
            sns.heatmap(corr, annot=False),
            plt.title("Matriz de correlaci√≥n"),
        ),
        figsize=(10, 8),
    )
    fig_paths.append(("Matriz de correlaci√≥n", out_path))



## 10. Guardar dataset explorado, par√°metros generados y metadatos

In [17]:
from mlops4ofp.tools.artifacts import (
    get_git_hash,
    save_numeric_dataset,
    save_params_and_metadata,
)

# --- Dataset ---
dataset_path = OUTPUTS["dataset"]

numeric_cols, df_out = save_numeric_dataset(
    df=df_clean,
    output_path=dataset_path,
    index_name="segs",
    drop_columns=["Timestamp", "segs_diff", "segs_dt"],
)

print(f"[OK] Dataset guardado en: {dataset_path}")

# --- Par√°metros generados ---
gen_params = {
    "Tu": float(Tu_value),
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "numeric_cols": numeric_cols,
    "nan_replacements_total": nan_repl_value,
}

# --- Metadatos ---
metadata_extra = {
    "dataset_explored": str(dataset_path),
    "Tu": float(Tu_value),
    "nan_replacements_total": nan_repl_value,
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "cleaning_strategy": variant_params.get("cleaning_strategy"),
    "nan_values": variant_params.get("nan_values"),
    "error_values_by_column": variant_params.get("error_values_by_column"),
}

save_params_and_metadata(
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=ctx["variant_root"],
    raw_path=raw_path,
    gen_params=gen_params,
    metadata_extra=metadata_extra,
    pm=pm,  # opcional
    git_commit=get_git_hash(),
)


[OK] Dataset guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_dataset.parquet


(PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_params.json'),
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_metadata.json'))

## 11. Generaci√≥n de informe HTML de la variante

In [18]:
# ============================================================
# 11.x ‚Äî Figuras + Informe HTML completo
# ============================================================

from mlops4ofp.tools.figures import save_figure
import matplotlib.pyplot as plt

FIGURES_DIR.mkdir(parents=True, exist_ok=True)

# ============================================================
# FIGURA 1 ‚Äî Porcentaje de nulos por columna
# ============================================================

na_pct = df_out.isna().mean() * 100
fig1 = FIGURES_DIR / "01_nulls_pct.png"

save_figure(
    fig1,
    plot_fn=lambda: (
        na_pct.sort_values(ascending=False)
        .plot(kind="bar", title="Porcentaje de nulos por columna")
    ),
    figsize=(12, 4),
)

fig_paths.append(("Nulos por columna", fig1))


# ============================================================
# FIGURA 2 ‚Äî Media por variable num√©rica
# ============================================================

desc = df_out.describe().T
fig2 = FIGURES_DIR / "02_mean_per_variable.png"

save_figure(
    fig2,
    plot_fn=lambda: (
        desc["mean"].plot(title="Media por variable num√©rica")
    ),
    figsize=(12, 4),
)

fig_paths.append(("Media por variable", fig2))


# ============================================================
# FIGURA 3 ‚Äî Histogramas por columna num√©rica
# ============================================================

for col in numeric_cols:
    fig_path = FIGURES_DIR / f"hist_{col}.png"

    save_figure(
        fig_path,
        plot_fn=lambda c=col: (
            df_out[c].dropna().hist(bins=50),
            plt.title(f"Histograma ‚Äî {c}")
        ),
        figsize=(10, 4),
    )

    fig_paths.append((f"Histograma {col}", fig_path))


# ============================================================
# FIGURA 4 ‚Äî Evoluci√≥n temporal por columna num√©rica
# ============================================================

for col in numeric_cols:
    if col == "segs":
        continue

    fig_path = FIGURES_DIR / f"time_{col}.png"

    save_figure(
        fig_path,
        plot_fn=lambda c=col: (
            plt.plot(df_out["segs"], df_out[c]),
            plt.title(f"Evoluci√≥n temporal ‚Äî {c}"),
            plt.xlabel("segs"),
        ),
        figsize=(12, 4),
    )

    fig_paths.append((f"Evoluci√≥n temporal {col}", fig_path))


# ============================================================
# FIGURA 5 ‚Äî Matriz de correlaci√≥n
# ============================================================

corr = df_out[numeric_cols].corr()
fig_corr = FIGURES_DIR / "correlation_matrix.png"

save_figure(
    fig_corr,
    plot_fn=lambda: (
        plt.imshow(corr, cmap="coolwarm", interpolation="nearest"),
        plt.colorbar(),
        plt.title("Matriz de correlaci√≥n"),
        plt.xticks(range(len(numeric_cols)), numeric_cols, rotation=90),
        plt.yticks(range(len(numeric_cols)), numeric_cols),
    ),
    figsize=(10, 8),
)

fig_paths.append(("Matriz de correlaci√≥n", fig_corr))


# ============================================================
# INFORME HTML FINAL
# ============================================================

EXPLORE_REPORT = VARIANT_DIR / "01_explore_report.html"

sections = []

# ----- ENCABEZADO -----
sections.append(f"<h1>Exploration Report ‚Äî Variante {ACTIVE_VARIANT}</h1>")
sections.append(f"<p><b>Archivo RAW:</b> {raw_path}</p>")
sections.append(f"<p><b>Filas:</b> {len(df_out):,}</p>")
sections.append(f"<p><b>Columnas num√©ricas:</b> {df_out.shape[1]}</p>")
sections.append(f"<p><b>Tu:</b> {Tu_value}</p>")
sections.append("<hr>")

# ----- TABLAS -----
sections.append("<h2>Nulos por columna (%)</h2>")
sections.append(na_pct.to_frame("pct_nulls").to_html())

sections.append("<h2>Estad√≠sticos b√°sicos</h2>")
sections.append(desc.to_html())

# ----- FIGURAS -----
sections.append("<hr>")
sections.append("<h2>Figuras</h2>")

for title, path in fig_paths:
    sections.append(
        f"<h3>{title}</h3><img src='figures/{path.name}' width='900'>"
    )

# ----- GUARDAR HTML -----
with open(EXPLORE_REPORT, "w", encoding="utf-8") as f:
    f.write("<html><body>" + "\n".join(sections) + "</body></html>")

print(f"[OK] Informe HTML generado en {EXPLORE_REPORT}")
print(f"[OK] Figuras generadas en: {FIGURES_DIR}")


[OK] Informe HTML generado en /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/01_explore_report.html
[OK] Figuras generadas en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v002/figures


In [19]:
#import pandas as pd
#import numpy as np

df = pd.read_parquet(
    "../executions/01_explore/v001/01_explore_dataset.parquet"
)

print("NaN totales:", df.isna().sum().sum())
print("Columnas con NaN:")
print(df.isna().sum()[df.isna().sum() > 0])


NaN totales: 0
Columnas con NaN:
Series([], dtype: int64)



## 12. Resumen 

En esta ejecuci√≥n se ha visto:

- C√≥mo una **variante** (p.ej. `v001`) fija:
  - el dataset de entrada (`raw_dataset_path`),
  - la estrategia de limpieza (`cleaning_strategy`, `nan_values`, `error_values_by_column`).
- C√≥mo se realiza un EDA sistem√°tico:
  - an√°lisis de tipos, nulos, histogramas, correlaciones, eje temporal.
- C√≥mo se generan y guardan:
  - el dataset explorado (`01_dataset_explored.parquet`),
  - par√°metros derivados (`01_explore_params.json`),
  - metadatos (`01_explore_metadata.json`),
  - informe HTML (`01_explore_report.html`),
  - figuras en `figures/`.

Todo ello queda **encapsulado en la carpeta de la variante**, lo que permite:

- reproducir el experimento m√°s adelante,
- comparar variantes entre s√≠,
- encadenar esta variante como entrada de la siguiente fase del pipeline.
