# Fase 02 ‚Äî prepareEventsDS (notebook)

Este notebook genera el **dataset de eventos** a partir del dataset limpio
producido en la Fase 01 (`01_explore`), siguiendo el mismo patr√≥n de variantes
y trazabilidad que en la Fase 01.

**Entradas principales:**
- Dataset explorado de la Fase 01, asociado a la variante padre
  indicada en la configuraci√≥n de la Fase 02.
- Par√°metros de la Fase 02, definidos en el `params.yaml` de la variante:
  - `band_thresholds_pct`: lista de porcentajes para discretizar las se√±ales en bandas.
  - `event_strategy` ‚àà {`levels`, `transitions`, `both`}.
  - `nan_handling` ‚àà {`keep`, `discard`}.


**Salidas principales:**
- Bandas corte para cada medida : `/executions/02_prepareeventsds/vNN/02_prepareeventsds_bands.json`
- Dataset de eventos: `/executions/02_prepareeventsds/vNN/02_prepareeventsds_dataset.parquet`.
- Cat√°logo de eventos: `/executions/02_prepareeventsds/vNN/02_prepareeventsds_event_catalog.json` (nombre ‚Üí ID).
- Metadata: `/executions/02_prepareeventsds/v011/02_prepareeventsds_metadata.json`.
- Par√°metros: `/executions/02_prepareeventsds/v011/02_prepareeventsds_params.json`.
- Informe HTML: `/executions/02_prepareeventsds/vNN/02_prepareeventsds_report.html`.
- Figuras en `/executions/02_prepareeventsds/vNN/figures/` .

‚ö†Ô∏è **No crea variantes nuevas**.  
Las variantes se crean SIEMPRE desde la l√≠nea de comandos con:

```bash
make variant2 VARIANT=v011 PARENT=v001 BANDS="40 60 90" STRATEGY=transitions NAN='discard'
```

Y el notebook se ejecuta para una variante con:

```bash
make nb2-run VARIANT=v011
```

## 1. Carga de datos y configuraci√≥n


In [1]:
# =====================================================================
# 1. IMPORTS COMUNES
# =====================================================================
import os
import sys
import json
import yaml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime, timezone
import seaborn as sns
import shutil

# Estilo gr√°fico com√∫n
plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["axes.grid"] = True

print("‚úî Imports y estilo cargados")

‚úî Imports y estilo cargados


In [2]:
# =====================================================================
# 2. CONTEXTO DE EJECUCI√ìN (bootstrap + run_context)
# =====================================================================
import sys
from pathlib import Path

# --- BOOTSTRAP M√çNIMO (antes de cualquier import mlops4ofp) ---
execution_dir = Path.cwd().resolve()

current = execution_dir
for _ in range(10):
    if (current / "mlops4ofp").exists():
        project_root = current
        break
    current = current.parent
else:
    raise RuntimeError("‚ùå No se pudo localizar project_root")

sys.path.insert(0, str(project_root))

print(f"üìÅ Project root a√±adido a PYTHONPATH: {project_root}")

# --- AHORA S√ç: imports normales ---
from mlops4ofp.tools.run_context import (
    detect_execution_dir,
    detect_project_root,
    assemble_run_context,
 )
import yaml

PHASE = "02_prepareeventsds"
phase = PHASE

# -------------------------------------------------------------------
# Selecci√≥n de variante
# -------------------------------------------------------------------
env_variant = os.getenv("VARIANT") or os.getenv("ACTIVE_VARIANT")

# env_variant = "v001"  # Para forzar una variante concreta (descomentar y asignar la variante deseada)

if not env_variant:
    raise RuntimeError(
        "‚ùå VARIANT no definido. Ejecuta el notebook con: make nb2-run VARIANT=v001"
    )
ACTIVE_VARIANT = env_variant
print(f"[INFO] Variante activa (desde entorno): {ACTIVE_VARIANT}")

# -------------------------------------------------------------------
# Construcci√≥n del contexto final
# -------------------------------------------------------------------
from mlops4ofp.tools.params_manager import ParamsManager

pm = ParamsManager(PHASE, project_root)

print("PHASE_DIR seg√∫n ParamsManager:", pm.phase_dir)
print("¬øExiste PHASE_DIR?", pm.phase_dir.exists())
print("¬øExiste variante?", (pm.phase_dir / ACTIVE_VARIANT).exists())

pm.set_current(ACTIVE_VARIANT)

variant_root = pm.current_variant_dir()

ctx = assemble_run_context(
    execution_dir=detect_execution_dir(),
    project_root=project_root,
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=variant_root,
 )

print("‚úî Contexto de ejecuci√≥n construido")
print(f"   Fase: {ctx['phase']}")
print(f"   Variante: {ctx['variant']}")
print(f"   Carpeta variante: {ctx['variant_root']}")


üìÅ Project root a√±adido a PYTHONPATH: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp
[INFO] Variante activa (desde entorno): v200
PHASE_DIR seg√∫n ParamsManager: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds
¬øExiste PHASE_DIR? True
¬øExiste variante? True
‚úî Contexto de ejecuci√≥n construido
   Fase: 02_prepareeventsds
   Variante: v200
   Carpeta variante: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200


In [3]:
# =====================================================================
# 4. RUTAS DE DATOS Y SALIDAS (VARIANTE)
#    Todos los ficheros llevan como prefijo el nombre de la fase
# =====================================================================
from mlops4ofp.tools.run_context import build_phase_outputs
VARIANT_DIR = ctx["variant_root"]
PHASE_PREFIX = ctx["phase"]

# Datos de entrada (raw es compartido por todo el proyecto)
RAW_DIR = ctx["project_root"] / "data" / "01-raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)

OUTPUTS = build_phase_outputs(
    variant_root=VARIANT_DIR,
    phase=ctx["phase"],
)
ctx["outputs"] = OUTPUTS  # para que generate_figures_and_report use ctx["outputs"]["report"]

# Figuras (√∫nica subcarpeta permitida dentro de la variante)
FIGURES_DIR = ctx["figures_dir"]

print("‚úî Rutas de salida preparadas (scoped a la variante)")
OUTPUTS, FIGURES_DIR

‚úî Rutas de salida preparadas (scoped a la variante)


({'dataset': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_dataset.parquet'),
  'report': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_report.html'),
  'params': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_params.json'),
  'metadata': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_metadata.json')},
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/figures'))

In [4]:
# =====================================================================
# 3. Leer par√°metros de la variante activa (Fase 02)
# =====================================================================
from mlops4ofp.tools.params_manager import validate_params

params_path = VARIANT_DIR / "params.yaml"
if not params_path.exists():
    raise FileNotFoundError(
        f"No existe el fichero de par√°metros de la variante: {params_path}\n"
        f"Primero crea la variante con:\n"
        f"    make variant2 VARIANT={ACTIVE_VARIANT} ..."
    )

# Leer par√°metros + validar contra traceability_schema.yaml
with open(params_path, "r", encoding="utf-8") as f:
    params_f02 = yaml.safe_load(f) or {}

validate_params(PHASE, params_f02, ctx["project_root"])
print("‚úî Par√°metros de Fase 02 validados correctamente seg√∫n schema.")

# ---------------------------------------------------------------------
# Par√°metros propios de Fase 02
# ---------------------------------------------------------------------
band_thresholds_pct = params_f02.get("band_thresholds_pct", [40, 60, 90])
event_strategy = params_f02.get("event_strategy", "both")
nan_handling = params_f02.get("nan_handling", "keep")

parent_phase = params_f02.get("parent_phase", "01_explore")
parent_variant = params_f02.get("parent_variant")
variant_id = params_f02.get("variant_id", ACTIVE_VARIANT)

# Validaciones m√≠nimas expl√≠citas (fase-espec√≠ficas)
ALLOWED_STRATEGIES = {"transitions", "levels", "both"}
if event_strategy not in ALLOWED_STRATEGIES:
    raise ValueError(
        f"event_strategy='{event_strategy}' no es v√°lida.\n"
        f"Opciones permitidas: {ALLOWED_STRATEGIES}"
    )

ALLOWED_NAN_HANDLING = {"keep", "discard"}
if nan_handling not in ALLOWED_NAN_HANDLING:
    raise ValueError(
        f"nan_handling='{nan_handling}' no es v√°lida.\n"
        f"Opciones permitidas: {ALLOWED_NAN_HANDLING}"
    )

# Log compacto
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS (F02) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"variant_id           = {variant_id}")
print(f"parent_phase         = {parent_phase}")
print(f"parent_variant       = {parent_variant}")
print(f"band_thresholds_pct  = {band_thresholds_pct}")
print(f"event_strategy       = {event_strategy}")
print(f"nan_handling         = {nan_handling}")
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")

# =====================================================================
# 4. Cargar metadata de la Fase 01 (padre) para herencia de Tu
# =====================================================================
parent_metadata_path = (
    ctx["project_root"]
    / "executions"
    / parent_phase
    / parent_variant
    / f"{parent_phase}_metadata.json"
)

if not parent_metadata_path.exists():
    raise FileNotFoundError(
        f"No se encontr√≥ metadata de la fase padre:\n{parent_metadata_path}"
    )

with open(parent_metadata_path, "r", encoding="utf-8") as f:
    parent_params_f01 = json.load(f)

# =====================================================================
# 5. Resolver Tu (prioridad: F02 ‚Üí F01)
# =====================================================================
def resolve_Tu(params_f02, parent_params_f01):
    """
    Selecci√≥n de Tu:
      1. Expl√≠cito en la variante de Fase 02
      2. Heredado de metadata de Fase 01
      3. Error si no existe
    """
    if params_f02.get("Tu") is not None:
        Tu = float(params_f02["Tu"])
        print(f"‚úî Tu tomado de variante Fase 02: {Tu}")
        return Tu

    Tu_f01 = parent_params_f01.get("Tu")
    if Tu_f01 is not None:
        Tu = float(Tu_f01)
        print(f"‚úî Tu heredado de metadata Fase 01: {Tu}")
        params_f02["Tu"] = Tu
        return Tu

    raise RuntimeError("‚ùå No se pudo determinar Tu (ni en Fase 02 ni en Fase 01).")


Tu = resolve_Tu(params_f02, parent_params_f01)
print(f"üëâ Tu final = {Tu} segundos")
ctx["variant_params"] = params_f02  


‚úî Par√°metros de Fase 02 validados correctamente seg√∫n schema.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS (F02) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
variant_id           = v200
parent_phase         = 01_explore
parent_variant       = v101
band_thresholds_pct  = [40, 60, 80]
event_strategy       = levels
nan_handling         = discard
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úî Tu heredado de metadata Fase 01: 10.0
üëâ Tu final = 10.0 segundos


In [5]:
# =====================================================================
# 6. Cargar dataset de entrada (resultado de Fase 01)
# =====================================================================

parent_dataset_path = (
    ctx["project_root"]
    / "executions"
    / parent_phase
    / parent_variant
    / "01_explore_dataset.parquet"
)

if not parent_dataset_path.exists():
    raise FileNotFoundError(
        f"No se encontr√≥ el dataset explorado de la fase padre:\n{parent_dataset_path}"
    )

print(f"‚úî Dataset padre encontrado:\n{parent_dataset_path}")

df_explored = pd.read_parquet(parent_dataset_path)

print("Shape dataset explorado:", df_explored.shape)
display(df_explored.head())


‚úî Dataset padre encontrado:
/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v101/01_explore_dataset.parquet


Shape dataset explorado: (3887242, 18)


Unnamed: 0,segs,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature
0,1651363201,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
1,1651363211,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
2,1651363221,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001
3,1651363231,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001
4,1651363241,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1


## 2. Estructura temporal del dataset

In [6]:
# =====================================================================
# 7. Normalizaci√≥n temporal y columnas base
# =====================================================================

# --- 7.1 Comprobaci√≥n de columna temporal ---
if "segs" not in df_explored.columns:
    raise RuntimeError(
        "‚ùå El dataset explorado no contiene la columna temporal obligatoria 'segs'."
    )

# Asegurar orden temporal
df_explored = df_explored.sort_values("segs").reset_index(drop=True)

# --- 7.2 Eliminar Timestamp si existe (ya no es necesario en F02) ---
if "Timestamp" in df_explored.columns:
    df_explored = df_explored.drop(columns=["Timestamp"])
    print("‚úî Columna 'Timestamp' eliminada")

# --- 7.3 Identificaci√≥n de columnas de medida ---
measurement_cols = [
    c for c in df_explored.columns
    if c != "segs"
]

if not measurement_cols:
    raise RuntimeError("‚ùå No se han detectado columnas de medida.")

print(f"‚úî Columnas de medida detectadas ({len(measurement_cols)}):")
print(measurement_cols)

# Vista r√°pida
display(df_explored.head())


‚úî Columnas de medida detectadas (17):
['Battery_Active_Power', 'Battery_Active_Power_Set_Response', 'PVPCS_Active_Power', 'GE_Body_Active_Power', 'GE_Active_Power', 'GE_Body_Active_Power_Set_Response', 'FC_Active_Power_FC_END_Set', 'FC_Active_Power', 'FC_Active_Power_FC_end_Set_Response', 'Island_mode_MCCB_Active_Power', 'MG-LV-MSB_AC_Voltage', 'Receiving_Point_AC_Voltage', 'Island_mode_MCCB_AC_Voltage', 'Island_mode_MCCB_Frequency', 'MG-LV-MSB_Frequency', 'Inlet_Temperature_of_Chilled_Water', 'Outlet_Temperature']


Unnamed: 0,segs,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature
0,1651363201,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
1,1651363211,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
2,1651363221,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001
3,1651363231,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001
4,1651363241,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1


## 3. Definici√≥n y tipolog√≠a de eventos


In [7]:
# ============================================================
# FUNCIONES AUXILIARES (Fase 02 ‚Äì optimizadas y r√°pidas)
# ============================================================

import numpy as np
import pandas as pd
from typing import Dict, List
from pathlib import Path


# ------------------------------------------------------------
# Bandas: cortes y etiquetas (vectorizado)
# ------------------------------------------------------------

def compute_cuts_and_labels(
    minmax_stats: Dict[str, Dict[str, float]],
    pct_thresholds: List[float],
) -> Dict[str, Dict[str, np.ndarray]]:
    """
    Devuelve para cada variable:
      {
        col: {
          "cuts": np.ndarray,
          "labels": np.ndarray(dtype=object)
        }
      }
    """
    pct_list = np.array([0.0] + pct_thresholds + [100.0])
    out = {}

    for col, mm in minmax_stats.items():
        mn = mm["min"]
        mx = mm["max"]
        r = mx - mn

        if r == 0.0:
            cuts = np.array([mn, mx])
            labels = np.array(["0_100"], dtype=object)
        else:
            cuts = mn + (pct_list / 100.0) * r
            labels = np.array(
                [f"{int(pct_list[i])}_{int(pct_list[i+1])}"
                 for i in range(len(pct_list) - 1)],
                dtype=object
            )

        out[col] = {"cuts": cuts, "labels": labels}

    return out


# ------------------------------------------------------------
# Asignaci√≥n vectorizada a bandas (muy r√°pida)
# ------------------------------------------------------------

def assign_bands_to_column(values: np.ndarray, cuts: np.ndarray, labels: np.ndarray):
    """
    Para un vector de valores num√©ricos, devuelve:
        kind  : array {"none","nan","band"}
        label : array con "xx_yy" o None
    """
    is_nan = np.isnan(values)
    kind = np.where(is_nan, "NaN", "band")

    idx = np.searchsorted(cuts, values, side="right") - 1
    idx = np.clip(idx, 0, len(labels) - 1)

    assigned_labels = labels[idx].copy()
    assigned_labels[is_nan] = None

    out_of_range = (values < cuts[0]) | (values > cuts[-1])
    assigned_labels[out_of_range] = None
    kind[out_of_range] = "none"

    return kind, assigned_labels


# ------------------------------------------------------------
# Cat√°logo de eventos (IDs compactos)
# ------------------------------------------------------------

def build_event_catalog(bands, event_strategy, nan_handling):
    """
    Devuelve:
      { event_name ‚Üí event_id }
    """
    event_to_id = {}
    next_id = 1

    strat = event_strategy.lower()
    nan_keep = (nan_handling.lower() == "keep")

    for col, info in bands.items():
        labels = info["labels"]

        # Eventos de transici√≥n (solo band ‚Üí band)
        if strat in ("transitions", "both"):
            for a in labels:
                for b in labels:
                    if a != b:
                        event_to_id[f"{col}_{a}-to-{b}"] = next_id
                        next_id += 1

        # Eventos de nivel (band)
        if strat in ("levels", "both"):
            for a in labels:
                event_to_id[f"{col}_{a}"] = next_id
                next_id += 1

        # Evento de nivel NaN (independiente de la estrategia)
        if nan_keep:
            event_to_id[f"{col}_NaN_NaN"] = next_id
            next_id += 1

    return event_to_id


# ------------------------------------------------------------
# Generaci√≥n r√°pida de eventos (n√∫cleo F02)
# ------------------------------------------------------------

def fast_generate_events(
    df,
    measure_cols,
    bands,
    event_to_id,
    event_strategy,
    nan_handling,
    Tu,
):
    """
    Generaci√≥n eficiente del dataset final de eventos.
    """
    N = len(df)
    segs = df["segs"].values.astype(np.int64)

    is_consecutive = np.zeros(N, dtype=bool)
    is_consecutive[1:] = (np.diff(segs) == Tu)

    strat = event_strategy.lower()
    nan_keep = (nan_handling.lower() == "keep")

    events_column = [[] for _ in range(N)]

    prev_kind = {col: None for col in measure_cols}
    prev_label = {col: None for col in measure_cols}

    col_kind = {}
    col_label = {}

    for col in measure_cols:
        vals = df[col].values
        info = bands[col]
        k_arr, lbl_arr = assign_bands_to_column(vals, info["cuts"], info["labels"])
        k_arr = np.where(k_arr == "NaN", "NaN", k_arr)
        col_kind[col] = k_arr
        col_label[col] = lbl_arr

    for i in range(N):
        row_events = []

        for col in measure_cols:
            curr_k = col_kind[col][i]
            curr_lbl = col_label[col][i]

            # Transiciones
            if i > 0 and is_consecutive[i] and strat in ("transitions", "both"):
                pk = prev_kind[col]
                pl = prev_label[col]

                if pk == "band" and curr_k == "band" and pl != curr_lbl:
                    ev = event_to_id.get(f"{col}_{pl}-to-{curr_lbl}")
                    if ev:
                        row_events.append(ev)

            # NIVELES
            if curr_k == "band" and strat in ("levels", "both"):
                ev = event_to_id.get(f"{col}_{curr_lbl}")
                if ev:
                    row_events.append(ev)

            elif curr_k == "NaN" and nan_keep:
                ev = event_to_id.get(f"{col}_NaN_NaN")
                if ev:
                    row_events.append(ev)

            prev_kind[col] = curr_k
            prev_label[col] = curr_lbl

        events_column[i] = row_events

    return pd.DataFrame({
        "segs": segs,
        "events": events_column,
    })


# ------------------------------------------------------------
# Min / Max por medida (Fase 02)
# ------------------------------------------------------------

def compute_minmax(df: pd.DataFrame, measure_cols: List[str]) -> Dict[str, Dict[str, float]]:
    return {
        col: {
            "min": float(df[col].min()),
            "max": float(df[col].max()),
        }
        for col in measure_cols
    }


minmax_stats = compute_minmax(df_explored, measurement_cols)

print("[prepareeventsds] min/max por medida calculado.")
for c, mm in list(minmax_stats.items())[:5]:
    print(f"  {c}: min={mm['min']:.3f}, max={mm['max']:.3f}")




[prepareeventsds] min/max por medida calculado.
  Battery_Active_Power: min=-216.800, max=119.900
  Battery_Active_Power_Set_Response: min=-90.000, max=100.000
  PVPCS_Active_Power: min=-3.000, max=51.000
  GE_Body_Active_Power: min=-10.000, max=293.000
  GE_Active_Power: min=-601.000, max=400.000


In [8]:
# =====================================================================
# 8. Pipeline de generaci√≥n de eventos (Fase 02)
# =====================================================================

# --- 8.1 Construcci√≥n de bandas ---
bands = compute_cuts_and_labels(
    minmax_stats=minmax_stats,
    pct_thresholds=band_thresholds_pct,
)

print(f"‚úî Bandas construidas para {len(bands)} medidas")

# Ejemplo de bandas (primeras 2 columnas)
for c in list(bands.keys())[:2]:
    print(f"{c}:")
    print("  cuts  :", bands[c]["cuts"])
    print("  labels:", bands[c]["labels"])


# --- 8.2 Cat√°logo de eventos ---
event_to_id = build_event_catalog(
    bands=bands,
    event_strategy=event_strategy,
    nan_handling=nan_handling,
)

print(f"‚úî Cat√°logo de eventos generado: {len(event_to_id)} eventos")


# --- 8.3 Generaci√≥n del dataset de eventos ---
df_events = fast_generate_events(
    df=df_explored,
    measure_cols=measurement_cols,
    bands=bands,
    event_to_id=event_to_id,
    event_strategy=event_strategy,
    nan_handling=nan_handling,
    Tu=Tu,
)

# =====================================================================
# VERIFICACI√ìN EVENTOS NaN (F02)
# =====================================================================

nan_event_ids = {
    eid for name, eid in event_to_id.items()
    if name.endswith("_NaN_NaN")
}

num_rows_with_nan_event = sum(
    any(e in nan_event_ids for e in evs)
    for evs in df_events["events"]
)

print("Filas con eventos NaN (nivel):", num_rows_with_nan_event)
print("IDs de eventos NaN:", nan_event_ids)



print("‚úî Dataset de eventos generado")
print("Shape:", df_events.shape)

display(df_events.head())


‚úî Bandas construidas para 17 medidas
Battery_Active_Power:
  cuts  : [-216.800003  -82.120001  -14.78       52.560001  119.900002]
  labels: ['0_40' '40_60' '60_80' '80_100']
Battery_Active_Power_Set_Response:
  cuts  : [-90. -14.  24.  62. 100.]
  labels: ['0_40' '40_60' '60_80' '80_100']
‚úî Cat√°logo de eventos generado: 68 eventos


Filas con eventos NaN (nivel): 0
IDs de eventos NaN: set()
‚úî Dataset de eventos generado
Shape: (3887242, 2)


Unnamed: 0,segs,events
0,1651363201,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
1,1651363211,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
2,1651363221,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
3,1651363231,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
4,1651363241,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."


## 4. Persistencia de resultados, metadata y trazabilidad

In [9]:
# =====================================================================
# 9. Persistencia de resultados, metadata y trazabilidad (Fase 02)
# =====================================================================

import json
from mlops4ofp.tools.artifacts import (
    get_git_hash,
    save_numeric_dataset,
    save_params_and_metadata,
)

# ---------------------------------------------------------------------
# 9.1 Rutas de salida (autocontenidas por fase / variante)
# ---------------------------------------------------------------------
PHASE_DIR = ctx["variant_root"]

# ---------------------------------------------------------------------
# 9.2 Guardar dataset de eventos
# ---------------------------------------------------------------------
events_dataset_path = PHASE_DIR / "02_prepareeventsds_dataset.parquet"


# DIAGN√ìSTICO ROBUSTO DE EVENTOS NaN EN F02
def has_nan_event(evs):
    if evs is None:
        return False

    # Evento √∫nico como string
    if isinstance(evs, str):
        return "nan" in evs.lower()

    # Lista / iterable de eventos
    if isinstance(evs, (list, tuple, set)):
        for e in evs:
            if isinstance(e, str) and "nan" in e.lower():
                return True
        return False

    # Cualquier otro tipo (int, float, etc.)
    return False

print("Tipos distintos en df['events']:",
      {type(evs).__name__ for evs in df_events["events"]})

print("Timestamps con eventos NaN:",
      sum(has_nan_event(evs) for evs in df_events["events"]))

# Dataset de eventos: columna temporal + lista de eventos
df_events.to_parquet(events_dataset_path, index=False)

print(f"[OK] Dataset de eventos guardado en: {events_dataset_path}")

# ---------------------------------------------------------------------
# 9.3 Guardar artefactos auxiliares (bandas y cat√°logo)
# ---------------------------------------------------------------------
bands_path = PHASE_DIR / "02_prepareeventsds_bands.json"
event_catalog_path = PHASE_DIR / "02_prepareeventsds_event_catalog.json"

with open(bands_path, "w", encoding="utf-8") as f:
    json.dump(
        {
            col: {
                "cuts": bands[col]["cuts"].tolist(),
                "labels": bands[col]["labels"].tolist(),
            }
            for col in bands
        },
        f,
        indent=2,
    )

with open(event_catalog_path, "w", encoding="utf-8") as f:
    json.dump(event_to_id, f, indent=2)

print("[OK] Artefactos guardados:")
print(f"  - {bands_path}")
print(f"  - {event_catalog_path}")

# ---------------------------------------------------------------------
# 9.4 Par√°metros generados (para trazabilidad)
# ---------------------------------------------------------------------
gen_params = {
    "Tu": float(Tu),
    "band_thresholds_pct": band_thresholds_pct,
    "event_strategy": event_strategy,
    "nan_handling": nan_handling,
    "n_rows_input": int(len(df_explored)),
    "n_rows_events": int(len(df_events)),
    "n_events_catalog": int(len(event_to_id)),
    "n_measures": int(len(measurement_cols)),
}

# ---------------------------------------------------------------------
# 9.5 Metadata Fase 02 (extendida)
# ---------------------------------------------------------------------
metadata_extra = {
    "dataset_events": str(events_dataset_path),
    "parent_phase": parent_phase,
    "parent_variant": parent_variant,
    "input_dataset": str(parent_dataset_path),
    "Tu": float(Tu),
    "band_thresholds_pct": band_thresholds_pct,
    "event_strategy": event_strategy,
    "nan_handling": nan_handling,
    "n_rows_input": int(len(df_explored)),
    "n_rows_events": int(len(df_events)),
    "n_events_catalog": int(len(event_to_id)),
    "n_measures": int(len(measurement_cols)),
}

# ---------------------------------------------------------------------
# 9.6 Guardar params + metadata + trazabilidad (como en F01)
# ---------------------------------------------------------------------
save_params_and_metadata(
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=ctx["variant_root"],
    raw_path=parent_dataset_path,   # dataset padre como "raw l√≥gico"
    gen_params=gen_params,
    metadata_extra=metadata_extra,
    pm=pm,                          # ParamsManager (opcional)
    git_commit=get_git_hash(),
)

print("[OK] Metadata y trazabilidad Fase 02 registradas correctamente")



Tipos distintos en df['events']: {'list'}


Timestamps con eventos NaN: 0


[OK] Dataset de eventos guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_dataset.parquet
[OK] Artefactos guardados:
  - /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_bands.json
  - /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_event_catalog.json
[OK] Metadata y trazabilidad Fase 02 registradas correctamente


## 5. Informe HTML y Generaci√≥n de gr√°ficos 

In [10]:
import mlops4ofp.tools.html_reports.html02 as prepareevents_report02

prepareevents_report02.generate_figures_and_report(
    ctx=ctx,
    event_to_id=event_to_id,
    df_events=df_events,
)


[prepareeventsds] Generando informe HTML final...


[OK] Informe HTML generado en /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_report.html


In [11]:
import pandas as pd
df = pd.read_parquet(f'../executions/02_prepareeventsds/{ctx["variant"]}/02_prepareeventsds_dataset.parquet') 
display(df.head())

display(ctx)

Unnamed: 0,segs,events
0,1651363201,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
1,1651363211,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
2,1651363221,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
3,1651363231,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."
4,1651363241,"[3, 6, 9, 13, 18, 21, 26, 31, 34, 38, 44, 48, ..."


{'execution_dir': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/notebooks'),
 'project_root': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp'),
 'phase': '02_prepareeventsds',
 'variant': 'v200',
 'variant_root': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200'),
 'figures_dir': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/figures'),
 'outputs': {'dataset': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_dataset.parquet'),
  'report': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_report.html'),
  'params': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/02_prepareeventsds/v200/02_prepareeventsds_params.json'),
  'metadata': PosixPath('/Users/juancarlosduenaslopez/Docum