
# Fase 01 ‚Äî Exploraci√≥n inicial del dataset (`01_explore`)

Este notebook forma parte del **workflow MLOps4OFP** y est√° dise√±ado para:

- trabajar **siempre sobre una variante concreta** de la fase `01_explore` (p.ej. `v001`),  
- leer sus par√°metros desde `executions/01_explore/vNNN/params.yaml`,  
- aplicar la **estrategia de limpieza** configurada en la variante,  
- generar un EDA (*Exploratory Data Analysis*) completo,  
- guardar todos los artefactos en la carpeta de la variante.
 
Las variantes se crean desde la l√≠nea de comandos con (valores de ejemplo):

```bash
make variant1 VARIANT=v001 RAW=data/raw.csv \
    CLEANING_STRATEGY=basic \
    NAN_VALUES='[-999999.0]' \
    FIRST_LINE=1 \
    MAX_LINES=10000
```

**Par√°metros opcionales:**
- `FIRST_LINE`: Primera l√≠nea del dataset a procesar (por defecto: 1)
- `MAX_LINES`: N√∫mero m√°ximo de l√≠neas a procesar desde `FIRST_LINE`

Y el notebook se ejecuta para una variante con:

```bash
make nb1-run VARIANT=v001
```



## 0. Configuraci√≥n

In [1]:
# =====================================================================
# 1. IMPORTS COMUNES
# =====================================================================
import os
import sys
import json
import yaml
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime, timezone
import seaborn as sns
import shutil

# Estilo gr√°fico com√∫n
plt.rcParams["figure.figsize"] = (8, 4)
plt.rcParams["axes.grid"] = True

print("‚úî Imports y estilo cargados")

‚úî Imports y estilo cargados


In [2]:
# =====================================================================
# 2. CONTEXTO DE EJECUCI√ìN (bootstrap + run_context)
# =====================================================================
import sys
from pathlib import Path

# --- BOOTSTRAP M√çNIMO (antes de cualquier import mlops4ofp) ---
execution_dir = Path.cwd().resolve()

current = execution_dir
for _ in range(10):
    if (current / "mlops4ofp").exists():
        project_root = current
        break
    current = current.parent
else:
    raise RuntimeError("‚ùå No se pudo localizar project_root")

sys.path.insert(0, str(project_root))

print(f"üìÅ Project root a√±adido a PYTHONPATH: {project_root}")

# --- AHORA S√ç: imports normales ---
from mlops4ofp.tools.run_context import (
    detect_execution_dir,
    detect_project_root,
    assemble_run_context,
 )
import yaml

PHASE = "01_explore"

# -------------------------------------------------------------------
# Selecci√≥n de variante
## -------------------------------------------------------------------
env_variant = os.getenv("VARIANT") or os.getenv("ACTIVE_VARIANT")

#env_variant = "v002"  # Para forzar una variante concreta (descomentar y asignar la variante deseada)

if not env_variant:
    raise RuntimeError(
        "‚ùå VARIANT no definido. Ejecuta el notebook con: make nb1-run VARIANT=v001"
    )
ACTIVE_VARIANT = env_variant
print(f"[INFO] Variante activa (desde entorno): {ACTIVE_VARIANT}")

# -------------------------------------------------------------------
# Construcci√≥n del contexto final
## -------------------------------------------------------------------
from mlops4ofp.tools.params_manager import ParamsManager

pm = ParamsManager(PHASE, project_root)

print("PHASE_DIR seg√∫n ParamsManager:", pm.phase_dir)
print("¬øExiste PHASE_DIR?", pm.phase_dir.exists())
print("¬øExiste variante?", (pm.phase_dir / ACTIVE_VARIANT).exists())

pm.set_current(ACTIVE_VARIANT)

variant_root = pm.current_variant_dir()

ctx = assemble_run_context(
    execution_dir=detect_execution_dir(),
    project_root=project_root,
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=variant_root,
 )

print("‚úî Contexto de ejecuci√≥n construido")
print(f"   Fase: {ctx['phase']}")
print(f"   Variante: {ctx['variant']}")
print(f"   Carpeta variante: {ctx['variant_root']}")


üìÅ Project root a√±adido a PYTHONPATH: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp
[INFO] Variante activa (desde entorno): v100
PHASE_DIR seg√∫n ParamsManager: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore
¬øExiste PHASE_DIR? True
¬øExiste variante? True
‚úî Contexto de ejecuci√≥n construido
   Fase: 01_explore
   Variante: v100
   Carpeta variante: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100


In [3]:
# =====================================================================
# 4. RUTAS DE DATOS Y SALIDAS (VARIANTE)
#    Todos los ficheros llevan como prefijo el nombre de la fase
# =====================================================================
from mlops4ofp.tools.run_context import build_phase_outputs
VARIANT_DIR = ctx["variant_root"]
PHASE_PREFIX = ctx["phase"]

# Datos de entrada (raw es compartido por todo el proyecto)
RAW_DIR = ctx["project_root"] / "data" / "01-raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)

# Salidas de esta fase (ficheros en la ra√≠z de la variante)
#OUTPUTS = {
#    "dataset": VARIANT_DIR / f"{PHASE_PREFIX}_dataset.parquet",
#    "report": VARIANT_DIR / f"{PHASE_PREFIX}_report.html",
#    "params": VARIANT_DIR / f"{PHASE_PREFIX}_params.json",
#    "metadata": VARIANT_DIR / f"{PHASE_PREFIX}_metadata.json",
#}
OUTPUTS = build_phase_outputs(
    variant_root=VARIANT_DIR,
    phase=ctx["phase"],
)
ctx["outputs"] = OUTPUTS

# Figuras (√∫nica subcarpeta permitida dentro de la variante)
FIGURES_DIR = ctx["figures_dir"]

print("‚úî Rutas de salida preparadas (scoped a la variante)")
OUTPUTS, FIGURES_DIR

‚úî Rutas de salida preparadas (scoped a la variante)


({'dataset': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_dataset.parquet'),
  'report': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_report.html'),
  'params': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_params.json'),
  'metadata': PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_metadata.json')},
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/figures'))

## 1. Par√°metros de la variante y selecci√≥n del dataset bruto

In [4]:

# =====================================================================
# 3. Leer par√°metros de la variante activa
# =====================================================================
from mlops4ofp.tools.params_manager import validate_params

params_path = VARIANT_DIR / "params.yaml"
if not params_path.exists():
    raise FileNotFoundError(
        f"No existe el fichero de par√°metros de la variante: {params_path}\n"
        f"Primero crea la variante con:\n"
        f"    make variant1 VARIANT={ACTIVE_VARIANT} RAW=/ruta/dataset"
    )

# Leer par√°metros + validar contra traceability_schema.yaml
with open(params_path, "r", encoding="utf-8") as f:
    variant_params = yaml.safe_load(f) or {}
original_variant_params = dict(variant_params)

validate_params(PHASE, variant_params, ctx["project_root"])

if variant_params != original_variant_params:
    with open(params_path, "w", encoding="utf-8") as f:
        yaml.safe_dump(variant_params, f, sort_keys=False)

ctx["variant_params"] = variant_params
print("‚úî Par√°metros validados correctamente seg√∫n schema.")

raw_dataset_path = variant_params.get("raw_dataset_path")
cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_by_column = variant_params.get("error_values_by_column", {})

# Validaci√≥n estricta de estrategias de limpieza
ALLOWED_STRATEGIES = {"none", "basic", "full"}
if cleaning_strategy not in ALLOWED_STRATEGIES:
    raise ValueError(
        f"cleaning_strategy='{cleaning_strategy}' no es v√°lida.\n"
        f"Opciones permitidas: {ALLOWED_STRATEGIES}"
    )

# Log compacto de par√°metros
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
print(f"raw_dataset_path    = {raw_dataset_path}")
print(f"cleaning_strategy   = {cleaning_strategy}")
print(f"nan_values          = {nan_values}")
print(f"error_values_by_col = {error_values_by_column}")
print("‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")

# =====================================================================
# 4. Resolver ruta del dataset de entrada
# =====================================================================
raw_input = (ctx["project_root"] / raw_dataset_path).expanduser().resolve()
if not raw_input.exists():
    raise FileNotFoundError(
        f"El fichero indicado en raw_dataset_path no existe:\n{raw_input}"
    )

# =====================================================================
# 5. Copiar dataset bruto a data/01-raw (copia inmutable de trabajo)
# =====================================================================
raw_path = RAW_DIR / f"01_explore_raw_{raw_input.name}"
if not raw_path.exists():
    shutil.copy2(raw_input, raw_path)
    print(f"[VARIANT] Copiado dataset bruto a: {raw_path}")
else:
    print(f"[VARIANT] Ya exist√≠a la copia local del dataset bruto: {raw_path}")

raw_path


‚úî Par√°metros validados correctamente seg√∫n schema.
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ VARIANT PARAMETERS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
raw_dataset_path    = data/01-raw/01_explore_raw_raw.csv
cleaning_strategy   = basic
nan_values          = [-999999]
error_values_by_col = {}
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
[VARIANT] Ya exist√≠a la copia local del dataset bruto: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv


PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv')

## 2. Carga del dataset y vista r√°pida

In [5]:
# =====================================================
# 4. Cargar el dataset desde la copia local
# =====================================================

print(f"üìÑ Usando fichero de entrada (copia local): {raw_path}")

# Cargar CSV o Parquet
suffix = raw_path.suffix.lower()
if suffix == ".csv":
    df = pd.read_csv(raw_path)
elif suffix in {".parquet", ".pq"}:
    df = pd.read_parquet(raw_path)
else:
    raise ValueError(f"‚ùå Extensi√≥n de fichero no soportada: {suffix}")

print(f"‚úî Dataset cargado correctamente.")
print(f"   Filas iniciales: {len(df):,}  |  Columnas: {df.shape[1]}")

# -------------------------------------------------------------------
# Procesamiento de par√°metros opcionales: first_line y max_lines
# -------------------------------------------------------------------
max_lines = variant_params.get("max_lines", variant_params.get("max_line"))
first_line = variant_params.get("first_line")

if max_lines is not None or first_line is not None:
    start_idx = max(int(first_line or 1) - 1, 0)
    end_idx = start_idx + int(max_lines) if max_lines is not None else None
    df = df.iloc[start_idx:end_idx].reset_index(drop=True)
    print(f"‚úî Aplicado filtrado de l√≠neas:")
    print(f"   first_line={first_line}, max_lines={max_lines}")
    print(f"   Filas resultantes: {len(df):,}")

print(f"\nüìä Dataset final: {len(df):,} filas  |  {df.shape[1]} columnas")
df.head()

üìÑ Usando fichero de entrada (copia local): /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/data/01-raw/01_explore_raw_01_explore_raw_raw.csv


‚úî Dataset cargado correctamente.
   Filas iniciales: 3,887,242  |  Columnas: 18
‚úî Aplicado filtrado de l√≠neas:
   first_line=1, max_lines=50000
   Filas resultantes: 50,000

üìä Dataset final: 50,000 filas  |  18 columnas


Unnamed: 0,Timestamp,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature
0,2022-05-01 00:00:01,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
1,2022-05-01 00:00:11,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1
2,2022-05-01 00:00:21,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001
3,2022-05-01 00:00:31,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001
4,2022-05-01 00:00:41,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1


In [6]:
import io
print("=== INFO DEL DATAFRAME ===")
buf = io.StringIO()
df.info(buf)
info_str = buf.getvalue()
print(info_str)
print("=== TIPOS DE DATOS ===")
df.dtypes.to_frame("dtype")



=== INFO DEL DATAFRAME ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Timestamp                            50000 non-null  object 
 1   Battery_Active_Power                 50000 non-null  float64
 2   Battery_Active_Power_Set_Response    50000 non-null  float64
 3   PVPCS_Active_Power                   50000 non-null  float64
 4   GE_Body_Active_Power                 50000 non-null  float64
 5   GE_Active_Power                      50000 non-null  float64
 6   GE_Body_Active_Power_Set_Response    50000 non-null  float64
 7   FC_Active_Power_FC_END_Set           50000 non-null  float64
 8   FC_Active_Power                      50000 non-null  float64
 9   FC_Active_Power_FC_end_Set_Response  50000 non-null  float64
 10  Island_mode_MCCB_Active_Power        50000 non-null  float64
 11  M

Unnamed: 0,dtype
Timestamp,object
Battery_Active_Power,float64
Battery_Active_Power_Set_Response,float64
PVPCS_Active_Power,float64
GE_Body_Active_Power,float64
GE_Active_Power,float64
GE_Body_Active_Power_Set_Response,float64
FC_Active_Power_FC_END_Set,float64
FC_Active_Power,float64
FC_Active_Power_FC_end_Set_Response,float64


## 3. Preparaci√≥n del eje temporal (`segs`)

In [7]:

# ================================================================
# 4. PREPROCESADO TEMPORAL: creaci√≥n de `segs` y `segs_diff`
# ================================================================

print("=== 4. Preparaci√≥n del eje temporal ===")

df_original_cols = df.columns.tolist()

# -------------------------------------------------------------------
# 4.1 Detectar columna temporal "real"
# -------------------------------------------------------------------
time_col = None

# 1) Prioridad expl√≠cita: columna 'Timestamp'
if "Timestamp" in df.columns:
    time_col = "Timestamp"
    print("‚úî Detectada columna temporal 'Timestamp' (prioridad m√°xima).")

else:
    # 2) Buscar columnas que parezcan tiempo por nombre
    time_keywords = ["time", "timestamp", "fecha", "date"]
    candidates = [c for c in df.columns
                  if any(k in c.lower() for k in time_keywords)]
    if candidates:
        time_col = candidates[0]
        print(f"‚úî Detectada columna temporal candidata: {time_col}")

# -------------------------------------------------------------------
# 4.2 Construir / usar `segs` a partir de la mejor fuente de tiempo
# -------------------------------------------------------------------

if time_col is not None:
    # Usamos la columna temporal "real" como referencia
    print(f"‚Üí Usando '{time_col}' como fuente de tiempo para construir 'segs'.")
    ts = pd.to_datetime(df[time_col])
    # Epoch en segundos
    df["segs"] = (ts - pd.Timestamp("1970-01-01")) // pd.Timedelta("1s")
    df = df.set_index("segs").sort_index()

elif "segs" in df.columns:
    print("‚úî No se encontr√≥ columna temporal explicita, pero existe 'segs'. Usando como √≠ndice.")
    df = df.set_index("segs").sort_index()

elif "epoc" in df.columns:
    print("‚úî Detectada columna 'epoc' ‚Äî renombrando a 'segs'.")
    df = df.rename(columns={"epoc": "segs"})
    df = df.set_index("segs").sort_index()

else:
    print("‚ö†Ô∏è No existe columna temporal ('Timestamp', 'time*', 'segs' o 'epoc').")
    print("   ‚Üí Se generar√° 'segs' autom√°ticamente usando el √≠ndice del dataframe.")
    start = pd.Timestamp("2020-01-01")
    df["segs"] = (
        (start + pd.to_timedelta(df.index, unit="s"))
        - pd.Timestamp("1970-01-01")
    ) // pd.Timedelta("1s")
    df = df.set_index("segs").sort_index()
    print("‚úî Columna 'segs' generada autom√°ticamente.")


# -------------------------------------------------------------------
# 4.3 Calcular diferencias temporales entre muestras consecutivas
# -------------------------------------------------------------------
df["segs_diff"] = df.index.to_series().diff()

Tu = float(df["segs_diff"].median())
print(f"‚úî Intervalo temporal mediano (Tu): {Tu:.6f} segundos")

print("\nPreview de √≠ndices y segs_diff:")
display(df[["segs_diff"]].head(5))



=== 4. Preparaci√≥n del eje temporal ===
‚úî Detectada columna temporal 'Timestamp' (prioridad m√°xima).
‚Üí Usando 'Timestamp' como fuente de tiempo para construir 'segs'.
‚úî Intervalo temporal mediano (Tu): 10.000000 segundos

Preview de √≠ndices y segs_diff:


Unnamed: 0_level_0,segs_diff
segs,Unnamed: 1_level_1
1651363201,
1651363211,10.0
1651363221,10.0
1651363231,10.0
1651363241,10.0


## 4. An√°lisis de valores y diagn√≥stico

In [8]:
print("=== 5. Diagn√≥stico previo a la limpieza ===")

with open(VARIANT_DIR / "params.yaml", "r", encoding="utf-8") as f:
    variant_params = yaml.safe_load(f) or {}

cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_map = variant_params.get("error_values_by_column", {})

print(f"[PARAMS] cleaning_strategy      = {cleaning_strategy}")
print(f"[PARAMS] nan_values             = {nan_values}")
print(f"[PARAMS] error_values_by_column = {error_values_by_column}")

report_preclean = {}

# -------------------------------------------------------------------
# 5.1 Valores nulos por columna
# -------------------------------------------------------------------
nulls = df.isna().sum()
report_preclean["nulls"] = nulls.to_dict()

print("\nüìå Valores nulos por columna:")
display(nulls[nulls > 0])

# -------------------------------------------------------------------
# 5.2 Valores constantes (sin variaci√≥n)
# -------------------------------------------------------------------
constant_cols = [c for c in df.columns if df[c].nunique(dropna=False) <= 1]
report_preclean["constant_columns"] = constant_cols

print("\nüìå Columnas constantes:")
display(constant_cols)

# -------------------------------------------------------------------
# 5.3 Valores fuera de rango t√≠pico usando IQR (outliers)
# -------------------------------------------------------------------
numeric_cols = df.select_dtypes(include=[np.number]).columns

outlier_summary = {}

for col in numeric_cols:
    series = df[col].dropna()
    if len(series) < 10:
        continue

    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    outliers = ((series < lower) | (series > upper)).sum()
    if outliers > 0:
        outlier_summary[col] = int(outliers)

report_preclean["outliers_IQR"] = outlier_summary

print("\nüìå Columnas con outliers detectados (m√©todo IQR):")
display(outlier_summary)

# -------------------------------------------------------------------
# 5.4 Valores sospechosos por par√°metros de usuario
# -------------------------------------------------------------------

suspect_map = {}

# 5.4.1 Valores globales a interpretar como NaN
for v in nan_values:
    for col in numeric_cols:
        cnt = (df[col] == v).sum()
        if cnt > 0:
            suspect_map.setdefault(col, {})[f"nan_value_{v}"] = int(cnt)

# 5.4.2 Valores err√≥neos por columna
for col, bad_values in error_values_map.items():
    if col in df.columns:
        for v in bad_values:
            cnt = (df[col] == v).sum()
            if cnt > 0:
                suspect_map.setdefault(col, {})[f"error_value_{v}"] = int(cnt)

report_preclean["suspect_values"] = suspect_map

print("\nüìå Valores sospechosos seg√∫n par√°metros de limpieza:")
display(suspect_map)

# -------------------------------------------------------------------
# 5.5 Guardar informe previo (para el HTML final)
# -------------------------------------------------------------------

report_preclean_path = VARIANT_DIR / "01_preclean_report.json"
with open(report_preclean_path, "w", encoding="utf-8") as f:
    json.dump(report_preclean, f, indent=2, ensure_ascii=False)

print(f"\n‚úî Informe previo guardado en: {report_preclean_path}")


=== 5. Diagn√≥stico previo a la limpieza ===
[PARAMS] cleaning_strategy      = basic
[PARAMS] nan_values             = [-999999]
[PARAMS] error_values_by_column = {}

üìå Valores nulos por columna:


segs_diff    1
dtype: int64


üìå Columnas constantes:


['Battery_Active_Power',
 'Battery_Active_Power_Set_Response',
 'GE_Body_Active_Power',
 'GE_Active_Power',
 'GE_Body_Active_Power_Set_Response',
 'FC_Active_Power_FC_END_Set',
 'FC_Active_Power_FC_end_Set_Response']


üìå Columnas con outliers detectados (m√©todo IQR):


{'FC_Active_Power': 2982,
 'MG-LV-MSB_AC_Voltage': 653,
 'Island_mode_MCCB_AC_Voltage': 1,
 'Island_mode_MCCB_Frequency': 3921,
 'MG-LV-MSB_Frequency': 3898}


üìå Valores sospechosos seg√∫n par√°metros de limpieza:


{}


‚úî Informe previo guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_preclean_report.json


In [9]:
# ================================================================
# 5 bis. DETECCI√ìN DE OUTLIERS IQR (solo diagn√≥stico, no limpieza)
# ================================================================

print("\n=== 5 bis. Detecci√≥n de outliers (IQR) por columna ===")

num_cols = df.select_dtypes(include=[np.number]).columns

iqr_report = {}

for col in num_cols:
    series = df[col].dropna()
    if len(series) < 5:
        continue  # no hay suficientes datos para IQR

    q1 = series.quantile(0.25)
    q3 = series.quantile(0.75)
    iqr = q3 - q1
    low = q1 - 1.5 * iqr
    high = q3 + 1.5 * iqr

    mask_low = series < low
    mask_high = series > high

    n_low = mask_low.sum()
    n_high = mask_high.sum()
    total_outliers = n_low + n_high

    if total_outliers > 0:
        iqr_report[col] = {
            "low_threshold": float(low),
            "high_threshold": float(high),
            "num_low_outliers": int(n_low),
            "num_high_outliers": int(n_high),
            "total_outliers": int(total_outliers),
            "percent": float(100 * total_outliers / len(series)),
        }


if len(iqr_report) == 0:
    print("‚úî No se encontraron outliers IQR en ninguna columna.")
else:
    print("‚úî Informe de outliers IQR encontrado en columnas num√©ricas:")

    outlier_df = pd.DataFrame.from_dict(iqr_report, orient="index")
    display(outlier_df)

OUTLIER_REPORT = iqr_report  # por si el informe HTML lo necesita m√°s adelante




=== 5 bis. Detecci√≥n de outliers (IQR) por columna ===
‚úî Informe de outliers IQR encontrado en columnas num√©ricas:


Unnamed: 0,low_threshold,high_threshold,num_low_outliers,num_high_outliers,total_outliers,percent
FC_Active_Power,38.0,38.0,2982,0,2982,5.964
MG-LV-MSB_AC_Voltage,476.5,488.5,646,7,653,1.306
Island_mode_MCCB_AC_Voltage,475.0,491.0,1,0,1,0.002
Island_mode_MCCB_Frequency,59.960008,60.039992,2051,1870,3921,7.842
MG-LV-MSB_Frequency,59.960008,60.039992,2050,1848,3898,7.796


## 6. Aplicaci√≥n de la estrategia de limpieza

In [10]:
# ================================================================
# 6. APLICAR ESTRATEGIA DE LIMPIEZA
# ================================================================

print("\n=== 6. Aplicando estrategia de limpieza ===")

cleaning_strategy = variant_params.get("cleaning_strategy", "none")
nan_values = variant_params.get("nan_values", [])
error_values_by_column = variant_params.get("error_values_by_column", {})

print(f"[INFO] Estrategia de limpieza        : {cleaning_strategy}")
print(f"[INFO] Valores globales ‚Üí NaN       : {nan_values}")
print(f"[INFO] Valores err√≥neos por columna : {error_values_by_column}")

df_clean = df.copy()
nan_replacements_total = 0

# -------------------------------
# Estrategia: none
# -------------------------------
if cleaning_strategy == "none":
    print("‚û°Ô∏è No se aplica ninguna limpieza.")
    # No modificamos nada
    pass


# -------------------------------
# Estrategia: basic
# -------------------------------
elif cleaning_strategy == "basic":

    # 1) Reemplazar valores globales por NaN
    if nan_values:
        before = df_clean.isna().sum().sum()
        df_clean.replace(nan_values, np.nan, inplace=True)
        after = df_clean.isna().sum().sum()
        changed = after - before
        nan_replacements_total += changed
        print(f"‚úî Reemplazados valores {nan_values} por NaN (global): {changed}")

    # 2) Reemplazar valores err√≥neos por columna
    for col, bad_vals in error_values_by_column.items():
        if col in df_clean.columns:
            before = df_clean[col].isna().sum()
            df_clean[col].replace(bad_vals, np.nan, inplace=True)
            after = df_clean[col].isna().sum()
            changed = after - before
            nan_replacements_total += changed
            print(f"‚úî Columna {col}: reemplazados {bad_vals} ‚Üí NaN: {changed}")

    # 3) Eliminar columnas completamente vac√≠as
    before = df_clean.shape[1]
    df_clean.dropna(axis=1, how="all", inplace=True)
    after = df_clean.shape[1]
    print(f"‚úî Columnas eliminadas por quedar vac√≠as: {before - after}")


# -------------------------------
# Estrategia: full
# -------------------------------
elif cleaning_strategy == "full":

    # 1) Limpieza b√°sica primero
    if nan_values:
        before = df_clean.isna().sum().sum()
        df_clean.replace(nan_values, np.nan, inplace=True)
        after = df_clean.isna().sum().sum()
        nan_replacements_total += (after - before)
        #print(f"‚úî Reemplazados {nan_replacements_total} valores {nan_values} por NaN (global)")

    for col, bad_vals in error_values_by_column.items():
        if col in df_clean.columns:
            before = df_clean[col].isna().sum()
            df_clean[col].replace(bad_vals, np.nan, inplace=True)
            after = df_clean[col].isna().sum()
            nan_replacements_total += (after - before)
            print(f"‚úî Columna {col}: reemplazados {bad_vals} ‚Üí NaN")

    # 2) Eliminaci√≥n autom√°tica de outliers por IQR
    num_cols = df_clean.select_dtypes(include=[np.number]).columns
    total_outliers = 0

    for col in num_cols:
        series = df_clean[col]
        q1 = series.quantile(0.25)
        q3 = series.quantile(0.75)
        iqr = q3 - q1
        low = q1 - 1.5 * iqr
        high = q3 + 1.5 * iqr

        mask = (series < low) | (series > high)
        outliers = mask.sum()
        total_outliers += int(outliers)

        df_clean.loc[mask, col] = np.nan

    print(f"‚úî Outliers detectados y eliminados: {total_outliers}")

    # 3) Eliminar filas completamente vac√≠as
    before = df_clean.shape[0]
    df_clean.dropna(axis=0, how="all", inplace=True)
    after = df_clean.shape[0]
    print(f"‚úî Filas eliminadas por quedar vac√≠as: {before - after}")

else:
    raise ValueError(f"Estrategia de limpieza desconocida: {cleaning_strategy}")

print("‚úî Limpieza completada.")
print(f"üßÆ Total de valores convertidos a NaN: {nan_replacements_total}")

df_clean.head()


=== 6. Aplicando estrategia de limpieza ===
[INFO] Estrategia de limpieza        : basic
[INFO] Valores globales ‚Üí NaN       : [-999999]
[INFO] Valores err√≥neos por columna : {}
‚úî Reemplazados valores [-999999] por NaN (global): 0
‚úî Columnas eliminadas por quedar vac√≠as: 0
‚úî Limpieza completada.
üßÆ Total de valores convertidos a NaN: 0


Unnamed: 0_level_0,Timestamp,Battery_Active_Power,Battery_Active_Power_Set_Response,PVPCS_Active_Power,GE_Body_Active_Power,GE_Active_Power,GE_Body_Active_Power_Set_Response,FC_Active_Power_FC_END_Set,FC_Active_Power,FC_Active_Power_FC_end_Set_Response,Island_mode_MCCB_Active_Power,MG-LV-MSB_AC_Voltage,Receiving_Point_AC_Voltage,Island_mode_MCCB_AC_Voltage,Island_mode_MCCB_Frequency,MG-LV-MSB_Frequency,Inlet_Temperature_of_Chilled_Water,Outlet_Temperature,segs_diff
segs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1651363201,2022-05-01 00:00:01,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,
1651363211,2022-05-01 00:00:11,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,10.0
1651363221,2022-05-01 00:00:21,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.200001,10.0
1651363231,2022-05-01 00:00:31,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-13.0,482.0,483.0,482.0,59.950001,59.950001,22.1,22.200001,10.0
1651363241,2022-05-01 00:00:41,0.0,0.0,0.0,0.0,-1.0,180.0,40.0,38.0,40.0,-14.0,482.0,483.0,482.0,59.959999,59.959999,22.1,22.1,10.0


## 6. Estad√≠sticos b√°sicos

In [11]:

df_clean.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Battery_Active_Power,50000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Battery_Active_Power_Set_Response,50000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PVPCS_Active_Power,50000.0,13.66112,16.991301,-2.0,0.0,2.0,29.0,51.0
GE_Body_Active_Power,50000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
GE_Active_Power,50000.0,-1.0,0.0,-1.0,-1.0,-1.0,-1.0,-1.0
GE_Body_Active_Power_Set_Response,50000.0,180.0,0.0,180.0,180.0,180.0,180.0,180.0
FC_Active_Power_FC_END_Set,50000.0,40.0,0.0,40.0,40.0,40.0,40.0,40.0
FC_Active_Power,50000.0,37.94036,0.236821,37.0,38.0,38.0,38.0,38.0
FC_Active_Power_FC_end_Set_Response,50000.0,40.0,0.0,40.0,40.0,40.0,40.0,40.0
Island_mode_MCCB_Active_Power,50000.0,-25.2679,17.964912,-64.0,-42.0,-15.0,-10.0,-2.0


## 7. Guardar dataset explorado, par√°metros generados y metadatos


In [12]:
def resolve_Tu_and_nan_repl(df_clean, Tu=None, nan_repl=None):
    """
    Resuelve Tu y nan_repl con fallback seguro.
    """
    # Tu
    if Tu is not None:
        Tu_value = float(Tu)
    else:
        if "segs_diff" not in df_clean.columns:
            raise RuntimeError("‚ùå No se encontr√≥ 'segs_diff' para calcular Tu.")
        Tu_value = float(df_clean["segs_diff"].median())

    # nan_repl
    try:
        nan_repl_value = int(nan_repl) if nan_repl is not None else 0
    except Exception:
        nan_repl_value = 0

    return Tu_value, nan_repl_value

Tu_value, nan_repl_value = resolve_Tu_and_nan_repl(
    df_clean=df_clean,
    Tu=globals().get("Tu"),
    nan_repl=globals().get("nan_repl"),
)

In [13]:
from mlops4ofp.tools.artifacts import (
    get_git_hash,
    save_numeric_dataset,
    save_params_and_metadata,
)

# --- Dataset ---
dataset_path = OUTPUTS["dataset"]

numeric_cols, df_out = save_numeric_dataset(
    df=df_clean,
    output_path=dataset_path,
    index_name="segs",
    drop_columns=["Timestamp", "segs_diff", "segs_dt"],
)

print(f"[OK] Dataset guardado en: {dataset_path}")

# --- Par√°metros generados ---
gen_params = {
    "Tu": float(Tu_value),
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "numeric_cols": numeric_cols,
    "nan_replacements_total": nan_repl_value,
}

# --- Metadatos ---
metadata_extra = {
    "dataset_explored": str(dataset_path),
    "Tu": float(Tu_value),
    "nan_replacements_total": nan_repl_value,
    "n_rows": int(len(df_out)),
    "n_cols": int(df_out.shape[1]),
    "cleaning_strategy": variant_params.get("cleaning_strategy"),
    "nan_values": variant_params.get("nan_values"),
    "error_values_by_column": variant_params.get("error_values_by_column"),
}

save_params_and_metadata(
    phase=PHASE,
    variant=ACTIVE_VARIANT,
    variant_root=ctx["variant_root"],
    raw_path=raw_path,
    gen_params=gen_params,
    metadata_extra=metadata_extra,
    pm=pm,  # opcional
    git_commit=get_git_hash(),
)


[OK] Dataset guardado en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_dataset.parquet


(PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_params.json'),
 PosixPath('/Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_metadata.json'))

## 8. Generaci√≥n de informe HTML de la variante

In [14]:
import mlops4ofp.tools.html_reports.html01 as explore_report
import importlib
importlib.reload(explore_report)


explore_report.generate_figures_and_report(
        variant=ACTIVE_VARIANT,
        ctx=ctx,
        df_out=df_out,
        numeric_cols=[c for c in numeric_cols if c != "segs"],
        Tu_value=Tu_value,
        report_preclean= report_preclean
    )

[explore] Generando informe HTML final...


  norm = (X - vmin) / span * 100.0


[OK] Informe HTML generado en /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/01_explore_report.html
[OK] Figuras generadas en: /Users/juancarlosduenaslopez/Documents/mlops/mlops4ofp/executions/01_explore/v100/figures



## 9. Resumen 

En esta ejecuci√≥n se ha visto:

- C√≥mo una **variante** (p.ej. `v001`) fija:
  - el dataset de entrada (`raw_dataset_path`),
  - la estrategia de limpieza (`cleaning_strategy`, `nan_values`, `error_values_by_column`).
- C√≥mo se realiza un EDA sistem√°tico:
  - an√°lisis de tipos, nulos, histogramas, correlaciones, eje temporal.
- C√≥mo se generan y guardan:
  - el dataset explorado (`01_dataset_explored.parquet`),
  - par√°metros derivados (`01_explore_params.json`),
  - metadatos (`01_explore_metadata.json`),
  - informe HTML (`01_explore_report.html`),
  - figuras en `figures/`.

Todo ello queda **encapsulado en la carpeta de la variante**, lo que permite:

- reproducir el experimento m√°s adelante,
- comparar variantes entre s√≠,
- encadenar esta variante como entrada de la siguiente fase del pipeline.


In [15]:
#import pandas as pd
#import numpy as np

df = pd.read_parquet(
    OUTPUTS["dataset"]
)

print("NaN totales:", df.isna().sum().sum())
print("Columnas con NaN:")
print(df.isna().sum()[df.isna().sum() > 0])


NaN totales: 0
Columnas con NaN:
Series([], dtype: int64)
