# Experimentos de Pipelines Modulares
Este cuaderno utiliza los modulos `src.pipelines.data_setup` y `src.pipelines.experiment_pipelines` para 
cargar datos, preparar atributos y evaluar modelos de manera reproducible.


## 1. Configuracion del entorno
Se asegura que el paquete `src` este disponible para importar los modulos.


In [1]:
import sys
from pathlib import Path

candidates = [Path.cwd()] + list(Path.cwd().parents)
for candidate in candidates:
    if (candidate / 'src').exists():
        project_root = candidate
        if str(project_root) not in sys.path:
            sys.path.append(str(project_root))
        break
else:
    raise FileNotFoundError('No se encontro la carpeta src en la jerarquia de directorios.')


## 2. Importar utilidades
Se reutilizan las funciones creadas para la carga de datos y la construccion de pipelines.


In [2]:
import pandas as pd

from src.pipelines.data_setup import (
    DEFAULT_FEATURE_CONFIG,
    build_feature_frame,
    load_clean_dataframe,
    split_train_test,
)
from src.pipelines.experiment_pipelines import (
    build_linear_pipeline,
    build_xgb_pipeline,
    cross_validate_pipeline,
    evaluate_regression,
    get_default_xgb_param_grid,
    run_xgb_grid_search,
)


## 3. Cargar datos preparados
El modulo `data_setup` encapsula la localizacion del CSV limpio y la preparacion de variables.


In [3]:
feature_config = DEFAULT_FEATURE_CONFIG
df = load_clean_dataframe()
feature_df, target = build_feature_frame(df, feature_config)
X_train, X_test, y_train, y_test = split_train_test(feature_df, target)
df.shape, feature_df.shape


[INFO] Loaded dataset — Rows: 32966, Columns: 11
[INFO] Column validation passed.


((12236, 11), (12236, 10))

In [4]:
feature_config.to_dict()


{'categorical_features': ['weekstatus', 'day_of_week', 'load_type'],
 'numeric_base_features': ['lagging_current_reactive.power_kvarh',
  'leading_current_reactive_power_kvarh',
  'co2(tco2)',
  'lagging_current_power_factor',
  'leading_current_power_factor',
  'nsm'],
 'date_feature_names': ['date_hour',
  'date_dayofweek',
  'date_month',
  'date_dayofyear',
  'date_sin_hour',
  'date_cos_hour'],
 'interaction_columns': ['lagging_current_reactive.power_kvarh',
  'leading_current_power_factor'],
 'interaction_feature_names': ['lagging_current_reactive.power_kvarh__div__leading_current_power_factor'],
 'numeric_features_for_scaling': ['lagging_current_reactive.power_kvarh',
  'leading_current_reactive_power_kvarh',
  'co2(tco2)',
  'lagging_current_power_factor',
  'leading_current_power_factor',
  'nsm',
  'date_hour',
  'date_dayofweek',
  'date_month',
  'date_dayofyear',
  'date_sin_hour',
  'date_cos_hour',
  'lagging_current_reactive.power_kvarh__div__leading_current_power_facto

## 4. Pipeline con Linear Regression
Se evalua el pipeline lineal con cross-validation y m?tricas en el conjunto de prueba.


In [5]:
linear_pipeline = build_linear_pipeline(feature_config)
linear_cv_results, linear_cv_summary = cross_validate_pipeline(
    linear_pipeline, feature_df, target
)
linear_pipeline.fit(X_train, y_train)
linear_metrics = evaluate_regression(
    linear_pipeline, X_train, X_test, y_train, y_test, 'LinearRegression'
)
linear_cv_summary.round(4)


Unnamed: 0,metric,train_mean,test_mean,test_std
0,rmse,0.2388,0.2485,0.0572
1,mae,0.1055,0.113,0.0206
2,r2,0.9465,0.9405,0.0165


In [6]:
pd.DataFrame([linear_metrics]).round(4)


Unnamed: 0,model,rmse_train,rmse_test,mae_test,r2_train,r2_test
0,LinearRegression,0.0574,0.0566,0.105,0.9459,0.9479


## 5. Pipeline con XGBoost
Se ejecuta una busqueda en rejilla sobre el pipeline de XGBoost reutilizando los valores por defecto del modulo.


In [7]:
xgb_pipeline = build_xgb_pipeline(feature_config)
xgb_grid = run_xgb_grid_search(
    xgb_pipeline,
    X_train,
    y_train,
    param_grid=get_default_xgb_param_grid(),
)
xgb_best = xgb_grid.best_estimator_
xgb_metrics = evaluate_regression(
    xgb_best, X_train, X_test, y_train, y_test, 'XGBRegressor'
)
xgb_grid.best_params_


Fitting 3 folds for each of 32 candidates, totalling 96 fits


{'regressor__colsample_bytree': 0.7,
 'regressor__learning_rate': 0.03,
 'regressor__max_depth': 6,
 'regressor__n_estimators': 200,
 'regressor__subsample': 1.0}

In [8]:
pd.DataFrame([xgb_metrics]).round(4)


Unnamed: 0,model,rmse_train,rmse_test,mae_test,r2_train,r2_test
0,XGBRegressor,0.0269,0.0377,0.0702,0.9747,0.9653


## 6. Comparativa final
Se consolidan los resultados para comparar rapidamente los modelos probados.


In [9]:
results_df = pd.DataFrame([linear_metrics, xgb_metrics]).round(4)
results_df


Unnamed: 0,model,rmse_train,rmse_test,mae_test,r2_train,r2_test
0,LinearRegression,0.0574,0.0566,0.105,0.9459,0.9479
1,XGBRegressor,0.0269,0.0377,0.0702,0.9747,0.9653
