# Trainer Experiments v2
Este cuaderno usa las clases `PipelineModelTrainer` para entrenar modelos con pipelines de scikit-learn que incluyen ingenieria de atributos.


## 1. Preparacion del entorno
Se detecta la raiz del proyecto para poder importar los modulos de `src`.


In [1]:
import os
import sys
from pathlib import Path

for candidate in [Path.cwd(), *Path.cwd().parents]:
    if (candidate / 'src').exists():
        project_root = candidate
        if str(project_root) not in sys.path:
            sys.path.append(str(project_root))
        break
else:
    raise FileNotFoundError('No se encontro la carpeta src en la jerarquia.')

os.chdir(project_root)
print(f'Usando project_root: {project_root}')


Usando project_root: c:\Users\uriel\OneDrive\Documentos\MNA\Git\steel-energy-mlops


## 2. Importar configuracion y utilidades
Se reutilizan los hiperparametros de `params.yaml` y las funciones de carga de datos/pipelines.


In [2]:
import json
import yaml
import pandas as pd

from src.pipelines.data_setup import (
    DEFAULT_FEATURE_CONFIG,
    build_feature_frame,
    load_clean_dataframe,
    split_train_test,
)
from src.models.linear_regression_model import PipelineModelTrainer as LinearPipelineTrainer
from src.models.random_forest_model import PipelineModelTrainer as RandomForestPipelineTrainer
from src.models.xgboost_model import PipelineModelTrainer as XGBPipelineTrainer

params_path = project_root / 'params.yaml'
with params_path.open('r') as f:
    cfg = yaml.safe_load(f)
print(f'Parametros cargados desde: {params_path}')
cfg


Parametros cargados desde: c:\Users\uriel\OneDrive\Documentos\MNA\Git\steel-energy-mlops\params.yaml


{'data': {'raw_path': 'data/clean/steel_energy_cleaned_V2.csv',
  'processed_path': 'data/processed/steel_energy_processed.csv'},
 'features': {'selected': ['lagging_current_reactive.power_kvarh',
   'leading_current_reactive_power_kvarh',
   'lagging_current_power_factor',
   'leading_current_power_factor',
   'nsm',
   'hour',
   'dayofweek_num',
   'month',
   'weekstatus_Weekend',
   'load_type_Maximum_load',
   'load_type_Medium_load'],
  'target': 'usage_kwh'},
 'train': {'model_type': 'xgboost',
  'model_types': ['xgboost', 'random_forest', 'linear_regression'],
  'cv_folds': 5,
  'xgboost': {'n_estimators': 200,
   'learning_rate': 0.1,
   'max_depth': 6,
   'subsample': 0.8,
   'colsample_bytree': 0.8,
   'eval_metric': 'rmse',
   'random_state': 42,
   'n_jobs': -1},
  'random_forest': {'n_estimators': 300,
   'max_depth': None,
   'min_samples_split': 2,
   'min_samples_leaf': 1,
   'max_features': None,
   'random_state': 42,
   'n_jobs': -1},
  'linear_regression': {'fit_i

## 3. Preparar datos de entrenamiento
El pipeline opera sobre atributos en crudo y genera las transformaciones necesarias.


In [3]:
feature_config = DEFAULT_FEATURE_CONFIG
df = load_clean_dataframe()
feature_df, target = build_feature_frame(df, feature_config)
X_train, X_test, y_train, y_test = split_train_test(feature_df, target)
df.shape, feature_df.shape


[INFO] Loaded dataset — Rows: 32966, Columns: 11
[INFO] Column validation passed.


((32966, 11), (32966, 10))

## 4. Utilidades para ejecutar experimentos
Se reutiliza el mismo flujo para todos los modelos.


In [4]:
def summarize_metrics(label, metrics):
    display(pd.DataFrame([metrics], index=[label]).round(4))

def run_pipeline_experiment(trainer, label, model_key):
    metrics = trainer.run(X_train, X_test, y_train, y_test, model_type=model_key)
    summarize_metrics(label, metrics)
    cv_scores = trainer.cross_validate(feature_df, target)
    return metrics, cv_scores


## 5. Experimentos por modelo
Se instancia cada `PipelineModelTrainer` con los hiperparametros de `params.yaml`.


In [5]:
results = {}
cv_details = {}

# Linear Regression
linear_trainer = LinearPipelineTrainer(
    model_params=cfg['train']['linear_regression'],
    training_params={'cv_folds': cfg['train']['cv_folds']},
    feature_config=feature_config,
)
results['linear_regression'], cv_details['linear_regression'] = run_pipeline_experiment(
    linear_trainer, 'LinearRegression', 'linear_regression'
)

# Random Forest
rf_trainer = RandomForestPipelineTrainer(
    model_params=cfg['train']['random_forest'],
    training_params={'cv_folds': cfg['train']['cv_folds']},
    feature_config=feature_config,
)
results['random_forest'], cv_details['random_forest'] = run_pipeline_experiment(
    rf_trainer, 'RandomForest', 'random_forest'
)

# XGBoost
xgb_trainer = XGBPipelineTrainer(
    model_params=cfg['train']['xgboost'],
    training_params={'cv_folds': cfg['train']['cv_folds']},
    feature_config=feature_config,
)
results['xgboost'], cv_details['xgboost'] = run_pipeline_experiment(
    xgb_trainer, 'XGBoost', 'xgboost'
)


[INFO] Starting Linear Regression training pipeline...
[INFO] Training Linear Regression model...
[INFO] Training complete.
[INFO] Evaluating model performance...
[INFO] Model Evaluation:
   RMSE: 0.2208
   MAE: 0.0954
   R2_test: 0.9514
   R2_train: 0.9557
[INFO] ✅ Saved versioned model to: models/linear_regression/artifacts\model_20251101_222956.pkl
[INFO] Modelo subido como artifact desde carpeta temporal: C:\Users\uriel\AppData\Local\Temp\tmp9wo4nptg\linear_regression_mlflow_model
[INFO] Linear Regression model saved at: models/linear_regression/artifacts\model_20251101_222956.pkl
[INFO] Linear Regression training pipeline complete.



Unnamed: 0,RMSE,MAE,R2_test,R2_train
LinearRegression,0.2208,0.0954,0.9514,0.9557


[INFO] Running cross-validation...
[INFO] CV R² mean: 0.9519 ± 0.0085
⚠️ No se encontró .env, usando variables del entorno del sistema.
[INFO] Starting Random Forest training pipeline...
[INFO] Training Random Forest model...
[INFO] Training complete.
[INFO] Evaluating model performance...
[INFO] Model Evaluation:
   RMSE: 0.1893
   MAE: 0.0501
   R2_test: 0.9642
   R2_train: 0.9952
[INFO] ✅ Saved versioned model to: models/random_forest/artifacts\model_20251101_223006.pkl
[INFO] Model directory logged as artifacts under 'model/' (ok=True)
[INFO] Random Forest model saved at: models/random_forest/artifacts\model_20251101_223006.pkl
[INFO] Random Forest training pipeline complete.



Unnamed: 0,RMSE,MAE,R2_test,R2_train
RandomForest,0.1893,0.0501,0.9642,0.9952


[INFO] Running cross-validation...
[INFO] CV R² mean: 0.9606 ± 0.0062
[INFO] Starting XGBoost training pipeline...
[INFO] Training XGBoost model...
[INFO] Training complete.
[INFO] Evaluating model performance...
[INFO] Model Evaluation:
   RMSE: 0.1881
   MAE: 0.0573
   R2_test: 0.9647
   R2_train: 0.9824
[INFO] ✅ Saved versioned model to: models/xgboost/artifacts\model_20251101_223034.pkl
[INFO] Modelo MLflow subido como artifacts desde: C:\Users\uriel\AppData\Local\Temp\tmps66kr3ze\xgboost_mlflow_model
[INFO] XGBoost model saved at: models/xgboost/artifacts\model_20251101_223034.pkl
[INFO] XGBoost training pipeline complete.



Unnamed: 0,RMSE,MAE,R2_test,R2_train
XGBoost,0.1881,0.0573,0.9647,0.9824


[INFO] Running cross-validation...
[INFO] CV R² mean: 0.9624 ± 0.0053


## 6. Comparativa de resultados
Se organiza la salida para comparar las metricas clave.


In [6]:
results_df = pd.DataFrame(results).T
results_df.round(4)


Unnamed: 0,RMSE,MAE,R2_test,R2_train
linear_regression,0.2208,0.0954,0.9514,0.9557
random_forest,0.1893,0.0501,0.9642,0.9952
xgboost,0.1881,0.0573,0.9647,0.9824


## 7. Resumen de cross-validation
Se imprimen los puntajes de cada modelo para referencia.


In [7]:
cv_json = {name: scores.tolist() if hasattr(scores, 'tolist') else list(scores)
           for name, scores in cv_details.items()}
print(json.dumps(cv_json, indent=2))


{
  "linear_regression": [
    0.9369767555145785,
    0.9515989789858095,
    0.9517217036276117,
    0.9630429369103755,
    0.9560684620532836
  ],
  "random_forest": [
    0.9513416023050324,
    0.9619400045543861,
    0.958608086997991,
    0.9705749530880969,
    0.9607359361718851
  ],
  "xgboost": [
    0.9559946186995685,
    0.9624023190543607,
    0.959632910151155,
    0.9719749056564009,
    0.9621246367841119
  ]
}
