# Scikit-learn Pipeline Experiments
Este cuaderno reutiliza modulos de `src.pipelines` para cargar datos, 
preparar atributos y ejecutar experimentos con diferentes modelos de regresion.


## 1. Configuracion
Las celdas iniciales solo ajustan el entorno y cargan utilidades.


In [None]:
import sys
from pathlib import Path

candidates = [Path.cwd()] + list(Path.cwd().parents)
for candidate in candidates:
    if (candidate / 'src').exists():
        project_root = candidate
        if str(project_root) not in sys.path:
            sys.path.append(str(project_root))
        break
else:
    raise FileNotFoundError('No se encontro la carpeta src en la jerarquia de directorios.')


In [None]:
import pandas as pd

from src.pipelines.data_setup import (
    DEFAULT_FEATURE_CONFIG,
    build_feature_frame,
    load_clean_dataframe,
    split_train_test,
)
from src.pipelines.experiment_pipelines import (
    build_linear_pipeline,
    build_xgb_pipeline,
    cross_validate_pipeline,
    evaluate_regression,
    get_default_xgb_param_grid,
    run_xgb_grid_search,
)


## 2. Carga y preparacion de datos
El modulo `data_setup` encapsula la carga del CSV limpio, la seleccion de columnas y la particion train/test.


In [None]:
feature_config = DEFAULT_FEATURE_CONFIG
df = load_clean_dataframe()
feature_df, target = build_feature_frame(df, feature_config)
X_train, X_test, y_train, y_test = split_train_test(feature_df, target)
df.shape, feature_df.shape


In [None]:
feature_config.to_dict()


## 3. Modelo base: Linear Regression
Se instancia el pipeline desde `experiment_pipelines`, se evalua con cross-validation y 
un conjunto hold-out.


In [None]:
linear_pipeline = build_linear_pipeline(feature_config)
linear_cv_results, linear_cv_summary = cross_validate_pipeline(
    linear_pipeline, feature_df, target
)
linear_pipeline.fit(X_train, y_train)
linear_metrics = evaluate_regression(
    linear_pipeline, X_train, X_test, y_train, y_test, 'LinearRegression'
)
linear_cv_summary.round(4)


In [None]:
pd.DataFrame([linear_metrics]).round(4)


## 4. Modelo XGBoost
El pipeline se reutiliza y se ejecuta una busqueda en rejilla utilizando los valores por defecto del modulo.


In [None]:
xgb_pipeline = build_xgb_pipeline(feature_config)
xgb_grid = run_xgb_grid_search(
    xgb_pipeline,
    X_train,
    y_train,
    param_grid=get_default_xgb_param_grid(),
)
xgb_best = xgb_grid.best_estimator_
xgb_metrics = evaluate_regression(
    xgb_best, X_train, X_test, y_train, y_test, 'XGBRegressor'
)
xgb_grid.best_params_


In [None]:
pd.DataFrame([xgb_metrics]).round(4)


## 5. Comparativa rapida
Los resultados se consolidan para facilitar la lectura. Puedes crear modulos adicionales para otros modelos y 
reutilizar las mismas funciones.


In [None]:
results_df = pd.DataFrame([linear_metrics, xgb_metrics]).round(4)
results_df
