# **ML Experimento: Análisis y Modelado de Demanda / Clasificación Alpha-Betha**
---
**Autor:** Juan Felipe Quinto Rios

>En este notebook construimos dos modelos clave: uno de regresión para predecir la demanda y otro de clasificación (Alpha/Betha) que guía las compras de materiales.



# **1. Importar librerías y cargar datos**

En esta celda se importan las bibliotecas necesarias y se leen los cuatro CSV (`dataset_A_raw.csv`, `dataset_B_addon_sum.csv`, `dataset_C_contractx.csv` y `dataset_D_fulleng.csv`) que contienen distintos bloques de características.

In [None]:
from os import path
import pandas as pd
import numpy as np

# Rutas a los archivos
path_A = "/contLAB_ML\Clasificacion_Regresion\transformed_data\dataset_A_raw.csv"
path_B = "/contLAB_ML\Clasificacion_Regresion\transformed_data\dataset_B_addon_sum.csv"
path_C = "/contLAB_ML\Clasificacion_Regresion\transformed_data\dataset_C_contractx.csv"
path_D = "/contLAB_ML\Clasificacion_Regresion\transformed_data\dataset_D_fulleng.csv"

# Carga de los DataFrames
df_A = pd.read_csv(path_A)
df_B = pd.read_csv(path_B)
df_C = pd.read_csv(path_C)
df_D = pd.read_csv(path_D)



# **2. Construcción del preprocesador genérico**

Aquí definimos un `ColumnTransformer` que aplica una imputación por mediana y estandarización a las variables numéricas, un rellenado con “Missing” y one-hot encoding a las categóricas, y deja pasar tal cual a las variables binarias.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Listas con todos los posibles nombres de columna
all_numeric_cols     = ["Charges", "Contract_months", "TotalAddOns",
                        "Charges_per_AddOn", "Contract_x_Charges"]
all_binary_cols      = ["SeniorCity", "Partner", "Dependents",
                        "Service1", "Service2", "Security",
                        "OnlineBackup", "DeviceProtection",
                        "TechSupport", "PaperlessBilling",
                        "InternetService", "AutoPayment_flag"]
all_categorical_cols = ["PaymentMethod_simple"]

# Transformadores para cada tipo de variable
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="Missing")),
    ("onehot", OneHotEncoder(drop="first"))
])

def make_preprocessor(df):
    """
    Genera un ColumnTransformer adaptado a las columnas presentes en cada df.
    """
    num_cols = [c for c in all_numeric_cols     if c in df.columns]
    cat_cols = [c for c in all_categorical_cols if c in df.columns]
    bin_cols = [c for c in all_binary_cols      if c in df.columns]

    preprocessor = ColumnTransformer([
        ("num", numeric_transformer,     num_cols),
        ("cat", categorical_transformer, cat_cols),
        ("bin", "passthrough",           bin_cols)
    ])
    return preprocessor


# **3. Definición de modelos base**

Aquí importamos las clases de regresión y clasificación que compararemos sobre cada bloque de datos. Cada diccionario asocia un nombre legible con la instancia del modelo sin ajustar.

In [None]:
from sklearn.linear_model     import LinearRegression, Ridge, LogisticRegression
from sklearn.ensemble         import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble         import RandomForestClassifier, GradientBoostingClassifier
from lightgbm                 import LGBMRegressor, LGBMClassifier
from catboost                 import CatBoostRegressor, CatBoostClassifier

regressors = {
    "LinearRegression":    LinearRegression(),
    "Ridge":               Ridge(alpha=1.0, random_state=42),
    "RandomForestReg":     RandomForestRegressor(n_estimators=100, random_state=42),
    "GradientBoostingReg": GradientBoostingRegressor(n_estimators=100, learning_rate=0.05, random_state=42),
    "LGBMReg":             LGBMRegressor(n_estimators=100, learning_rate=0.05, random_state=42, verbose=-1),
    "CatBoostReg":         CatBoostRegressor(iterations=100, learning_rate=0.05, random_state=42, verbose=False)
}

classifiers = {
    "LogisticRegression":  LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42),
    "RandomForestClf":     RandomForestClassifier(n_estimators=100, class_weight="balanced", random_state=42),
    "GradientBoostingClf": GradientBoostingClassifier(n_estimators=100, learning_rate=0.05, random_state=42),
    "LGBMClf":             LGBMClassifier(n_estimators=100, learning_rate=0.05, class_weight="balanced", random_state=42, verbose=-1),
    "CatBoostClf":         CatBoostClassifier(iterations=100, learning_rate=0.05, random_state=42, verbose=False)
}


# **4. Evaluación de bloques con validación cruzada**

En esta celda definimos la función `evaluate_block(df, block_name)` que, para cada bloque, separa el 80 % de entrenamiento y 20 % de prueba, realiza cross‐validation (5 pliegues) en entrenamiento y luego evalúa cada modelo en el test. Para regresión anotamos RMSE y MAE, y para clasificación anotamos ROC-AUC y el reporte de precisión/recall/f1 de la clase Betha.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold
from sklearn.metrics         import mean_squared_error, mean_absolute_error, roc_auc_score, classification_report

kf_reg = KFold(n_splits=5, shuffle=True, random_state=42)
skf_clf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = []

def evaluate_block(df, block_name):
    """
    - df: DataFrame que ya contiene sólo las columnas de features + 'Demand' + 'Class_flag'.
    - block_name: etiqueta para identificar el bloque (p. ej. "A_raw").

    Esta función:
      1) Separa X, y_reg y y_clf.
      2) Realiza train/test split.
      3) Para cada regresor: cross_val_score + entrenamiento final + RMSE/MAE.
      4) Para cada clasificador: cross_val_score + entrenamiento final + AUC/Precision/Recall/F1.
      5) Almacena resultados en la lista global “results”.
    """
    # Separar features y targets para regresión y clasificación
    feats   = [c for c in df.columns if c not in ["Demand", "Class_flag"]]
    X_r, y_r = df[feats], df["Demand"]
    X_c, y_c = df[feats], df["Class_flag"]

    # Split train/test
    Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_r, y_r, test_size=0.20, random_state=42)
    Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_c, y_c, test_size=0.20, stratify=y_c, random_state=42)

    # Preprocesador adaptado
    preproc = make_preprocessor(df)

    # Evaluar regresores
    for name, model in regressors.items():
        pipe = Pipeline([("preproc", preproc), ("model", model)])
        cv_scores = cross_val_score(pipe, Xr_tr, yr_tr, cv=kf_reg, scoring="neg_root_mean_squared_error", n_jobs=-1)
        rmse_cv = -cv_scores.mean()
        pipe.fit(Xr_tr, yr_tr)
        preds = pipe.predict(Xr_te)
        rmse_t = np.sqrt(mean_squared_error(yr_te, preds))
        mae_t  = mean_absolute_error(yr_te, preds)
        results.append({
            "block":     block_name,
            "type":      "regression",
            "model":     name,
            "rmse_cv":   rmse_cv,
            "rmse_test": rmse_t,
            "mae_test":  mae_t
        })
        print(f"[{block_name}][REG][{name}] RMSE_CV={rmse_cv:.2f} | RMSE_Test={rmse_t:.2f} | MAE_Test={mae_t:.2f}")

    # Evaluar clasificadores
    for name, model in classifiers.items():
        pipe = Pipeline([("preproc", preproc), ("model", model)])
        cv_scores = cross_val_score(pipe, Xc_tr, yc_tr, cv=skf_clf, scoring="roc_auc", n_jobs=-1)
        auc_cv = cv_scores.mean()
        pipe.fit(Xc_tr, yc_tr)
        probas = pipe.predict_proba(Xc_te)[:, 1]
        auc_t  = roc_auc_score(yc_te, probas)
        report = classification_report(yc_te, pipe.predict(Xc_te), output_dict=True)
        results.append({
            "block":       block_name,
            "type":        "classification",
            "model":       name,
            "auc_cv":      auc_cv,
            "auc_test":    auc_t,
            "precision_1": report["1"]["precision"],
            "recall_1":    report["1"]["recall"],
            "f1_1":        report["1"]["f1-score"]
        })
        print(f"[{block_name}][CLF][{name}] AUC_CV={auc_cv:.4f} | AUC_Test={auc_t:.4f}")

# Ejecutar para los cuatro bloques
evaluate_block(df_A, "A_raw")
evaluate_block(df_B, "B_addon_sum")
evaluate_block(df_C, "C_contractx")
evaluate_block(df_D, "D_fulleng")

# Mostrar resultados en DataFrame
df_results = pd.DataFrame(results)

#  Resumen regresión
df_reg = df_results[df_results["type"] == "regression"]\
            .sort_values("rmse_test", ascending=True)\
            .reset_index(drop=True)
print("\n=== RESULTADOS REGRESIÓN (ordenados por RMSE_Test asc) ===")
display(df_reg)

#  Resumen clasificación
df_clf = df_results[df_results["type"] == "classification"]\
            .sort_values("auc_test", ascending=False)\
            .reset_index(drop=True)
print("\n=== RESULTADOS CLASIFICACIÓN (ordenados por AUC_Test desc) ===")
display(df_clf)


[A_raw][REG][LinearRegression] RMSE_CV=1141.59 | RMSE_Test=1147.21 | MAE_Test=899.88
[A_raw][REG][Ridge] RMSE_CV=1141.58 | RMSE_Test=1147.22 | MAE_Test=899.89
[A_raw][REG][RandomForestReg] RMSE_CV=1088.96 | RMSE_Test=1140.08 | MAE_Test=791.51
[A_raw][REG][GradientBoostingReg] RMSE_CV=1019.10 | RMSE_Test=1051.10 | MAE_Test=740.85




[A_raw][REG][LGBMReg] RMSE_CV=1021.66 | RMSE_Test=1060.65 | MAE_Test=745.02
[A_raw][REG][CatBoostReg] RMSE_CV=1002.96 | RMSE_Test=1043.65 | MAE_Test=735.05
[A_raw][CLF][LogisticRegression] AUC_CV=0.8311 | AUC_Test=0.8123
[A_raw][CLF][RandomForestClf] AUC_CV=0.7633 | AUC_Test=0.7478
[A_raw][CLF][GradientBoostingClf] AUC_CV=0.8274 | AUC_Test=0.8081




[A_raw][CLF][LGBMClf] AUC_CV=0.8182 | AUC_Test=0.8001
[A_raw][CLF][CatBoostClf] AUC_CV=0.8286 | AUC_Test=0.8107
[B_addon_sum][REG][LinearRegression] RMSE_CV=1146.08 | RMSE_Test=1155.48 | MAE_Test=904.57
[B_addon_sum][REG][Ridge] RMSE_CV=1146.08 | RMSE_Test=1155.51 | MAE_Test=904.56
[B_addon_sum][REG][RandomForestReg] RMSE_CV=1125.70 | RMSE_Test=1172.88 | MAE_Test=807.51
[B_addon_sum][REG][GradientBoostingReg] RMSE_CV=1029.43 | RMSE_Test=1062.90 | MAE_Test=751.41




[B_addon_sum][REG][LGBMReg] RMSE_CV=1031.53 | RMSE_Test=1073.27 | MAE_Test=751.80
[B_addon_sum][REG][CatBoostReg] RMSE_CV=1012.86 | RMSE_Test=1053.31 | MAE_Test=741.87
[B_addon_sum][CLF][LogisticRegression] AUC_CV=0.8304 | AUC_Test=0.8124
[B_addon_sum][CLF][RandomForestClf] AUC_CV=0.7627 | AUC_Test=0.7535
[B_addon_sum][CLF][GradientBoostingClf] AUC_CV=0.8270 | AUC_Test=0.8085




[B_addon_sum][CLF][LGBMClf] AUC_CV=0.8191 | AUC_Test=0.8051
[B_addon_sum][CLF][CatBoostClf] AUC_CV=0.8291 | AUC_Test=0.8123
[C_contractx][REG][LinearRegression] RMSE_CV=1055.82 | RMSE_Test=1081.67 | MAE_Test=782.68
[C_contractx][REG][Ridge] RMSE_CV=1055.82 | RMSE_Test=1081.71 | MAE_Test=782.70
[C_contractx][REG][RandomForestReg] RMSE_CV=1088.60 | RMSE_Test=1130.35 | MAE_Test=780.34
[C_contractx][REG][GradientBoostingReg] RMSE_CV=1018.73 | RMSE_Test=1057.27 | MAE_Test=748.38




[C_contractx][REG][LGBMReg] RMSE_CV=1024.21 | RMSE_Test=1063.09 | MAE_Test=742.75
[C_contractx][REG][CatBoostReg] RMSE_CV=1005.43 | RMSE_Test=1048.10 | MAE_Test=739.03
[C_contractx][CLF][LogisticRegression] AUC_CV=0.8318 | AUC_Test=0.8136
[C_contractx][CLF][RandomForestClf] AUC_CV=0.7725 | AUC_Test=0.7578
[C_contractx][CLF][GradientBoostingClf] AUC_CV=0.8278 | AUC_Test=0.8106




[C_contractx][CLF][LGBMClf] AUC_CV=0.8164 | AUC_Test=0.7995
[C_contractx][CLF][CatBoostClf] AUC_CV=0.8291 | AUC_Test=0.8097
[D_fulleng][REG][LinearRegression] RMSE_CV=1046.10 | RMSE_Test=1066.94 | MAE_Test=767.49
[D_fulleng][REG][Ridge] RMSE_CV=1046.09 | RMSE_Test=1066.98 | MAE_Test=767.56
[D_fulleng][REG][RandomForestReg] RMSE_CV=1097.75 | RMSE_Test=1136.14 | MAE_Test=785.63
[D_fulleng][REG][GradientBoostingReg] RMSE_CV=1019.23 | RMSE_Test=1053.83 | MAE_Test=744.30




[D_fulleng][REG][LGBMReg] RMSE_CV=1022.43 | RMSE_Test=1060.84 | MAE_Test=742.21
[D_fulleng][REG][CatBoostReg] RMSE_CV=1006.94 | RMSE_Test=1048.11 | MAE_Test=738.86
[D_fulleng][CLF][LogisticRegression] AUC_CV=0.8321 | AUC_Test=0.8144
[D_fulleng][CLF][RandomForestClf] AUC_CV=0.7796 | AUC_Test=0.7646
[D_fulleng][CLF][GradientBoostingClf] AUC_CV=0.8298 | AUC_Test=0.8090




[D_fulleng][CLF][LGBMClf] AUC_CV=0.8195 | AUC_Test=0.8016
[D_fulleng][CLF][CatBoostClf] AUC_CV=0.8295 | AUC_Test=0.8116

=== RESULTADOS REGRESIÓN (ordenados por RMSE_Test asc) ===


Unnamed: 0,block,type,model,rmse_cv,rmse_test,mae_test,auc_cv,auc_test,precision_1,recall_1,f1_1
0,A_raw,regression,CatBoostReg,1002.959339,1043.650234,735.053318,,,,,
1,C_contractx,regression,CatBoostReg,1005.425359,1048.099156,739.026655,,,,,
2,D_fulleng,regression,CatBoostReg,1006.935847,1048.112779,738.864619,,,,,
3,A_raw,regression,GradientBoostingReg,1019.101552,1051.10047,740.845848,,,,,
4,B_addon_sum,regression,CatBoostReg,1012.864701,1053.308305,741.869542,,,,,
5,D_fulleng,regression,GradientBoostingReg,1019.228002,1053.833693,744.299945,,,,,
6,C_contractx,regression,GradientBoostingReg,1018.732325,1057.273255,748.380116,,,,,
7,A_raw,regression,LGBMReg,1021.657562,1060.651569,745.020738,,,,,
8,D_fulleng,regression,LGBMReg,1022.432491,1060.84062,742.207148,,,,,
9,B_addon_sum,regression,GradientBoostingReg,1029.434133,1062.898533,751.409892,,,,,



=== RESULTADOS CLASIFICACIÓN (ordenados por AUC_Test desc) ===


Unnamed: 0,block,type,model,rmse_cv,rmse_test,mae_test,auc_cv,auc_test,precision_1,recall_1,f1_1
0,D_fulleng,classification,LogisticRegression,,,,0.832129,0.814359,0.463259,0.775401,0.58
1,C_contractx,classification,LogisticRegression,,,,0.83183,0.813637,0.464968,0.780749,0.582834
2,B_addon_sum,classification,LogisticRegression,,,,0.830418,0.812351,0.470305,0.783422,0.587763
3,A_raw,classification,LogisticRegression,,,,0.831071,0.812293,0.46252,0.775401,0.579421
4,B_addon_sum,classification,CatBoostClf,,,,0.829111,0.812263,0.619377,0.47861,0.53997
5,D_fulleng,classification,CatBoostClf,,,,0.829504,0.81158,0.601974,0.489305,0.539823
6,A_raw,classification,CatBoostClf,,,,0.828637,0.810738,0.598706,0.494652,0.541728
7,C_contractx,classification,GradientBoostingClf,,,,0.827835,0.810647,0.59396,0.473262,0.526786
8,C_contractx,classification,CatBoostClf,,,,0.829133,0.809731,0.60396,0.489305,0.54062
9,D_fulleng,classification,GradientBoostingClf,,,,0.829835,0.808974,0.610345,0.473262,0.533133


## **4.1 Interpretacion de Resultados:**

En los resultados obtenidos, el bloque **A_raw** con CatBoostRegressor muestra el menor RMSE en test (~ 1043), mientras que el bloque **D_fulleng** con LogisticRegression alcanza el mayor ROC-AUC en clasificación (~0.814). Estos resultados definen los modelos base para la siguiente fase de ajuste.

# **5. Fine-tuning ligero**

En este bloque ajustamos de forma sencilla los dos modelos que seleccionamos: CatBoostRegressor sobre `df_A` y LogisticRegression sobre `df_D`. Usamos un grid pequeño y CV de 3 pliegues para mantener tiempos razonable

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, GridSearchCV
from sklearn.metrics       import mean_squared_error, mean_absolute_error, roc_auc_score, classification_report
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute        import SimpleImputer
from sklearn.compose       import ColumnTransformer
from catboost             import CatBoostRegressor
from sklearn.linear_model import LogisticRegression

# Cargar dataframes finales
df_A = pd.read_csv("/content/dataset_A_raw.csv")
df_D = pd.read_csv("/content/dataset_D_fulleng.csv")

# Listas de columnas (idénticas a antes)
all_numeric_cols     = ["Charges", "Contract_months", "TotalAddOns", "Charges_per_AddOn", "Contract_x_Charges"]
all_binary_cols      = ["SeniorCity", "Partner", "Dependents", "Service1", "Service2", "Security",
                        "OnlineBackup", "DeviceProtection", "TechSupport", "PaperlessBilling",
                        "InternetService", "AutoPayment_flag"]
all_categorical_cols = ["PaymentMethod_simple"]

def make_preprocessor(df):
    num_cols = [c for c in all_numeric_cols     if c in df.columns]
    cat_cols = [c for c in all_categorical_cols if c in df.columns]
    bin_cols = [c for c in all_binary_cols      if c in df.columns]

    num_t = Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])
    cat_t = Pipeline([("imputer", SimpleImputer(strategy="constant", fill_value="Missing")),
                      ("onehot", OneHotEncoder(drop="first"))])

    return ColumnTransformer([
        ("num", num_t,     num_cols),
        ("cat", cat_t,     cat_cols),
        ("bin", "passthrough", bin_cols)
    ])

# ----- Ajuste para regresión en df_A -----
X_A = df_A[[c for c in df_A.columns if c not in ["Demand", "Class_flag"]]]
y_A = df_A["Demand"]
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_A, y_A, test_size=0.20, random_state=42)

preproc_A = make_preprocessor(df_A)
pipe_cb = Pipeline([
    ("preproc", preproc_A),
    ("model", CatBoostRegressor(iterations=100, learning_rate=0.05, depth=6, l2_leaf_reg=3, random_state=42, verbose=False))
])

param_grid_cb = {
    "model__iterations":    [100, 300],
    "model__learning_rate": [0.03, 0.05],
    "model__depth":         [4, 6],
    "model__l2_leaf_reg":   [1, 3]
}

grid_cb = GridSearchCV(pipe_cb, param_grid=param_grid_cb,
                       cv=KFold(n_splits=3, shuffle=True, random_state=42),
                       scoring="neg_root_mean_squared_error", n_jobs=-1, verbose=1)
grid_cb.fit(Xr_tr, yr_tr)

print("Mejores parámetros CatBoostReg:", grid_cb.best_params_)
best_cb = grid_cb.best_estimator_
yr_pred = best_cb.predict(Xr_te)
rmse_final = np.sqrt(mean_squared_error(yr_te, yr_pred))
mae_final  = mean_absolute_error(yr_te, yr_pred)
print(f"RMSE_test final: {rmse_final:.2f}, MAE_test final: {mae_final:.2f}")

# ----- Ajuste para clasificación en df_D -----
X_D = df_D[[c for c in df_D.columns if c not in ["Demand", "Class_flag"]]]
y_D = df_D["Class_flag"]
Xc_tr, Xc_te, yc_tr, yc_te = train_test_split(X_D, y_D, test_size=0.20, stratify=y_D, random_state=42)

preproc_D = make_preprocessor(df_D)
pipe_lr = Pipeline([
    ("preproc", preproc_D),
    ("model", LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42))
])

param_grid_lr = {
    "model__C":       [0.1, 1],
    "model__penalty": ["l2"],
    "model__solver":  ["liblinear"]
}

grid_lr = GridSearchCV(pipe_lr, param_grid=param_grid_lr,
                       cv=StratifiedKFold(n_splits=3, shuffle=True, random_state=42),
                       scoring="roc_auc", n_jobs=-1, verbose=1)
grid_lr.fit(Xc_tr, yc_tr)

print("Mejores parámetros LogisticRegression:", grid_lr.best_params_)
best_lr = grid_lr.best_estimator_
yc_proba = best_lr.predict_proba(Xc_te)[:, 1]
yc_pred  = best_lr.predict(Xc_te)
auc_final = roc_auc_score(yc_te, yc_proba)
print(f"ROC-AUC_test final: {auc_final:.4f}")
print("Reporte (clase Betha):")
print(classification_report(yc_te, yc_pred, digits=4))


Fitting 3 folds for each of 16 candidates, totalling 48 fits
Mejores parámetros CatBoostReg: {'model__depth': 4, 'model__iterations': 300, 'model__l2_leaf_reg': 1, 'model__learning_rate': 0.03}
RMSE_test final: 1038.63, MAE_test final: 725.55
Fitting 3 folds for each of 2 candidates, totalling 6 fits
Mejores parámetros LogisticRegression: {'model__C': 1, 'model__penalty': 'l2', 'model__solver': 'liblinear'}
ROC-AUC_test final: 0.8143
Reporte (clase Betha):
              precision    recall  f1-score   support

           0     0.8924    0.6747    0.7685      1033
           1     0.4633    0.7754    0.5800       374

    accuracy                         0.7015      1407
   macro avg     0.6779    0.7251    0.6742      1407
weighted avg     0.7784    0.7015    0.7184      1407



# **6. Extracción de importancias y coeficientes**

Se obtienen las importancias de las características para el CatBoostRegressor final y los coeficientes ordenados para la regresión logística final, de modo que podamos interpretar el peso de cada variable en ambos modelos.

In [None]:
import pandas as pd

# Importancias de CatBoostReg (best_cb)
feat_names_cb = best_cb.named_steps["preproc"].get_feature_names_out()
importances_cb = best_cb.named_steps["model"].get_feature_importance()
feat_imp_cb = pd.Series(importances_cb, index=feat_names_cb).sort_values(ascending=False)

print("Top 10 features (CatBoostReg):")
print(feat_imp_cb.head(10))
print("\nImportancia completa:")
print(feat_imp_cb)

Top 10 features (CatBoostReg):
num__Charges                              40.404008
num__Contract_months                      35.784338
bin__OnlineBackup                          4.802999
bin__Service2                              4.638583
bin__Partner                               3.373142
bin__Security                              2.603208
bin__InternetService                       2.238062
bin__DeviceProtection                      1.932197
bin__AutoPayment_flag                      1.719117
cat__PaymentMethod_simple_Mailed check     0.663273
dtype: float64

Importancia completa:
num__Charges                                  40.404008
num__Contract_months                          35.784338
bin__OnlineBackup                              4.802999
bin__Service2                                  4.638583
bin__Partner                                   3.373142
bin__Security                                  2.603208
bin__InternetService                           2.238062
bin__DeviceProtecti

In [None]:
# Coeficientes de LogisticRegression (best_lr)
feat_names_lr = best_lr.named_steps["preproc"].get_feature_names_out()
coeffs_lr = best_lr.named_steps["model"].coef_.flatten()
coef_ser = pd.Series(coeffs_lr, index=feat_names_lr).sort_values(key=abs, ascending=False)

print("\nTop 10 coeficientes (LogisticRegression):")
print(coef_ser.head(10))
print("\nCoeficientes completos:")
print(coef_ser)


Top 10 coeficientes (LogisticRegression):
num__Contract_months                         -1.606003
bin__Service1                                -0.766596
num__Contract_x_Charges                       0.568977
num__Charges                                  0.493067
num__TotalAddOns                             -0.278788
bin__Security                                -0.275157
cat__PaymentMethod_simple_Electronic check    0.267448
bin__AutoPayment_flag                        -0.260460
bin__PaperlessBilling                         0.240795
bin__Partner                                 -0.223493
dtype: float64

Coeficientes completos:
num__Contract_months                         -1.606003
bin__Service1                                -0.766596
num__Contract_x_Charges                       0.568977
num__Charges                                  0.493067
num__TotalAddOns                             -0.278788
bin__Security                                -0.275157
cat__PaymentMethod_simple_Electronic 

# **7. Guardado de modelos finales**

Almacena ambos pipelines completos para producción, asegurando el preprocesamiento junto con el modelo y, de ser necesario, un archivo JSON que contenga el umbral óptimo para clasificar Betha.

In [None]:
import json
from pathlib import Path
from joblib import dump

# Crear carpeta para modelos
ARTIFACT_DIR = Path("models")
ARTIFACT_DIR.mkdir(exist_ok=True)

# Pipeline completo de regresión: preprocesador + CatBoostReg
pipe_reg = Pipeline([
    ("preproc", best_cb.named_steps["preproc"]),
    ("model", best_cb.named_steps["model"])
])
dump(pipe_reg, ARTIFACT_DIR / "catboost_reg_pipeline.joblib")

# Pipeline completo de clasificación: preprocesador + LogisticRegression
pipe_clf = Pipeline([
    ("preproc", best_lr.named_steps["preproc"]),
    ("model", best_lr.named_steps["model"])
])
dump(pipe_clf, ARTIFACT_DIR / "logistic_clf_pipeline.joblib")

# Umbral óptimo (si se aplicó)
threshold_dict = {"threshold_betha": 0.35}
with open(ARTIFACT_DIR / "threshold.json", "w") as f:
    json.dump(threshold_dict, f)


# **8. Guardado de métricas e información de los modelos en JSON**

En esta sección creamos un archivo txt_info_modelos.json que contiene, para cada modelo final (regresión y clasificación), las métricas principales, el nombre del algoritmo empleado y el tamaño de los conjuntos de entrenamiento y prueba.

In [None]:
import json
from pathlib import Path

# --------------------------------------------------------------
#  Recolectar información de los conjuntos y métricas finales
# --------------------------------------------------------------
# Para regresión
n_train_reg = Xr_tr.shape[0]
n_test_reg  = Xr_te.shape[0]

info_reg = {
    "modelo":           "CatBoostRegressor",
    "n_train":          int(n_train_reg),
    "n_test":           int(n_test_reg),
    "rmse_test":        float(rmse_final),
    "mae_test":         float(mae_final)
}

# Para clasificación
n_train_clf = Xc_tr.shape[0]
n_test_clf  = Xc_te.shape[0]

info_clf = {
    "modelo":           "LogisticRegression",
    "n_train":          int(n_train_clf),
    "n_test":           int(n_test_clf),
    "roc_auc_test":     float(auc_final),
    "precision_1":      float(classification_report(yc_te, yc_pred, digits=4, output_dict=True)["1"]["precision"]),
    "recall_1":         float(classification_report(yc_te, yc_pred, digits=4, output_dict=True)["1"]["recall"]),
    "f1_1":             float(classification_report(yc_te, yc_pred, digits=4, output_dict=True)["1"]["f1-score"])
}

# --------------------------------------------------------------
# Construir diccionario general y guardarlo en JSON
# --------------------------------------------------------------
info_modelos = {
    "regresion":      info_reg,
    "clasificacion":  info_clf
}

ARTIFACT_DIR = Path("models")
file_path = ARTIFACT_DIR / "txt_info_modelos.json"

with open(file_path, "w") as f:
    json.dump(info_modelos, f, indent=4)

print(f"Guardado JSON de métricas en: {file_path}")


Guardado JSON de métricas en: models/txt_info_modelos.json


# **9. Conclusiones finales**

En esta sección resumimos los hallazgos clave: el CatBoostRegressor final para demanda, con RMSE ≈ 1 038, y el LogisticRegression final para clasificación, con ROC-AUC ≈ 0.814. Además, destacamos las variables más relevantes de cara  a un posterior anlisis  para el analista de compras sobre cómo usar estas predicciones en la planificación diaria de materiales