# 03.5_Modelo_3outputs

---

## Objetivo
Construir un modelo de clasificación multiclase (tres categorías) partiendo de las etiquetas originales. Se remapean las etiquetas, se cargan los datos sintéticos, se define un preprocesador y se entrenan modelos (`Logistic Regression`, `Random Forest` y `GradientBoosting`) con validación mediante `SMOTE` y `GridSearch`. Finalmente, se optimiza un `Random Forest` minimizando un coste definido por una matriz de penalizaciones.

## Entradas (Inputs)
- `data/splits/experiments/X_train_17.parquet`
- `data/splits/experiments/X_val_17.parquet`
- `data/splits/experiments/X_test_17.parquet`
- `data/splits/final/y_train.parquet`
- `data/splits/final/y_val.parquet`
- `data/splits/final/y_test.parquet`

## Salidas (Outputs)
### Splits Experimentales:
- `data/splits/final/y_train_3_classes.parquet`
- `data/splits/final/y_val_3_classes.parquet`
- `data/splits/final/y_test_3_classes.parquet`

---


## Resumen Ejecutivo
- Objetivo: clasificar en tres niveles de riesgo financiero (bajo, medio, alto) usando pipelines con SMOTE y `StandardScaler`.  
- Técnicas: `GridSearchCV` estratificado 5-fold sobre tres clasificadores (LogisticRegression, RandomForest, GradientBoosting) + optimización de umbrales multiclass para RF.  
- Hiperparámetros óptimos (CV F1_macro):  
  - LR: `C=0.01` (F1≈0.4529)  
  - RF: `max_depth=5, n_estimators=100` (F1≈0.4900)  
  - GB: `learning_rate=0.01, max_depth=3, n_estimators=200` (F1≈0.4851)  
- En validación sin umbral:  
  - **LR**: Acc=0.4788, F1_macro=0.4501, AUC_ovr≈0.6823  
  - **RF**: Acc=0.5255, F1_macro=0.4646, AUC_ovr≈0.6748  
- La optimización de umbrales multiclass para RF eleva el F1_macro en validación a ≈0.5120.  
- En test con umbrales RF: Acc=0.5401, F1_macro=0.4623, AUC_ovr≈0.6625.  
- Se examinan matrices de confusión y conteo global de errores: 220 predicciones correctas, 115 sobreestimaciones y 89 subestimaciones, con sesgos distintos por clase.
---

## 1. Configuración de entorno local, importar librerías y cargar configuración

Monta Google Drive, añade la raíz del proyecto al `sys.path` y carga las rutas de configuración, tras importar las librerías de Colab, estándar, procesamiento de datos, scikit-learn, imbalanced-learn y configuración local.


In [1]:
import sys
import os
from pathlib import Path

# 1. Añadir la raíz del proyecto al path
current_dir = Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))


# Data processing
import numpy as np
import pandas as pd
import joblib

# Scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    make_scorer,
    roc_auc_score
)
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline


# Importar las rutas necesarias desde el archivo de configuración
from config import FINAL_SPLITS_DIR, EXP_SPLITS_DIR, EXP_ARTIFACTS_DIR

print("Drive montado, librerías importadas y configuración de rutas cargada.")

Módulo de configuración cargado y estructura de carpetas asegurada.
Drive montado, librerías importadas y configuración de rutas cargada.


## 2. Cargar conjuntos de datos desde Parquet

Carga los DataFrames `X_train_sint`, `X_val_sint`, `X_test_sint` y las Series `y_train`, `y_val`, `y_test` desde archivos Parquet, e imprime las dimensiones de cada split.


In [2]:
# CARGAR LOS CONJUNTOS DE DATOS
try:
    # Las 'X' con 17 features vienen del experimento 03.4
    X_train_sint = pd.read_parquet(EXP_SPLITS_DIR / 'X_train_17.parquet')
    X_val_sint   = pd.read_parquet(EXP_SPLITS_DIR / 'X_val_17.parquet')
    X_test_sint  = pd.read_parquet(EXP_SPLITS_DIR / 'X_test_17.parquet')

    # Las 'y' originales vienen del split final (03.1)
    y_train = pd.read_parquet(FINAL_SPLITS_DIR / 'y_train.parquet').squeeze()
    y_val   = pd.read_parquet(FINAL_SPLITS_DIR / 'y_val.parquet').squeeze()
    y_test  = pd.read_parquet(FINAL_SPLITS_DIR / 'y_test.parquet').squeeze()

    print("Datos .parquet cargados correctamente.")
    print("\nShapes tras cargar splits:")
    print(f"   • X_train_sint: {X_train_sint.shape}, y_train: {y_train.shape}")
    print(f"   • X_val_sint:   {X_val_sint.shape},   y_val:   {y_val.shape}")
    print(f"   • X_test_sint:  {X_test_sint.shape},  y_test:  {y_test.shape}")

except Exception as e:
    print(f"\nOcurrió un error inesperado al cargar los datos: {e}")

Datos .parquet cargados correctamente.

Shapes tras cargar splits:
   • X_train_sint: (1976, 14), y_train: (1976,)
   • X_val_sint:   (424, 14),   y_val:   (424,)
   • X_test_sint:  (424, 14),  y_test:  (424,)


## 3. Remapear etiquetas a tres clases y guardar resultados

Define `remap_to_3` para convertir las etiquetas originales a tres clases (Bajo-Medio, Alto, Muy Alto), aplica el mapeo sobre los splits de `y`, muestra su distribución y guarda los nuevos archivos Parquet.


In [3]:
# REMAPEAR ETIQUETAS A 3 CLASES Y GUARDAR

def remap_to_3(x):
    if x in [1.0, 2.0]: return 1.0 # Bajo-Medio
    elif x == 3.0:      return 2.0 # Alto
    else:               return 3.0 # Muy Alto

y_train_3 = y_train.map(remap_to_3)
y_val_3   = y_val.map(remap_to_3)
y_test_3  = y_test.map(remap_to_3)

print("\nDistribución remapeada y_train (3 clases):")
print(y_train_3.value_counts(normalize=True).rename('proporción'))

# Guardar los nuevos splits de 'y' en la carpeta de experimentos ---
y_train_3.to_frame(name='target').to_parquet(FINAL_SPLITS_DIR / 'y_train_3_classes.parquet')
y_val_3.to_frame(name='target').to_parquet(FINAL_SPLITS_DIR / 'y_val_3_classes.parquet')
y_test_3.to_frame(name='target').to_parquet(FINAL_SPLITS_DIR / 'y_test_3_classes.parquet')

print("\n-----------------------------------------------")
print(" Mapeo a 3 clases guardado correctamente.")
print(f"Archivos guardados en la carpeta: {FINAL_SPLITS_DIR}")
print("-----------------------------------------------")


Distribución remapeada y_train (3 clases):
B10
2.0    0.550101
1.0    0.354757
3.0    0.095142
Name: proporción, dtype: float64

-----------------------------------------------
 Mapeo a 3 clases guardado correctamente.
Archivos guardados en la carpeta: C:\Users\Antonio\TFM-Digitech\data\splits\final
-----------------------------------------------


## 4. Definir el preprocesador de datos

Crea un `ColumnTransformer` que imputa valores faltantes con la mediana y aplica un escalado estándar a todas las características numéricas de `X_train_sint`.


In [4]:
# DEFINIR EL PREPROCESADOR

# Lista con todas las columnas de X_train_sint (las 17 features)
numeric_features = X_train_sint.columns.tolist()

# Pipeline para imputar la mediana y luego escalar
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler',  StandardScaler())
])

# ColumnTransformer que aplica el pipeline a todas las columnas
preprocessor_sint = ColumnTransformer([
    ('num', numeric_transformer, numeric_features)
])

print(" Preprocesador 'preprocessor_sint' definido correctamente en el notebook.")
print(f"   Aplicará imputación y escalado a las {len(numeric_features)} columnas de entrada.")

 Preprocesador 'preprocessor_sint' definido correctamente en el notebook.
   Aplicará imputación y escalado a las 14 columnas de entrada.


## 5. Crear pipelines con SMOTE en cada fold

Define tres pipelines (`ImbPipeline`) que incorporan SMOTE, el preprocesador y los clasificadores `LogisticRegression`, `RandomForestClassifier` y `GradientBoostingClassifier` para clasificación multiclase.


In [5]:
# Definir pipelines con SMOTE en cada fold

pipe3_lr_base = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('pre',   preprocessor_sint),
    ('clf',   LogisticRegression(class_weight='balanced', random_state=42, max_iter=2000))
])

pipe3_rf_base = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('pre',   preprocessor_sint),
    ('clf',   RandomForestClassifier(class_weight='balanced', random_state=42))
])

pipe3_gb_base = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('pre',   preprocessor_sint),
    ('clf',   GradientBoostingClassifier(random_state=42))
])

## 6. Definir grillas de hiperparámetros y validación

Establece las grillas de parámetros para cada clasificador y configura un `StratifiedKFold` de 5 particiones para validación cruzada.


In [6]:
# Grillas de hiperparámetros
param_grid_lr3 = {'clf__C': [0.01, 0.1, 1, 10]}
param_grid_rf3 = {'clf__n_estimators': [100, 200], 'clf__max_depth': [None, 5, 10]}
param_grid_gb3 = {'clf__n_estimators': [100, 200], 'clf__learning_rate': [0.01, 0.1], 'clf__max_depth': [3, 5]}

cv3 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


## 7. Ejecutar GridSearchCV con SMOTE para las tres clases

Lanza `GridSearchCV` sobre cada pipeline usando `f1_macro` como métrica, ajusta los modelos en `X_train_sint`/`y_train_3` e imprime los mejores parámetros y puntuaciones.


In [7]:
# GridSearchCV con SMOTE en cada fold para las 3 clases
gs3_lr = GridSearchCV(pipe3_lr_base, param_grid=param_grid_lr3, cv=cv3, scoring='f1_macro', n_jobs=-1, verbose=2)
gs3_rf = GridSearchCV(pipe3_rf_base, param_grid=param_grid_rf3, cv=cv3, scoring='f1_macro', n_jobs=-1, verbose=2)
gs3_gb = GridSearchCV(pipe3_gb_base, param_grid=param_grid_gb3, cv=cv3, scoring='f1_macro', n_jobs=-1, verbose=2)

print("Entrenando LR (3 clases) con SMOTE en cada fold...")
gs3_lr.fit(X_train_sint, y_train_3)

print("\nEntrenando RF (3 clases) con SMOTE en cada fold...")
gs3_rf.fit(X_train_sint, y_train_3)

print("\nEntrenando GB (3 clases) con SMOTE en cada fold...")
gs3_gb.fit(X_train_sint, y_train_3)

print(f"\n→ Mejores LR: {gs3_lr.best_params_}, F1_macro: {gs3_lr.best_score_}")
print(f"→ Mejores RF: {gs3_rf.best_params_}, F1_macro: {gs3_rf.best_score_}")
print(f"→ Mejores GB: {gs3_gb.best_params_}, F1_macro: {gs3_gb.best_score_}")


Entrenando LR (3 clases) con SMOTE en cada fold...
Fitting 5 folds for each of 4 candidates, totalling 20 fits

Entrenando RF (3 clases) con SMOTE en cada fold...
Fitting 5 folds for each of 6 candidates, totalling 30 fits

Entrenando GB (3 clases) con SMOTE en cada fold...
Fitting 5 folds for each of 8 candidates, totalling 40 fits

→ Mejores LR: {'clf__C': 1}, F1_macro: 0.4620606504441966
→ Mejores RF: {'clf__max_depth': 5, 'clf__n_estimators': 100}, F1_macro: 0.489696133529096
→ Mejores GB: {'clf__learning_rate': 0.01, 'clf__max_depth': 3, 'clf__n_estimators': 200}, F1_macro: 0.48339151468044816


## 8. Configurar pipelines finales con los mejores parámetros

Construye pipelines finales sin SMOTE que incluyen solo el preprocesador y el clasificador configurado con los mejores hiperparámetros obtenidos de la búsqueda.


In [8]:
# Pipelines finales con las mejores configuraciones
best3_lr = gs3_lr.best_params_
best3_rf = gs3_rf.best_params_
best3_gb = gs3_gb.best_params_

pipe3_final_lr = Pipeline([
    ('pre', preprocessor_sint),
    ('clf', LogisticRegression(
        # Eliminado: multi_class='ovr'
        class_weight='balanced', 
        random_state=42, 
        max_iter=2000, 
        C=best3_lr['clf__C']
    ))
])

pipe3_final_rf = Pipeline([
    ('pre', preprocessor_sint),
    ('clf', RandomForestClassifier(
        class_weight='balanced', 
        random_state=42, 
        n_estimators=best3_rf['clf__n_estimators'], 
        max_depth=best3_rf['clf__max_depth']
    ))
])

pipe3_final_gb = Pipeline([
    ('pre', preprocessor_sint),
    ('clf', GradientBoostingClassifier(
        random_state=42, 
        n_estimators=best3_gb['clf__n_estimators'], 
        learning_rate=best3_gb['clf__learning_rate'], 
        max_depth=best3_gb['clf__max_depth']
    ))
])

print("✔️ Pipelines finales (3 clases) definidas")

✔️ Pipelines finales (3 clases) definidas


## 9. Aplicar SMOTE y reentrenar pipelines finales

Aplica SMOTE al conjunto de entrenamiento para balancear las clases, muestra las nuevas dimensiones y entrena los pipelines finales (`LR`, `RF`, `GB`) con los datos re-muestreados.


In [9]:
# Aplicar SMOTE sobre X_train_sint y reentrenar finales
sm3 = SMOTE(random_state=42)
X_train_sm3, y_train_sm3 = sm3.fit_resample(X_train_sint, y_train_3)
print(f"Shapes after SMOTE (3 clases): {X_train_sm3.shape}, {y_train_sm3.shape}")

print("Entrenando LR3 final...")
pipe3_final_lr.fit(X_train_sm3, y_train_sm3)

print("Entrenando RF3 final...")
pipe3_final_rf.fit(X_train_sm3, y_train_sm3)

print("Entrenando GB3 final...")
pipe3_final_gb.fit(X_train_sm3, y_train_sm3)


Shapes after SMOTE (3 clases): (3261, 14), (3261,)
Entrenando LR3 final...
Entrenando RF3 final...
Entrenando GB3 final...


0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('pre', ...), ('clf', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"transformers  transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str  Like in Pipeline and FeatureUnion, this allows the transformer and  its parameters to be set using ``set_params`` and searched in grid  search. transformer : {'drop', 'passthrough'} or estimator  Estimator must support :term:`fit` and :term:`transform`.  Special-cased strings 'drop' and 'passthrough' are accepted as  well, to indicate to drop the columns or to pass them through  untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable  Indexes the data on its second axis. Integers are interpreted as  positional columns, while strings can reference DataFrame columns  by name. A scalar string or int should be used where  ``transformer`` expects X to be a 1d array-like (vector),  otherwise a 2d array will be passed to the transformer.  A callable is passed the input data `X` and can return any of the  above. To select multiple columns by name or dtype, you can use  :obj:`make_column_selector`.","[('num', ...)]"
,"remainder  remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.",'drop'
,"sparse_threshold  sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.",0.3
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"transformer_weights  transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.",False
,"verbose_feature_names_out  verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix  all feature names with the name of the transformer that generated that  feature. It is equivalent to setting  `verbose_feature_names_out=""{transformer_name}__{feature_name}""`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not  prefix any feature names and will error if feature names are not  unique. - If ``Callable[[str, str], str]``,  :meth:`ColumnTransformer.get_feature_names_out` will rename all the features  using the name of the transformer. The first argument of the callable is the  transformer name and the second argument is the feature name. The returned  string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will  be formatted using two field names: ``transformer_name`` and ``feature_name``.  e.g. ``""{feature_name}__{transformer_name}""``. See :meth:`str.format` method  from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6  `verbose_feature_names_out` can be a callable or a string to be formatted.",True
,"force_int_remainder_cols  force_int_remainder_cols: bool, default=False This parameter has no effect. .. note::  If you do not access the list of columns for the remainder columns  in the `transformers_` fitted attribute, you do not need to set  this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7  The default value for `force_int_remainder_cols` will change from  `True` to `False` in version 1.7. .. deprecated:: 1.7  `force_int_remainder_cols` is deprecated and will be removed in 1.9.",'deprecated'

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'median'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True

0,1,2
,"loss  loss: {'log_loss', 'exponential'}, default='log_loss' The loss function to be optimized. 'log_loss' refers to binomial and multinomial deviance, the same as used in logistic regression. It is a good choice for classification with probabilistic outputs. For loss 'exponential', gradient boosting recovers the AdaBoost algorithm.",'log_loss'
,"learning_rate  learning_rate: float, default=0.1 Learning rate shrinks the contribution of each tree by `learning_rate`. There is a trade-off between learning_rate and n_estimators. Values must be in the range `[0.0, inf)`. For an example of the effects of this parameter and its interaction with ``subsample``, see :ref:`sphx_glr_auto_examples_ensemble_plot_gradient_boosting_regularization.py`.",0.01
,"n_estimators  n_estimators: int, default=100 The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. Values must be in the range `[1, inf)`.",200
,"subsample  subsample: float, default=1.0 The fraction of samples to be used for fitting the individual base learners. If smaller than 1.0 this results in Stochastic Gradient Boosting. `subsample` interacts with the parameter `n_estimators`. Choosing `subsample < 1.0` leads to a reduction of variance and an increase in bias. Values must be in the range `(0.0, 1.0]`.",1.0
,"criterion  criterion: {'friedman_mse', 'squared_error'}, default='friedman_mse' The function to measure the quality of a split. Supported criteria are 'friedman_mse' for the mean squared error with improvement score by Friedman, 'squared_error' for mean squared error. The default value of 'friedman_mse' is generally the best as it can provide a better approximation in some cases. .. versionadded:: 0.18",'friedman_mse'
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, values must be in the range `[2, inf)`. - If float, values must be in the range `(0.0, 1.0]` and `min_samples_split`  will be `ceil(min_samples_split * n_samples)`. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, values must be in the range `[1, inf)`. - If float, values must be in the range `(0.0, 1.0)` and `min_samples_leaf`  will be `ceil(min_samples_leaf * n_samples)`. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. Values must be in the range `[0.0, 0.5]`.",0.0
,"max_depth  max_depth: int or None, default=3 Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. If int, values must be in the range `[1, inf)`.",3
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. Values must be in the range `[0.0, inf)`. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


## 10. Definir función de evaluación para tres clases

Implementa `evaluate_3cls`, que evalúa un pipeline en validación y test para multiclase, imprimiendo accuracy, F1_macro, AUC_ovr y el reporte de clasificación.


In [10]:
def evaluate_3cls(pipe, X_val, y_val, X_test, y_test, name):
    print(f"\n--- {name} sobre VALIDATION (3 clases) ---")
    yv_pred  = pipe.predict(X_val)
    yv_proba = pipe.predict_proba(X_val)
    print(f"Accuracy: {accuracy_score(y_val, yv_pred):.4f}")
    print(f"F1_macro: {f1_score(y_val, yv_pred, average='macro'):.4f}")
    print("\nClassification Report:\n", classification_report(y_val, yv_pred))
    print(f"AUC_ovr: {roc_auc_score(y_val, yv_proba, multi_class='ovr', average='macro'):.4f}")

    print(f"\n--- {name} sobre TEST (3 clases) ---")
    yt_pred  = pipe.predict(X_test)
    yt_proba = pipe.predict_proba(X_test)
    print(f"Accuracy: {accuracy_score(y_test, yt_pred):.4f}")
    print(f"F1_macro: {f1_score(y_test, yt_pred, average='macro'):.4f}")
    print("\nClassification Report:\n", classification_report(y_test, yt_pred))
    print(f"AUC_ovr: {roc_auc_score(y_test, yt_proba, multi_class='ovr', average='macro'):.4f}")
    print("-"*60)

# 8) Evaluar todos los modelos
evaluate_3cls(pipe3_final_lr, X_val_sint, y_val_3, X_test_sint, y_test_3, name='LogisticRegression3')
evaluate_3cls(pipe3_final_rf, X_val_sint, y_val_3, X_test_sint, y_test_3, name='RandomForest3')
evaluate_3cls(pipe3_final_gb, X_val_sint, y_val_3, X_test_sint, y_test_3, name='GradientBoosting3')



--- LogisticRegression3 sobre VALIDATION (3 clases) ---
Accuracy: 0.4906
F1_macro: 0.4576

Classification Report:
               precision    recall  f1-score   support

         1.0       0.56      0.54      0.55       150
         2.0       0.65      0.44      0.53       233
         3.0       0.20      0.59      0.30        41

    accuracy                           0.49       424
   macro avg       0.47      0.52      0.46       424
weighted avg       0.57      0.49      0.51       424

AUC_ovr: 0.6723

--- LogisticRegression3 sobre TEST (3 clases) ---
Accuracy: 0.4953
F1_macro: 0.4662

Classification Report:
               precision    recall  f1-score   support

         1.0       0.54      0.59      0.56       150
         2.0       0.68      0.41      0.51       233
         3.0       0.22      0.63      0.32        41

    accuracy                           0.50       424
   macro avg       0.48      0.54      0.47       424
weighted avg       0.59      0.50      0.51       4

## 11. Analizar errores direccionales y matriz de confusión

Calcula la matriz de confusión para `pipe3_final_rf`, clasifica cada predicción como subestimación, sobreestimación o correcta, y muestra el conteo global y por clase real.


In [11]:
# Análisis de errores direccionales y matriz de confusión

# Predicciones finales con el modelo que quieras analizar (por ejemplo RandomForest3)
y_test_pred = pipe3_final_rf.predict(X_test_sint)

# Matriz de confusión completa para 3 clases
labels = [1.0, 2.0, 3.0]
cm = confusion_matrix(y_test_3, y_test_pred, labels=labels)
cm_df = pd.DataFrame(
    cm,
    index=[f"True {lbl}" for lbl in labels],
    columns=[f"Pred {lbl}" for lbl in labels]
)
print("Matriz de confusión (3 clases):\n")
print(cm_df)

# Clasificar cada predicción en subestimación, sobreestimación o correcto
errors = pd.DataFrame({'true': y_test_3.values, 'pred': y_test_pred})
errors['tipo_error'] = errors.apply(
    lambda row: 'subestimar' if row['pred'] < row['true']
    else ('sobreestimar' if row['pred'] > row['true'] else 'correcto'),
    axis=1
)

# Conteo global de errores
counts = errors['tipo_error'].value_counts()
print("\nConteo de errores globales:")
print(counts)

# Detalle de errores por clase real
detail = errors.groupby(['true', 'tipo_error']).size().unstack(fill_value=0)
print("\nErrores por clase real:")
print(detail)


Matriz de confusión (3 clases):

          Pred 1.0  Pred 2.0  Pred 3.0
True 1.0        75        63        12
True 2.0        66       127        40
True 3.0         8        15        18

Conteo de errores globales:
tipo_error
correcto        220
sobreestimar    115
subestimar       89
Name: count, dtype: int64

Errores por clase real:
tipo_error  correcto  sobreestimar  subestimar
true                                          
1.0               75            75           0
2.0              127            40          66
3.0               18             0          23


## 12. Definir matriz de costes y scorer personalizado

Crea una matriz de costes para penalizar distintos errores en multiclase y genera un `cost_scorer` con `make_scorer` para optimizar esta métrica en las búsquedas.



In [12]:
# Definir matriz de costes y scorer personalizado

# Matriz de costes (filas=true 1,2,3 → columnas=pred 1,2,3)
cost_matrix = np.array([
    [ 0,  2, 10],   # true=1 → pred 1→0, 2→2, 3→10
    [ 2,  0,  5],   # true=2 → pred 1→2, 2→0, 3→5
    [ 8,  1,  0],   # true=3 → pred 1→8, 2→1, 3→0
])

def cost_score(y_true, y_pred):
    total = 0.0
    n = len(y_true)
    for yt, yp in zip(y_true, y_pred):
        i = int(yt) - 1
        j = int(yp) - 1
        total += cost_matrix[i, j]
    return total / n

cost_scorer = make_scorer(cost_score, greater_is_better=False)

## 13. GridSearchCV optimizando el coste para RandomForest3

Configura y ejecuta un `GridSearchCV` con SMOTE para `RandomForestClassifier`, usando `cost_scorer` como métrica, y muestra los mejores parámetros y coste medio.


In [13]:
# GridSearchCV optimizando el coste para RandomForest3

# Pipeline base con SMOTE y preprocessor_sint
pipe3_rf_cost_base = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('pre',   preprocessor_sint),
    ('clf',   RandomForestClassifier(class_weight='balanced', random_state=42))
])

# Grilla de hiperparámetros
param_grid_rf3 = {
    'clf__n_estimators': [100, 200],
    'clf__max_depth':    [None, 5, 10]
}

cv3 = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV minimizando el coste
gs3_rf_cost = GridSearchCV(
    pipe3_rf_cost_base,
    param_grid=param_grid_rf3,
    cv=cv3,
    scoring=cost_scorer,
    n_jobs=-1,
    verbose=2
)

print("Entrenando RandomForest3 (cost-based) con SMOTE en cada fold...")
gs3_rf_cost.fit(X_train_sint, y_train_3)

print("\n→ Mejores parámetros (coste mínimo):", gs3_rf_cost.best_params_)
print("   Coste medio validación (best):", -gs3_rf_cost.best_score_)


Entrenando RandomForest3 (cost-based) con SMOTE en cada fold...
Fitting 5 folds for each of 6 candidates, totalling 30 fits

→ Mejores parámetros (coste mínimo): {'clf__max_depth': None, 'clf__n_estimators': 200}
   Coste medio validación (best): 0.9580539572944635


## 14. Reentrenar modelo final con parámetros de coste

Construye y entrena el pipeline `pipe3_final_rf_cost` con los parámetros óptimos basados en costes, aplicando SMOTE antes del ajuste.


In [14]:
# Reentrenar el modelo final con parámetros optimizados por coste

best3_rf = gs3_rf_cost.best_params_

pipe3_final_rf_cost = Pipeline([
    ('pre', preprocessor_sint),
    ('clf', RandomForestClassifier(
        class_weight='balanced',
        random_state=42,
        n_estimators=best3_rf['clf__n_estimators'],
        max_depth=best3_rf['clf__max_depth']
    ))
])

# Aplicar SMOTE y entrenar
sm3 = SMOTE(random_state=42)
X_train_sm3, y_train_sm3 = sm3.fit_resample(X_train_sint, y_train_3)

pipe3_final_rf_cost.fit(X_train_sm3, y_train_sm3)
print(" RandomForest3 (cost-based) entrenado")


 RandomForest3 (cost-based) entrenado


## 15. Evaluar modelo cost-based en VALIDATION y TEST

Implementa `evaluate_3cls_cost` para evaluar el pipeline cost-based en validación y test, imprimiendo métricas y el coste medio en cada conjunto.


In [15]:
# Evaluar el modelo cost-based en VALIDATION y TEST

def evaluate_3cls_cost(pipe, X_val, y_val, X_test, y_test):
    # VALIDATION
    print("\n--- RF (cost-based) sobre VALIDATION (3 clases) ---")
    yv_pred  = pipe.predict(X_val)
    yv_proba = pipe.predict_proba(X_val)
    print(f"Accuracy: {accuracy_score(y_val, yv_pred):.4f}")
    print(f"F1_macro: {f1_score(y_val, yv_pred, average='macro'):.4f}")
    print(f"AUC_ovr:   {roc_auc_score(y_val, yv_proba, multi_class='ovr', average='macro'):.4f}")
    medio_coste_val = cost_score(y_val, yv_pred)
    print(f"Coste medio validación: {medio_coste_val:.4f}")

    # TEST
    print("\n--- RF (cost-based) sobre TEST (3 clases) ---")
    yt_pred  = pipe.predict(X_test)
    yt_proba = pipe.predict_proba(X_test)
    print(f"Accuracy: {accuracy_score(y_test, yt_pred):.4f}")
    print(f"F1_macro: {f1_score(y_test, yt_pred, average='macro'):.4f}")
    print(f"AUC_ovr:   {roc_auc_score(y_test, yt_proba, multi_class='ovr', average='macro'):.4f}")
    medio_coste_test = cost_score(y_test, yt_pred)
    print(f"Coste medio test: {medio_coste_test:.4f}")

evaluate_3cls_cost(pipe3_final_rf_cost, X_val_sint, y_val_3, X_test_sint, y_test_3)


--- RF (cost-based) sobre VALIDATION (3 clases) ---
Accuracy: 0.5873
F1_macro: 0.4642
AUC_ovr:   0.6740
Coste medio validación: 0.9575

--- RF (cost-based) sobre TEST (3 clases) ---
Accuracy: 0.5920
F1_macro: 0.4723
AUC_ovr:   0.6761
Coste medio test: 1.0519


## 16. Visualizar matriz de confusión y errores tras coste-optimización

Genera la nueva matriz de confusión y cuenta subestimaciones, sobreestimaciones y aciertos tras entrenar el modelo cost-based, mostrando los totales y desglose por clase real.


In [16]:
# Nueva matriz de confusión y conteo de errores tras coste-optimización

y_test_pred_cost = pipe3_final_rf_cost.predict(X_test_sint)
labels = [1.0, 2.0, 3.0]
cm_cost = confusion_matrix(y_test_3, y_test_pred_cost, labels=labels)
cm_cost_df = pd.DataFrame(
    cm_cost,
    index=[f"True {lbl}" for lbl in labels],
    columns=[f"Pred {lbl}" for lbl in labels]
)
print("Matriz de confusión (post-cost-training):\n")
print(cm_cost_df)

errors_cost = pd.DataFrame({'true': y_test_3.values, 'pred': y_test_pred_cost})
errors_cost['tipo_error'] = errors_cost.apply(
    lambda row: 'subestimar' if row['pred'] < row['true']
    else ('sobreestimar' if row['pred'] > row['true'] else 'correcto'),
    axis=1
)
print("\nConteo de errores post-cost-training:")
print(errors_cost['tipo_error'].value_counts())

print("\nErrores por clase real (post-cost-training):")
print(errors_cost.groupby(['true', 'tipo_error']).size().unstack(fill_value=0))


Matriz de confusión (post-cost-training):

          Pred 1.0  Pred 2.0  Pred 3.0
True 1.0        75        70         5
True 2.0        49       169        15
True 3.0         7        27         7

Conteo de errores post-cost-training:
tipo_error
correcto        251
sobreestimar     90
subestimar       83
Name: count, dtype: int64

Errores por clase real (post-cost-training):
tipo_error  correcto  sobreestimar  subestimar
true                                          
1.0               75            75           0
2.0              169            15          49
3.0                7             0          34


## 17. Ajustar umbrales (threshold tuning) para RandomForest multiclase

Define `find_optimal_thresholds` para encontrar umbrales de decisión óptimos en validación y evalúa el modelo en test usando estos umbrales, mostrando métricas.


In [17]:
# Threshold tuning para RF3
def find_optimal_thresholds(pipe, X_val, y_val):
    proba = pipe.predict_proba(X_val)
    classes = pipe.classes_
    taus = np.array([0.5] * len(classes))
    for idx in range(len(classes)):
        best_f1, best_tau = 0, 0.5
        for τ in np.linspace(0.1, 0.9, 17):
            temp = taus.copy()
            temp[idx] = τ
            preds = []
            for row in proba:
                above = row >= temp
                if above.any():
                    preds.append(classes[np.argmax(row * above)])
                else:
                    preds.append(classes[np.argmax(row)])
            f1_temp = f1_score(y_val, preds, average='macro')
            if f1_temp > best_f1:
                best_f1, best_tau = f1_temp, τ
        taus[idx] = best_tau
    return dict(zip(classes, taus)), best_f1

best_taus_rf3, f1_val3_t = find_optimal_thresholds(pipe3_final_rf, X_val_sint, y_val_3)
print("Umbrales óptimos RF3:", best_taus_rf3)
print("F1_macro validación3 con umbrales:", f1_val3_t)

# Evaluar RF3 con umbrales en TEST
def predict_with_thresholds(pipe, X, thresholds):
    proba = pipe.predict_proba(X)
    classes = pipe.classes_
    preds = []
    for row in proba:
        above = row >= np.array([thresholds[c] for c in classes])
        if above.any():
            preds.append(classes[np.argmax(row * above)])
        else:
            preds.append(classes[np.argmax(row)])
    return np.array(preds)

y_test3_pred_t = predict_with_thresholds(pipe3_final_rf, X_test_sint, best_taus_rf3)
print("\n--- RF3 en TEST con umbrales ---")
print(f"Accuracy: {accuracy_score(y_test_3, y_test3_pred_t):.4f}")
print(f"F1_macro: {f1_score(y_test_3, y_test3_pred_t, average='macro'):.4f}")
print(f"AUC_ovr: {roc_auc_score(y_test_3, pipe3_final_rf.predict_proba(X_test_sint), multi_class='ovr', average='macro'):.4f}")


Umbrales óptimos RF3: {np.float64(1.0): np.float64(0.45000000000000007), np.float64(2.0): np.float64(0.35), np.float64(3.0): np.float64(0.5)}
F1_macro validación3 con umbrales: 0.5141012416037695

--- RF3 en TEST con umbrales ---
Accuracy: 0.5425
F1_macro: 0.4642
AUC_ovr: 0.6630


## Conclusiones Finales
- SMOTE + GridSearchCV generan modelos robustos; RF y GB superan a LR en F1_macro multiclass durante CV y validación.  
- La optimización de umbrales en RF mejora significativamente el equilibrio precisión-recall (F1_macro↑≈0.05) en validación, aunque la ganancia en test es moderada.  
- LogisticRegression mantiene el AUC_ovr más alto (~0.68) pero limita la flexibilidad para controlar costes de error por clase.  
- La matriz de confusión revela una tendencia a sobreestimar el riesgo bajo y subestimar el riesgo alto, especialmente notable en la clase 3.  
- El análisis de errores por clase (sobre/subestimaciones) enfatiza la importancia de ajustar los umbrales según el impacto de cada tipo de error en escenarios financieros.  
- El plateau en F1_macro tras ajustar umbrales sugiere que las 14 variables finales, combinadas con SMOTE, capturan la mayor parte de la señal disponible.  
- Se recomienda el pipeline **RandomForest con umbrales cost-based**, pues proporciona el mejor control de errores críticos en producción financiera.  






