# Notebook: Implementaci√≥n de Boosting
Este notebook demuestra c√≥mo entrenar modelos de Boosting (AdaBoost y Gradient Boosting) usando un dataset totalmente categ√≥rico previa numeralizaci√≥n con OneHotEncoder.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

print("Imports OK")

Imports OK


## Cargar datos
Cambia el c√≥digo para que cargue tu dataset real.

In [20]:
# Cargar dataset (c√°mbialo por el tuyo)
df = pd.read_csv("health_data_processed.csv")  # REEMPLAZA por tu archivo
df.head()

Unnamed: 0,Edad,G√©nero,Estado civil,Altura,Peso,√çndice de masa corporal,¬øFuma actualmente?,¬øFum√≥ en el pasado?,¬øConsume alcohol frecuentemente?,Nivel de actividad f√≠sica,...,Asma_real,C√°ncer_real,Obesidad_real,Depresi√≥n/Ansiedad_real,Enfermedad cardiovascular_bin,Diabetes_bin,Asma_bin,C√°ncer_bin,Obesidad_bin,Depresi√≥n/Ansiedad_bin
0,76.596326,Otro,Soltero,153.681426,76.920289,29.612895,No,S√≠,No,Moderado,...,0.607766,0.206125,0.185332,0.526435,1,0,1,0,0,1
1,79.795297,Otro,Casado,155.882307,66.743641,9.902543,No,No,No,Moderado,...,0.127169,0.464862,0.353357,1.062676,0,0,0,0,0,1
2,90.603394,Otro,Casado,176.481841,124.818134,27.248719,S√≠,No,S√≠,Moderado,...,0.518238,0.152951,0.233972,0.290081,1,0,1,0,0,0
3,22.154276,Femenino,Viudo,158.681358,114.807668,27.634473,No,No,No,Moderado,...,0.087128,0.01628,0.908137,0.561079,1,0,0,0,1,1
4,46.176676,Masculino,Casado,184.451263,60.217207,24.094841,No,S√≠,No,Sedentario,...,0.216952,0.110858,0.713157,1.180575,1,0,0,0,1,1


## Preprocesamiento

In [21]:
# Todas las variables excepto la etiqueta son categ√≥ricas

target_cols = [
    "Asma_bin", "C√°ncer_bin", "Obesidad_bin",
    "Depresi√≥n/Ansiedad_bin", "Diabetes_bin",
    "Enfermedad cardiovascular_bin"
]

y = df[target_cols]   # Reemplaza por el nombre de tu variable objetivo
X = df.drop(columns=target_cols)

categorical_cols = X.columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

print("Preprocesamiento configurado")

Preprocesamiento configurado


## Divisi√≥n Train/Test

In [22]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_test.shape

((16000, 62), (4000, 62))

## AdaBoost

In [23]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

model_ada = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', MultiOutputClassifier(AdaBoostClassifier(n_estimators=200)))
])

model_ada.fit(X_train, y_train)
preds_ada = model_ada.predict(X_test)

print("F1 Score AdaBoost:", f1_score(y_test, preds_ada, average="macro"))

F1 Score AdaBoost: 0.11538461538461538


In [28]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, multilabel_confusion_matrix
import pandas as pd

y_pred = model_ada.predict(X_test)

print("\nüü© Accuracy:", accuracy_score(y_test, y_pred))
print("\nüü© Classification report:\n")
print(classification_report(y_test, y_pred))

print("\nüî∑ MATRICES DE CONFUSI√ìN (una por enfermedad):")
cms = multilabel_confusion_matrix(y_test, y_pred)

for i, disease in enumerate(target_cols):
    print(f"\nü¶† {disease}")
    print(cms[i])


üü© Accuracy: 0.1845

üü© Classification report:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       477
           1       0.00      0.00      0.00        70
           2       0.00      0.00      0.00      1031
           3       0.53      0.99      0.69      2130
           4       0.00      0.00      0.00       754
           5       0.00      0.00      0.00      1379

   micro avg       0.53      0.36      0.43      5841
   macro avg       0.09      0.16      0.12      5841
weighted avg       0.19      0.36      0.25      5841
 samples avg       0.53      0.33      0.39      5841


üî∑ MATRICES DE CONFUSI√ìN (una por enfermedad):

ü¶† Asma_bin
[[3523    0]
 [ 477    0]]

ü¶† C√°ncer_bin
[[3930    0]
 [  70    0]]

ü¶† Obesidad_bin
[[2969    0]
 [1031    0]]

ü¶† Depresi√≥n/Ansiedad_bin
[[  22 1848]
 [  24 2106]]

ü¶† Diabetes_bin
[[3246    0]
 [ 754    0]]

ü¶† Enfermedad cardiovascular_bin
[[2621    0]
 [1379    0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


# XGBoost

In [30]:
from xgboost import XGBClassifier

xgb_model = Pipeline(steps=[
    ('prep', preprocessor),
    ('clf', MultiOutputClassifier(
        XGBClassifier(
            n_estimators=250,
            learning_rate=0.05,
            max_depth=5,
            subsample=0.8,
            colsample_bytree=0.8,
            eval_metric="logloss",
            tree_method="hist"
        )
    ))
])

xgb_model.fit(X_train, y_train)

preds_xgb = xgb_model.predict(X_test)

print("F1 Score XGBoost (macro):", f1_score(y_test, preds_xgb, average="macro"))


F1 Score XGBoost (macro): 0.10680841552280833


In [31]:
print("\nüü© Accuracy:", accuracy_score(y_test, preds_xgb))

print("\nüü© Classification report:\n")
print(classification_report(y_test, preds_xgb, target_names=target_cols))

# Matrices de confusi√≥n por enfermedad
print("\nüî∑ MATRICES DE CONFUSI√ìN (una por enfermedad):")
cms = multilabel_confusion_matrix(y_test, preds_xgb)

for i, disease in enumerate(target_cols):
    print(f"\nü¶† {disease}")
    print(cms[i])


üü© Accuracy: 0.17725

üü© Classification report:

                               precision    recall  f1-score   support

                     Asma_bin       0.00      0.00      0.00       477
                   C√°ncer_bin       0.00      0.00      0.00        70
                 Obesidad_bin       0.25      0.00      0.00      1031
       Depresi√≥n/Ansiedad_bin       0.53      0.74      0.62      2130
                 Diabetes_bin       0.00      0.00      0.00       754
Enfermedad cardiovascular_bin       0.22      0.01      0.02      1379

                    micro avg       0.52      0.27      0.36      5841
                    macro avg       0.17      0.13      0.11      5841
                 weighted avg       0.29      0.27      0.23      5841
                  samples avg       0.39      0.25      0.29      5841


üî∑ MATRICES DE CONFUSI√ìN (una por enfermedad):

ü¶† Asma_bin
[[3523    0]
 [ 477    0]]

ü¶† C√°ncer_bin
[[3930    0]
 [  70    0]]

ü¶† Obesidad_bin
[[2

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [32]:
ohe = xgb_model.named_steps['prep'].named_transformers_['cat']
feature_names_ohe = ohe.get_feature_names_out(categorical_cols)

# Extraer el primer modelo (uno por etiqueta)
first_estimator = xgb_model.named_steps['clf'].estimators_[0]

importances = first_estimator.feature_importances_

ranking = pd.DataFrame({
    "feature": feature_names_ohe,
    "importance": importances
}).sort_values("importance", ascending=False)

print("\n‚≠ê Importancia de caracter√≠sticas (solo primer modelo):")
ranking.head(20)


‚≠ê Importancia de caracter√≠sticas (solo primer modelo):


Unnamed: 0,feature,importance
64087,¬øPadece de insomnio?_S√≠,0.01229
64032,¬øTiene antecedentes de c√°ncer en la familia?_S√≠,0.011421
64092,¬øTiene problemas digestivos frecuentes?_S√≠,0.011359
64012,¬øConsume alcohol frecuentemente?_S√≠,0.010916
16002,G√©nero_Otro,0.010661
64085,¬øConsume alimentos procesados frecuentemente?_S√≠,0.010656
64008,¬øFuma actualmente?_S√≠,0.010634
64072,¬øHa tenido dolor en el pecho recientemente?_S√≠,0.010623
64026,¬øExperimenta estr√©s con frecuencia?_S√≠,0.010362
64070,¬øHa experimentado p√©rdida de peso no intencion...,0.010358
