# **A09 - Bagging**
### **Miguel Aaron Castillon Ochoa**
### **Expediente:** 751858
### **Fecha:** 24 / 11 / 2025

In [31]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## **Carga del dataset**

In [52]:
df = pd.read_csv("Default.csv")
df.head()

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.13895
3,No,No,529.250605,35704.49394
4,No,No,785.655883,38463.49588


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  object 
 1   student  10000 non-null  object 
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB


## **Definir columnas explicativas y salida.**

In [54]:
df["student_bin"] = (df["student"] == "Yes").astype(int)
features = ["balance", "income", "student_bin"]
X = df[features].astype(float)
y = (df['default'] == 'Yes').astype(int)


## **Train y test split**

In [55]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

## **Logistic Regression**

In [56]:
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

y_proba_logreg = logreg.predict_proba(X_test)[:, 1]
y_pred_logreg = (y_proba_logreg >= 0.5).astype(int)

auc_logreg = roc_auc_score(y_test, y_proba_logreg)
acc_logreg = accuracy_score(y_test, y_pred_logreg)

print(f"AUC Logistic Regression:      {auc_logreg:.4f}")
print(f"Accuracy Logistic Regression: {acc_logreg:.4f}")

AUC Logistic Regression:      0.9511
Accuracy Logistic Regression: 0.9717


## **Bagging con 5000 arboles**

In [57]:
n_trees = 5000
n_train = X_train.shape[0]

trees = []
selected_features_list = []  

for i in range(n_trees):
    
    cols = np.random.choice(features, size=2, replace=False)
    selected_features_list.append(cols)
    
    
    idx = np.random.choice(n_train, size=5000, replace=True)
    Xb = X_train[cols].iloc[idx]
    yb = y_train.iloc[idx]
    
    
    tree = DecisionTreeClassifier(random_state=i)
    tree.fit(Xb, yb)
    
    trees.append(tree)


## **Aggregating**

In [58]:
proba_all = []

for tree, cols in zip(trees, selected_features_list):
    p = tree.predict_proba(X_test[cols])[:, 1]
    proba_all.append(p)

proba_all = np.array(proba_all)
bagging_prob = proba_all.mean(axis=0)

y_pred_bagging = (bagging_prob >= 0.5).astype(int)

auc_bagging = roc_auc_score(y_test, bagging_prob)
acc_bagging = accuracy_score(y_test, y_pred_bagging)

print(f"\nAUC Bagging (5000 árboles):    {auc_bagging:.4f}")
print(f"Accuracy Bagging:             {acc_bagging:.4f}")


AUC Bagging (5000 árboles):    0.9045
Accuracy Bagging:             0.9697


## **Comparaciones**

In [59]:
print("\n COMPARACIÓN FINAL ")
print(f"AUC Logistic Regression:      {auc_logreg:.4f}")
print(f"AUC Bagging:                  {auc_bagging:.4f}")
print(f"Accuracy Logistic Regression: {acc_logreg:.4f}")
print(f"Accuracy Bagging:             {acc_bagging:.4f}")


 COMPARACIÓN FINAL 
AUC Logistic Regression:      0.9511
AUC Bagging:                  0.9045
Accuracy Logistic Regression: 0.9717
Accuracy Bagging:             0.9697


## **Conclusiones**
La Regresión Logística obtuvo el mejor desempeño, con un AUC de 0.9511, superando al Bagging gracias a que utiliza las tres variables simultáneamente y modela adecuadamente la relación casi lineal del dataset.

El Bagging con 5000 árboles y solo 2 columnas aleatorias por modelo alcanzó un AUC de 0.9045, mejorando respecto a versiones anteriores, pero aún por debajo del modelo lineal debido a la pérdida parcial de información en cada árbol.

El Accuracy fue muy similar para ambos modelos (≈0.97), aunque esta métrica se ve influenciada por el desbalance de clases.

En general, la regresión logística sigue siendo el modelo más adecuado para este problema, logrando mayor capacidad predictiva y estabilidad que el Bagging bajo la restricción de variables aleatorias.