# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [5]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [7]:
import pandas as pd
import numpy as np

# Carga de datos
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Limpieza simple consistente con labs previos
spaceship = spaceship.dropna()

# FE mínima: quedarnos con la letra de la cabina
spaceship['CabinDeck'] = spaceship['Cabin'].str[0].str.upper()

# Target y features
y = spaceship['Transported']
X = pd.get_dummies(
    spaceship.drop(columns=['Transported', 'PassengerId', 'Name', 'Cabin']),
    drop_first=False
)

X.shape, y.shape


((6606, 24), (6606,))

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [9]:
#your code here
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif

# Escalado
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Selección de características (ajusta k si quieres)
k = min(30, X.shape[1])   # p.ej., top-30
selector = SelectKBest(score_func=f_classif, k=k)
X_fs = selector.fit_transform(X_scaled, y)

X_fs.shape



(6606, 24)

**Perform Train Test Split**

In [10]:
#your code here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_fs, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape

((5284, 24), (1322, 24))

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [12]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

results = {}

# Bagging (bootstrap=True)
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=200,
    max_samples=0.8,
    bootstrap=True,
    random_state=42,
    n_jobs=-1
)
bagging_clf.fit(X_train, y_train)
y_pred_bag = bagging_clf.predict(X_test)
results['Bagging'] = {
    'accuracy': accuracy_score(y_test, y_pred_bag),
    'f1_macro': f1_score(y_test, y_pred_bag, average='macro')
}
print("Bagging -> Acc:", results['Bagging']['accuracy'], " F1-macro:", results['Bagging']['f1_macro'])

# Pasting (bootstrap=False)
pasting_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=200,
    max_samples=0.8,
    bootstrap=False,
    random_state=42,
    n_jobs=-1
)
pasting_clf.fit(X_train, y_train)
y_pred_paste = pasting_clf.predict(X_test)
results['Pasting'] = {
    'accuracy': accuracy_score(y_test, y_pred_paste),
    'f1_macro': f1_score(y_test, y_pred_paste, average='macro')
}
print("Pasting -> Acc:", results['Pasting']['accuracy'], " F1-macro:", results['Pasting']['f1_macro'])




Bagging -> Acc: 0.7806354009077155  F1-macro: 0.7805746336996338
Pasting -> Acc: 0.7776096822995462  F1-macro: 0.7775771018891766


- Random Forests

In [13]:
#your code here
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

rf_clf = RandomForestClassifier(
    n_estimators=400,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42,
    n_jobs=-1
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
results['RandomForest'] = {
    'accuracy': accuracy_score(y_test, y_pred_rf),
    'f1_macro': f1_score(y_test, y_pred_rf, average='macro')
}
print("Random Forest -> Acc:", results['RandomForest']['accuracy'], " F1-macro:", results['RandomForest']['f1_macro'])


Random Forest -> Acc: 0.7813918305597579  F1-macro: 0.781391705475192


- Gradient Boosting

In [14]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score

gb_clf = GradientBoostingClassifier(
    learning_rate=0.1,
    n_estimators=200,
    max_depth=3,
    random_state=42
)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
results['GradientBoosting'] = {
    'accuracy': accuracy_score(y_test, y_pred_gb),
    'f1_macro': f1_score(y_test, y_pred_gb, average='macro')
}
print("Gradient Boosting -> Acc:", results['GradientBoosting']['accuracy'], " F1-macro:", results['GradientBoosting']['f1_macro'])


Gradient Boosting -> Acc: 0.783661119515885  F1-macro: 0.7828659770606059


- Adaptive Boosting

In [16]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score

ada_clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=42),
    n_estimators=300,
    learning_rate=0.5,
    random_state=42
)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
results['AdaBoost'] = {
    'accuracy': accuracy_score(y_test, y_pred_ada),
    'f1_macro': f1_score(y_test, y_pred_ada, average='macro')
}
print("AdaBoost -> Acc:", results['AdaBoost']['accuracy'], " F1-macro:", results['AdaBoost']['f1_macro'])



AdaBoost -> Acc: 0.7738275340393344  F1-macro: 0.7737186455794826


Which model is the best and why?

In [17]:
#comment here
import pandas as pd

# Tabla de resultados
score_df = pd.DataFrame(results).T.sort_values(by=['accuracy','f1_macro'], ascending=False)
print(score_df)

best_model = score_df.index[0]
best_acc = score_df.loc[best_model, 'accuracy']
best_f1  = score_df.loc[best_model, 'f1_macro']

print("\nMejor modelo:", best_model)
print(f"Accuracy: {best_acc:.4f} | F1-macro: {best_f1:.4f}")

"""
Comentario:
Seleccionamos el mejor modelo en función de accuracy y F1-macro sobre el conjunto de test.
- Bagging/Pasting suelen mejorar modelos base al reducir varianza.
- Random Forest aprovecha múltiples árboles con bootstrap + submuestreo de features, robusto y con buen sesgo-varianza.
- Gradient Boosting y AdaBoost optimizan secuencialmente corrigiendo errores previos; suelen rendir muy bien si el ruido no es alto.
En nuestra comparación (arriba), el modelo con mejor métrica general es el indicado como 'Mejor modelo'.
"""


                  accuracy  f1_macro
GradientBoosting  0.783661  0.782866
RandomForest      0.781392  0.781392
Bagging           0.780635  0.780575
Pasting           0.777610  0.777577
AdaBoost          0.773828  0.773719

Mejor modelo: GradientBoosting
Accuracy: 0.7837 | F1-macro: 0.7829


"\nComentario:\nSeleccionamos el mejor modelo en función de accuracy y F1-macro sobre el conjunto de test.\n- Bagging/Pasting suelen mejorar modelos base al reducir varianza.\n- Random Forest aprovecha múltiples árboles con bootstrap + submuestreo de features, robusto y con buen sesgo-varianza.\n- Gradient Boosting y AdaBoost optimizan secuencialmente corrigiendo errores previos; suelen rendir muy bien si el ruido no es alto.\nEn nuestra comparación (arriba), el modelo con mejor métrica general es el indicado como 'Mejor modelo'.\n"