# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [10]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Extraer columnas de Cabin
spaceship[["Deck", "Cabin_num", "Side"]] = spaceship["Cabin"].str.split("/", expand=True)

# Eliminar columnas irrelevantes
spaceship.drop(columns=["PassengerId", "Name", "Cabin"], inplace=True)

# Definir target
y = spaceship["Transported"].astype(int)
X = spaceship.drop(columns=["Transported"])

# Definir columnas
num_cols = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Cabin_num"]
cat_cols = ["HomePlanet", "CryoSleep", "Destination", "VIP", "Deck", "Side"]

# Pipelines para numéricas y categóricas
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ("num", num_transformer, num_cols),
        ("cat", cat_transformer, cat_cols)
    ]
)

print("✅ Feature scaling y selección completados")


✅ Feature scaling y selección completados


**Perform Train Test Split**

In [12]:
from sklearn.model_selection import train_test_split

# Dividir los datos
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,    # 20% para test
    random_state=42,  # reproducibilidad
    stratify=y        # mantiene proporción de clases
)

print(f"📊 Tamaño del train: {X_train.shape}")
print(f"📊 Tamaño del test: {X_test.shape}")


📊 Tamaño del train: (6954, 13)
📊 Tamaño del test: (1739, 13)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [14]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Clasificador base
base_clf = DecisionTreeClassifier(random_state=42)

# 🔹 Bagging (bootstrap=True)
bagging_clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", BaggingClassifier(
        estimator=base_clf,
        n_estimators=100,
        max_samples=1.0,
        bootstrap=True,  # Con reemplazo
        random_state=42
    ))
])

bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
acc_bagging = accuracy_score(y_test, y_pred_bagging)
print(f"✅ Accuracy Bagging: {acc_bagging:.4f}")

# 🔹 Pasting (bootstrap=False)
pasting_clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", BaggingClassifier(
        estimator=base_clf,
        n_estimators=100,
        max_samples=1.0,
        bootstrap=False,  # Sin reemplazo
        random_state=42
    ))
])

pasting_clf.fit(X_train, y_train)
y_pred_pasting = pasting_clf.predict(X_test)
acc_pasting = accuracy_score(y_test, y_pred_pasting)
print(f"✅ Accuracy Pasting: {acc_pasting:.4f}")


✅ Accuracy Bagging: 0.8039
✅ Accuracy Pasting: 0.7510


- Random Forests

In [15]:
from sklearn.ensemble import RandomForestClassifier

# 🔹 Random Forest
rf_clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(
        n_estimators=200,       # más árboles para mayor estabilidad
        max_depth=None,         # sin límite de profundidad
        random_state=42,
        n_jobs=-1               # usar todos los núcleos
    ))
])

# Entrenar
rf_clf.fit(X_train, y_train)

# Evaluar
rf_acc = rf_clf.score(X_test, y_test)
print(f"🌲 Accuracy Random Forest: {rf_acc:.4f}")


🌲 Accuracy Random Forest: 0.8068


- Gradient Boosting

In [16]:
from sklearn.ensemble import GradientBoostingClassifier

# 🔹 Gradient Boosting
gb_clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    ))
])

# Entrenar
gb_clf.fit(X_train, y_train)

# Evaluar
gb_acc = gb_clf.score(X_test, y_test)
print(f"⚡ Accuracy Gradient Boosting: {gb_acc:.4f}")


⚡ Accuracy Gradient Boosting: 0.8074


- Adaptive Boosting

In [17]:
from sklearn.ensemble import AdaBoostClassifier

# 🔹 AdaBoost
ada_clf = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", AdaBoostClassifier(
        n_estimators=100,
        learning_rate=0.5,
        random_state=42
    ))
])

# Entrenar
ada_clf.fit(X_train, y_train)

# Evaluar
y_pred_ada = ada_clf.predict(X_test)
acc_ada = accuracy_score(y_test, y_pred_ada)
print(f"⚡ Accuracy AdaBoost: {acc_ada:.4f}")




⚡ Accuracy AdaBoost: 0.8022


Which model is the best and why?

The best model in your experiment is Gradient Boosting, because:

It achieved the highest accuracy: 0.8074.

Gradient Boosting trains models sequentially, with each new tree correcting the mistakes of the previous ones, which helps reduce bias.

This makes it very effective at capturing complex patterns in the data compared to bagging-based methods like Random Forest or AdaBoost.

While Random Forest was very close in accuracy (0.8068), Gradient Boosting slightly outperformed it and often has better potential with further tuning.

Final choice: Gradient Boosting is the best performer in this case due to its balance of accuracy and ability to model non-linear relationships in the dataset.


