# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [15]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier


In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
X = spaceship.drop(columns=["Transported"])
y = spaceship["Transported"]

In [6]:
# drop PassengerId
X = X.drop(columns=["PassengerId"])

# define numeric and categorical columns
num_cols = X.select_dtypes(include="number").columns
cat_cols = X.select_dtypes(exclude="number").columns

# pipelines
numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# combine
preprocess = ColumnTransformer([
    ("num", numeric_pipe, num_cols),
    ("cat", categorical_pipe, cat_cols)
])

**Perform Train Test Split**

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6954, 12), (1739, 12), (6954,), (1739,))

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
bagging_clf = Pipeline([
    ("preprocess", preprocess),
    ("clf", BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=8),
        n_estimators=200,
        max_samples=3000,
        bootstrap=True,
        random_state=42
    ))
])
bagging_clf.fit(X_train, y_train)
bagging_clf.score(X_test, y_test)

0.7843588269120184

In [9]:
pasting_clf = Pipeline([
    ("preprocess", preprocess),
    ("clf", BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=8),
        n_estimators=200,
        max_samples=3000,
        bootstrap=False,
        random_state=42
    ))
])

pasting_clf.fit(X_train, y_train)
pasting_clf.score(X_test, y_test)

0.7872340425531915

- Random Forests

In [10]:
rf_pipe = Pipeline([
    ("preprocess", preprocess),  # your transformer from above
    ("model", RandomForestClassifier(n_estimators=400, random_state=42, n_jobs=-1))
])

cv = cross_val_score(rf_pipe, X_train, y_train, cv=5, scoring="accuracy", n_jobs=-1)
rf_pipe.fit(X_train, y_train)
test_acc = accuracy_score(y_test, rf_pipe.predict(X_test))
print(f"RF CV mean={cv.mean():.4f} (±{cv.std():.4f}) | Test acc={test_acc:.4f}")


RF CV mean=0.7912 (±0.0037) | Test acc=0.7913


- Gradient Boosting

In [11]:
gb_pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", GradientBoostingClassifier(max_depth=10, n_estimators=500, learning_rate=0.05, random_state=42))
])

cv = cross_val_score(gb_pipe, X_train, y_train, cv=5, scoring="accuracy", n_jobs=-1)
gb_pipe.fit(X_train, y_train)
test_acc = accuracy_score(y_test, gb_pipe.predict(X_test))
print(f"GB CV mean={cv.mean():.4f} (±{cv.std():.4f}) | Test acc={test_acc:.4f}")

GB CV mean=0.7965 (±0.0081) | Test acc=0.7872


- Adaptive Boosting

In [16]:
ab_pipe = Pipeline([
    ("preprocess", preprocess),
    ("model", AdaBoostClassifier(n_estimators=300, learning_rate=0.5, random_state=42))
])

cv = cross_val_score(ab_pipe, X_train, y_train, cv=5, scoring="accuracy", n_jobs=-1)
ab_pipe.fit(X_train, y_train)
test_acc = accuracy_score(y_test, ab_pipe.predict(X_test))
print(f"AB CV mean={cv.mean():.4f} (±{cv.std():.4f}) | Test acc={test_acc:.4f}")

AB CV mean=0.7785 (±0.0079) | Test acc=0.7642


Which model is the best and why?

In [17]:
# Best model: Random Forest.

# Highest test accuracy: 0.7913 (vs. GB 0.7872, Bagging 0.7843, Pasting 0.7872, AdaBoost 0.7642).

# Stable: CV (0.7912) ≈ Test (0.7913) → good generalization.

# Others are slightly lower on test, so Random Forest is the safest choice.