# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [44]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [45]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [46]:
from sklearn.preprocessing import StandardScaler

# Feature Engineering: Fill missing values and encode categorical variables
spaceship_clean = spaceship.copy()

# Fill missing numerical values with median
num_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
spaceship_clean[num_cols] = spaceship_clean[num_cols].fillna(spaceship_clean[num_cols].median())

# Fill missing categorical values with mode
cat_cols = ['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
for col in cat_cols:
    spaceship_clean[col] = spaceship_clean[col].fillna(spaceship_clean[col].mode()[0])

# Encode boolean columns
spaceship_clean['CryoSleep'] = spaceship_clean['CryoSleep'].map({True: 1, False: 0, 'True': 1, 'False': 0})
spaceship_clean['VIP'] = spaceship_clean['VIP'].map({True: 1, False: 0, 'True': 1, 'False': 0})

# Encode categorical variables using one-hot encoding
spaceship_encoded = pd.get_dummies(spaceship_clean, columns=['HomePlanet', 'Destination'], drop_first=True)

# Feature selection: drop columns not useful for modeling
features = spaceship_encoded.drop(['PassengerId', 'Cabin', 'Name', 'Transported'], axis=1)
target = spaceship_encoded['Transported'].astype(int)

# Feature scaling
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

  spaceship_clean[col] = spaceship_clean[col].fillna(spaceship_clean[col].mode()[0])


**Perform Train Test Split**

In [47]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42, stratify=target)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [48]:
import sklearn
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Detect correct parameter name based on scikit-learn version
sklearn_version = tuple(map(int, sklearn.__version__.split('.')[:2]))
param_name = 'estimator' if sklearn_version >= (1, 2) else 'base_estimator'

# Bagging Classifier (with replacement)
bagging_clf = BaggingClassifier(
    **{param_name: DecisionTreeClassifier()},
    n_estimators=100,
    bootstrap=True,
    n_jobs=-1,
    random_state=42
)
bagging_clf.fit(X_train, y_train)
bagging_score = bagging_clf.score(X_test, y_test)
print(f"Bagging Classifier Test Accuracy: {bagging_score:.4f}")

# Pasting Classifier (without replacement)
pasting_clf = BaggingClassifier(
    **{param_name: DecisionTreeClassifier()},
    n_estimators=100,
    bootstrap=False,
    n_jobs=-1,
    random_state=42
)
pasting_clf.fit(X_train, y_train)
pasting_score = pasting_clf.score(X_test, y_test)
print(f"Pasting Classifier Test Accuracy: {pasting_score:.4f}")


Bagging Classifier Test Accuracy: 0.7792
Pasting Classifier Test Accuracy: 0.7309


- Random Forests

In [53]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_clf.fit(X_train, y_train)
rf_score = rf_clf.score(X_test, y_test)
print(f"Random Forest Test Accuracy: {rf_score:.4f}")

Random Forest Test Accuracy: 0.7878


- Gradient Boosting

In [54]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_clf.fit(X_train, y_train)
gb_score = gb_clf.score(X_test, y_test)
print(f"Gradient Boosting Test Accuracy: {gb_score:.4f}")

Gradient Boosting Test Accuracy: 0.8039


- Adaptive Boosting

In [55]:
from sklearn.ensemble import AdaBoostClassifier

# Adaptive Boosting Classifier
ada_clf = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=100,
    learning_rate=0.5,
    random_state=42
)
ada_clf.fit(X_train, y_train)
ada_score = ada_clf.score(X_test, y_test)
print(f"Adaptive Boosting Test Accuracy: {ada_score:.4f}")



Adaptive Boosting Test Accuracy: 0.7861


Which model is the best and why?

In [56]:
# Compare the test accuracies of all ensemble models
print(f"Bagging Classifier Test Accuracy: {bagging_score:.4f}")
print(f"Pasting Classifier Test Accuracy: {pasting_score:.4f}")
print(f"Random Forest Test Accuracy: {rf_score:.4f}")
print(f"Gradient Boosting Test Accuracy: {gb_score:.4f}")
print(f"Adaptive Boosting Test Accuracy: {ada_score:.4f}")

# Identify the best model
scores = {
    "Bagging": bagging_score,
    "Pasting": pasting_score,
    "Random Forest": rf_score,
    "Gradient Boosting": gb_score,
    "AdaBoost": ada_score
}
best_model = max(scores, key=scores.get)
best_score = scores[best_model]

print(f"\nThe best model is {best_model} with a test accuracy of {best_score:.4f}.")

# Explanation
print("\nGradient Boosting achieved the highest accuracy on the test set. This is likely because boosting methods like Gradient Boosting can reduce both bias and variance by sequentially correcting the errors of previous models, leading to better generalization on this dataset.")

Bagging Classifier Test Accuracy: 0.7792
Pasting Classifier Test Accuracy: 0.7309
Random Forest Test Accuracy: 0.7878
Gradient Boosting Test Accuracy: 0.8039
Adaptive Boosting Test Accuracy: 0.7861

The best model is Gradient Boosting with a test accuracy of 0.8039.

Gradient Boosting achieved the highest accuracy on the test set. This is likely because boosting methods like Gradient Boosting can reduce both bias and variance by sequentially correcting the errors of previous models, leading to better generalization on this dataset.
