# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [8]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [9]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [10]:
# 2. CLEANING 

# Drop missing values
print("\nbefore:", spaceship.isnull().sum().sum())
spaceship_clean = spaceship.dropna()
print("after :", spaceship_clean.isnull().sum().sum())

# Granularity
spaceship_clean = spaceship_clean.copy()
spaceship_clean['Cabin'] = spaceship_clean['Cabin'].str[0]

# Drop PassengerId et Name
spaceship_clean = spaceship_clean.drop(columns=["PassengerId", "Name"])

# One-Hot Encoding
spaceship_clean = pd.get_dummies(
    spaceship_clean, 
    columns=['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'],
    drop_first=True
)

# Divide X et y
X = spaceship_clean.drop(columns=["Transported"])
y = spaceship_clean["Transported"].astype(int)


before: 2324
after : 0


**Perform Train Test Split**

In [11]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print("\nX_train shape :", X_train.shape)
print("X_test shape :", X_test.shape)


X_train shape : (5284, 19)
X_test shape : (1322, 19)


In [14]:
# FEATURE SCALING

scaler.fit(X_train)

# TRANSFORM on train and test
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# converting into DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

In [15]:
# FEATURE SELECTION 

df_with_target = X_train_scaled.copy()
df_with_target['Transported'] = y_train.values

correlations = df_with_target.corr()['Transported'].drop('Transported')
correlations_abs = correlations.abs().sort_values(ascending=False)
print(correlations_abs)

threshold = 0.1
selected_features = correlations_abs[correlations_abs > threshold].index.tolist()
print(selected_features)

X_train_selected = X_train_scaled[selected_features]
X_test_selected = X_test_scaled[selected_features]

CryoSleep_True               0.461425
RoomService                  0.244292
Spa                          0.221505
VRDeck                       0.204573
HomePlanet_Europa            0.179831
Cabin_B                      0.140558
Destination_TRAPPIST-1e      0.115129
Cabin_C                      0.112810
Cabin_F                      0.093712
Cabin_E                      0.087992
Age                          0.085729
FoodCourt                    0.044343
VIP_True                     0.040641
Cabin_D                      0.035937
HomePlanet_Mars              0.020240
Cabin_G                      0.014621
ShoppingMall                 0.008824
Destination_PSO J318.5-22    0.000671
Cabin_T                      0.000118
Name: Transported, dtype: float64
['CryoSleep_True', 'RoomService', 'Spa', 'VRDeck', 'HomePlanet_Europa', 'Cabin_B', 'Destination_TRAPPIST-1e', 'Cabin_C']


In [16]:
print(f"\nShape before : {X_train_scaled.shape}")
print(f"Shape after : {X_train_selected.shape}")


Shape before : (5284, 19)
Shape after : (5284, 8)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

In [17]:
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    AdaBoostClassifier,
    GradientBoostingClassifier
)
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

results = {}

- Bagging and Pasting

In [18]:
bagging = BaggingClassifier(n_estimators=100, random_state=42)
bagging.fit(X_train_selected, y_train)
results['Bagging'] = bagging.score(X_test_selected, y_test)
print(f"Accuracy : {results['Bagging']*100:.2f}%")

Accuracy : 76.48%


- Random Forests

In [20]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_selected, y_train)
results['Random Forest'] = rf.score(X_test_selected, y_test)
print(f"Accuracy : {results['Random Forest']*100:.2f}%")

Accuracy : 77.16%


- Gradient Boosting

In [19]:
gb = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb.fit(X_train_selected, y_train)
results['Gradient Boosting'] = gb.score(X_test_selected, y_test)
print(f"Accuracy : {results['Gradient Boosting']*100:.2f}%")

Accuracy : 78.29%


- Adaptive Boosting

In [21]:
ada = AdaBoostClassifier(n_estimators=100, random_state=42)
ada.fit(X_train_selected, y_train)
results['AdaBoost'] = ada.score(X_test_selected, y_test)
print(f"Accuracy : {results['AdaBoost']*100:.2f}%")

Accuracy : 76.17%


Which model is the best and why?

In [22]:
for model, accuracy in results.items():
    print(f"{model:20s} : {accuracy*100:.2f}%")

Bagging              : 76.48%
Gradient Boosting    : 78.29%
Random Forest        : 77.16%
AdaBoost             : 76.17%
