# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [27]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [28]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [29]:
# Verificar valores ausentes
spaceship.isnull().sum()

# Substituir valores nulos numéricos pela mediana
spaceship.fillna(spaceship.median(numeric_only=True), inplace=True)

# Substituir valores nulos categóricos pela moda (valor mais frequente)
spaceship.fillna(spaceship.mode().iloc[0], inplace=True)


In [30]:
# Separar a coluna 'Cabin' em três partes (Deck, Num, Side)
spaceship[['Deck', 'Num', 'Side']] = spaceship['Cabin'].str.split('/', expand=True)

# Converter booleanos em 0 e 1
spaceship['CryoSleep'] = spaceship['CryoSleep'].map({True: 1, False: 0})
spaceship['VIP'] = spaceship['VIP'].map({True: 1, False: 0})
spaceship['Transported'] = spaceship['Transported'].map({True: 1, False: 0})

# Codificar variáveis categóricas (One-Hot Encoding)
categorical_cols = ['HomePlanet', 'Destination', 'Deck', 'Side']
spaceship = pd.get_dummies(spaceship, columns=categorical_cols, drop_first=True)

# Remover colunas não úteis
spaceship.drop(['Name', 'Cabin', 'PassengerId'], axis=1, inplace=True)


In [31]:
from sklearn.preprocessing import StandardScaler

# Separar features (X) e target (y)
X = spaceship.drop('Transported', axis=1)
y = spaceship['Transported']

# Escalar os dados numéricos
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [32]:
from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X_scaled, y)

# Ver quais features foram selecionadas
selected_features = X.columns[selector.get_support()]
print("Selected Features:")
print(selected_features)


Selected Features:
Index(['CryoSleep', 'RoomService', 'Spa', 'VRDeck', 'HomePlanet_Europa',
       'Destination_TRAPPIST-1e', 'Deck_B', 'Deck_C', 'Deck_E', 'Side_S'],
      dtype='object')


In [33]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [34]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))


Random Forest Accuracy: 0.7947096032202415


**Perform Train Test Split**

In [35]:
#your code here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [36]:
#your code here
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Bagging (com reposição)
bagging_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    bootstrap=True,        # Bagging (amostragem com reposição)
    random_state=42
)
bagging_clf.fit(X_train, y_train)
y_pred_bagging = bagging_clf.predict(X_test)
bagging_acc = accuracy_score(y_test, y_pred_bagging)
print("Bagging Accuracy:", bagging_acc)

# Pasting (sem reposição)
pasting_clf = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),
    n_estimators=100,
    bootstrap=False,       # Pasting (amostragem sem reposição)
    random_state=42
)
pasting_clf.fit(X_train, y_train)
y_pred_pasting = pasting_clf.predict(X_test)
pasting_acc = accuracy_score(y_test, y_pred_pasting)
print("Pasting Accuracy:", pasting_acc)




Bagging Accuracy: 0.7889591719378953




Pasting Accuracy: 0.7446808510638298


- Random Forests

In [37]:
#your code here
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42
)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
rf_acc = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", rf_acc)


Random Forest Accuracy: 0.7947096032202415


- Gradient Boosting

In [38]:
#your code here
from sklearn.ensemble import GradientBoostingClassifier

gb_clf = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)
gb_acc = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", gb_acc)


Gradient Boosting Accuracy: 0.7935595169637722


- Adaptive Boosting

In [39]:
#your code here
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.1,
    random_state=42
)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_test)
ada_acc = accuracy_score(y_test, y_pred_ada)
print("AdaBoost Accuracy:", ada_acc)


AdaBoost Accuracy: 0.7837837837837838


Which model is the best and why?

In [40]:
# Comparar resultados
results = {
    'Bagging': bagging_acc,
    'Pasting': pasting_acc,
    'Random Forest': rf_acc,
    'Gradient Boosting': gb_acc,
    'AdaBoost': ada_acc
}

import pandas as pd
pd.DataFrame(results, index=['Accuracy']).T.sort_values(by='Accuracy', ascending=False)


Unnamed: 0,Accuracy
Random Forest,0.79471
Gradient Boosting,0.79356
Bagging,0.788959
AdaBoost,0.783784
Pasting,0.744681
