# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
# vemos los el tipo de dato de columnas
spaceship.dtypes
# nulos
spaceship.isnull().sum()
# Eliminamos los nulos
spaceship = spaceship.dropna()



**Perform Train Test Split**

In [7]:
# Dividimos los datos en entrenamiento y prueba
X = spaceship.drop(columns=["Transported"])
y = spaceship["Transported"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [9]:
# Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Pasting
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns

# Apply one-hot encoding to categorical columns
preprocessor = ColumnTransformer(
	transformers=[
		('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
	],
	remainder='passthrough'  # Keep other columns as is
)

# Transform the training and test data
X_train_encoded = preprocessor.fit_transform(X_train)
X_test_encoded = preprocessor.transform(X_test)

# Fit the BaggingClassifier
bagging = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=False, n_jobs=-1)
bagging.fit(X_train_encoded, y_train)
y_pred = bagging.predict(X_test_encoded)

# Print results
print("Pasting")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Pasting
Accuracy: 0.7859304084720121
              precision    recall  f1-score   support

       False       0.79      0.77      0.78       653
        True       0.78      0.80      0.79       669

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



- Random Forests

In [11]:
# random forest
from sklearn.ensemble import RandomForestClassifier
# Fit the RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=500, max_features='sqrt', n_jobs=-1)
random_forest.fit(X_train_encoded, y_train)
y_pred = random_forest.predict(X_test_encoded)


- Gradient Boosting

In [14]:
# gradient bossting
from sklearn.ensemble import GradientBoostingClassifier
# Fit the GradientBoostingClassifier
gradient_boosting = GradientBoostingClassifier(n_estimators=500, max_depth=3, learning_rate=0.1)
gradient_boosting.fit(X_train_encoded, y_train)
y_pred = gradient_boosting.predict(X_test_encoded)
# Print results
print("Gradient Boosting")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))



Gradient Boosting
Accuracy: 0.800302571860817
              precision    recall  f1-score   support

       False       0.83      0.75      0.79       653
        True       0.78      0.85      0.81       669

    accuracy                           0.80      1322
   macro avg       0.80      0.80      0.80      1322
weighted avg       0.80      0.80      0.80      1322



- Adaptive Boosting

In [13]:
#adaptative boosting
from sklearn.ensemble import AdaBoostClassifier
# Fit the AdaBoostClassifier
ada_boost = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=500, learning_rate=0.05)
ada_boost.fit(X_train_encoded, y_train)
y_pred = ada_boost.predict(X_test_encoded)
# Print results
print("AdaBoost")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


AdaBoost
Accuracy: 0.7450832072617246
              precision    recall  f1-score   support

       False       0.71      0.83      0.76       653
        True       0.80      0.66      0.72       669

    accuracy                           0.75      1322
   macro avg       0.75      0.75      0.74      1322
weighted avg       0.75      0.75      0.74      1322



Which model is the best and why?

In [None]:
# el mejor modelo es el de Gradient Boosting porque tiene el mejor accuracy