# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [85]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [86]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

# Cargar los datos
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")

# Convertir las variables categóricas a numéricas usando One-Hot Encoding
spaceship_encoded = pd.get_dummies(spaceship)

# Separar las características (X) y la variable objetivo (y)
X = spaceship_encoded.drop('Transported', axis=1)
y = spaceship_encoded['Transported']

# 1. Imputar valores faltantes (NaN) usando la media de cada columna
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# 2. Escalado de características
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)

# 3. Selección de características (usamos SelectKBest para seleccionar las mejores características)
selector = SelectKBest(f_classif, k=10)  # Seleccionamos las 10 mejores características
X_selected = selector.fit_transform(X_scaled, y)


**Perform Train Test Split**

In [3]:

# Dividir el conjunto de datos en conjunto de entrenamiento y conjunto de prueba
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Mostrar las primeras filas de X_train escaladas
print(X_train[:5])

[[-5.77428657e-02 -3.40589867e-01  3.06649292e-01 -2.69022628e-01
   9.42847454e-01 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [-8.24922656e-01 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
   9.42847454e-01 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [-5.77428657e-02 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
  -1.06061696e+00  1.75479500e+00 -1.29285779e+00  1.36468464e+00
   1.95689664e+00 -1.45918801e+00]
 [-6.15691804e-01  4.30826867e-17  5.91192079e-01 -2.69022628e-01
  -1.06061696e+00 -5.69867136e-01  7.73480278e-01 -7.32770025e-01
  -5.11013194e-01  6.85312647e-01]
 [ 5.00206073e-01 -3.40589867e-01 -2.76663422e-01 -2.69022628e-01
  -1.06061696e+00  1.75479500e+00 -1.29285779e+00  1.36468464e+00
   1.95689664e+00 -1.45918801e+00]]


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score

# Crear el clasificador base (Decision Tree)
base_model = DecisionTreeClassifier(random_state=42)

# Bagging Classifier
bagging_model = BaggingClassifier(base_model, n_estimators=100, random_state=42)
bagging_model.fit(X_train, y_train)

# Evaluación de Bagging
y_pred_bagging = bagging_model.predict(X_test)
bagging_accuracy = accuracy_score(y_test, y_pred_bagging)
print(f"Precisión de Bagging: {bagging_accuracy:.4f}")

# Pasting Classifier (usando VotingClassifier como ejemplo de un ensamblado)
pasting_model = VotingClassifier(estimators=[('tree', base_model)], voting='hard')
pasting_model.fit(X_train, y_train)

# Evaluación de Pasting
y_pred_pasting = pasting_model.predict(X_test)
pasting_accuracy = accuracy_score(y_test, y_pred_pasting)
print(f"Precisión de Pasting: {pasting_accuracy:.4f}")


Precisión de Bagging: 0.7550
Precisión de Pasting: 0.7062


- Random Forests

In [8]:
from sklearn.ensemble import RandomForestClassifier

# Crear el modelo Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Entrenamiento y evaluación
rf_model.fit(X_train, y_train)

# Predicciones y evaluación de la precisión
y_pred_rf = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Precisión de Random Forest: {rf_accuracy:.4f}")


Precisión de Random Forest: 0.7487


- Gradient Boosting

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

# Crear el modelo Gradient Boosting
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Entrenamiento y evaluación
gb_model.fit(X_train, y_train)

# Predicciones y evaluación de la precisión
y_pred_gb = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, y_pred_gb)
print(f"Precisión de Gradient Boosting: {gb_accuracy:.4f}")


Precisión de Gradient Boosting: 0.7757


- Adaptive Boosting

In [10]:
from sklearn.ensemble import AdaBoostClassifier

# Crear el modelo AdaBoost
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Entrenamiento y evaluación
ada_model.fit(X_train, y_train)

# Predicciones y evaluación de la precisión
y_pred_ada = ada_model.predict(X_test)
ada_accuracy = accuracy_score(y_test, y_pred_ada)
print(f"Precisión de AdaBoost: {ada_accuracy:.4f}")


Precisión de AdaBoost: 0.7476


Which model is the best and why?

In [11]:
# Comparar las precisiones de los modelos
print("\nComparación de precisiones:")
print(f"Bagging Precision: {bagging_accuracy:.4f}")
print(f"Pasting Precision: {pasting_accuracy:.4f}")
print(f"Random Forest Precision: {rf_accuracy:.4f}")
print(f"Gradient Boosting Precision: {gb_accuracy:.4f}")
print(f"AdaBoost Precision: {ada_accuracy:.4f}")



Comparación de precisiones:
Bagging Precision: 0.7550
Pasting Precision: 0.7062
Random Forest Precision: 0.7487
Gradient Boosting Precision: 0.7757
AdaBoost Precision: 0.7476


#comment here
En términos de precisión del modelo para la métrica MAE, el modelo con el valor más bajo (en este caso, Gradient Boosting) es preferible. Sin embargo, la elección final debe equilibrar precisión, tiempo de entrenamiento, interpretabilidad y estabilidad del rendimiento del modelo según los objetivos del proyecto.