# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [4]:
#Libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error # root_mean_squared_error

In [5]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


**Perform Train Test Split**

In [6]:
#your code here
features = spaceship.drop(columns=["Transported","PassengerId","Name"])  
target = spaceship['Transported']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

In [7]:
# Identify numerical columns for scaling
num_cols = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
cat_cols = ["HomePlanet", "CryoSleep", "Cabin", "Destination","VIP"]

# Numerical columns
scaler = MinMaxScaler()
scaler.fit(X_train[num_cols])
X_train_num_scaled_np = scaler.transform(X_train[num_cols])
X_test_num_scaled_np = scaler.transform(X_test[num_cols])

X_train_num_scaled_df = pd.DataFrame(X_train_num_scaled_np, columns=X_train[num_cols].columns, index=X_train.index)
X_test_num_scaled_df = pd.DataFrame(X_test_num_scaled_np, columns=X_test[num_cols].columns, index=X_test.index)

# Categorical columns
category_values = [ spaceship[col].unique() for col in spaceship[cat_cols] ]
ohe = OneHotEncoder(sparse_output=False, categories=category_values)
ohe.fit(X_train[cat_cols])
X_train_cat_ohe_np = ohe.transform(X_train[cat_cols]) 
X_test_cat_ohe_np = ohe.transform(X_test[cat_cols])

X_train_cat_ohe_df = pd.DataFrame(X_train_cat_ohe_np, columns=ohe.get_feature_names_out(), index=X_train.index)
X_test_cat_ohe_df = pd.DataFrame(X_test_cat_ohe_np, columns=ohe.get_feature_names_out(), index=X_test.index)

X_train_processed = pd.concat([X_train_num_scaled_df, X_train_cat_ohe_df], axis=1)
X_test_processed = pd.concat([X_test_num_scaled_df, X_test_cat_ohe_df], axis=1)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [8]:
#your code here
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100, # number of models to use
                               max_samples = 1000)

In [9]:
bagging_reg.fit(X_train_num_scaled_df, y_train)

In [10]:
y_pred_test_bag = bagging_reg.predict(X_test_num_scaled_df)

print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .2f}")
print(f"MSE {mean_squared_error(y_pred_test_bag, y_test): .2f}")
#print(f"RMSE {root_mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"R2 score {bagging_reg.score(X_test_num_scaled_df, y_test): .2f}")

MAE  0.34
MSE  0.16
R2 score  0.36


- Random Forests

In [11]:
#your code here
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [15]:
#forest.fit(X_train_num_scaled_df, y_train)

ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

- Gradient Boosting

In [12]:
#your code here
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [13]:
gb_reg.fit(X_train_num_scaled_df, y_train)

ValueError: Input X contains NaN.
GradientBoostingRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

- Adaptive Boosting

In [None]:
#your code here


Which model is the best and why?

In [None]:
#comment here