# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [1]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
#your code here
spaceship_cleaned = spaceship.dropna() #first we clean a bit the data, dropping NaN

In [40]:
features = spaceship_cleaned.select_dtypes(include=['int64', 'float64']) # selecting numerical columns as features

target = spaceship_cleaned['Transported'] # seting Transported column as target

#preparing the sets:
X = features
y = target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

**Standarization:** Centers the data around the mean (zero mean) with a unit standard deviation.

In [41]:
scaler = StandardScaler()

scaler.fit(X_train)

In [58]:
# and here we have the data standarized (we use normalized and standardized separately for later comparison)
X_train_standarized_np = scaler.transform(X_train)
X_test_standarized_np = scaler.transform(X_test)

X_train_standarized_df = pd.DataFrame(X_train_standarized_np, columns = X_train.columns, index=X_train.index)
X_test_standarized_df  = pd.DataFrame(X_test_standarized_np, columns = X_test.columns, index=X_test.index)

In [59]:
X_train_standarized_df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,5284.0,5284.0,5284.0,5284.0,5284.0,5284.0
mean,-1.035424e-16,-3.361765e-17,1.3447060000000001e-17,5.378825e-18,6.723531e-18,-3.2272950000000004e-17
std,1.000095,1.000095,1.000095,1.000095,1.000095,1.000095
min,-1.979531,-0.3470463,-0.2820994,-0.3058919,-0.271543,-0.2711225
25%,-0.673254,-0.3470463,-0.2820994,-0.3058919,-0.271543,-0.2711225
50%,-0.1232425,-0.3470463,-0.2820994,-0.3058919,-0.271543,-0.2711225
75%,0.6330232,-0.2709066,-0.2315362,-0.251515,-0.2172822,-0.2236418
max,3.451832,15.06736,17.24626,20.51536,19.02807,17.28467


In [63]:
X_test_standarized_df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,1322.0,1322.0,1322.0,1322.0,1322.0,1322.0
mean,0.034855,-0.002727,-0.002493,-0.014065,-0.009107,-0.04434
std,0.995475,1.011091,0.932751,0.891851,0.92424,0.855875
min,-1.979531,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
25%,-0.604503,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
50%,-0.123243,-0.347046,-0.282099,-0.305892,-0.271543,-0.271123
75%,0.633023,-0.270907,-0.248734,-0.270207,-0.21233,-0.236807
max,3.451832,12.994498,9.134403,9.808212,12.932775,14.66891


**Normalization:** Scales data to fit into a given range, usually [0, 1].

In [42]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [60]:
# here we have the data normalized:
X_train_norm_np = normalizer.transform(X_train)
X_test_norm_np = normalizer.transform(X_test)

In [61]:
X_train_norm_df = pd.DataFrame(X_train_norm_np, columns = X_train.columns, index=X_train.index)
X_train_norm_df.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
3432,0.405063,0.0,0.0,0.0,0.0,0.0
7312,0.050633,0.0,0.0,0.0,0.0,0.0
2042,0.379747,0.0,0.007916,0.0,0.051276,0.0
4999,0.21519,0.00131,0.0,0.046111,0.016378,4.9e-05
5755,0.329114,0.0,0.0,0.0,0.0,0.0


In [62]:
X_test_norm_df = pd.DataFrame(X_test_norm_np, columns = X_test.columns, index=X_test.index)
X_test_norm_df.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
2453,0.632911,0.0,0.0,0.0,0.0,0.0
1334,0.227848,0.0,0.0,0.0,0.0,0.0
8272,0.189873,0.0,0.0,0.0,0.0,0.0
5090,0.658228,0.0,0.0,0.0,0.0,0.0
4357,0.78481,0.0,0.054775,0.0,0.07774,0.0


**Perform Train Test Split**

In [43]:
#  here we have the data standarized (we use normalized and standardized separately for later comparison)
X_train_standarized_np = scaler.transform(X_train)
X_test_standarized_np = scaler.transform(X_test)

#creating DataFrames:
X_train_standarized_df = pd.DataFrame(X_train_standarized_np, columns = X_train.columns, index=X_train.index)
X_test_standarized_df  = pd.DataFrame(X_test_standarized_np, columns = X_test.columns, index=X_test.index)

In [44]:
#your code here
# here we have the data normalized:
X_train_norm_np = normalizer.transform(X_train)
X_test_norm_np = normalizer.transform(X_test)

#creating DataFrames:
X_train_norm_df = pd.DataFrame(X_train_norm_np, columns = X_train.columns, index=X_train.index)
X_test_norm_df = pd.DataFrame(X_test_norm_np, columns = X_test.columns, index=X_test.index)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [45]:
#your code here
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

In [46]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100, # number of models to use
                               max_samples = 1000)

In [47]:
bagging_reg.fit(X_train_norm_df, y_train) # training the model

In [48]:
y_pred_test_bag = bagging_reg.predict(X_test_norm_df)

print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .2f}")
print(f"MSE {mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"RMSE {root_mean_squared_error(y_pred_test_bag, y_test): .2f}")
print(f"R2 score {bagging_reg.score(X_test_norm_df, y_test): .2f}")

MAE  0.32
MSE  0.16
RMSE  0.40
R2 score  0.35


- Random Forests

In [49]:
#your code here
# initialize random forest:
forest = RandomForestRegressor(n_estimators=100,
                             max_depth=20)

In [50]:
forest.fit(X_train_norm_df, y_train) # training the model

In [51]:
y_pred_test_rf = forest.predict(X_test_norm_df)

print(f"MAE, {mean_absolute_error(y_pred_test_rf, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_rf, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_rf, y_test): .2f}")
print(f"R2 score, {forest.score(X_test_norm_df, y_test): .2f}")

MAE,  0.31
MSE,  0.16
RMSE,  0.41
R2 score,  0.34


- Gradient Boosting

In [52]:
#your code here
# Initialize a AdaBoost model:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [53]:
gb_reg.fit(X_train_norm_df, y_train) # training the model

In [54]:
y_pred_test_gb = gb_reg.predict(X_test_norm_df)

print(f"MAE, {mean_absolute_error(y_pred_test_gb, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_gb, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_gb, y_test): .2f}")
print(f"R2 score, {gb_reg.score(X_test_norm_df, y_test): .2f}")

MAE,  0.31
MSE,  0.22
RMSE,  0.47
R2 score,  0.12


- Adaptive Boosting

In [55]:
#your code here
# Initialize AdaBoost model:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20),
                            n_estimators=100)

In [56]:
ada_reg.fit(X_train_norm_df, y_train) # training the model

In [57]:
y_pred_test_ada = ada_reg.predict(X_test_norm_df)

print(f"MAE, {mean_absolute_error(y_pred_test_ada, y_test): .2f}")
print(f"MSE, {mean_squared_error(y_pred_test_ada, y_test): .2f}")
print(f"RMSE, {root_mean_squared_error(y_pred_test_ada, y_test): .2f}")
print(f"R2 score, {ada_reg.score(X_test_norm_df, y_test): .2f}")

MAE,  0.35
MSE,  0.23
RMSE,  0.48
R2 score,  0.06


Which model is the best and why?

In [None]:
#comment here
# KNN from past labs had better results on accuracy

Why Bagging and Pasting Might Work Best:

**Variance Reduction:**

Bagging reduces variance by aggregating predictions from many models. Each model is trained on a random subset of the data, providing robustness against overfitting. This is particularly advantageous if your base model (like decision trees) inherently has high variance.

**Parallelization:**

These methods easily allow for parallel training of models, which boosts computational efficiency. Each model can be trained independently, and the results are combined at the end.

**Simple Model Aggregation:**

By using simple models (e.g., decision trees) that are fast to train, Bagging and Pasting create an ensemble that captures diverse patterns without being too complex. This balance can make the model more adaptable.

**Low Parameter Sensitivity:**

Bagging, in particular, doesn't require much hyperparameter tuning beyond choosing the right number of base models and samples. It relies on redundancy and slight randomness to achieve good performance.

**Effective with Homogeneous Data:**

If your dataset has consistent patterns across different segments, training on subsets and combining results can significantly capture the overarching trend without being misled by noise.

**Considerations:**
Model Used: The effectiveness also depends on the base model you used with Bagging. Decision trees, which are common, fit well into this, as their outputs can vary dramatically based on training data.

Dataset Characteristics: If your dataset has noise or potential outliers, aggregating results can mitigate their influence on predictions.

Simple Overfitting Prevention: Bagging naturally helps in managing overfitting by training multiple models on different data samples.

