# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [22]:
#Libraries
import pandas as pd
import numpy as np
import seaborn as  sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
# Handling missing values properly first 
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [7]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [9]:
numeric_columns = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
categorical_columns =  ["HomePlanet", "CryoSleep", "Destination", "VIP"]

#fill numerical columns 
for col in numeric_columns:
    spaceship[col] = spaceship[col].fillna(spaceship[col].median())

for col in categorical_columns: 
    spaceship[col] = spaceship[col].fillna(spaceship[col].mode()[0])

  spaceship[col] = spaceship[col].fillna(spaceship[col].mode()[0])


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [11]:
# Feature Engineering 
# one-hot encoding for categorical variables
spaceship_processed = pd.get_dummies(
    spaceship, columns=["HomePlanet", "Destination"])

**Perform Train Test Split**

In [13]:
# Prepare features and target 
features = spaceship_processed.drop(["PassengerId", "Name", "Cabin", "Transported"], axis=1)
target = spaceship_processed["Transported"]

In [15]:
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [18]:
normalizer = MinMaxScaler()
normalizer.fit(x_train)

In [19]:
x_train_norm = normalizer.transform(x_train)
x_test_norm = normalizer.transform(x_test)

In [20]:
x_train_norm = pd.DataFrame(x_train_norm, columns = x_train.columns)
x_train_norm.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.0,0.683544,0.0,0.0,0.020164,0.0,0.820482,0.115982,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,0.253165,0.0,0.0,0.000721,4.3e-05,0.037476,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.544304,0.0,0.127103,0.0,0.002001,0.001561,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,0.0,0.303797,0.0,0.012913,0.0,0.020262,0.097459,0.002196,1.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.316456,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


In [21]:
x_test_norm = pd.DataFrame(x_test_norm, columns = x_test.columns)
x_test_norm.head()

Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e
0,0.0,0.202532,0.0,0.0,0.0,0.02652,0.002154,0.00866,1.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.025316,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.392405,0.0,0.036086,3.6e-05,0.012813,0.003231,0.000166,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.177215,0.0,0.045578,0.0,0.00017,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.379747,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0


**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [23]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20), 
                               n_estimators=100, 
                               max_samples=1000)

In [24]:
# Training Baggubg model with normalized data 
bagging_reg.fit(x_train_norm, y_train)

In [25]:
pred = bagging_reg.predict(x_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 SCORE", bagging_reg.score(x_test_norm, y_test))

MAE 0.2930696920278951
RMSE 0.3849210590648485
R2 SCORE 0.40730999126081446




- Random Forests

In [26]:
forest = RandomForestRegressor(n_estimators=100,
                               max_depth=20)

In [27]:
forest.fit(x_train_norm, y_train)

Evaluating the model 

In [28]:
pred = forest.predict(x_test_norm)
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 SCORE", forest.score(x_test_norm, y_test))

MAE 0.2852643122547987
RMSE 0.3909683093921075
R2 SCORE 0.38854095342249473




- Gradient Boosting

In [29]:
gb_reg = GradientBoostingRegressor(max_depth=20,
                                   n_estimators=100)

In [30]:
# Training the model
gb_reg.fit(x_train_norm, y_train)

In [None]:
# Evaluating now the  model 

Pred = gb_reg.predict(x_test_norm)
print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 SCORE", gb_reg.score(x_test_norm, y_test))

MAE 0.2852643122547987
RMSE 0.3909683093921075
R2 SCORE 0.19813650962688367




- Adaptive Boosting

In [32]:
ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=20), 
                            n_estimators=100)

In [33]:
ada_reg.fit(x_train_norm, y_train)

In [34]:
pred = ada_reg.predict(x_test_norm)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 SCORE", ada_reg.score(x_test_norm, y_test))

MAE 0.28519865409272854
RMSE 0.4412497897477425
R2 SCORE 0.22115096698525616




Which model is the best and why?

Decision Tree Regressor would have been the best since the R2 is higher than with the other models.