# LAB | Ensemble Methods

**Load the data**

In this challenge, we will be working with the same Spaceship Titanic data, like the previous Lab. The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

In this Lab, you should try different ensemble methods in order to see if can obtain a better model than before. In order to do a fair comparison, you should perform the same feature scaling, engineering applied in previous Lab.

In [2]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np # linear algebra

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [3]:
spaceship = pd.read_csv("/Users/skyler/Documents/GitHub/Homework/spaceship_titanic.csv")

spaceship.shape
spaceship.isnull().sum()
spaceship = spaceship.dropna()
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)
spaceship['Cabin'] = spaceship['Cabin'].str[0]
spaceship.columns


Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported'],
      dtype='object')

In [4]:
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
0,Europa,False,B,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False
1,Earth,False,F,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True
2,Europa,False,A,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False
3,Europa,False,A,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False
4,Earth,False,F,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [5]:
#your code here
features = spaceship.drop(columns = ['Transported'])
target = spaceship['Transported']

columns_to_encode = ['HomePlanet', 'Destination', 'Cabin']
columns_to_encode = [col for col in columns_to_encode if col in features.columns]

# Apply pd.get_dummies() only to the existing columns
X_encoded = pd.get_dummies(features, columns=columns_to_encode)
X_encoded




Unnamed: 0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Earth,HomePlanet_Europa,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,...,False,True,False,True,False,False,False,False,False,False
1,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,False,...,False,True,False,False,False,False,False,True,False,False
2,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,...,False,True,True,False,False,False,False,False,False,False
3,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,...,False,True,True,False,False,False,False,False,False,False
4,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,False,...,False,True,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,True,...,False,False,True,False,False,False,False,False,False,False
8689,True,18.0,False,0.0,0.0,0.0,0.0,0.0,True,False,...,True,False,False,False,False,False,False,False,True,False
8690,False,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,False,...,False,True,False,False,False,False,False,False,True,False
8691,False,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,True,...,False,False,False,False,False,False,True,False,False,False


**Perform Train Test Split**

In [6]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(X_encoded, target, test_size = 0.20, random_state=0)

In [7]:
# normalize the data
from sklearn.preprocessing import MinMaxScaler, StandardScaler
normalizer = MinMaxScaler()
normalizer.fit(X_train)

X_train_norm = normalizer.transform(X_train)

X_test_norm = normalizer.transform(X_test)

In [8]:
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Model Selection** - now you will try to apply different ensemble methods in order to get a better model

- Bagging and Pasting

In [9]:
#your code here

# using bagging 
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
# Set max_samples to the minimum of 10000 and the number of samples in X_train_norm

max_samples = min(100, X_train_scaled.shape[0])

bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=10),
                               n_estimators=50,
                               max_samples=max_samples)

bagging_reg.fit(X_train_scaled, y_train)


In [18]:
pred = bagging_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", bagging_reg.score(X_test_scaled, y_test))

MAE 0.3163211420450452
RMSE 0.3880271963113288
R2 score 0.397739579691078


- Random Forests

In [21]:
#your code here
forest = RandomForestRegressor(n_estimators=65,
                             max_depth=299)

forest.fit(X_train_scaled, y_train)

In [22]:
pred = forest.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", forest.score(X_test_scaled, y_test))

MAE 0.2691508511165344
RMSE 0.38562841713314966
R2 score 0.40516289559752605


- Gradient Boosting

In [13]:
#your code here
gb_reg = GradientBoostingRegressor(max_depth=10,
                                   n_estimators=50)
gb_reg.fit(X_train_scaled, y_train)

In [14]:
pred = gb_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", gb_reg.score(X_test_scaled, y_test))

MAE 0.2738284417574897
RMSE 0.3921059948109815
R2 score 0.3850115553331621


- Adaptive Boosting

In [15]:
#your code here

ada_reg = AdaBoostRegressor(DecisionTreeRegressor(max_depth=10),
                            n_estimators=100)
ada_reg.fit(X_train_scaled, y_train)

In [16]:
pred = ada_reg.predict(X_test_scaled)

print("MAE", mean_absolute_error(pred, y_test))
print("RMSE", mean_squared_error(pred, y_test, squared=False))
print("R2 score", ada_reg.score(X_test_scaled, y_test))

MAE 0.39142978619996815
RMSE 0.4325370422357467
R2 score 0.25164682837580754


Which model is the best and why?

In [17]:
#comment here