# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [6]:
# Check the shape of the data
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [8]:
# Check for missing values
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [10]:
# drop null values
spaceship.dropna(axis=0, how='any', inplace=True)
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   6606 non-null   object 
 1   HomePlanet    6606 non-null   object 
 2   CryoSleep     6606 non-null   object 
 3   Cabin         6606 non-null   object 
 4   Destination   6606 non-null   object 
 5   Age           6606 non-null   float64
 6   VIP           6606 non-null   object 
 7   RoomService   6606 non-null   float64
 8   FoodCourt     6606 non-null   float64
 9   ShoppingMall  6606 non-null   float64
 10  Spa           6606 non-null   float64
 11  VRDeck        6606 non-null   float64
 12  Name          6606 non-null   object 
 13  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 729.0+ KB


In [12]:
# transform the "Cabin" column
spaceship['Cabin'] = spaceship['Cabin'].str.split('/').str[0]
spaceship.value_counts()

PassengerId  HomePlanet  CryoSleep  Cabin  Destination    Age   VIP    RoomService  FoodCourt  ShoppingMall  Spa    VRDeck  Name               Transported
0001_01      Europa      False      B      TRAPPIST-1e    39.0  False  0.0          0.0        0.0           0.0    0.0     Maham Ofracculy    False          1
6162_01      Earth       False      F      55 Cancri e    22.0  False  0.0          0.0        1.0           575.0  0.0     Bonyan Hineyley    False          1
6175_01      Earth       False      G      TRAPPIST-1e    18.0  False  628.0        0.0        0.0           31.0   150.0   Thel Pittler       False          1
6174_02      Earth       True       G      PSO J318.5-22  4.0   False  0.0          0.0        0.0           0.0    0.0     Cherry Fisheparks  True           1
6174_01      Earth       False      F      55 Cancri e    24.0  False  0.0          479.0      116.0         1.0    37.0    Jord Mcbriddley    False          1
                                             

In [14]:
# drop the columns "PassengerId" and "Name"
spaceship.drop(columns=["PassengerId", "Name"], inplace=True)
spaceship.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HomePlanet    6606 non-null   object 
 1   CryoSleep     6606 non-null   object 
 2   Cabin         6606 non-null   object 
 3   Destination   6606 non-null   object 
 4   Age           6606 non-null   float64
 5   VIP           6606 non-null   object 
 6   RoomService   6606 non-null   float64
 7   FoodCourt     6606 non-null   float64
 8   ShoppingMall  6606 non-null   float64
 9   Spa           6606 non-null   float64
 10  VRDeck        6606 non-null   float64
 11  Transported   6606 non-null   bool   
dtypes: bool(1), float64(6), object(5)
memory usage: 625.8+ KB


In [16]:
# Create dummies for all non-numerical columns
spaceship_dummies = pd.get_dummies(spaceship, drop_first=True)
spaceship_dummies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 20 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        6606 non-null   float64
 1   RoomService                6606 non-null   float64
 2   FoodCourt                  6606 non-null   float64
 3   ShoppingMall               6606 non-null   float64
 4   Spa                        6606 non-null   float64
 5   VRDeck                     6606 non-null   float64
 6   Transported                6606 non-null   bool   
 7   HomePlanet_Europa          6606 non-null   bool   
 8   HomePlanet_Mars            6606 non-null   bool   
 9   CryoSleep_True             6606 non-null   bool   
 10  Cabin_B                    6606 non-null   bool   
 11  Cabin_C                    6606 non-null   bool   
 12  Cabin_D                    6606 non-null   bool   
 13  Cabin_E                    6606 non-null   bool   
 1

In [18]:
# define "Transported" as target
spaceship_target = spaceship_dummies["Transported"]
spaceship_target.info()

<class 'pandas.core.series.Series'>
Index: 6606 entries, 0 to 8692
Series name: Transported
Non-Null Count  Dtype
--------------  -----
6606 non-null   bool 
dtypes: bool(1)
memory usage: 58.1 KB


In [20]:
# select feature columns
spaceship_numeric = spaceship_dummies.drop(columns=["Transported"])
spaceship_numeric.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6606 entries, 0 to 8692
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        6606 non-null   float64
 1   RoomService                6606 non-null   float64
 2   FoodCourt                  6606 non-null   float64
 3   ShoppingMall               6606 non-null   float64
 4   Spa                        6606 non-null   float64
 5   VRDeck                     6606 non-null   float64
 6   HomePlanet_Europa          6606 non-null   bool   
 7   HomePlanet_Mars            6606 non-null   bool   
 8   CryoSleep_True             6606 non-null   bool   
 9   Cabin_B                    6606 non-null   bool   
 10  Cabin_C                    6606 non-null   bool   
 11  Cabin_D                    6606 non-null   bool   
 12  Cabin_E                    6606 non-null   bool   
 13  Cabin_F                    6606 non-null   bool   
 1

In [22]:
# Normalization
from sklearn.preprocessing import MinMaxScaler, StandardScaler
normalizer = MinMaxScaler()

In [24]:
# fitting = calculating min and max for each column
normalizer.fit(spaceship_numeric)

In [26]:
# transforming = using the min and max data to scale the rest of the values
spaceship_numeric_norm = normalizer.transform(spaceship_numeric)

In [28]:
# The normalizer returns an array instead of a dataframe
spaceship_numeric_norm

array([[4.93670886e-01, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [3.03797468e-01, 1.09879032e-02, 3.01881729e-04, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [7.34177215e-01, 4.33467742e-03, 1.19947674e-01, ...,
        0.00000000e+00, 1.00000000e+00, 1.00000000e+00],
       ...,
       [3.29113924e-01, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
       [4.05063291e-01, 0.00000000e+00, 3.51859927e-02, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.56962025e-01, 1.27016129e-02, 1.57246839e-01, ...,
        0.00000000e+00, 1.00000000e+00, 0.00000000e+00]])

In [30]:
# creating a dataframe of the spaceship_numeric_norm array again
spaceship_numeric_norm = pd.DataFrame(spaceship_numeric_norm, columns = spaceship_numeric.columns)
spaceship_numeric_norm.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,0.493671,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.303797,0.010988,0.000302,0.00204,0.0245,0.002164,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.734177,0.004335,0.119948,0.0,0.29967,0.00241,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,0.417722,0.0,0.043035,0.030278,0.148563,0.009491,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.202532,0.030544,0.002348,0.012324,0.025214,9.8e-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [32]:
# Train Test Split
# split X (features) and y (target) into X_train, X_test, y_train, and y_test. 
# 80% of the data should be in the training set and 20% in the test set.

X_train, X_test, y_train, y_test = train_test_split(spaceship_numeric_norm, spaceship_target, test_size=0.20, random_state=0)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5284 entries, 2584 to 2732
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        5284 non-null   float64
 1   RoomService                5284 non-null   float64
 2   FoodCourt                  5284 non-null   float64
 3   ShoppingMall               5284 non-null   float64
 4   Spa                        5284 non-null   float64
 5   VRDeck                     5284 non-null   float64
 6   HomePlanet_Europa          5284 non-null   float64
 7   HomePlanet_Mars            5284 non-null   float64
 8   CryoSleep_True             5284 non-null   float64
 9   Cabin_B                    5284 non-null   float64
 10  Cabin_C                    5284 non-null   float64
 11  Cabin_D                    5284 non-null   float64
 12  Cabin_E                    5284 non-null   float64
 13  Cabin_F                    5284 non-null   float64

In [34]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1322 entries, 1830 to 5022
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        1322 non-null   float64
 1   RoomService                1322 non-null   float64
 2   FoodCourt                  1322 non-null   float64
 3   ShoppingMall               1322 non-null   float64
 4   Spa                        1322 non-null   float64
 5   VRDeck                     1322 non-null   float64
 6   HomePlanet_Europa          1322 non-null   float64
 7   HomePlanet_Mars            1322 non-null   float64
 8   CryoSleep_True             1322 non-null   float64
 9   Cabin_B                    1322 non-null   float64
 10  Cabin_C                    1322 non-null   float64
 11  Cabin_D                    1322 non-null   float64
 12  Cabin_E                    1322 non-null   float64
 13  Cabin_F                    1322 non-null   float64

In [36]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 5284 entries, 3432 to 3642
Series name: Transported
Non-Null Count  Dtype
--------------  -----
5284 non-null   bool 
dtypes: bool(1)
memory usage: 46.4 KB


In [38]:
y_test.info()

<class 'pandas.core.series.Series'>
Index: 1322 entries, 2453 to 6640
Series name: Transported
Non-Null Count  Dtype
--------------  -----
1322 non-null   bool 
dtypes: bool(1)
memory usage: 11.6 KB


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [40]:
# Importing models from sklearn
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [44]:
# Choose Random Forest Classifier for modelling
# The accuracy of Transported prediction using the Random Forest Classification model is: 0.794251134644478.
forest_clf = RandomForestClassifier(n_estimators=100, max_depth=20)

In [46]:
# Training Random Forest model with normalized data
forest_clf.fit(X_train, y_train)

In [48]:
# prediction of "Transported" classification applying Random Forest Classification
pred_frst = forest_clf.predict(X_test)

- Evaluate your model

In [52]:
# Evaluate Bagging Classification model's performance --> calculate accuracy
accuracy_frst=forest_clf.score(X_test, y_test)
accuracy_frst

0.7874432677760969

In [54]:
# Evaluating accuracy value of the Random Forest classification models:
print(f"The accuracy of Transported prediction using the Random Forest Classification model is: {accuracy_frst}.")

The accuracy of Transported prediction using the Random Forest Classification model is: 0.7874432677760969.


In [56]:
# The model resulting in the highest accuracy for prediction of "Transported" Classification is the Random Forest model.
# However: the accuracy can still be improved

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [58]:
# Import Hyperparameter Tuning 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [70]:
# Random Forest Model - Cross Validation using Grid Search --> all combinations will be tested
grid = {"n_estimators": [50, 100, 200],                   # Number of decision trees in the forest.
        "max_leaf_nodes": [250, 500, None],    # Maximum number of leaf nodes in each tree.
        "max_depth":[10, 20, 50]}              # Maximum depth of each tree. Prevents overfitting.
# bootstrap --> Whether to use bootstrapped samples when building trees (default: True).
# oob_score --> Whether to use Out-of-Bag samples for cross-validation (default: False).

In [72]:
# Define the Random Forest Model applying Grid Search
forest_clf_grid = RandomForestClassifier()

- Run Grid Search

In [75]:
# CV = how many chunks of the training data, 5 or 10 usually
# scoring is set to accuracy as default, but can also be set to recall or precision
model = GridSearchCV(estimator = forest_clf_grid, param_grid = grid, cv=5)

In [77]:
# Fitting the model
model.fit(X_train, y_train)

In [79]:
# Best hyperparameters given based on accuracy
model.best_params_

{'max_depth': 10, 'max_leaf_nodes': 500, 'n_estimators': 100}

In [81]:
best_model = model.best_estimator_

- Evaluate your model

In [93]:
# Choose Random Forest Classifier for modelling applying the selected estimators from Grid Search
forest_clf_grid = RandomForestClassifier(max_depth=10, max_leaf_nodes=500, n_estimators=100)

In [95]:
# Training Random Forest model with normalized data
forest_clf_grid.fit(X_train, y_train)

In [97]:
# prediction of "Transported" classification applying Random Forest Classification
pred_frst_grid = forest_clf_grid.predict(X_test)

In [99]:
# Evaluate Bagging Classification model's performance --> calculate accuracy
accuracy_frst_grid=forest_clf_grid.score(X_test, y_test)
accuracy_frst_grid

0.789712556732224

In [101]:
# Evaluating accuracy value of the Random Forest classification model after Grid Search Hyperparameter Tuning:
print(f"The accuracy of Transported prediction using the Random Forest Classification model after Grid Search Hyperparameter Tuning is: {accuracy_frst_grid}.")

The accuracy of Transported prediction using the Random Forest Classification model after Grid Search Hyperparameter Tuning is: 0.789712556732224.


In [103]:
# The accuracy for prediction of "Transported" Classification applying the Random Forest model does not improve using the Hyperparameters suggested by Grid Search Hyperparameter Tuning.