# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor,AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error, make_scorer


import optuna
import optuna.visualization as vis
import time
import scipy.stats as st

import matplotlib.pyplot as plt
import seaborn as sns

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
df=spaceship.dropna(axis=0)

In [3]:
mapping={k:k[0] for k in df['Cabin'].unique()}
df['Cabin']=df['Cabin'].map(mapping)
df=df.drop(['PassengerId','Name'],axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Cabin']=df['Cabin'].map(mapping)


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
non_num_col=df.select_dtypes('object').columns
ohe = OneHotEncoder(sparse_output=False)
ohe.fit(df[non_num_col])

In [5]:
features = df.drop(columns='Transported')
target = df["Transported"]
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.20, random_state=0)

In [6]:
ohe = OneHotEncoder(drop='first',sparse_output=False)
ohe.fit(X_train[non_num_col])

In [7]:
object_col=X_train.select_dtypes('object').columns

In [8]:
X_train.select_dtypes('object').columns

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP'], dtype='object')

In [9]:
X_train_trans_np = ohe.transform(X_train[object_col])
X_train_trans_df = pd.DataFrame(X_train_trans_np, columns=ohe.get_feature_names_out(), index=X_train.index)
X_train_final=pd.concat([X_train_trans_df,X_train],axis=1)
X_train=X_train.drop(object_col, axis=1) 

In [10]:
X_test_trans_np = ohe.transform(X_test[object_col])
X_test_trans_df = pd.DataFrame(X_test_trans_np, columns=ohe.get_feature_names_out(), index=X_test.index)
X_test_final=pd.concat([X_test_trans_df,X_test],axis=1)
X_test=X_test.drop(object_col, axis=1) 

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [11]:
bagging_reg = BaggingRegressor(DecisionTreeRegressor(max_depth=20),
                               n_estimators=100, # number of models to use
                               max_samples = 1000)

bagging_reg.fit(X_train, y_train)

- Evaluate your model

In [17]:
y_pred_test_bag = bagging_reg.predict(X_test)

print(f"Results for Decision tree")
print(f"MAE {mean_absolute_error(y_pred_test_bag, y_test): .4f}")
print(f"MSE {mean_squared_error(y_pred_test_bag, y_test): .4f}")
print(f"RMSE {root_mean_squared_error(y_pred_test_bag, y_test): .4f}")
print(f"R2 score {bagging_reg.score(X_test, y_test): .4f}")

Results for Decision tree
MAE  0.3162
MSE  0.1612
RMSE  0.4015
R2 score  0.3553


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [13]:
parameter_grid = {"max_depth": [10, 50],
                  "min_samples_split": [4, 16],
                  "max_leaf_nodes": [250, 100],
                  "max_features": ["sqrt", "log2"]}

dt = DecisionTreeRegressor(random_state=123)

confidence_level = 0.95 
folds = 10

gs = GridSearchCV(dt, param_grid=parameter_grid, cv=folds, verbose=10)

start_time = time.time()
gs.fit(X_train, y_train)
end_time = time.time()

print("\n")
print(f"Time taken to find the best combination of hyperparameters among the given ones: {end_time - start_time: .4f} seconds")
print("\n")


print(f"The best combination of hyperparameters has been: {gs.best_params_}")
print(f"The R2 is: {gs.best_score_: .4f}")

results_gs_df = pd.DataFrame(gs.cv_results_).sort_values(by="mean_test_score", ascending=False)

gs_mean_score = results_gs_df.iloc[0,-3]
gs_sem = results_gs_df.iloc[0,-2] / np.sqrt(folds)

gs_tc = st.t.ppf(1-((1-confidence_level)/2), df=folds-1)
gs_lower_bound = gs_mean_score - ( gs_tc * gs_sem )
gs_upper_bound = gs_mean_score + ( gs_tc * gs_sem )

print(f"The R2 confidence interval for the best combination of hyperparameters is: \({gs_lower_bound: .4f}, {gs_mean_score: .4f}, {gs_upper_bound: .4f}) ")

best_model = gs.best_estimator_




Fitting 10 folds for each of 16 candidates, totalling 160 fits
[CV 1/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 1/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.231 total time=   0.0s
[CV 2/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 2/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.365 total time=   0.0s
[CV 3/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 3/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.337 total time=   0.0s
[CV 4/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 4/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.285 total time=   0.0s
[CV 5/10; 1/16] START max_depth=10, max_features=sqrt

  print(f"The R2 confidence interval for the best combination of hyperparameters is: \({gs_lower_bound: .4f}, {gs_mean_score: .4f}, {gs_upper_bound: .4f}) ")


[CV 10/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.234 total time=   0.0s
[CV 1/10; 2/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16
[CV 1/10; 2/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16;, score=0.252 total time=   0.0s
[CV 2/10; 2/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16
[CV 2/10; 2/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16;, score=0.358 total time=   0.0s
[CV 3/10; 2/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16
[CV 3/10; 2/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16;, score=0.373 total time=   0.0s
[CV 4/10; 2/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16
[CV 4/10; 2/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=16;, score=0.277 

- Run Grid Search

In [14]:

y_pred_train_df = best_model.predict(X_train)
y_pred_test_df  = best_model.predict(X_test)


- Evaluate your model

In [16]:
print(f"Test MAE: {mean_absolute_error(y_pred_test_df, y_test): .4f}")
print(f"Test MSE: {mean_squared_error(y_pred_test_df, y_test): .4f}")
print(f"Test RMSE: {root_mean_squared_error(y_pred_test_df, y_test): .4f}")
print(f"Test R2 score:  {best_model.score(X_test, y_test): .4f}")
print("\n")

Test MAE:  0.3164
Test MSE:  0.1745
Test RMSE:  0.4177
Test R2 score:   0.3022


