# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [115]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, recall_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler

from sklearn.utils import resample

In [4]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [11]:
#your code here
df = spaceship.dropna()
df['Cabin'] = df['Cabin'].apply(lambda x: x[0] if pd.notna(x) else 'Unknown')
df['Cabin']. unique()
df = df.drop(columns=['PassengerId', 'Name'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Cabin'] = df['Cabin'].apply(lambda x: x[0] if pd.notna(x) else 'Unknown')


In [13]:
from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(sparse_output=False)

In [17]:
non_numeric_cols = df.select_dtypes(include=['object']).columns
encoded_features = ohe.fit_transform(df[non_numeric_cols])
encoded_features

array([[0., 1., 0., ..., 1., 1., 0.],
       [1., 0., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 1., 0., 1.],
       ...,
       [1., 0., 0., ..., 1., 1., 0.],
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 0., ..., 1., 1., 0.]])

In [36]:
encoded_df = pd.DataFrame(encoded_features, columns=ohe.get_feature_names_out(non_numeric_cols), index=df.index).astype(int)
encoded_df = pd.concat([df.drop(columns=non_numeric_cols), encoded_df], axis=1)
encoded_df

Unnamed: 0,Transported,HomePlanet_Earth,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_False,CryoSleep_True,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_55 Cancri e,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_False,VIP_True
0,False,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0
1,True,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0
2,False,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,1
3,False,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,0
4,True,1,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,False,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1
8689,False,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0
8690,True,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0
8691,False,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0


In [29]:
encoded_df = encoded_df.drop(columns=['Age', 'RoomService','FoodCourt','ShoppingMall','Spa','VRDeck'])

In [38]:
encoded_df.dtypes

Transported                   bool
HomePlanet_Earth             int64
HomePlanet_Europa            int64
HomePlanet_Mars              int64
CryoSleep_False              int64
CryoSleep_True               int64
Cabin_A                      int64
Cabin_B                      int64
Cabin_C                      int64
Cabin_D                      int64
Cabin_E                      int64
Cabin_F                      int64
Cabin_G                      int64
Cabin_T                      int64
Destination_55 Cancri e      int64
Destination_PSO J318.5-22    int64
Destination_TRAPPIST-1e      int64
VIP_False                    int64
VIP_True                     int64
dtype: object

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [68]:
#your code here
features = encoded_df.drop(columns=['Transported'])
target = encoded_df["Transported"]
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=42)

In [70]:
#your code here
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # n_neighbors = K

knn.fit(X_train, y_train)

pred = knn.predict(X_test)

- Evaluate your model

In [73]:
#your code here
print(classification_report(y_pred = pred, y_true = y_test))

              precision    recall  f1-score   support

       False       0.63      0.64      0.63       653
        True       0.64      0.63      0.64       669

    accuracy                           0.64      1322
   macro avg       0.64      0.64      0.64      1322
weighted avg       0.64      0.64      0.64      1322



In [84]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=10, min_samples_split=3, min_samples_leaf=2, random_state=42)
dt.fit(X_train, y_train)

y_pred = dt.predict(X_test)

In [86]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

       False       0.68      0.86      0.76       653
        True       0.81      0.60      0.69       669

    accuracy                           0.73      1322
   macro avg       0.74      0.73      0.72      1322
weighted avg       0.75      0.73      0.72      1322



**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

- Run Grid Search

In [108]:
import optuna
import optuna.visualization as vis
import time
import scipy.stats as st

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

# Load dataset
housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["target"] = housing.target

# Split dataset
X = df.drop(columns=["target"])
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

# Define Decision Tree model
dt = DecisionTreeRegressor(random_state=42)

# Define hyperparameter grid
parameter_grid = {
    "max_depth": [10, 50],
    "min_samples_split": [4, 16],
    "max_leaf_nodes": [250, 100],
    "max_features": ["sqrt", "log2"]
}

# Confidence Interval Parameters
confidence_level = 0.95
folds = 10

# Perform Grid Search
gs = GridSearchCV(dt, param_grid=parameter_grid, cv=folds, verbose=10)

start_time = time.time()
gs.fit(X_train, y_train)
end_time = time.time()

# Best Parameters & Score
print("\n")
print(f"Time taken for Grid Search: {end_time - start_time:.4f} seconds\n")
print(f"Best Hyperparameters: {gs.best_params_}")
print(f"Best R2 Score: {gs.best_score_:.4f}")

# Compute Confidence Interval
results_gs_df = pd.DataFrame(gs.cv_results_).sort_values(by="mean_test_score", ascending=False)
gs_mean_score = results_gs_df.iloc[0, -3]
gs_sem = results_gs_df.iloc[0, -2] / np.sqrt(folds)

gs_tc = st.t.ppf(1 - ((1 - confidence_level) / 2), df=folds - 1)
gs_lower_bound = gs_mean_score - (gs_tc * gs_sem)
gs_upper_bound = gs_mean_score + (gs_tc * gs_sem)

print(f"R2 Confidence Interval: ({gs_lower_bound:.4f}, {gs_mean_score:.4f}, {gs_upper_bound:.4f})")

# Best Model Evaluation
best_model = gs.best_estimator_
y_pred_test_df = best_model.predict(X_test)


Fitting 10 folds for each of 16 candidates, totalling 160 fits
[CV 1/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 1/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.639 total time=   0.0s
[CV 2/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 2/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.658 total time=   0.0s
[CV 3/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 3/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.651 total time=   0.0s
[CV 4/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 4/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.646 total time=   0.0s
[CV 5/10; 1/16] START max_depth=10, max_features=sqrt

- Evaluate your model

In [110]:
# Compute Metrics
mae = mean_absolute_error(y_test, y_pred_test_df)
mse = mean_squared_error(y_test, y_pred_test_df)
rmse = np.sqrt(mse)
r2 = best_model.score(X_test, y_test)

# Print Results
print("\n")
print(f"Test MAE: {mae:.4f}")
print(f"Test MSE: {mse:.4f}")
print(f"Test RMSE: {rmse:.4f}")
print(f"Test R2 Score: {r2:.4f}")
print("\n")



Test MAE: 0.4504
Test MSE: 0.4355
Test RMSE: 0.6600
Test R2 Score: 0.6676


