# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [20]:
#Libraries
import optuna
import optuna.visualization as vis
import time

import scipy.stats as st

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier,  GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier 

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

In [21]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [3]:
spaceship.shape

(8693, 14)

In [4]:
spaceship.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

In [5]:
spaceship.isna().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [6]:
spaceship.dropna(inplace=True)

In [7]:
# Cabin is too granular - transform it in order to obtain {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'T'}
spaceship["Cabin"] = spaceship["Cabin"].str[0]

In [8]:
# Drop PassengerId and Name
spaceship.drop(["PassengerId", "Name"], axis=1, inplace=True)

In [9]:
# For non-numerical columns, do dummies.
categorical_spaceship = spaceship.select_dtypes('object')
spaceship_df = pd.get_dummies(spaceship, columns = ["HomePlanet", "CryoSleep", "Cabin", "Destination",  "VIP"],  dtype=int, drop_first = True)
spaceship_df

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,1,0,0,1,0,0,0,0,0,0,0,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,0,0,0,0,0,0,0,1,0,0,0,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,1,0,0,0,0,0,0,0,0,0,0,1,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,1,0,0,0,0,0,0,0,0,0,0,1,0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,0,0,0,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,1,0,0,0,0,0,0,0,0,0,0,0,1
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,0,0,1,0,0,0,0,0,1,0,1,0,0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,0,0,0,0,0,0,0,0,1,0,0,1,0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,1,0,0,0,0,0,1,0,0,0,0,0,0


In [10]:
# prepare data
X = spaceship_df.drop('Transported', axis=1)
y = spaceship_df["Transported"]
X

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,1,0,0,1,0,0,0,0,0,0,0,1,0
1,24.0,109.0,9.0,25.0,549.0,44.0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,1,0,0,0,0,0,0,0,0,0,0,1,1
3,33.0,0.0,1283.0,371.0,3329.0,193.0,1,0,0,0,0,0,0,0,0,0,0,1,0
4,16.0,303.0,70.0,151.0,565.0,2.0,0,0,0,0,0,0,0,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,1,0,0,0,0,0,0,0,0,0,0,0,1
8689,18.0,0.0,0.0,0.0,0.0,0.0,0,0,1,0,0,0,0,0,1,0,1,0,0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,0,0,0,0,0,0,0,0,1,0,0,1,0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,1,0,0,0,0,0,1,0,0,0,0,0,0


**Perform Train Test Split**

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

In [23]:
# Split train/test
scaler = StandardScaler() 
X_scaled = scaler.fit(X_train)

In [29]:
X_train_scaled_np = scaler.transform(X_train)
X_test_scaled_np = scaler.transform(X_test)


In [28]:
X_train_scaled_df = pd.DataFrame(X_train_scaled_np, columns = X_train.columns, index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled_np, columns = X_test.columns, index=X_test.index)


- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [18]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)

print(classification_report(y_test, y_pred_knn))
print(f"KNN Accuracy is {accuracy_score(y_pred_knn, y_test): .2f}")

              precision    recall  f1-score   support

       False       0.77      0.81      0.79       661
        True       0.80      0.76      0.78       661

    accuracy                           0.78      1322
   macro avg       0.79      0.78      0.78      1322
weighted avg       0.79      0.78      0.78      1322

KNN Accuracy is  0.78


[WinError 2] Le fichier spécifié est introuvable
  File "C:\Users\davyg\anaconda3\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
        "wmic CPU Get NumberOfCores /Format:csv".split(),
        capture_output=True,
        text=True,
    )
  File "C:\Users\davyg\anaconda3\Lib\subprocess.py", line 554, in run
    with Popen(*popenargs, **kwargs) as process:
         ~~~~~^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\davyg\anaconda3\Lib\subprocess.py", line 1039, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                        pass_fds, cwd, env,
                        ^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
                        gid, gids, uid, umask,
                        ^^^^^^^^^^^^^^^^^^^^^^
                        start_new_session, process_group)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Evaluate your model

Random Forests

In [33]:
forest = RandomForestClassifier(random_state=42)

forest.fit(X_train, y_train)
y_pred_test_rf = forest.predict(X_test)

print(f"Accuracy Random Forest (KNN) {accuracy_score(y_pred_test_rf, y_test)}")
print(classification_report(y_test, y_pred_test_rf))


Accuracy Random Forest (KNN) 0.789712556732224
              precision    recall  f1-score   support

       False       0.78      0.80      0.79       661
        True       0.80      0.78      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [30]:
# First we need to setup a dicstionary with all the values that we want to try for each hyperparameter
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}


- Run Grid Search

In [31]:
rf = RandomForestClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)



best_rf = grid_search.best_estimator_
print("Best hyperparameters:", grid_search.best_params_)


Fitting 5 folds for each of 216 candidates, totalling 1080 fits
Best hyperparameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}


- Evaluate your model

In [32]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = best_rf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.7821482602118003
              precision    recall  f1-score   support

       False       0.79      0.77      0.78       661
        True       0.78      0.79      0.78       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



## Conclusion :
Hyperparameter did not improve the performance , confirming that the Random Forest remains the best-performing model for this dataset.