# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [74]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from scipy.stats import randint, uniform


In [21]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [22]:
spaceship.shape

(8693, 14)

In [23]:
spaceship = spaceship.dropna(axis=0).copy()
spaceship.shape

(6606, 14)

In [24]:
spaceship["CabinDeck"] = spaceship["Cabin"].str[0]
sorted(spaceship["CabinDeck"].unique())

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'T']

In [25]:
spaceship = spaceship.drop(columns=["PassengerId", "Name"])
spaceship.head()

Unnamed: 0,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,CabinDeck
0,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,B
1,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,F
2,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,A
3,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,A
4,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,F


In [26]:
spaceship =pd.get_dummies(spaceship, drop_first=True)
spaceship.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,...,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True,CabinDeck_B,CabinDeck_C,CabinDeck_D,CabinDeck_E,CabinDeck_F,CabinDeck_G,CabinDeck_T
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,...,False,True,False,True,False,False,False,False,False,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,...,False,True,False,False,False,False,False,True,False,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,...,False,True,True,False,False,False,False,False,False,False
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,...,False,True,False,False,False,False,False,True,False,False


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [66]:
x = spaceship.drop(columns=["Transported"])
y = spaceship["Transported"].astype(int)

In [67]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)

x_train.shape, x_test.shape

((5284, 5323), (1322, 5323))

In [68]:
normalizer = MinMaxScaler()
normalizer.fit(x_train)


x_train_norm = normalizer.transform(x_train)
x_test_norm = normalizer.transform(x_test)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [69]:
gb_base = GradientBoostingRegressor(random_state=42)

In [75]:
# Distributions — small & sensible so it runs fast
param_distributions = {
    "n_estimators": randint(150, 450),      # 150–449
    "learning_rate": uniform(0.02, 0.18),   # 0.02–0.20
    "max_depth": randint(2, 5),             # 2–4
    "min_samples_leaf": randint(1, 5),      # 1–4
    "subsample": uniform(0.6, 0.4),         # 0.6–1.0
    "max_features": [None, "sqrt", "log2", 0.8]
}

In [77]:
search = RandomizedSearchCV(
    estimator=gb_base,
    param_distributions=param_distributions,
    n_iter=20,          # keep small for speed; raise to 40 later if you want
    cv=3,
    scoring="r2",
    n_jobs=-1,
    random_state=42,
    verbose=1
)

search.fit(x_train_norm, y_train)

print("Best CV R^2:", search.best_score_)
print("Best params:", search.best_params_)

best_model = search.best_estimator_
print("Fitted:", hasattr(best_model, "estimators_"))  # should be True

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best CV R^2: 0.45598481363227855
Best params: {'learning_rate': 0.022873925399638555, 'max_depth': 3, 'max_features': 0.8, 'min_samples_leaf': 4, 'n_estimators': 413, 'subsample': 0.6137554084460873}
Fitted: True


- Evaluate your model

In [82]:
# Predictions
y_pred_test  = best_model.predict(x_test_norm)
y_pred_train = best_model.predict(x_train_norm)

# Metrics
print("TEST DATA")
print("R2:",  r2_score(y_test, y_pred_test))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_test)))
print("MAE:",  mean_absolute_error(y_test, y_pred_test))

print("\nTRAIN DATA")
print("R2:",  r2_score(y_train, y_pred_train))
print("RMSE:", np.sqrt(mean_squared_error(y_train, y_pred_train)))
print("MAE:",  mean_absolute_error(y_train, y_pred_train))


TEST DATA
R2: 0.4373860560746904
RMSE: 0.37502724966104456
MAE: 0.2908554610763867

TRAIN DATA
R2: 0.5011545000247855
RMSE: 0.3531358452112827
MAE: 0.26967535596022085


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [86]:
gb_base = GradientBoostingRegressor(random_state=42)

# Small, sensible grid so it runs quickly
param_grid = {
    "n_estimators": [150, 300],
    "learning_rate": [0.05, 0.1],
    "max_depth": [2, 3],
    "min_samples_leaf": [1, 2],
    "subsample": [0.8, 1.0],
    "max_features": [None, "sqrt"]
}

- Run Grid Search

In [87]:
grid = GridSearchCV(
    estimator=gb_base,
    param_grid=param_grid,
    cv=3,
    scoring="r2",
    n_jobs=-1,
    verbose=1
)

grid.fit(x_train_norm, y_train)

print("Best CV R^2:", grid.best_score_)
print("Best params:", grid.best_params_)

best_model = grid.best_estimator_

Fitting 3 folds for each of 64 candidates, totalling 192 fits
Best CV R^2: 0.4541559243204946
Best params: {'learning_rate': 0.05, 'max_depth': 3, 'max_features': None, 'min_samples_leaf': 2, 'n_estimators': 300, 'subsample': 0.8}


- Evaluate your model

In [88]:
y_pred_test  = best_model.predict(x_test_norm)
y_pred_train = best_model.predict(x_train_norm)

print("TEST DATA")
print("R2:",  r2_score(y_test, y_pred_test))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_test)))
print("MAE:",  mean_absolute_error(y_test, y_pred_test))

print("\nTRAIN DATA")
print("R2:",  r2_score(y_train, y_pred_train))
print("RMSE:", np.sqrt(mean_squared_error(y_train, y_pred_train)))
print("MAE:",  mean_absolute_error(y_train, y_pred_train))

TEST DATA
R2: 0.43448384553550246
RMSE: 0.37599328332834836
MAE: 0.2915289984058934

TRAIN DATA
R2: 0.5168318442678523
RMSE: 0.3475425036056371
MAE: 0.2657056850364444
