# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [41]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score,  classification_report
from sklearn.model_selection import GridSearchCV


In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [5]:
# drop nulls
spaceship = spaceship.dropna().reset_index(drop=True)  # Drop nulls
spaceship.shape

(6606, 14)

In [8]:
# Keep only the first letter of Cabin

spaceship['Cabin'] = spaceship['Cabin'].str[0]

print(spaceship.Cabin.value_counts())

Cabin
F    2152
G    1973
E     683
B     628
C     587
D     374
A     207
T       2
Name: count, dtype: int64


In [10]:
# Drop PassengerId and Name
spaceship = spaceship.drop(['PassengerId', 'Name'], axis=1)

Now perform the same as before:
- Feature Scaling
- Feature Selection


In [12]:
# Use dummies on non-numerical columns 
# Get non-numerical columns 
non_num_columns = spaceship.select_dtypes(include=['object']).columns

# Use get_dummies to transform these columns
spaceship_with_dummies = pd.get_dummies(spaceship, columns=non_num_columns, dtype=int) #(I got booleans before specifying int)

print(spaceship_with_dummies)

       Age  RoomService  FoodCourt  ShoppingMall     Spa  VRDeck  Transported  \
0     39.0          0.0        0.0           0.0     0.0     0.0        False   
1     24.0        109.0        9.0          25.0   549.0    44.0         True   
2     58.0         43.0     3576.0           0.0  6715.0    49.0        False   
3     33.0          0.0     1283.0         371.0  3329.0   193.0        False   
4     16.0        303.0       70.0         151.0   565.0     2.0         True   
...    ...          ...        ...           ...     ...     ...          ...   
6601  41.0          0.0     6819.0           0.0  1643.0    74.0        False   
6602  18.0          0.0        0.0           0.0     0.0     0.0        False   
6603  26.0          0.0        0.0        1872.0     1.0     0.0         True   
6604  32.0          0.0     1049.0           0.0   353.0  3235.0        False   
6605  44.0        126.0     4688.0           0.0     0.0    12.0         True   

      HomePlanet_Earth  Hom

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [14]:
features = spaceship_with_dummies.drop('Transported', axis=1)
target = spaceship_with_dummies['Transported']

# train test split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.20, random_state=0)

In [16]:
# Scale features

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)   # fit scaler on training data, transform training data
X_test_scaled = scaler.transform(X_test)         # only transform test data

In [18]:
# Best model: Random Forest 
# Create the model
forest = RandomForestClassifier(n_estimators=100, random_state=0)

# Fit the model 
forest.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = forest.predict(X_test_scaled)




- Evaluate your model

In [20]:
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))

Random Forest Accuracy: 0.7912254160363086
Precision: 0.7975270479134466
Recall: 0.7806354009077155
F1 Score: 0.7889908256880734


In [24]:
train_score = forest.score(X_train_scaled, y_train)
test_score = forest.score(X_test_scaled, y_test)

print("Training accuracy:", train_score)
print("Test accuracy:", test_score)


Training accuracy: 0.9403860711582135
Test accuracy: 0.7912254160363086


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [30]:
# Parameters chosen with help from ChatGPT - not too many so my poor laptop keeps living

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [10, None],
    "max_features": ["sqrt", "log2"],
    "min_samples_split": [2, 10],
    "min_samples_leaf": [1, 4],
    "bootstrap": [True]
}


- Run Grid Search

In [37]:
# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=forest,
    param_grid=param_grid,
    cv=5,                      # 5-fold cross-validation
    scoring='accuracy',        # or 'f1' if you want
    n_jobs=-1,                 # use all cores
    verbose=2
)

# Fit to your scaled training data
grid_search.fit(X_train_scaled, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


- Evaluate your model

In [39]:
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test_scaled, y_test)
print("Test set accuracy:", test_accuracy)


Best parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Best cross-validation score: 0.8031784854218629
Test set accuracy: 0.7904689863842662


In [43]:
# Predict using the best estimator found by GridSearchCV
y_pred = grid_search.best_estimator_.predict(X_test_scaled)

# Generate classification report
report = classification_report(y_test, y_pred)
print(report)


              precision    recall  f1-score   support

       False       0.79      0.79      0.79       661
        True       0.79      0.79      0.79       661

    accuracy                           0.79      1322
   macro avg       0.79      0.79      0.79      1322
weighted avg       0.79      0.79      0.79      1322



In [None]:
# GridSearch did not yield better results than the original RandomForests model. Perhaps a different ensemble method could improve the predictions further. 