# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [2]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [3]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [4]:
import numpy as np

# 🚫 Fill boolean-like columns safely (no downcasting warning)
spaceship['CryoSleep'] = np.where(spaceship['CryoSleep'].isna(), False, spaceship['CryoSleep']).astype(int)
spaceship['VIP'] = np.where(spaceship['VIP'].isna(), False, spaceship['VIP']).astype(int)

# 🔢 Fill numerical columns
spaceship['Age'] = spaceship['Age'].fillna(spaceship['Age'].median())
spaceship['RoomService'] = spaceship['RoomService'].fillna(0)
spaceship['FoodCourt'] = spaceship['FoodCourt'].fillna(0)
spaceship['ShoppingMall'] = spaceship['ShoppingMall'].fillna(0)
spaceship['Spa'] = spaceship['Spa'].fillna(0)
spaceship['VRDeck'] = spaceship['VRDeck'].fillna(0)

# 🏷️ Fill categorical columns
spaceship['HomePlanet'] = spaceship['HomePlanet'].fillna(spaceship['HomePlanet'].mode()[0])
spaceship['Destination'] = spaceship['Destination'].fillna(spaceship['Destination'].mode()[0])
spaceship['Cabin'] = spaceship['Cabin'].fillna('Unknown/0/0')
spaceship['Name'] = spaceship['Name'].fillna('Unknown Unknown')


In [5]:
# 🧱 Split 'Cabin' into 'Deck', 'CabinNum', and 'Side'
spaceship[['Deck', 'CabinNum', 'Side']] = spaceship['Cabin'].str.split('/', expand=True)

# 🗑️ Drop columns we no longer need
spaceship.drop(columns=['PassengerId', 'Name', 'Cabin'], inplace=True)


In [6]:
# One-hot encode categorical columns
spaceship = pd.get_dummies(spaceship, columns=['HomePlanet', 'Destination', 'Deck', 'Side'], drop_first=True)


In [7]:
# Separate features and target
X = spaceship.drop('Transported', axis=1)  # Drop the target column
y = spaceship['Transported'].astype(int)   # Target column


In [8]:
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [10]:
from sklearn.metrics import accuracy_score, classification_report

# Check the best hyperparameters (this will display the results)
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the model on the test set
y_pred = best_model.predict(X_test_scaled)

# Print accuracy and classification report to verify performance
test_accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Check the results
test_accuracy, classification_rep


NameError: name 'grid_search' is not defined

In [25]:
from sklearn.feature_selection import SelectFromModel

# Fit the best model (from grid search) to get feature importances
best_model.fit(X_train_scaled, y_train)

# Select features based on their importance (mean importance threshold)
selector = SelectFromModel(best_model, threshold="mean", max_features=10)
X_train_selected = selector.transform(X_train_scaled)
X_test_selected = selector.transform(X_test_scaled)

# Check which features were selected
selected_features = X.columns[selector.get_support()]
selected_features


Index(['CryoSleep', 'Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
       'VRDeck', 'CabinNum'],
      dtype='object')

In [26]:
# Train the model using only the selected features
best_model.fit(X_train_selected, y_train)

# Make predictions on the selected test set
y_pred_selected = best_model.predict(X_test_selected)

# Evaluate the model's performance
test_accuracy_selected = accuracy_score(y_test, y_pred_selected)
classification_rep_selected = classification_report(y_test, y_pred_selected)

# Check the results
test_accuracy_selected, classification_rep_selected


(0.78205865439908,
 '              precision    recall  f1-score   support\n\n           0       0.81      0.73      0.77       861\n           1       0.76      0.83      0.79       878\n\n    accuracy                           0.78      1739\n   macro avg       0.78      0.78      0.78      1739\nweighted avg       0.78      0.78      0.78      1739\n')

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Re-train the original model using all features (before feature selection)
original_model = RandomForestClassifier(random_state=42)

# Fit the original model using all features
original_model.fit(X_train_scaled, y_train)

# Make predictions using the original model (with all features)
y_pred_original = original_model.predict(X_test_scaled)

# Evaluate the original model
test_accuracy_original = accuracy_score(y_test, y_pred_original)
classification_rep_original = classification_report(y_test, y_pred_original)

# Evaluate the model with selected features (using the model we tuned earlier)
y_pred_selected = best_model.predict(X_test_selected)

# Evaluate the selected model's performance
test_accuracy_selected = accuracy_score(y_test, y_pred_selected)
classification_rep_selected = classification_report(y_test, y_pred_selected)

# Compare the results
test_accuracy_original, test_accuracy_selected


(0.7947096032202415, 0.78205865439908)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [None]:
#your code here **************here***************
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Initialize the model (we're using the RandomForestClassifier here)
rf_model = RandomForestClassifier(random_state=42)

# Set up GridSearchCV
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train_scaled, y_train)

# Get the best model
best_grid_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred_best_model = best_grid_model.predict(X_test_scaled)

# Get accuracy and classification report
test_accuracy_best_model = accuracy_score(y_test, y_pred_best_model)
classification_rep_best_model = classification_report(y_test, y_pred_best_model)

# Output the results
test_accuracy_best_model, classification_rep_best_model


- Evaluate your model

In [1]:
#your code here

**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [None]:
#your code here

- Run Grid Search

- Evaluate your model