# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [27]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [29]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [37]:
spaceship["Cabin"] = spaceship["Cabin"].astype(str).str[0]

In [39]:
spaceship = spaceship.drop(columns=["PassengerId", "Name"])

In [41]:
spaceship = pd.get_dummies(spaceship)

In [43]:
spaceship = spaceship.dropna()

In [47]:
# Split features and target
X = spaceship.drop("Transported", axis=1)
y = spaceship["Transported"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Now we scale (all columns are numeric now)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [49]:
#your code here
from sklearn.model_selection import GridSearchCV

# Set up the hyperparameter grid
param_grid = {
    'n_neighbors': list(range(1, 21)),  # try k from 1 to 20
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # p=1 (Manhattan), p=2 (Euclidean)
}

# Initialize a KNN model
knn = KNeighborsClassifier()

# Use GridSearchCV to find the best combination
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)


Best parameters: {'n_neighbors': 19, 'p': 1, 'weights': 'uniform'}
Best cross-validation score: 0.7649278500248793


- Evaluate your model

In [53]:
#your code here

from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define parameter grid for tuning
param_grid = {
    'n_neighbors': list(range(1, 21)),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]  # p=1: Manhattan, p=2: Euclidean
}

# Create base KNN model
knn = KNeighborsClassifier()

# Run GridSearchCV
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best model from grid search
best_knn = grid_search.best_estimator_

# Evaluate on test set
y_pred_best = best_knn.predict(X_test_scaled)

# Results
print("Best Parameters:", grid_search.best_params_)
print("Test Accuracy:", accuracy_score(y_test, y_pred_best))
print("Classification Report:")
print(classification_report(y_test, y_pred_best))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))


Best Parameters: {'n_neighbors': 19, 'p': 1, 'weights': 'uniform'}
Test Accuracy: 0.8018372703412073

Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.80      0.80       758
        True       0.81      0.80      0.80       766

    accuracy                           0.80      1524
   macro avg       0.80      0.80      0.80      1524
weighted avg       0.80      0.80      0.80      1524


Confusion Matrix:
[[610 148]
 [154 612]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [55]:
#your code here
param_grid = {
    'n_neighbors': list(range(1, 21)),       # Try k from 1 to 20
    'weights': ['uniform', 'distance'],      # Uniform: all neighbors equal, Distance: closer = more weight
    'p': [1, 2]                               # p=1: Manhattan distance, p=2: Euclidean distance
}


- Run Grid Search

In [57]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Define hyperparameter grid
param_grid = {
    'n_neighbors': list(range(1, 21)),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

# Initialize base KNN model
knn = KNeighborsClassifier()

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,                     # 5-fold cross-validation
    scoring='accuracy',
    n_jobs=-1                 # Use all CPU cores
)

# Fit the grid search on the scaled training data
grid_search.fit(X_train_scaled, y_train)

# Print the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Accuracy:", grid_search.best_score_)


Best Parameters: {'n_neighbors': 19, 'p': 1, 'weights': 'uniform'}
Best Cross-Validation Accuracy: 0.7649278500248793


- Evaluate your model

In [59]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Get the best model from grid search
best_knn = grid_search.best_estimator_

# Predict on the test set
y_pred_best = best_knn.predict(X_test_scaled)

# Accuracy
print("Test Accuracy:", accuracy_score(y_test, y_pred_best))

# Detailed classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_best))

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))


Test Accuracy: 0.8018372703412073

Classification Report:
              precision    recall  f1-score   support

       False       0.80      0.80      0.80       758
        True       0.81      0.80      0.80       766

    accuracy                           0.80      1524
   macro avg       0.80      0.80      0.80      1524
weighted avg       0.80      0.80      0.80      1524


Confusion Matrix:
[[610 148]
 [154 612]]


In [None]:
#Test Accuracy shows how well the model performs on unseen data.

#Classification Report breaks down performance by class:

#precision: how many predicted were actually correct

#recall: how many actual cases the model caught

#f1-score: balance between precision and recall

#Confusion Matrix gives a direct look at correct vs. incorrect predictions.

#Once you see the results, you can compare them with your earlier default KNN model to confirm whether tuning improved performance.