# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [21]:
#Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [23]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [25]:
#your code here
#your code here
spaceship_v1 = spaceship.dropna() 
spaceship_v1["Cabin_Deck"] = spaceship_v1["Cabin"].str[0]
spaceship_v1.drop(['PassengerId', 'Name'], axis=1, inplace=True)
spaceship_v1 = pd.get_dummies(spaceship_v1, drop_first=True)
spaceship_v1.columns 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_v1["Cabin_Deck"] = spaceship_v1["Cabin"].str[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_v1.drop(['PassengerId', 'Name'], axis=1, inplace=True)


Index(['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Transported', 'HomePlanet_Europa', 'HomePlanet_Mars', 'CryoSleep_True',
       ...
       'Destination_PSO J318.5-22', 'Destination_TRAPPIST-1e', 'VIP_True',
       'Cabin_Deck_B', 'Cabin_Deck_C', 'Cabin_Deck_D', 'Cabin_Deck_E',
       'Cabin_Deck_F', 'Cabin_Deck_G', 'Cabin_Deck_T'],
      dtype='object', length=5324)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [29]:
#your code here
X = spaceship_v1.drop('Transported', axis=1)
y = spaceship_v1['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [31]:
#your code here

random_forest_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    n_jobs=-1
)

# Train the Random Forest model
print("--- Applying Random Forest ---")
random_forest_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = random_forest_model.predict(X_test)

# Evaluate the Random Forest model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Classifier Accuracy: {rf_accuracy:.4f}")

--- Applying Random Forest ---
Random Forest Classifier Accuracy: 0.8162


In [None]:
#your code here

- Evaluate your model

In [36]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

X = spaceship_v1.drop('Transported', axis=1)
y = spaceship_v1['Transported']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

cv_scores = cross_val_score(model, X_train, y_train, cv=5)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Cross-validation scores (5-fold on training data): {cv_scores}")
print(f"Mean Cross-validation Accuracy: {cv_scores.mean():.4f}")
print("Confusion Matrix:")
print(conf_matrix)

Accuracy: 0.7496
Cross-validation scores (5-fold on training data): [0.75685904 0.71428571 0.72280038 0.74172185 0.74147727]
Mean Cross-validation Accuracy: 0.7354
Confusion Matrix:
[[547 106]
 [225 444]]


**Grid/Random Search**

For this lab we will use Grid Search.

- Define hyperparameters to fine tune.

In [38]:
#your code here
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [10, 20, None],
    'min_samples_leaf': [1, 2, 4]
}

print("Defined Hyperparameter Grid:")
print(param_grid)

Defined Hyperparameter Grid:
{'n_estimators': [50, 100, 200], 'max_features': ['sqrt', 'log2', None], 'max_depth': [10, 20, None], 'min_samples_leaf': [1, 2, 4]}


- Run Grid Search

In [44]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

# Define the model
rf_model = RandomForestClassifier(random_state=42)

# Define hyperparameter distributions
param_distributions = {
    'n_estimators': randint(50, 150),  # Reduced range for quicker trials
    'max_features': ['sqrt', 'log2'],  # Removed None to simplify
    'max_depth': randint(5, 15),       # Reduced max to limit complexity
    'min_samples_leaf': randint(1, 3)  # Narrowed range
}

# Configure RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_model,
    param_distributions=param_distributions,
    n_iter=20,  # Reduced number of iterations for faster results
    cv=3,       # Using 3-fold cross-validation
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,  # Utilizes all available cores
    verbose=1   # Slightly more output to keep track of progress
)

# Run the hyperparameter search
random_search.fit(X_train, y_train)

# Print the results
print(f"Best parameters found: {random_search.best_params_}")
print(f"Best cross-validated accuracy: {random_search.best_score_:.4f}")

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best parameters found: {'max_depth': 11, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'n_estimators': 137}
Best cross-validated accuracy: 0.7405


- Evaluate your model

In [46]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Retrieve the best model from the randomized search
best_rf_model = random_search.best_estimator_

# Retrain on the entire training set
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print evaluation results
print("Test Accuracy: {:.4f}".format(accuracy))
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", class_report)

Test Accuracy: 0.7481
Confusion Matrix:
 [[546 107]
 [226 443]]
Classification Report:
               precision    recall  f1-score   support

       False       0.71      0.84      0.77       653
        True       0.81      0.66      0.73       669

    accuracy                           0.75      1322
   macro avg       0.76      0.75      0.75      1322
weighted avg       0.76      0.75      0.75      1322

