# LAB | Hyperparameter Tuning

**Load the data**

Finally step in order to maximize the performance on your Spaceship Titanic model.

The data can be found here:

https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv

Metadata

https://github.com/data-bootcamp-v4/data/blob/main/spaceship_titanic.md

So far we've been training and evaluating models with default values for hyperparameters.

Today we will perform the same feature engineering as before, and then compare the best working models you got so far, but now fine tuning it's hyperparameters.

In [45]:
#Libraries
import pandas as pd
import numpy as np
from scipy import stats as st
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

In [10]:
spaceship = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/spaceship_titanic.csv")
spaceship.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Now perform the same as before:
- Feature Scaling
- Feature Selection


In [11]:
#your code here
spaceship_cleaned = spaceship.dropna() #first we clean a bit the data, dropping NaN

In [12]:
spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0] # Use string operations to extract the first character from each entry in the Cabin column.
print(spaceship_cleaned['Cabin'].unique())

['B' 'F' 'A' 'G' 'E' 'C' 'D' 'T']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  spaceship_cleaned['Cabin'] = spaceship_cleaned['Cabin'].str[0] # Use string operations to extract the first character from each entry in the Cabin column.


In [13]:
spaceship_cleaned = spaceship_cleaned.drop(columns=['PassengerId', 'Name'])

In [14]:
# now doing dummies:

non_numeric_columns = spaceship_cleaned.select_dtypes(include=['object']).columns # Select non-numeric columns

spaceship_cleaned = pd.get_dummies(spaceship_cleaned, columns=non_numeric_columns, drop_first=True) # Generate dummy variables

spaceship_cleaned

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,HomePlanet_Europa,HomePlanet_Mars,CryoSleep_True,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Destination_PSO J318.5-22,Destination_TRAPPIST-1e,VIP_True
0,39.0,0.0,0.0,0.0,0.0,0.0,False,True,False,False,True,False,False,False,False,False,False,False,True,False
1,24.0,109.0,9.0,25.0,549.0,44.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,True,False,False,False,False,False,False,False,False,False,False,True,True
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,True,False,False,False,False,False,False,False,False,False,False,True,False
4,16.0,303.0,70.0,151.0,565.0,2.0,True,False,False,False,False,False,False,False,True,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,False,True,False,False,False,False,False,False,False,False,False,False,False,True
8689,18.0,0.0,0.0,0.0,0.0,0.0,False,False,False,True,False,False,False,False,False,True,False,True,False,False
8690,26.0,0.0,0.0,1872.0,1.0,0.0,True,False,False,False,False,False,False,False,False,True,False,False,True,False
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,False,True,False,False,False,False,False,True,False,False,False,False,False,False


In [15]:
features = spaceship_cleaned.select_dtypes(include=['int64', 'float64']) # selecting numerical columns as features

target = spaceship_cleaned['Transported'] # seting Transported column as target

#preparing the sets:
X = features
y = target

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [21]:
normalizer = MinMaxScaler()

normalizer.fit(X_train)

In [22]:
# here we have the data normalized:
X_train_norm_np = normalizer.transform(X_train)
X_test_norm_np = normalizer.transform(X_test)

#creating DataFrames:
X_train_norm_df = pd.DataFrame(X_train_norm_np, columns = X_train.columns, index=X_train.index)
X_test_norm_df = pd.DataFrame(X_test_norm_np, columns = X_test.columns, index=X_test.index)

- Now let's use the best model we got so far in order to see how it can improve when we fine tune it's hyperparameters.

In [23]:
#your code here

# initialize random forest:
forest = RandomForestClassifier(n_estimators=100,
                             max_depth=10)

In [24]:
forest.fit(X_train_norm_df, y_train) # training the model

- Evaluate your model

In [25]:
#your code here
# Make predictions with the Random Forest Classifier
y_pred_test_rf = forest.predict(X_test_norm_df)

# Importing relevant classification metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_test_rf)
print(f"Accuracy: {accuracy:.2f}")

# Calculate precision, recall, and F1 Score
precision = precision_score(y_test, y_pred_test_rf, average='weighted')
recall = recall_score(y_test, y_pred_test_rf, average='weighted')
f1 = f1_score(y_test, y_pred_test_rf, average='weighted')

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test_rf))

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred_test_rf))

Accuracy: 0.78
Precision: 0.78
Recall: 0.78
F1 Score: 0.78
Confusion Matrix:
[[481 180]
 [114 547]]
Classification Report:
              precision    recall  f1-score   support

       False       0.81      0.73      0.77       661
        True       0.75      0.83      0.79       661

    accuracy                           0.78      1322
   macro avg       0.78      0.78      0.78      1322
weighted avg       0.78      0.78      0.78      1322



**Grid/Random Search**

For this lab we will use Grid Search.

In [36]:
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

- Define hyperparameters to fine tune.

In [37]:
#your code here
parameter_grid = {"max_depth": [10, 50],
                  "min_samples_split": [4, 16],
                  "max_leaf_nodes": [250, 100],
                  "max_features": ["sqrt", "log2"]} # In example we're going to test 2 * 2 * 2 * 2 = 16 combinations of hyperparameters

- Run Grid Search

In [38]:
# We create an instance or our machine learning model
dt = DecisionTreeClassifier(random_state=123)

# We need to set this two variables to be able to compute a confidence interval
confidence_level = 0.95
folds = 10

# Now we need to create an intance of the GridSearchCV class
gs = GridSearchCV(dt, param_grid=parameter_grid, cv=folds, verbose=10) # Here the "cv" allows you to define the number of folds to use.

start_time = time.time()
gs.fit(X_train_norm_df, y_train)
end_time = time.time()

Fitting 10 folds for each of 16 candidates, totalling 160 fits
[CV 1/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 1/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.766 total time=   0.0s
[CV 2/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 2/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.786 total time=   0.0s
[CV 3/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 3/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.809 total time=   0.0s
[CV 4/10; 1/16] START max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4
[CV 4/10; 1/16] END max_depth=10, max_features=sqrt, max_leaf_nodes=250, min_samples_split=4;, score=0.769 total time=   0.0s
[CV 5/10; 1/16] START max_depth=10, max_features=sqrt

In [41]:
# Assuming start_time and end_time are correctly set around your grid search process
print(f"Time taken to find the best combination of hyperparameters among the given ones: {end_time - start_time: .4f} seconds")
print("\n")

# Display the best hyperparameters found through grid search
print(f"The best combination of hyperparameters has been: {gs.best_params_}")

# Replace R2 with relevant classification score, such as accuracy
print(f"The best accuracy score is: {gs.best_score_: .4f}")

# Create a DataFrame from the grid search results and sort it based on the appropriate score metric
results_gs_df = pd.DataFrame(gs.cv_results_).sort_values(by="mean_test_score", ascending=False)

Time taken to find the best combination of hyperparameters among the given ones:  1.3053 seconds


The best combination of hyperparameters has been: {'max_depth': 50, 'max_features': 'sqrt', 'max_leaf_nodes': 100, 'min_samples_split': 16}
The best accuracy score is:  0.7820


In [43]:
# Print the top results for further insight if needed
display(results_gs_df.head())

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_max_leaf_nodes,param_min_samples_split,params,split0_test_score,...,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
11,0.005935,0.000677,0.001405,0.000527,50,sqrt,100,16,"{'max_depth': 50, 'max_features': 'sqrt', 'max...",0.780718,...,0.773157,0.767045,0.801136,0.767045,0.772727,0.797348,0.776515,0.781981,0.014704,1
15,0.006123,0.000852,0.001658,0.000563,50,log2,100,16,"{'max_depth': 50, 'max_features': 'log2', 'max...",0.780718,...,0.773157,0.767045,0.801136,0.767045,0.772727,0.797348,0.776515,0.781981,0.014704,1
3,0.00573,0.000464,0.001526,0.0008,10,sqrt,100,16,"{'max_depth': 10, 'max_features': 'sqrt', 'max...",0.773157,...,0.778828,0.770833,0.782197,0.772727,0.778409,0.799242,0.768939,0.78198,0.012056,3
7,0.006232,0.000361,0.001391,0.000616,10,log2,100,16,"{'max_depth': 10, 'max_features': 'log2', 'max...",0.773157,...,0.778828,0.770833,0.782197,0.772727,0.778409,0.799242,0.768939,0.78198,0.012056,3
1,0.004449,0.000923,0.001301,0.000715,10,sqrt,250,16,"{'max_depth': 10, 'max_features': 'sqrt', 'max...",0.767486,...,0.775047,0.770833,0.780303,0.772727,0.778409,0.799242,0.768939,0.7799,0.012464,5


- Evaluate your model

In [46]:
# Retrieve mean test score and standard deviation from the results, ensuring the correct indices are used
gs_mean_score = results_gs_df.iloc[0]['mean_test_score']  # Assuming mean test score is properly indexed
gs_std = results_gs_df.iloc[0]['std_test_score']  # Assuming std test score is properly indexed
gs_sem = gs_std / np.sqrt(folds)

# Calculate the t-critical value for the confidence interval
gs_tc = st.t.ppf(1 - ((1 - confidence_level) / 2), df=folds - 1)
gs_lower_bound = gs_mean_score - (gs_tc * gs_sem)
gs_upper_bound = gs_mean_score + (gs_tc * gs_sem)

# Print the confidence interval for the best hyperparameter combination
print(f"The accuracy confidence interval for the best combination of hyperparameters is: \
    ({gs_lower_bound: .4f}, {gs_mean_score: .4f}, {gs_upper_bound: .4f}) ")

The accuracy confidence interval for the best combination of hyperparameters is:     ( 0.7715,  0.7820,  0.7925) 


In [47]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Let's store the best model
best_model = gs.best_estimator_

# Evaluate the model on the test set
y_pred_test_df = best_model.predict(X_test_norm_df)

# Print evaluation results for classification metrics
print("\n")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred_test_df): .4f}")
print(f"Test Precision: {precision_score(y_test, y_pred_test_df, average='weighted'): .4f}")
print(f"Test Recall: {recall_score(y_test, y_pred_test_df, average='weighted'): .4f}")
print(f"Test F1 Score: {f1_score(y_test, y_pred_test_df, average='weighted'): .4f}")
print("\n")

# (Optional) Print a confusion matrix and classification report for additional insights
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_test_df))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_test_df))



Test Accuracy:  0.7632
Test Precision:  0.7660
Test Recall:  0.7632
Test F1 Score:  0.7626


Confusion Matrix:
[[471 190]
 [123 538]]

Classification Report:
              precision    recall  f1-score   support

       False       0.79      0.71      0.75       661
        True       0.74      0.81      0.77       661

    accuracy                           0.76      1322
   macro avg       0.77      0.76      0.76      1322
weighted avg       0.77      0.76      0.76      1322

