# Spaceship. Part 5.
## Hyperparameters tuning 

We'll load our prepared in Part 4 data, as well as the scores DataFrame, and set random seed:

In [109]:
# Random seed for reproducibility
SEED = 123

import pandas as pd

train = pd.read_csv('04_train_prepared.csv', index_col=0)
test =  pd.read_csv('04_test_prepared.csv', index_col=0)
scores_df = pd.read_csv('04_scores_df.csv', index_col=0)
test_Ids = pd.read_csv('test_Ids.csv', index_col=0).reset_index(drop=True)

train['Transported'] = [1 if i else 0 for i in train['Transported']]

Next, we'll create train_evaluate function, that will return average cross-validation ROC AUC score for a given set of parameters. We'll also define our classifier inside this function.  We'll put n_estimators to 100 for speed. Greater number may increase performance.

In [110]:
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

def train_evaluate(params):
    '''
    This function takes  parameters for a classifier.
    
    It returns average cross-validated ROC AUC score.
    '''
    
    # Prepare our best estimator for training
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(random_state=SEED,
                               n_estimators= 100,
                               n_jobs=-1
                               )


    # Set parameters for the model
    model.set_params(**params)
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # An empty list for collecting scores
    test_roc_auc_scores = []
    
    # Iterate through folds
    for train_index, cv_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate ROC AUC score and append to the scores lists
        test_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        test_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], test_pred_proba))
        
    return np.mean(test_roc_auc_scores)
        

We'll use Optuna for our tuning. For this, we'll need to create a study:

In [111]:
import optuna

study = optuna.create_study(study_name='05_RF', direction='maximize')

[I 2023-07-28 18:58:11,329] A new study created in memory with name: 05_RF


Next, we need to define objective function to optimize, which contains range of parameters for search. We'll need to redefine this function (and start a new study) in order to change the ranges of search.

In [112]:
def objective(trial):
    params = {
        # 'n_estimators': optuna.distributions.IntDistribution(100, 1000),
        # 'criterion': optuna.distributions.CategoricalDistribution(['log_loss', 'entropy']),
        'criterion': trial.suggest_categorical('criterion', ['log_loss', 'gini']),
        'max_depth': trial.suggest_int('max_depth', 2, 50),
        'max_features': trial.suggest_int('max_features', 1, 16),
        'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 20, 300),
        "min_impurity_decrease": trial.suggest_float("min_impurity_decrease", 1e-9, 5e+7, log=True),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 30),
        'ccp_alpha': trial.suggest_float('ccp_alpha', 1e-7, 4e-1, log=True),
        'max_samples': trial.suggest_float('max_samples', 0.3, 1)
             
         }
    return train_evaluate(params)

Now we can run our study. Let's run it for 5 seconds:

In [113]:
study.optimize(objective, timeout=5, n_jobs=-1)

[I 2023-07-28 18:58:26,439] Trial 10 finished with value: 0.5 and parameters: {'criterion': 'gini', 'max_depth': 25, 'max_features': 11, 'max_leaf_nodes': 62, 'min_impurity_decrease': 1265495.7100621029, 'min_samples_leaf': 14, 'ccp_alpha': 6.197940917245772e-06, 'max_samples': 0.7087892800488842}. Best is trial 10 with value: 0.5.
[I 2023-07-28 18:58:26,450] Trial 2 finished with value: 0.5 and parameters: {'criterion': 'log_loss', 'max_depth': 34, 'max_features': 2, 'max_leaf_nodes': 146, 'min_impurity_decrease': 55.482474204021216, 'min_samples_leaf': 12, 'ccp_alpha': 3.3206893641002994e-05, 'max_samples': 0.4933093620487603}. Best is trial 10 with value: 0.5.
[I 2023-07-28 18:58:26,518] Trial 6 finished with value: 0.5 and parameters: {'criterion': 'gini', 'max_depth': 16, 'max_features': 7, 'max_leaf_nodes': 278, 'min_impurity_decrease': 43.902619759540904, 'min_samples_leaf': 5, 'ccp_alpha': 1.0672311645135976e-07, 'max_samples': 0.31718825945414153}. Best is trial 10 with value:

Here is what we've got in 5 seconds:

In [114]:
print("Best trial:", study.best_trial.number)
print("Best average cross-validation ROC AUC:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)

Best trial: 0
Best average cross-validation ROC AUC: 0.8823579898190624
Best hyperparameters: {'criterion': 'log_loss', 'max_depth': 35, 'max_features': 7, 'max_leaf_nodes': 254, 'min_impurity_decrease': 6.154960068691581e-06, 'min_samples_leaf': 15, 'ccp_alpha': 2.861104663831306e-06, 'max_samples': 0.6008982179780424}


Let's take a look at the optimization history. Note, that plot is interactive, so we can zoom in:

In [115]:
# Plotting Optimization History
import optuna.visualization as vis

optimization_history_plot = vis.plot_optimization_history(study, error_bar=True)
optimization_history_plot.show()

Let's run for 5 seconds more, look at the results and at the plot:

In [116]:
study.optimize(objective, timeout=5, n_jobs=-1)

print("Best trial:", study.best_trial.number)
print("Best average cross-validation ROC AUC:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)

# Plotting Optimization History
optimization_history_plot = vis.plot_optimization_history(study, error_bar=True)
optimization_history_plot.show()

[I 2023-07-28 18:58:45,766] Trial 16 finished with value: 0.5 and parameters: {'criterion': 'log_loss', 'max_depth': 7, 'max_features': 15, 'max_leaf_nodes': 82, 'min_impurity_decrease': 1.680183056308017e-09, 'min_samples_leaf': 30, 'ccp_alpha': 0.3577787055349788, 'max_samples': 0.5497640278914131}. Best is trial 0 with value: 0.8823579898190624.
[I 2023-07-28 18:58:45,777] Trial 18 finished with value: 0.8807969420978433 and parameters: {'criterion': 'log_loss', 'max_depth': 49, 'max_features': 16, 'max_leaf_nodes': 83, 'min_impurity_decrease': 5.7530577468305685e-08, 'min_samples_leaf': 20, 'ccp_alpha': 6.41635975304259e-05, 'max_samples': 0.5770694194676651}. Best is trial 0 with value: 0.8823579898190624.
[I 2023-07-28 18:58:45,783] Trial 17 finished with value: 0.8810569045353863 and parameters: {'criterion': 'log_loss', 'max_depth': 49, 'max_features': 15, 'max_leaf_nodes': 60, 'min_impurity_decrease': 1.3157470919177103e-09, 'min_samples_leaf': 21, 'ccp_alpha': 7.9331631644681

Best trial: 0
Best average cross-validation ROC AUC: 0.8823579898190624
Best hyperparameters: {'criterion': 'log_loss', 'max_depth': 35, 'max_features': 7, 'max_leaf_nodes': 254, 'min_impurity_decrease': 6.154960068691581e-06, 'min_samples_leaf': 15, 'ccp_alpha': 2.861104663831306e-06, 'max_samples': 0.6008982179780424}


Note, that we didn't start over, but continued previous search. We can also look at the importances of each hyperparameter:

In [117]:
# Plotting Parameter Importance
param_importance_plot = vis.plot_param_importances(study)
param_importance_plot.show()

We can plot Contour Plots of Optimization, to highlight the relationship between two hyperparameters and the trial value. Let's look at max_depth vs. min_samples_leaf:

In [118]:
# Plotting a Contour Plot in Optuna
contour_plot = vis.plot_contour(study, params=["max_depth", "min_samples_leaf"])
contour_plot.show()

Finally, we can save study to a file and load it later!

In [119]:
import joblib

joblib.dump(study, "05_RF.pkl")

study = joblib.load("05_RF.pkl")
    
print("Best trial:", study.best_trial.number)
print("Best average cross-validation ROC AUC:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)

Best trial: 0
Best average cross-validation ROC AUC: 0.8823579898190624
Best hyperparameters: {'criterion': 'log_loss', 'max_depth': 35, 'max_features': 7, 'max_leaf_nodes': 254, 'min_impurity_decrease': 6.154960068691581e-06, 'min_samples_leaf': 15, 'ccp_alpha': 2.861104663831306e-06, 'max_samples': 0.6008982179780424}


Let's put our new scores in our score table. For this, we'll reuse get_scores_cv fuction from Part 4, but now we'll use the best parameters from our Optuna study:

In [120]:
%%time

import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

# Random seed for reproducibility
SEED = 123

# Prepare our best model for training
from sklearn.ensemble import RandomForestClassifier
model_for_tests = RandomForestClassifier(random_state=SEED,
                               n_estimators= 90,
                               n_jobs=-1,
                               **study.best_params
                               )


def get_cv_scores(train, test, model, scores_df, comment = "", verbose=False, prepare_submission=False):
    
    '''
    This function takes train and test sets, as well as a model for cross validation and a DataFrame with previous scores.
    It also takes an optional comment string to comment changes.
    
    Setting verbose to True makes function printing out updated scores.

    
    It returns:
        
        -) Updated DataFrame with new:
            1) Average training ROC AUC score.
            2) Average cross-validation ROC AUC score.
            3) Average training accuracy score. 
            4) Average cross-validation accuracy score.
        
        -) A dataset for a new submission, if prepare_submission is True
    '''
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # Empty lists for collecting scores
    train_roc_auc_scores = []
    cv_roc_auc_scores = []
    train_accuracy_scores = []
    cv_accuracy_scores = []
    
    # Iterate through folds
    for train_index, cv_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate scores and append to the scores lists
        train_pred_proba = model.predict_proba(cv_train.drop('Transported', axis=1))[:, 1]
        train_roc_auc_scores.append(roc_auc_score(cv_train['Transported'], train_pred_proba))
        cv_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        cv_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], cv_pred_proba))
        train_accuracy_scores.append(model.score(cv_train.drop('Transported', axis=1), cv_train['Transported']))
        cv_accuracy_scores.append(model.score(cv_test.drop('Transported', axis=1), cv_test['Transported']))
        

    # Update the scores DataFrame with average scores:
    
    scores_df.loc[len(scores_df)] = [comment, np.mean(train_roc_auc_scores), np.mean(cv_roc_auc_scores), \
                                     np.mean(train_accuracy_scores), np.mean(cv_accuracy_scores), np.nan]
    #scores_df.index = scores_df.index + 1
    #scores_df.sort_index()
    
    # Print the updated scores DataFrame
    if verbose:
        print(scores_df)
        
    submission = "prepare_submission=False"
        
    if prepare_submission:
    
        # Prepare the submission DataFrame
        test_pred = model.predict(test)
        test_pred = ["True" if i == 1 else "False" for i in test_pred]
        test_pred = pd.DataFrame(test_pred, columns=['Transported'])
        submission = pd.concat([test_Ids, test_pred], axis=1)

    
    return submission

print(model_for_tests)

get_cv_scores(train, test, model_for_tests, scores_df, comment= "optuna_test", prepare_submission=False)

scores_df

RandomForestClassifier(ccp_alpha=2.861104663831306e-06, criterion='log_loss',
                       max_depth=35, max_features=7, max_leaf_nodes=254,
                       max_samples=0.6008982179780424,
                       min_impurity_decrease=6.154960068691581e-06,
                       min_samples_leaf=15, n_estimators=90, n_jobs=-1,
                       random_state=123)
CPU times: total: 906 ms
Wall time: 2.13 s


Unnamed: 0,Changes:,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,Unprocessed numeric features,0.891046,0.847367,0.830047,0.790751,0.80056
1,GroupSize,0.89933,0.854434,0.828759,0.792477,
2,FamilySize,0.900522,0.854035,0.828989,0.791787,
3,- GroupSize,0.896124,0.851555,0.829334,0.792938,
4,1 + Deck_enc,0.927033,0.877194,0.831174,0.791672,0.79682
5,+ HomePlanet,0.930021,0.880737,0.834349,0.794893,
6,+ Destination,0.930903,0.882469,0.835431,0.796503,
7,+ CryoSleep,0.931807,0.882508,0.838951,0.799264,
8,+ VIP,0.931491,0.882599,0.83879,0.799724,
9,+ Side,0.93437,0.886205,0.844196,0.799379,


Here we described the whole process. Now we'll save our study and from now on we'll be rerunning ['05_hyperparameters_2.ipynb'](05_hyperparameters_2.ipynb), loading and saving our study to a file.

In [1]:
joblib.dump(study, "05_RF.pkl")
scores_df.to_csv('05_scores_df.csv')

NameError: name 'joblib' is not defined