# Spaceship. Part 5. (continued)
## Hyperparameters tuning 

Here we'll proceed with hyperparameters searching process described in ['05_hyperparameters.ipynb'](05_hyperparameters.ipynb).\

This notebook can be re-run over and over to continue searching.

Choose running time:

In [197]:
HOURS = 0
MINUTES = 20
SECONDS = 0

RUNNING_TIME = HOURS * 3600 + MINUTES * 60 + SECONDS

Let's load our data, our Optuna study and define all the nesessary functions.

We'll put n_estimators in the search to 90 for speed. Greater numbers may increase scores. For the scores table and submissions we'll use 500 estimators.

In [198]:
# Random seed for reproducibility
SEED = 123

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import joblib
import optuna
import optuna.visualization as vis

train = pd.read_csv('04_train_prepared.csv', index_col=0)
test =  pd.read_csv('04_test_prepared.csv', index_col=0)
scores_df = pd.read_csv('05_scores_df.csv', index_col=0)
test_Ids = pd.read_csv('test_Ids.csv', index_col=0).reset_index(drop=True)

train['Transported'] = [1 if i else 0 for i in train['Transported']]

study = joblib.load("05_RF.pkl")
total_seconds = pd.read_csv('05_total_seconds.csv', index_col=0)

print('Before current session: ')
print("Best trial:", study.best_trial.number)
print("Best average cross-validation ROC AUC:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)


def train_evaluate(params):
    '''
    This function takes  parameters for a classifier.
    
    It returns average cross-validated ROC AUC score.
    '''
    
    # Prepare our best estimator for training
    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier(random_state=SEED,
                               n_estimators= 100,
                               n_jobs=-1
                               )


    # Set parameters for the model
    model.set_params(**params)
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # An empty list for collecting scores
    test_roc_auc_scores = []
    
    # Iterate through folds
    for train_index, cv_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate ROC AUC score and append to the scores lists
        test_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        test_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], test_pred_proba))
        
    return np.mean(test_roc_auc_scores)
        

Before current session: 
Best trial: 1412
Best average cross-validation ROC AUC: 0.8877726804385926
Best hyperparameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 172, 'min_impurity_decrease': 2.1053751014191062e-05, 'min_samples_leaf': 3, 'ccp_alpha': 2.673562655646687e-07, 'max_samples': 0.6368446246746027}


In [199]:
def objective(trial):
    params = {
        # 'n_estimators': optuna.distributions.IntDistribution(100, 1000),
        # 'criterion': optuna.distributions.CategoricalDistribution(['log_loss', 'entropy']),
        'criterion': trial.suggest_categorical('criterion', ['log_loss', 'gini']),
        'max_depth': trial.suggest_int('max_depth', 2, 50),
        'max_features': trial.suggest_int('max_features', 1, 16),
        'max_leaf_nodes': trial.suggest_int('max_leaf_nodes', 20, 500),
        "min_impurity_decrease": trial.suggest_float("min_impurity_decrease", 1e-9, 1e-1, log=True),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 2, 30),
        'ccp_alpha': trial.suggest_float('ccp_alpha', 1e-7, 4e-1, log=True),
        'max_samples': trial.suggest_float('max_samples', 0.3, 1)
             
         }
    return train_evaluate(params)

In [200]:
def get_cv_scores(train, test, model, scores_df, comment = "", verbose=False, prepare_submission=False):
    
    '''
    This function takes train and test sets, as well as a model for cross validation and a DataFrame with previous scores.
    It also takes an optional comment string to comment changes.
    
    Setting verbose to True makes function printing out updated scores.

    
    It returns:
        
        -) Updated DataFrame with new:
            1) Average training ROC AUC score.
            2) Average cross-validation ROC AUC score.
            3) Average training accuracy score. 
            4) Average cross-validation accuracy score.
        
        -) A dataset for a new submission, if prepare_submission is True
    '''
    
    # Create a StratifiedKFold object (6 splits with equal proportion of positive target values)
    skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
    
    # Empty lists for collecting scores
    train_roc_auc_scores = []
    cv_roc_auc_scores = []
    train_accuracy_scores = []
    cv_accuracy_scores = []
    
    # Iterate through folds
    for train_index, cv_index in skf.split(train.drop('Transported', axis=1), train['Transported']):
        # Obtain training and testing folds
        cv_train, cv_test = train.iloc[train_index], train.iloc[cv_index]
        
        # Fit the model
        model.fit(cv_train.drop('Transported', axis=1), cv_train['Transported']) 
        
        # Calculate scores and append to the scores lists
        train_pred_proba = model.predict_proba(cv_train.drop('Transported', axis=1))[:, 1]
        train_roc_auc_scores.append(roc_auc_score(cv_train['Transported'], train_pred_proba))
        cv_pred_proba = model.predict_proba(cv_test.drop('Transported', axis=1))[:, 1]
        cv_roc_auc_scores.append(roc_auc_score(cv_test['Transported'], cv_pred_proba))
        train_accuracy_scores.append(model.score(cv_train.drop('Transported', axis=1), cv_train['Transported']))
        cv_accuracy_scores.append(model.score(cv_test.drop('Transported', axis=1), cv_test['Transported']))
        

    # Update the scores DataFrame with average scores:
    
    scores_df.loc[len(scores_df)] = [comment, np.mean(train_roc_auc_scores), np.mean(cv_roc_auc_scores), \
                                     np.mean(train_accuracy_scores), np.mean(cv_accuracy_scores), np.nan]
    #scores_df.index = scores_df.index + 1
    #scores_df.sort_index()
    
    # Print the updated scores DataFrame
    if verbose:
        print(scores_df)
        
    submission = "prepare_submission=False"
        
    if prepare_submission:
    
        # Prepare the submission DataFrame
        test_pred = model.predict(test)
        test_pred = ["True" if i == 1 else "False" for i in test_pred]
        test_pred = pd.DataFrame(test_pred, columns=['Transported'])
        submission = pd.concat([test_Ids, test_pred], axis=1)

    
    return submission
                         

Now, let's optimize and observe results:

In [201]:
study.optimize(objective, timeout=RUNNING_TIME, n_jobs=-1)

[I 2023-07-28 22:41:37,949] Trial 1427 finished with value: 0.8659329181023034 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 165, 'min_impurity_decrease': 0.00010918621336690212, 'min_samples_leaf': 3, 'ccp_alpha': 0.007222318749108655, 'max_samples': 0.6486078358299878}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:41:38,010] Trial 1429 finished with value: 0.8845166975047754 and parameters: {'criterion': 'log_loss', 'max_depth': 45, 'max_features': 11, 'max_leaf_nodes': 161, 'min_impurity_decrease': 6.67701197075021e-06, 'min_samples_leaf': 3, 'ccp_alpha': 1.575937831574664e-07, 'max_samples': 0.6104131727922167}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:41:38,015] Trial 1426 finished with value: 0.8853276477435709 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 163, 'min_impurity_decrease': 7.850868684705682e-06, 'min_samples_leaf': 3,

[I 2023-07-28 22:42:00,762] Trial 1436 finished with value: 0.8859812555379536 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 11, 'max_leaf_nodes': 177, 'min_impurity_decrease': 7.079026242336668e-05, 'min_samples_leaf': 2, 'ccp_alpha': 1.3885623436345569e-06, 'max_samples': 0.6744380702856267}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:42:01,367] Trial 1441 finished with value: 0.8856520455805675 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 145, 'min_impurity_decrease': 7.0021613268248646e-06, 'min_samples_leaf': 3, 'ccp_alpha': 1.9622046735979357e-07, 'max_samples': 0.5982669381343944}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:42:21,249] Trial 1442 finished with value: 0.8844721248778536 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 145, 'min_impurity_decrease': 7.263884729729261e-06, 'min_samples_leaf'

[I 2023-07-28 22:42:42,727] Trial 1459 finished with value: 0.8851381801166488 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 171, 'min_impurity_decrease': 3.81561207828704e-05, 'min_samples_leaf': 4, 'ccp_alpha': 1.545586544753712e-07, 'max_samples': 0.6164899165287621}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:42:42,882] Trial 1460 finished with value: 0.8745848814802848 and parameters: {'criterion': 'log_loss', 'max_depth': 5, 'max_features': 12, 'max_leaf_nodes': 171, 'min_impurity_decrease': 3.3909990492910728e-06, 'min_samples_leaf': 4, 'ccp_alpha': 2.2162643823483565e-07, 'max_samples': 0.6268838983370889}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:42:43,473] Trial 1464 finished with value: 0.8829596725061667 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 12, 'max_leaf_nodes': 170, 'min_impurity_decrease': 3.714229444039537e-05, 'min_samples_leaf': 19

[I 2023-07-28 22:43:25,181] Trial 1478 finished with value: 0.8860756287776166 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 158, 'min_impurity_decrease': 5.465564916426863e-07, 'min_samples_leaf': 2, 'ccp_alpha': 3.159015491501682e-07, 'max_samples': 0.6628030167836654}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:43:25,217] Trial 1481 finished with value: 0.8848479993637044 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 184, 'min_impurity_decrease': 0.0002150700704331697, 'min_samples_leaf': 3, 'ccp_alpha': 4.408434774948267e-07, 'max_samples': 0.5916486286945075}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:43:25,535] Trial 1482 finished with value: 0.8846925610458655 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 11, 'max_leaf_nodes': 151, 'min_impurity_decrease': 2.8688919883613476e-06, 'min_samples_leaf': 3

[I 2023-07-28 22:44:06,896] Trial 1510 finished with value: 0.8593442743319198 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 183, 'min_impurity_decrease': 2.5411873781824576e-07, 'min_samples_leaf': 2, 'ccp_alpha': 0.010988046501188264, 'max_samples': 0.6486752444858885}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:44:06,955] Trial 1507 finished with value: 0.8854168504881285 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 12, 'max_leaf_nodes': 184, 'min_impurity_decrease': 2.417973616723923e-05, 'min_samples_leaf': 2, 'ccp_alpha': 3.127022228259245e-06, 'max_samples': 0.6484424965306086}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:44:06,956] Trial 1505 finished with value: 0.8602499897732714 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 183, 'min_impurity_decrease': 1.695184648015542e-05, 'min_samples_leaf': 2, 

[I 2023-07-28 22:44:51,974] Trial 1526 finished with value: 0.8857933790299173 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 165, 'min_impurity_decrease': 5.3578688326020606e-05, 'min_samples_leaf': 4, 'ccp_alpha': 1.613966391543756e-05, 'max_samples': 0.699416453859743}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:44:52,018] Trial 1531 finished with value: 0.8864207334746065 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 174, 'min_impurity_decrease': 4.602991877510306e-05, 'min_samples_leaf': 3, 'ccp_alpha': 7.531572765881801e-05, 'max_samples': 0.587322019491919}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:44:52,235] Trial 1530 finished with value: 0.8854983237474275 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 139, 'min_impurity_decrease': 4.306494348469502e-05, 'min_samples_leaf': 3,

[I 2023-07-28 22:45:35,603] Trial 1552 finished with value: 0.8846075758377646 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 11, 'max_leaf_nodes': 134, 'min_impurity_decrease': 3.6636153154997405e-07, 'min_samples_leaf': 5, 'ccp_alpha': 1.788240999408533e-06, 'max_samples': 0.5960283947527364}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:45:35,709] Trial 1551 finished with value: 0.884631349249306 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 11, 'max_leaf_nodes': 142, 'min_impurity_decrease': 9.135136405746098e-05, 'min_samples_leaf': 5, 'ccp_alpha': 0.00013410543461450907, 'max_samples': 0.5925353517479962}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:45:35,755] Trial 1553 finished with value: 0.8861787537670508 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 11, 'max_leaf_nodes': 141, 'min_impurity_decrease': 7.128561749418043e-05, 'min_samples_leaf': 5, 

[I 2023-07-28 22:46:01,870] Trial 1572 finished with value: 0.8835739011880083 and parameters: {'criterion': 'log_loss', 'max_depth': 7, 'max_features': 12, 'max_leaf_nodes': 146, 'min_impurity_decrease': 1.534329064772744e-06, 'min_samples_leaf': 7, 'ccp_alpha': 5.143572309002287e-06, 'max_samples': 0.8419577765401959}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:46:05,727] Trial 1573 finished with value: 0.8823590005769626 and parameters: {'criterion': 'log_loss', 'max_depth': 7, 'max_features': 12, 'max_leaf_nodes': 156, 'min_impurity_decrease': 2.283876758974549e-06, 'min_samples_leaf': 7, 'ccp_alpha': 5.219500095439867e-06, 'max_samples': 0.8082393022629208}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:46:19,921] Trial 1574 finished with value: 0.8841710909029006 and parameters: {'criterion': 'log_loss', 'max_depth': 41, 'max_features': 11, 'max_leaf_nodes': 149, 'min_impurity_decrease': 0.00018119103121483327, 'min_samples_leaf': 7,

[I 2023-07-28 22:46:46,369] Trial 1594 finished with value: 0.8861218810795718 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 160, 'min_impurity_decrease': 4.3786628836403926e-05, 'min_samples_leaf': 4, 'ccp_alpha': 2.5188555110957994e-05, 'max_samples': 0.8660191568209964}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:46:47,372] Trial 1595 finished with value: 0.8858098110170083 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 174, 'min_impurity_decrease': 3.906065786390293e-05, 'min_samples_leaf': 4, 'ccp_alpha': 9.102779914877626e-06, 'max_samples': 0.5589782858977853}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:46:47,457] Trial 1596 finished with value: 0.8848361830648507 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 174, 'min_impurity_decrease': 3.366950579744402e-05, 'min_samples_leaf':

[I 2023-07-28 22:47:30,353] Trial 1617 finished with value: 0.8851863830041845 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 147, 'min_impurity_decrease': 6.110833446947329e-07, 'min_samples_leaf': 5, 'ccp_alpha': 7.919194246844762e-07, 'max_samples': 0.8553316709188502}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:47:31,067] Trial 1614 finished with value: 0.8849450556707383 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 152, 'min_impurity_decrease': 0.00034563605357721647, 'min_samples_leaf': 6, 'ccp_alpha': 1.2463153001732717e-06, 'max_samples': 0.860061003947087}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:47:31,485] Trial 1618 finished with value: 0.8842488181064666 and parameters: {'criterion': 'log_loss', 'max_depth': 8, 'max_features': 11, 'max_leaf_nodes': 143, 'min_impurity_decrease': 0.0004400755277130341, 'min_samples_leaf': 5, 

[I 2023-07-28 22:48:16,501] Trial 1637 finished with value: 0.8812018592524066 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 165, 'min_impurity_decrease': 0.0016065102096143593, 'min_samples_leaf': 6, 'ccp_alpha': 7.006322443393891e-07, 'max_samples': 0.8375808109671316}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:48:16,886] Trial 1639 finished with value: 0.8845775121144884 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 161, 'min_impurity_decrease': 0.00023500903100037368, 'min_samples_leaf': 6, 'ccp_alpha': 2.28784857876687e-06, 'max_samples': 0.8360084785928225}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:48:17,892] Trial 1640 finished with value: 0.885007914349696 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 164, 'min_impurity_decrease': 0.00023535500155608122, 'min_samples_leaf': 6

[I 2023-07-28 22:49:02,149] Trial 1659 finished with value: 0.8579063401377627 and parameters: {'criterion': 'gini', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 178, 'min_impurity_decrease': 1.1164900396861193e-06, 'min_samples_leaf': 5, 'ccp_alpha': 0.005292770171935277, 'max_samples': 0.800426176385163}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:49:02,802] Trial 1661 finished with value: 0.8845771329952012 and parameters: {'criterion': 'gini', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 179, 'min_impurity_decrease': 1.5803661928668918e-06, 'min_samples_leaf': 5, 'ccp_alpha': 1.0171515342905447e-06, 'max_samples': 0.79167324534926}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:49:03,277] Trial 1662 finished with value: 0.7625594317502286 and parameters: {'criterion': 'gini', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 178, 'min_impurity_decrease': 2.731773698320871e-06, 'min_samples_leaf': 5, 'ccp_alpha'

[I 2023-07-28 22:49:46,159] Trial 1682 finished with value: 0.8852228215359658 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 140, 'min_impurity_decrease': 7.312157027798566e-07, 'min_samples_leaf': 7, 'ccp_alpha': 6.010114422267226e-07, 'max_samples': 0.8540705200947404}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:49:48,287] Trial 1684 finished with value: 0.8457991095370908 and parameters: {'criterion': 'log_loss', 'max_depth': 3, 'max_features': 12, 'max_leaf_nodes': 135, 'min_impurity_decrease': 8.236648719540109e-06, 'min_samples_leaf': 7, 'ccp_alpha': 6.517766620202629e-07, 'max_samples': 0.8513626323922207}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:49:48,442] Trial 1683 finished with value: 0.8771464044792889 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 138, 'min_impurity_decrease': 0.0028178844645285668, 'min_samples_leaf': 7, '

[I 2023-07-28 22:50:17,896] Trial 1704 finished with value: 0.8850632794985228 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 12, 'max_leaf_nodes': 155, 'min_impurity_decrease': 1.351234968925212e-05, 'min_samples_leaf': 6, 'ccp_alpha': 1.5709143011812601e-06, 'max_samples': 0.7403996048895493}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:50:21,176] Trial 1705 finished with value: 0.7616834186628143 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 13, 'max_leaf_nodes': 159, 'min_impurity_decrease': 1.7088930994784967e-06, 'min_samples_leaf': 6, 'ccp_alpha': 0.053381102608833254, 'max_samples': 0.6389143652853152}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:50:32,761] Trial 1708 finished with value: 0.8563232833673725 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 13, 'max_leaf_nodes': 158, 'min_impurity_decrease': 0.0109440496131774, 'min_samples_leaf': 5, 

[I 2023-07-28 22:50:57,925] Trial 1724 finished with value: 0.8854093444471278 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 173, 'min_impurity_decrease': 1.5656478219448454e-07, 'min_samples_leaf': 4, 'ccp_alpha': 1.0160234054126014e-07, 'max_samples': 0.6972298435125465}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:51:01,408] Trial 1727 finished with value: 0.8855993182893767 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 11, 'max_leaf_nodes': 172, 'min_impurity_decrease': 0.0001299761764542547, 'min_samples_leaf': 4, 'ccp_alpha': 1.1638841650422681e-07, 'max_samples': 0.7067772248030244}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:51:03,566] Trial 1728 finished with value: 0.8844099790613678 and parameters: {'criterion': 'log_loss', 'max_depth': 38, 'max_features': 11, 'max_leaf_nodes': 169, 'min_impurity_decrease': 9.88426709463505e-07, 'min_samples_leaf':

[I 2023-07-28 22:51:40,647] Trial 1746 finished with value: 0.8828993973732716 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 5, 'max_leaf_nodes': 132, 'min_impurity_decrease': 1.407957365906643e-06, 'min_samples_leaf': 17, 'ccp_alpha': 3.528594110914836e-07, 'max_samples': 0.8202057694183943}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:51:40,792] Trial 1750 finished with value: 0.8844860728008673 and parameters: {'criterion': 'log_loss', 'max_depth': 19, 'max_features': 11, 'max_leaf_nodes': 124, 'min_impurity_decrease': 1.964330529229612e-06, 'min_samples_leaf': 4, 'ccp_alpha': 3.367275209325001e-07, 'max_samples': 0.39473282441587826}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:51:41,680] Trial 1749 finished with value: 0.8850283167381895 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 11, 'max_leaf_nodes': 120, 'min_impurity_decrease': 5.480488439283436e-07, 'min_samples_leaf': 

[I 2023-07-28 22:52:27,299] Trial 1770 finished with value: 0.8855412637602716 and parameters: {'criterion': 'log_loss', 'max_depth': 16, 'max_features': 12, 'max_leaf_nodes': 119, 'min_impurity_decrease': 2.8381112749595393e-06, 'min_samples_leaf': 5, 'ccp_alpha': 1.8103411052271452e-07, 'max_samples': 0.6548104231670921}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:52:27,897] Trial 1772 finished with value: 0.8853826816538071 and parameters: {'criterion': 'log_loss', 'max_depth': 15, 'max_features': 12, 'max_leaf_nodes': 122, 'min_impurity_decrease': 9.538726190422227e-07, 'min_samples_leaf': 5, 'ccp_alpha': 1.9817336268701045e-07, 'max_samples': 0.6716870692508523}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:52:27,964] Trial 1773 finished with value: 0.8858452572214337 and parameters: {'criterion': 'log_loss', 'max_depth': 16, 'max_features': 12, 'max_leaf_nodes': 130, 'min_impurity_decrease': 6.643882142916938e-07, 'min_samples_leaf'

[I 2023-07-28 22:53:11,049] Trial 1790 finished with value: 0.8863417133712445 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 153, 'min_impurity_decrease': 1.026284459975642e-05, 'min_samples_leaf': 3, 'ccp_alpha': 6.384368691470288e-06, 'max_samples': 0.6243686511962264}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:53:11,572] Trial 1792 finished with value: 0.8855195826374569 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 142, 'min_impurity_decrease': 4.310872598386661e-07, 'min_samples_leaf': 3, 'ccp_alpha': 4.993135668582312e-07, 'max_samples': 0.6686307673566612}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:53:11,920] Trial 1793 finished with value: 0.8853812829379445 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 145, 'min_impurity_decrease': 1.180485663434681e-06, 'min_samples_leaf': 3

[I 2023-07-28 22:53:58,490] Trial 1816 finished with value: 0.8604172333310327 and parameters: {'criterion': 'log_loss', 'max_depth': 4, 'max_features': 12, 'max_leaf_nodes': 132, 'min_impurity_decrease': 1.798983113465194e-05, 'min_samples_leaf': 15, 'ccp_alpha': 7.358993979884798e-07, 'max_samples': 0.7548571249141892}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:53:59,098] Trial 1817 finished with value: 0.8846153377859475 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 6, 'max_leaf_nodes': 131, 'min_impurity_decrease': 1.7159692644137866e-05, 'min_samples_leaf': 4, 'ccp_alpha': 2.7386467268475107e-07, 'max_samples': 0.7478427262250378}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:53:59,219] Trial 1814 finished with value: 0.8854204311650432 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 13, 'max_leaf_nodes': 134, 'min_impurity_decrease': 1.780843550875791e-05, 'min_samples_leaf': 

[I 2023-07-28 22:54:30,606] Trial 1836 finished with value: 0.8859911679280241 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 143, 'min_impurity_decrease': 7.159293394762654e-05, 'min_samples_leaf': 3, 'ccp_alpha': 1.2272184357099745e-07, 'max_samples': 0.846504184167793}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:54:39,779] Trial 1837 finished with value: 0.8823185186354722 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 143, 'min_impurity_decrease': 1.5290539588874205e-06, 'min_samples_leaf': 23, 'ccp_alpha': 1.3948404115544264e-07, 'max_samples': 0.8476058256596287}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:54:45,664] Trial 1838 finished with value: 0.8861118950822701 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 144, 'min_impurity_decrease': 6.93636461559961e-05, 'min_samples_leaf':

[I 2023-07-28 22:55:11,463] Trial 1858 finished with value: 0.886428834196336 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 117, 'min_impurity_decrease': 4.6863370943833316e-05, 'min_samples_leaf': 5, 'ccp_alpha': 1.7329008170997805e-07, 'max_samples': 0.8582755657670481}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:55:13,316] Trial 1859 finished with value: 0.8848575154592657 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 119, 'min_impurity_decrease': 0.00010047122901059577, 'min_samples_leaf': 5, 'ccp_alpha': 1.628293227666021e-07, 'max_samples': 0.8250719300770979}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:55:17,698] Trial 1860 finished with value: 0.8851661234592137 and parameters: {'criterion': 'log_loss', 'max_depth': 9, 'max_features': 12, 'max_leaf_nodes': 119, 'min_impurity_decrease': 0.00012248866883959728, 'min_samples_leaf': 5

[I 2023-07-28 22:55:54,845] Trial 1880 finished with value: 0.8847576499114069 and parameters: {'criterion': 'log_loss', 'max_depth': 8, 'max_features': 12, 'max_leaf_nodes': 140, 'min_impurity_decrease': 7.953544983385275e-05, 'min_samples_leaf': 6, 'ccp_alpha': 2.468053504659341e-07, 'max_samples': 0.8390614559851722}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:55:55,097] Trial 1882 finished with value: 0.8859381235153198 and parameters: {'criterion': 'log_loss', 'max_depth': 10, 'max_features': 13, 'max_leaf_nodes': 158, 'min_impurity_decrease': 4.542317974602481e-06, 'min_samples_leaf': 6, 'ccp_alpha': 2.5706271696360925e-07, 'max_samples': 0.4438229528990987}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:55:55,101] Trial 1881 finished with value: 0.8863309723235832 and parameters: {'criterion': 'log_loss', 'max_depth': 22, 'max_features': 12, 'max_leaf_nodes': 141, 'min_impurity_decrease': 0.00015671983492958482, 'min_samples_leaf': 

[I 2023-07-28 22:56:39,434] Trial 1902 finished with value: 0.8826483189570951 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 126, 'min_impurity_decrease': 1.1654656720810962e-07, 'min_samples_leaf': 21, 'ccp_alpha': 1.103658275623054e-07, 'max_samples': 0.7973951632544255}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:56:39,928] Trial 1903 finished with value: 0.886643546048482 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 126, 'min_impurity_decrease': 2.8465962473900525e-05, 'min_samples_leaf': 4, 'ccp_alpha': 3.3954885988507927e-07, 'max_samples': 0.7967527868955825}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:56:40,268] Trial 1904 finished with value: 0.884787523073807 and parameters: {'criterion': 'log_loss', 'max_depth': 20, 'max_features': 11, 'max_leaf_nodes': 164, 'min_impurity_decrease': 2.888073405340121e-08, 'min_samples_leaf':

[I 2023-07-28 22:57:23,606] Trial 1924 finished with value: 0.8848435547380192 and parameters: {'criterion': 'log_loss', 'max_depth': 8, 'max_features': 11, 'max_leaf_nodes': 124, 'min_impurity_decrease': 3.7337417666201394e-05, 'min_samples_leaf': 5, 'ccp_alpha': 4.806272747615542e-07, 'max_samples': 0.8152853085234405}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:57:24,355] Trial 1926 finished with value: 0.8845667632414838 and parameters: {'criterion': 'log_loss', 'max_depth': 8, 'max_features': 12, 'max_leaf_nodes': 107, 'min_impurity_decrease': 4.0576112911668546e-05, 'min_samples_leaf': 5, 'ccp_alpha': 4.3400736465567874e-06, 'max_samples': 0.8137018368825863}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:57:26,272] Trial 1925 finished with value: 0.8844158292058522 and parameters: {'criterion': 'log_loss', 'max_depth': 8, 'max_features': 11, 'max_leaf_nodes': 125, 'min_impurity_decrease': 3.531237496934875e-05, 'min_samples_leaf': 5

[I 2023-07-28 22:58:08,733] Trial 1946 finished with value: 0.8838837260545738 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 11, 'max_leaf_nodes': 82, 'min_impurity_decrease': 1.1700622116394498e-05, 'min_samples_leaf': 6, 'ccp_alpha': 3.722788286438462e-07, 'max_samples': 0.7845475773631102}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:58:09,550] Trial 1947 finished with value: 0.8871247044625196 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 12, 'max_leaf_nodes': 115, 'min_impurity_decrease': 1.2893962924362678e-05, 'min_samples_leaf': 6, 'ccp_alpha': 3.931253053789777e-07, 'max_samples': 0.7840693874082554}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:58:11,309] Trial 1948 finished with value: 0.8848954997338301 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 12, 'max_leaf_nodes': 78, 'min_impurity_decrease': 1.1484401205598057e-05, 'min_samples_leaf': 

[I 2023-07-28 22:58:45,558] Trial 1968 finished with value: 0.8840780533165721 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 12, 'max_leaf_nodes': 95, 'min_impurity_decrease': 6.043969142160993e-06, 'min_samples_leaf': 6, 'ccp_alpha': 5.564898883256519e-07, 'max_samples': 0.8000551833245294}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:58:52,725] Trial 1969 finished with value: 0.8859169870176546 and parameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 13, 'max_leaf_nodes': 119, 'min_impurity_decrease': 7.525872025048571e-06, 'min_samples_leaf': 7, 'ccp_alpha': 5.927350396757751e-07, 'max_samples': 0.7681292491571301}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:58:57,743] Trial 1970 finished with value: 0.8837361623341699 and parameters: {'criterion': 'log_loss', 'max_depth': 15, 'max_features': 11, 'max_leaf_nodes': 114, 'min_impurity_decrease': 2.10959764551251e-05, 'min_samples_leaf': 7, 

[I 2023-07-28 22:59:25,650] Trial 1989 finished with value: 0.8824706361673349 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 126, 'min_impurity_decrease': 4.1330797688693886e-05, 'min_samples_leaf': 22, 'ccp_alpha': 4.946428034414485e-05, 'max_samples': 0.8548186410066271}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:59:26,984] Trial 1991 finished with value: 0.8842910805873423 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 124, 'min_impurity_decrease': 3.438432859994142e-05, 'min_samples_leaf': 6, 'ccp_alpha': 4.630832235364179e-05, 'max_samples': 0.8566759772308915}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 22:59:30,885] Trial 1992 finished with value: 0.8835557651817149 and parameters: {'criterion': 'log_loss', 'max_depth': 11, 'max_features': 11, 'max_leaf_nodes': 150, 'min_impurity_decrease': 4.036683370307517e-05, 'min_samples_leaf':

[I 2023-07-28 23:00:10,389] Trial 2013 finished with value: 0.8544351728305196 and parameters: {'criterion': 'log_loss', 'max_depth': 14, 'max_features': 12, 'max_leaf_nodes': 136, 'min_impurity_decrease': 0.013414508082778906, 'min_samples_leaf': 5, 'ccp_alpha': 7.92006576367222e-07, 'max_samples': 0.47269318390820825}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 23:00:10,987] Trial 2012 finished with value: 0.8871587754240878 and parameters: {'criterion': 'log_loss', 'max_depth': 14, 'max_features': 12, 'max_leaf_nodes': 137, 'min_impurity_decrease': 2.136463015749651e-05, 'min_samples_leaf': 5, 'ccp_alpha': 1.1410811500464547e-06, 'max_samples': 0.8699720106744742}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 23:00:11,763] Trial 2015 finished with value: 0.74814255724123 and parameters: {'criterion': 'log_loss', 'max_depth': 14, 'max_features': 12, 'max_leaf_nodes': 133, 'min_impurity_decrease': 0.09582075730368599, 'min_samples_leaf': 5, 'c

[I 2023-07-28 23:00:55,317] Trial 2034 finished with value: 0.8849120450658937 and parameters: {'criterion': 'log_loss', 'max_depth': 40, 'max_features': 12, 'max_leaf_nodes': 126, 'min_impurity_decrease': 0.001086111716368876, 'min_samples_leaf': 5, 'ccp_alpha': 1.20955541791981e-06, 'max_samples': 0.5281827695524501}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 23:00:55,406] Trial 2035 finished with value: 0.879003526532072 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 12, 'max_leaf_nodes': 117, 'min_impurity_decrease': 0.0027666742766750446, 'min_samples_leaf': 5, 'ccp_alpha': 1.2259747011680812e-06, 'max_samples': 0.5289113172600428}. Best is trial 1412 with value: 0.8877726804385926.
[I 2023-07-28 23:00:56,729] Trial 2036 finished with value: 0.8834660051347698 and parameters: {'criterion': 'log_loss', 'max_depth': 13, 'max_features': 12, 'max_leaf_nodes': 124, 'min_impurity_decrease': 0.0010683365981325455, 'min_samples_leaf': 6, 

Save our study back to the file:

In [202]:
total_seconds.iloc[0, 0] = total_seconds.iloc[0, 0] + RUNNING_TIME
joblib.dump(study, "05_RF.pkl")
total_seconds.to_csv('05_total_seconds.csv')

In [203]:
# Plotting Optimization History
optimization_history_plot = vis.plot_optimization_history(study, error_bar=True)
optimization_history_plot.show()

In [204]:
# Plotting Parameter Importance
param_importance_plot = vis.plot_param_importances(study)
param_importance_plot.show()

In [205]:
# Plotting a Contour Plot
contour_plot = vis.plot_contour(study, params=["max_depth", "min_samples_leaf"])
contour_plot.show()

In [206]:
print('After current session: ')
print("Best trial:", study.best_trial.number)
print("Best average cross-validation ROC AUC:", study.best_trial.value)
print("Best hyperparameters:", study.best_params)
total_hours = round(total_seconds.iloc[0, 0] / 3600, 3)
print("Total running time (hours):", total_hours)

After current session: 
Best trial: 1412
Best average cross-validation ROC AUC: 0.8877726804385926
Best hyperparameters: {'criterion': 'log_loss', 'max_depth': 12, 'max_features': 11, 'max_leaf_nodes': 172, 'min_impurity_decrease': 2.1053751014191062e-05, 'min_samples_leaf': 3, 'ccp_alpha': 2.673562655646687e-07, 'max_samples': 0.6368446246746027}
Total running time (hours): 1.006


Now, let's test our best model with greater number of estimators, put scores in the table with a comment "optuna_(number of hours)" and prepare a submission file:

In [207]:
%%time

model_for_tests = RandomForestClassifier(random_state=SEED,
                               n_estimators= 500,
                               n_jobs=-1,
                               **study.best_params
                               )

print(model_for_tests)

submission = get_cv_scores(train, test, model_for_tests, scores_df,
                              comment= "optuna_{}".format(total_hours),
                              prepare_submission=True)

scores_df

RandomForestClassifier(ccp_alpha=2.673562655646687e-07, criterion='log_loss',
                       max_depth=12, max_features=11, max_leaf_nodes=172,
                       max_samples=0.6368446246746027,
                       min_impurity_decrease=2.1053751014191062e-05,
                       min_samples_leaf=3, n_estimators=500, n_jobs=-1,
                       random_state=123)
CPU times: total: 39.2 s
Wall time: 11.3 s


Unnamed: 0,Changes:,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,Unprocessed numeric features,0.891046,0.847367,0.830047,0.790751,0.80056
1,GroupSize,0.89933,0.854434,0.828759,0.792477,
2,FamilySize,0.900522,0.854035,0.828989,0.791787,
3,- GroupSize,0.896124,0.851555,0.829334,0.792938,
4,1 + Deck_enc,0.927033,0.877194,0.831174,0.791672,0.79682
5,+ HomePlanet,0.930021,0.880737,0.834349,0.794893,
6,+ Destination,0.930903,0.882469,0.835431,0.796503,
7,+ CryoSleep,0.931807,0.882508,0.838951,0.799264,
8,+ VIP,0.931491,0.882599,0.83879,0.799724,
9,+ Side,0.93437,0.886205,0.844196,0.799379,


In [208]:
# FOR SUBMISSION

submission.to_csv('05_submission_13.csv', index=False)

scores_df.loc[13, 'Test accuracy'] = 0.79565

scores_df

Unnamed: 0,Changes:,Train ROC AUC,Cross-val ROC AUC,Train Accuracy,Cross-val Accuracy,Test accuracy
0,Unprocessed numeric features,0.891046,0.847367,0.830047,0.790751,0.80056
1,GroupSize,0.89933,0.854434,0.828759,0.792477,
2,FamilySize,0.900522,0.854035,0.828989,0.791787,
3,- GroupSize,0.896124,0.851555,0.829334,0.792938,
4,1 + Deck_enc,0.927033,0.877194,0.831174,0.791672,0.79682
5,+ HomePlanet,0.930021,0.880737,0.834349,0.794893,
6,+ Destination,0.930903,0.882469,0.835431,0.796503,
7,+ CryoSleep,0.931807,0.882508,0.838951,0.799264,
8,+ VIP,0.931491,0.882599,0.83879,0.799724,
9,+ Side,0.93437,0.886205,0.844196,0.799379,


After running this notebook several times we reached situation when we stopped seeing improvements in Cross-val ROC AUC score for several hundred of trials.


So we can move on to the next part, which will be Model Ensembling: ['06_ensembling.ipynb'](06_ensembling.ipynb).