# OPTIMIZATION BY 5x2 CV- `DecisionTreeClassifier` - VALIDATION OF PROCESS BY MULTIPLE RUNS AND TRAINING/TEST SPLIT

In this notebook we use `optuna` to run several optimization runs for the `DecisionTreeClassifier` using the 5x2 CV precision results as the objective.

The aim of this check is to see if the optimization done with a particular 5 x 2 CV split was a lucky one.


In this case, we use the original S4 training sample to apply the 5x2 CV (i.e. without oversampling).

The hyperparameters of `DecisionTreeClassifier` to optimize are the following ones. The initial characteristics of the values to try are taken from the best values proposed by the previous basic optimization done when pre-selecting models:

- `max_depth`: integer $\in$ \[`1`, `25`\]. Rationale: the first optimization yielded $25$ (the lowest value in the grid). Even so, the tree was overfitting, so it is better to further reduce the tree depth.
- `min_samples_leaf`: integer $\in$ \[`5`, `20`\]. Rationale: the first optimization yielded $5$ (the lowest value in the grid). Even so, the tree was overfitting, so it is better to try also values larger than 5.
- `ccp_alpha`: float $\in$ \[`0.005`, `0.100`\]. Rationale: the first optimization yielded $0.005$. Even so, the tree was overfitting, so it is better to further prune the tree, which means larger values for `ccp_alpha`, which is the tuning parameter $\alpha$, governing the trade-off between the tree size and the goodness of fit to the data. Larger values of $\alpha$ imply smaller trees and lower goodness of fit to the data; lower values of $\alpha$ imply larger trees and higher goodness of fit to the data. As we were suffering from severe overfitting, we are interested in smalleer trees, even if the goodness of fit to the training sample decreases. That is to say, we are interested in higher values of $\alpha$ (`ccp_alhpa`).

At the moment, and in order to keep computation time under control, we leave the following parameters at the values suggested by the first basic optimization operation, assuming it was a good choice.

- `criterion`: `'entropy'`
- `max_features`: `None` (Note: we put this to `None`, because we feel `'sqrt'` and `'log2'` can introduce additional randomness to the models.


## Modules and configuration

### Modules

In [1]:
import pickle

import pandas as pd
import numpy as np

from IPython.display import clear_output

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

from sklearn.metrics import precision_score, make_scorer, classification_report, confusion_matrix

import optuna

from matplotlib import pyplot as plt


### Configuration

In [2]:
RANDOM_STATE = 11 # For reproducibility

N = 250 # Number of experiments

#S4_TRAIN_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest_OVERSAMPLED_n3.csv"
S4_TRAIN_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest.csv"
# Train/test set for S4 sample, all 112 features
S4_VALIDATION_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv"
# Validation set for S4 sample, all 112 features
CARMENES_SET_IN = "../data/DATASETS_ML/ML_02_DS_AfterImputing.csv"

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"

OPT_RUNS_RESULTS_OUT = "../data/ML_MODELS/Results_DecisionTree/Opt_DT_Scaled_OptRunsResults.csv"


### Functions

In [3]:
def objective_cv_5x2_arg(trial, train_set, random_seed):
    # Defining hyper parameters:
    # Suggest maximum depth of tree:
    max_depth = trial.suggest_int("max_depth", 1, 8, step=1, log=False)
    # Suggest minimum samples in a leaf node:
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 5, 10, step=1, log=False)
    # Suggest the complexity parameter for Minimal Cost-Complexity Pruning:
    ccp_alpha = trial.suggest_float("ccp_alpha", 0.005, 0.500, step=0.005, log=False)
    
    # Create the classifier:
    clf = DecisionTreeClassifier(
        max_depth=max_depth,
        min_samples_leaf=min_samples_leaf,
        ccp_alpha=ccp_alpha,
        criterion='entropy', max_features=None,
        splitter='best'
    )

    # Calculate the objective funcion (based on the 5 x 2 cross_validation score in the training set)
    # (Note: this function uses the global variable 'train_set'
    # Maybe not the best practice, but the intention is to prevent loading and
    # preprocessing the data each time the 'objective function is invoked')
    cv_results = []
    for fold in range(0,5): # Five runs
        run_results = []
        # Create 2-fold split:
        X_train, X_test, y_train, y_test = train_test_split(
            train_set[rel_features], train_set['Pulsating'],
            test_size=0.5, stratify=train_set['Pulsating'],
            random_state=random_seed + fold
        )
        
        # Fit on X_train, y_train:
        clf.fit(X_train, y_train)
        # Predict values on X_test:
        y_test_pred = clf.predict(X_test)
        # Measure precision and add to the run results:
        new_precision = precision_score(y_test, y_test_pred, zero_division=0.0)
        run_results.append(new_precision)
        
        # Invert the 2-fold split
        X_train, X_test, y_train, y_test = X_test, X_train, y_test, y_train
        # Fit on (new, swapped) X_train, y_train:
        clf.fit(X_train, y_train)
        # Predict values on (new, swapped) X_test:
        y_test_pred = clf.predict(X_test)
        # Measure precision and add to the run results:
        new_precision = precision_score(y_test, y_test_pred, zero_division=0.0)
        run_results.append(new_precision)
        
        # Add the average to the overall results:
        cv_results.append(np.mean(run_results))
    
    # Create the crossvalidation result:
    cv_avg = np.mean(cv_results)

    # Return objective value (notice that we must tell 'optuna' to *maximize* the objective)
    return cv_avg    

In [26]:
def re_evaluate_model(model, X_train, y_train, X_val, y_val, refit=True):
    '''Reevaluates a model in the training and validation set'''
    if refit == True:
        model.fit(X_train, y_train)
    # Training score:
    y_train_pred = model.predict(X_train)
    tr_score = precision_score(y_train, y_train_pred, zero_division=0.0)
    # Validation score:
    y_val_pred = model.predict(X_val)
    val_score = precision_score(y_val, y_val_pred, zero_division=0.0)
    # Final depth of the tree:
    tree_depth = model.get_depth()
    
    return {'train_score': tr_score, 'validation_score': val_score, 'tree_depth': tree_depth}

## Load data

### Load reliable features

In [5]:
REL_FEATURES_IN

'../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle'

In [6]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

In [7]:
len(rel_features)

48

### Load training data

In [8]:
S4_TRAIN_SET_IN

'../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest.csv'

In [9]:
tr_load = pd.read_csv(S4_TRAIN_SET_IN, sep=',', decimal='.')
tr_load.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00163,False,0.0,0.0,0.0,2457444.0,0.0,-0.674126,0.519174,0.466681,...,-0.71231,-1.187392,0.425026,-0.002305,0.495906,-0.537353,-0.028926,-0.262548,-0.135686,-0.705143
1,Star-00123,True,30.0,0.72,0.0,2457401.0,0.37,-1.626729,1.911247,-0.740748,...,0.040924,-1.110488,-0.289189,0.056551,0.555375,-0.69959,-0.292135,-0.013533,0.443673,-1.207278
2,Star-00022,False,0.0,0.0,0.0,2457430.0,0.0,-0.039057,-1.012107,0.013895,...,-0.943428,0.637603,-0.679383,0.020496,-0.496592,-0.001214,-0.101526,-0.011097,-0.293389,0.242263
3,Star-00708,False,0.0,0.0,0.0,2459677.0,0.0,-0.039057,1.632833,-0.514355,...,-1.091456,0.75988,-0.161363,-0.21093,0.135863,0.662121,-0.492481,0.015621,-0.724783,0.682494
4,Star-00484,False,0.0,0.0,0.0,2457400.0,0.0,0.596012,-0.176863,-1.042605,...,-0.69626,0.153752,0.936459,0.070402,-0.067689,-0.656553,-0.237337,-0.032597,-0.139141,-0.09808


#### Transform training data

Map the `Pulsating` column to `0` / `1`.

In [10]:
tr_load['Pulsating'] = tr_load['Pulsating'].map(lambda x: 1 if x == True else 0)
tr_load.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00163,0,0.0,0.0,0.0,2457444.0,0.0,-0.674126,0.519174,0.466681,...,-0.71231,-1.187392,0.425026,-0.002305,0.495906,-0.537353,-0.028926,-0.262548,-0.135686,-0.705143
1,Star-00123,1,30.0,0.72,0.0,2457401.0,0.37,-1.626729,1.911247,-0.740748,...,0.040924,-1.110488,-0.289189,0.056551,0.555375,-0.69959,-0.292135,-0.013533,0.443673,-1.207278
2,Star-00022,0,0.0,0.0,0.0,2457430.0,0.0,-0.039057,-1.012107,0.013895,...,-0.943428,0.637603,-0.679383,0.020496,-0.496592,-0.001214,-0.101526,-0.011097,-0.293389,0.242263
3,Star-00708,0,0.0,0.0,0.0,2459677.0,0.0,-0.039057,1.632833,-0.514355,...,-1.091456,0.75988,-0.161363,-0.21093,0.135863,0.662121,-0.492481,0.015621,-0.724783,0.682494
4,Star-00484,0,0.0,0.0,0.0,2457400.0,0.0,0.596012,-0.176863,-1.042605,...,-0.69626,0.153752,0.936459,0.070402,-0.067689,-0.656553,-0.237337,-0.032597,-0.139141,-0.09808


Select only the reliable features and target. Note the use of global variable here.

In [11]:
global train_set
train_set = tr_load[rel_features + ['Pulsating']].copy()
train_set.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2,Pulsating
0,-0.674126,0.519174,0.466681,0.766297,1.786498,-0.304944,0.843252,0.189055,1.390901,0.462908,...,0.908818,1.305379,1.413989,0.174334,-0.188773,0.985693,-0.258841,-1.099919,-0.461571,0
1,-1.626729,1.911247,-0.740748,0.691384,0.168331,1.522002,1.16642,0.157675,0.019744,-0.192345,...,-1.22435,0.710232,-1.272791,1.617586,1.392776,0.260283,0.708876,1.030413,0.400968,1
2,-0.039057,-1.012107,0.013895,-0.357397,1.168762,-0.232282,-0.443941,-0.136007,-0.412519,-0.042766,...,0.882515,1.044322,-1.204443,-0.593335,-1.011092,0.5925,0.135213,0.725303,-0.319881,0
3,-0.039057,1.632833,-0.514355,0.166993,1.47763,-0.544204,-0.572606,-0.586661,-0.338658,-0.449412,...,-0.92175,-1.095322,-0.03196,-0.068737,1.152465,-0.672518,0.391616,-1.301501,0.559262,0
4,0.596012,-0.176863,-1.042605,-0.43231,0.242158,-0.277263,-0.498198,-0.37002,-0.451106,-0.32157,...,-0.057582,0.541269,1.529879,-0.689578,1.47529,-1.268004,1.297212,1.327633,-0.059789,0


In [12]:
print("Pulsating stars in training sample: %d" %len(train_set[train_set['Pulsating'] == 1]))

Pulsating stars in training sample: 78


In [13]:
print("Non-pulsating stars in training sample: %d" %len(train_set[train_set['Pulsating'] == 0]))

Non-pulsating stars in training sample: 672


### Load validation data

In [14]:
S4_VALIDATION_SET_IN

'../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv'

In [15]:
val_load = pd.read_csv(S4_VALIDATION_SET_IN, sep=',', decimal='.')
val_load.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,False,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,0.215296,-1.17101,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,False,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,False,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.39456,0.012444,-0.506509,-0.073337,-0.01962,0.850942
3,Star-00120,False,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.22416,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,False,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.30514,-0.144503,0.889861,0.402366


#### Transform validation data

Map the `Pulsating` column to `0` / `1`.

In [16]:
val_load['Pulsating'] = val_load['Pulsating'].map(lambda x: 1 if x == True else 0)
val_load.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,0.215296,-1.17101,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.39456,0.012444,-0.506509,-0.073337,-0.01962,0.850942
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.22416,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.30514,-0.144503,0.889861,0.402366


Select only the reliable features and target.

In [17]:
val = val_load[rel_features + ['Pulsating']].copy()
val.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2,Pulsating
0,-0.99166,0.031948,0.542146,1.740165,-0.298361,2.924506,1.112558,1.223343,-0.528719,-0.333528,...,-1.55381,-0.039531,-1.766924,0.105349,-0.914928,1.624409,1.04124,-1.287022,0.212829,0
1,-1.309194,-1.081711,1.825039,1.815078,-1.61105,-0.692478,0.439292,1.223343,1.528017,2.763725,...,-0.257974,-0.443062,0.117884,0.110252,-1.094708,-0.142404,0.174193,1.058664,-1.753975,0
2,-0.356591,0.379966,0.844003,0.166993,-1.147748,-0.480752,-0.473181,0.12521,-0.246422,0.578714,...,-0.244072,-1.117475,1.393767,-0.017646,-0.23654,0.981973,0.721607,-0.990852,-0.054942,0
3,-0.039057,0.519174,0.994931,1.065949,-1.379399,-0.348004,-0.310918,-0.22466,-0.13696,-0.068049,...,0.049026,1.53379,-0.885426,1.05705,0.254424,0.604281,-1.070993,-0.70584,-2.196394,0
4,0.596012,-0.664089,-0.212498,0.391732,0.087724,1.19675,2.120285,1.085438,0.9309,0.226927,...,-0.834589,0.906258,1.467923,0.111756,-0.662832,-0.936044,0.525532,-1.277372,-0.605419,0


In [18]:
print("Pulsating stars in validation sample: %d" %len(val[val['Pulsating'] == 1]))

Pulsating stars in validation sample: 26


In [19]:
print("Non-pulsating stars in validation sample: %d" %len(val[val['Pulsating'] == 0]))

Non-pulsating stars in validation sample: 224


### Load CARMENES data

In [20]:
CARMENES_SET_IN

'../data/DATASETS_ML/ML_02_DS_AfterImputing.csv'

In [21]:
carm = pd.read_csv(CARMENES_SET_IN, sep=',', decimal='.')
carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,0.278478,-0.594485,0.013895,-0.956701,0.396592,-0.582246,-0.695467,0.903654,-0.497556,...,-1.344237,0.307975,2.25,0.336808,-0.001346,1.8128,-0.802968,-0.248463,-2.393587,0.183519
1,J23492+024,-0.674126,-0.733692,-0.438891,-0.057745,0.473809,-0.347436,-0.322543,1.312483,-0.153626,...,-2.251395,1.650742,-4.235033,0.118526,-0.244276,0.60428,-0.787328,-0.3944,-1.810785,1.600874
2,J23431+365,0.596012,-0.107259,0.542146,1.740165,-0.993314,-0.573237,-0.710438,-0.638376,-0.528719,...,0.178298,-1.062289,0.3663,0.121669,-0.255425,-0.788734,-0.490843,-0.66769,-0.338895,-1.116797
3,J23419+441,0.278478,1.354418,0.240288,1.140862,-0.993314,-0.281692,-0.412485,-0.559694,-0.332839,...,-2.029837,0.290144,0.829992,-0.008235,-0.235705,0.526272,-0.801627,-0.33911,-1.928617,0.555949
4,J23381-162,-0.039057,-0.385674,-0.891677,-0.657049,1.47763,0.003096,0.886923,0.217009,0.990671,...,-0.842653,0.846232,-0.199704,0.15027,-0.191291,0.634827,-0.436603,-0.188157,0.037966,0.22897


## Optimize with `optuna`

### Define objective function

See 'Functions' subsection.

### Loop through multiple `optuna` studies

In [22]:
# Initialize seed:
np.random.seed(RANDOM_STATE)

In [23]:
# Initialize the lists of results:
run = [] # Number of experiment
best_trial = []
max_depth = [] # Optimized value for 'max_depth'
min_samples_leaf = [] # Optimized value for 'min_samples_leaf'
ccp_alpha = [] # Optimized value for 'ccp_alpha'
avg_precision = [] # Average precision achieved over 5x2 CV folds
tr_precision = [] # Precision over the training/test sample from S4, after retraining.
val_precision = [] # Precision over the validation sample from S4, after retraining.
final_depth = [] # Final decision tree depth, after retraining.

In [24]:
# Set the verbosity value, to prevent too much output.
optuna.logging.set_verbosity(optuna.logging.ERROR)
for i in range(0, N):
    clear_output(wait=True)
    print("Experiment %d..." %i)
    
    # Create 'optuna' study:
    study = optuna.create_study(direction="maximize")
    # Execute 500 runs for each study:
    print("Optimizing optuna...")
    study.optimize(
        lambda trial: objective_cv_5x2_arg(trial, train_set, RANDOM_STATE + N * i),
        n_trials=500,
        show_progress_bar=True
    )

    # Record the results of the optimization:
    run.append(i)
    best_trial.append(study.best_trial.number)
    max_depth.append(study.best_params['max_depth'])
    min_samples_leaf.append(study.best_params['min_samples_leaf'])
    ccp_alpha.append(study.best_params['ccp_alpha'])
    avg_precision.append(study.best_value)
        
    # Create, re-train and evaluate a classifier with the optimized parameters:
    new_results = re_evaluate_model(
        DecisionTreeClassifier(
            max_depth=study.best_params['max_depth'],
            min_samples_leaf=study.best_params['min_samples_leaf'],
            ccp_alpha=study.best_params['ccp_alpha'],
            criterion='entropy', max_features=None,
            splitter='best'
        ),
        X_train=train_set[rel_features], y_train=train_set['Pulsating'],
        X_val=val[rel_features], y_val=val['Pulsating'],
        refit=True)

    tr_precision.append(new_results['train_score'])
    val_precision.append(new_results['validation_score'])
    final_depth.append(new_results['tree_depth'])

# Create the results dataframe:
opt_val_results = pd.DataFrame(
    data={
        'Experiment_ID': run,
        'Best_max_depth': max_depth,
        'Best_min_samples_leaf': min_samples_leaf,
        'Best_ccp_alpha': ccp_alpha,
        'Best_average_precision_5x2CV': avg_precision,
        'S4_traintest_precision': tr_precision,
        'S4_validation_precision': val_precision,
    }
)



Experiment 249...
Optimizing optuna...


  self._init_valid()


  0%|          | 0/500 [00:00<?, ?it/s]

In [25]:
opt_val_results

Unnamed: 0,Experiment_ID,Best_max_depth,Best_min_samples_leaf,Best_ccp_alpha,Best_average_precision,S4_traintest_precision,S4_validation_precision
0,0,7,5,0.005,0.093205,0.875000,0.000000
1,1,8,5,0.015,0.114634,0.000000,0.000000
2,2,5,6,0.005,0.101115,0.857143,0.000000
3,3,7,5,0.015,0.119038,0.000000,0.000000
4,4,6,5,0.005,0.099076,0.842105,0.000000
...,...,...,...,...,...,...,...
245,245,6,7,0.005,0.105166,0.772727,0.000000
246,246,4,7,0.005,0.101323,0.857143,0.000000
247,247,8,7,0.005,0.112393,0.789474,0.076923
248,248,8,7,0.010,0.115760,0.533333,0.333333


#### Fix: rename the erroneous `Best_min_samples_leaf ` (one trailing space was entered)

In [33]:
opt_val_results.rename(columns={'Best_min_samples_leaf ': 'Best_min_samples_leaf'}, inplace=True)

#### Fix: add the final tree depth

In [34]:
opt_val_results['Final_tree_depth'] = -1
opt_val_results['NEW_tr_precision'] = 0.0 # For check the fix, will be deleted later.
opt_val_results['NEW_val_precision'] = 0.0 # For check the fix, will be deleted later.
opt_val_results

Unnamed: 0,Experiment_ID,Best_max_depth,Best_min_samples_leaf,Best_ccp_alpha,Best_average_precision,S4_traintest_precision,S4_validation_precision,Final_tree_depth,NEW_tr_precision,NEW_val_precision
0,0,7,5,0.005,0.093205,0.875000,0.000000,-1,0.0,0.0
1,1,8,5,0.015,0.114634,0.000000,0.000000,-1,0.0,0.0
2,2,5,6,0.005,0.101115,0.857143,0.000000,-1,0.0,0.0
3,3,7,5,0.015,0.119038,0.000000,0.000000,-1,0.0,0.0
4,4,6,5,0.005,0.099076,0.842105,0.000000,-1,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
245,245,6,7,0.005,0.105166,0.772727,0.000000,-1,0.0,0.0
246,246,4,7,0.005,0.101323,0.857143,0.000000,-1,0.0,0.0
247,247,8,7,0.005,0.112393,0.789474,0.076923,-1,0.0,0.0
248,248,8,7,0.010,0.115760,0.533333,0.333333,-1,0.0,0.0


In [37]:
for i in range(0, N):
    new_clf = DecisionTreeClassifier(
        max_depth=opt_val_results.loc[i, 'Best_max_depth'],
        min_samples_leaf=opt_val_results.loc[i, 'Best_min_samples_leaf'],
        ccp_alpha=opt_val_results.loc[i, 'Best_ccp_alpha'],
        criterion='entropy', max_features=None,
        splitter='best'
    )
    new_clf.fit(train_set[rel_features], train_set['Pulsating'])
    opt_val_results.loc[i, 'Final_tree_depth'] = new_clf.get_depth()
    
    # Just for checking afterwards - Will be deleted later:
    opt_val_results.loc[i, 'NEW_tr_precision'] = \
        precision_score(train_set['Pulsating'], new_clf.predict(train_set[rel_features]), zero_division=0.0)
    opt_val_results.loc[i, 'NEW_val_precision'] = \
        precision_score(val['Pulsating'], new_clf.predict(val[rel_features]), zero_division=0.0)

opt_val_results

Unnamed: 0,Experiment_ID,Best_max_depth,Best_min_samples_leaf,Best_ccp_alpha,Best_average_precision,S4_traintest_precision,S4_validation_precision,Final_tree_depth,NEW_tr_precision,NEW_val_precision
0,0,7,5,0.005,0.093205,0.875000,0.000000,7,0.875000,0.000000
1,1,8,5,0.015,0.114634,0.000000,0.000000,0,0.000000,0.000000
2,2,5,6,0.005,0.101115,0.857143,0.000000,5,0.857143,0.000000
3,3,7,5,0.015,0.119038,0.000000,0.000000,0,0.000000,0.000000
4,4,6,5,0.005,0.099076,0.842105,0.000000,6,0.842105,0.000000
...,...,...,...,...,...,...,...,...,...,...
245,245,6,7,0.005,0.105166,0.772727,0.000000,6,0.772727,0.000000
246,246,4,7,0.005,0.101323,0.857143,0.000000,4,0.857143,0.000000
247,247,8,7,0.005,0.112393,0.789474,0.076923,8,0.789474,0.076923
248,248,8,7,0.010,0.115760,0.533333,0.333333,4,0.533333,0.333333


Check if results are the same:

In [39]:
(opt_val_results['S4_traintest_precision'] == opt_val_results['NEW_tr_precision']).sum()

250

In [40]:
(opt_val_results['S4_validation_precision'] == opt_val_results['NEW_val_precision']).sum()

198

**OBSERVATION:** there are 52 cases where the validation results are not the same, lets see which ones they are:

In [41]:
opt_val_results[opt_val_results['S4_validation_precision'] != opt_val_results['NEW_val_precision']]

Unnamed: 0,Experiment_ID,Best_max_depth,Best_min_samples_leaf,Best_ccp_alpha,Best_average_precision,S4_traintest_precision,S4_validation_precision,Final_tree_depth,NEW_tr_precision,NEW_val_precision
12,12,8,5,0.005,0.122456,0.794872,0.2,8,0.794872,0.375
14,14,8,6,0.005,0.106438,0.789474,0.076923,8,0.789474,0.2
16,16,8,6,0.005,0.117757,0.789474,0.2,8,0.789474,0.181818
27,27,8,8,0.005,0.113723,0.611111,0.214286,8,0.611111,0.125
31,31,8,8,0.005,0.125223,0.611111,0.125,8,0.611111,0.230769
43,43,8,6,0.005,0.104659,0.789474,0.181818,8,0.789474,0.076923
65,65,8,8,0.005,0.093908,0.611111,0.214286,8,0.611111,0.230769
69,69,8,5,0.005,0.114879,0.794872,0.2,8,0.794872,0.375
70,70,8,5,0.005,0.109065,0.794872,0.375,8,0.794872,0.2
89,89,8,6,0.005,0.129276,0.789474,0.2,8,0.789474,0.076923


**NOTE:** ok, as training results are the same, we will review (or repeat the 500 runs) later on, we continue with what we have.

In [42]:
try:
    opt_val_results.drop(columns=['NEW_tr_precision', 'NEW_val_precision'], inplace=True)
except:
    print("*Probably these columns no longer exist...")
    print(list(opt_val_results.columns))

In [43]:
opt_val_results

Unnamed: 0,Experiment_ID,Best_max_depth,Best_min_samples_leaf,Best_ccp_alpha,Best_average_precision,S4_traintest_precision,S4_validation_precision,Final_tree_depth
0,0,7,5,0.005,0.093205,0.875000,0.000000,7
1,1,8,5,0.015,0.114634,0.000000,0.000000,0
2,2,5,6,0.005,0.101115,0.857143,0.000000,5
3,3,7,5,0.015,0.119038,0.000000,0.000000,0
4,4,6,5,0.005,0.099076,0.842105,0.000000,6
...,...,...,...,...,...,...,...,...
245,245,6,7,0.005,0.105166,0.772727,0.000000,6
246,246,4,7,0.005,0.101323,0.857143,0.000000,4
247,247,8,7,0.005,0.112393,0.789474,0.076923,8
248,248,8,7,0.010,0.115760,0.533333,0.333333,4


## Save the results

### Save the results of the optimization runs

In [49]:
opt_val_results.to_csv(OPT_RUNS_RESULTS_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We have executed multiple runs of $500$ `optuna` trials, each run using a different 5x2 fold CV split, and saved the results.
