# MACHINE LEARNING TELESCOPIC SEARCH

In this notebook we do a new Search (a Telescopic Search), starting from the points that the previous Grid Search found (notebook _MachineLearning_05_GridSearch.ipynb_).
- First, we do a new Grid Search. Notice that the parameters for which the first Grid Search took extreme values on the grid are re-tested, looking for values outside the original range for the first Grid Search.
- Then, we do a Random Search around the best values found by this sew Grid Search.

## Modules and configuration

### Modules

In [1]:
import warnings

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from scipy.stats import uniform, randint

from sklearn.metrics import roc_auc_score, precision_score, accuracy_score, recall_score, f1_score, \
    RocCurveDisplay

#from sklearn.experimental import enable_hist_gradient_boosting
# Not sure why this 'experimental' import is needed, as the sklearn version is 0.24.x > 0.21.x
from sklearn.ensemble import HistGradientBoostingClassifier

from joblib import dump, load


### Configuration

In [2]:
# CONFIGURATION:
RANDOM_STATE = 11 # For reproducibility

# FILES AND FOLDERS
SYNTH_DATASET = "../data/DATASETS_ML/DS4_Dataset.csv" # The already preprocessed one.
OUT_MODELS_FOLDER = "../data/MODELS_ML/"

# TARGET FEATURE - To remove the Lomb-Scargle (Periodic) 'cesium' features
# Notice that the input dataset has the proper 'cesium' features already preselected.
TARGET_DS4 = ['Pulsating']

# TRAIN/TEST SPLITS:
VALIDATION_SIZE = 0.25 # Fraction of DS4 reserved for model validation.

# MODEL OPTIMIZATION / TRAINING:
N_FOLD = 5 # n value for n-fold cross_validation (not-used, but could be added to the cross validations)
PARAM_GRID = { # Parameter grid, only valid for a 'HistGradientBoostingClassifier' outside a `Pipeline`.
    'learning_rate': [0.025, 0.05, 0.1],
    'max_iter': [25, 50 ,100], 
    'max_leaf_nodes': [None],
    'max_depth': [None],
    'min_samples_leaf': [20, 40, 80],
    'max_bins': [255]
} # This is the starting parameter grid, based on the results of the first grid search.



### Functions

## Load the train/test and validation dataset

In [3]:
ds4 = pd.read_csv(SYNTH_DATASET, sep=',', decimal='.')
ds4.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,qso_log_chi2_qsonu,qso_log_chi2nuNULL_chi2nu,skew,std,stetson_j,stetson_k,weighted_average,fold2P_slope_10percentile,fold2P_slope_90percentile,Pulsating
0,8.0,22.0,32.0,34.0,47.0,1.037736,1.122449,1.527778,1.081633,1.472222,...,11.309243,-0.256635,-0.781648,3.165073,24.980336,0.958442,640.080094,-804.937332,734.364181,1
1,4.0,25.0,23.0,49.0,,1.5,2.0,,1.333333,,...,10.193226,-0.719675,-0.066318,2.172986,18.856633,1.041665,1393.65176,-8.4e-05,6.9e-05,1
2,7.0,9.0,16.0,26.0,43.0,1.5,1.5,1.615385,1.0,1.076923,...,9.388395,0.509991,0.234427,2.803447,22.406281,0.974131,1798.934279,-0.000131,9.2e-05,0
3,9.0,10.0,46.0,49.0,0.0,1.0,1.0,2.0,1.0,2.0,...,12.881028,-0.441638,-0.583404,3.02131,27.726903,0.968993,918.862056,-0.00011,0.000187,0
4,7.0,23.0,26.0,19.0,3.0,1.0,1.5,3.0,1.5,3.0,...,9.92881,-0.441719,0.65328,2.83569,24.866941,1.020485,-1374.121335,-1.7e-05,2.9e-05,1


## Train/test set and validation set split

We do a train/test set and validation set split.

**NOTE:** it is important to stratify the split with the target variable (`Pulsating` column).

We first separate the features from the target variable:

In [4]:
X = ds4.drop(columns=TARGET_DS4).copy()
y = ds4[TARGET_DS4].copy()

In [5]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VALIDATION_SIZE,
                                                    stratify=y, random_state=RANDOM_STATE)

In [6]:
X_train.shape

(750, 67)

In [7]:
y_train.shape

(750, 1)

In [8]:
y_train[:10]

Unnamed: 0,Pulsating
303,0
425,1
6,1
457,1
619,0
34,1
935,0
52,1
674,1
33,1


In [9]:
np.ravel(y_train)[:10]

array([0, 1, 1, 1, 0, 1, 0, 1, 1, 1], dtype=int64)

In [10]:
X_val.shape

(250, 67)

In [11]:
y_val.shape

(250, 1)

## Design the model for optimization

At the moment, we don't use the `Pipeline` for simplification. It would be better to do this inside a `Pipeline` in case more steps are needed to be introduced later (for example, introduce the transformation to drop certain columns, a `RobustScaler`, etc.).

For optimization and training of the model we use telescopic search (Grid search followed by Random search) and 5-fold cross-validation, with the `auc`metric.

## Optimize and train the model

### Telescopic search step 1: Grid Search CV

In [12]:
# Define the Grid Search model:
clf = GridSearchCV(estimator=HistGradientBoostingClassifier(random_state=RANDOM_STATE),
                   param_grid=PARAM_GRID, cv=5, scoring='roc_auc',
                   verbose=1)

#### Optimize / train

In [13]:
# DISABLE WARNINGS:
warnings.filterwarnings('ignore')

# Optimize / train:
clf.fit(X_train, np.ravel(y_train))

Fitting 5 folds for each of 27 candidates, totalling 135 fits


GridSearchCV(cv=5, estimator=HistGradientBoostingClassifier(random_state=11),
             param_grid={'learning_rate': [0.025, 0.05, 0.1], 'max_bins': [255],
                         'max_depth': [None], 'max_iter': [25, 50, 100],
                         'max_leaf_nodes': [None],
                         'min_samples_leaf': [20, 40, 80]},
             scoring='roc_auc', verbose=1)

We now show the parameters of the estimator:

In [14]:
# Show the grid search parameters:
clf.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__categorical_features': None,
 'estimator__early_stopping': 'auto',
 'estimator__l2_regularization': 0.0,
 'estimator__learning_rate': 0.1,
 'estimator__loss': 'auto',
 'estimator__max_bins': 255,
 'estimator__max_depth': None,
 'estimator__max_iter': 100,
 'estimator__max_leaf_nodes': 31,
 'estimator__min_samples_leaf': 20,
 'estimator__monotonic_cst': None,
 'estimator__n_iter_no_change': 10,
 'estimator__random_state': 11,
 'estimator__scoring': 'loss',
 'estimator__tol': 1e-07,
 'estimator__validation_fraction': 0.1,
 'estimator__verbose': 0,
 'estimator__warm_start': False,
 'estimator': HistGradientBoostingClassifier(random_state=11),
 'n_jobs': None,
 'param_grid': {'learning_rate': [0.025, 0.05, 0.1],
  'max_iter': [25, 50, 100],
  'max_leaf_nodes': [None],
  'max_depth': [None],
  'min_samples_leaf': [20, 40, 80],
  'max_bins': [255]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': 'roc_auc',
 '

#### Best params found

In [15]:
# Show the best parameters found:
clf.best_params_

{'learning_rate': 0.05,
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 50,
 'max_leaf_nodes': None,
 'min_samples_leaf': 40}

#### Best estimator

In [16]:
clf.best_estimator_

HistGradientBoostingClassifier(learning_rate=0.05, max_iter=50,
                               max_leaf_nodes=None, min_samples_leaf=40,
                               random_state=11)

#### Best estimator performance on validation set (just as a reference)

In [17]:
# PRINT METRICS:
print("GridSearchCV - precision on validation set: %f"
      %precision_score(y_val, clf.best_estimator_.predict(X_val)))
print("GridSearchCV - accuracy on validation set: %f"
      %accuracy_score(y_val, clf.best_estimator_.predict(X_val)))
print("GridSearchCV - recall on validation set: %f"
      %recall_score(y_val, clf.best_estimator_.predict(X_val)))
print("GridSearchCV - F1 score on validation set: %f"
      %f1_score(y_val, clf.best_estimator_.predict(X_val)))
print("GridSearchCV - ROC - AUC on validation set: %f"
      %roc_auc_score(y_val, clf.best_estimator_.predict(X_val)))


GridSearchCV - precision on validation set: 0.693277
GridSearchCV - accuracy on validation set: 0.664000
GridSearchCV - recall on validation set: 0.937500
GridSearchCV - F1 score on validation set: 0.797101
GridSearchCV - ROC - AUC on validation set: 0.475507


**OBSERVATION:** Results are exaclty the same than before (with the first Grid Search), and not very good indeed, so we now try the additional step of a random search.

### Telescopic search step 2: Random Search CV

In [18]:
random_params ={
    'learning_rate': uniform(loc=0.025, scale=0.075),
    'max_iter': randint(low=25, high=101), 
    'max_leaf_nodes': [None],
    'max_depth': [None],
    'min_samples_leaf': randint(low=20, high=81),
    'max_bins': [255]
}

In [19]:
# Define the Grid Search model:
clf_rand = RandomizedSearchCV(estimator=HistGradientBoostingClassifier(random_state=RANDOM_STATE),
                              param_distributions=random_params, cv=5, scoring='roc_auc', n_iter=200,
                              verbose=1)

#### Optimize / train

In [20]:
# DISABLE WARNINGS:
warnings.filterwarnings('ignore')

# Optimize / train:
clf_rand.fit(X_train, np.ravel(y_train))

Fitting 5 folds for each of 200 candidates, totalling 1000 fits


RandomizedSearchCV(cv=5,
                   estimator=HistGradientBoostingClassifier(random_state=11),
                   n_iter=200,
                   param_distributions={'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CA96050820>,
                                        'max_bins': [255], 'max_depth': [None],
                                        'max_iter': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CA96067520>,
                                        'max_leaf_nodes': [None],
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CA96028820>},
                   scoring='roc_auc', verbose=1)

We now show the parameters of the estimator:

In [21]:
# Show the grid search parameters:
clf_rand.get_params()

{'cv': 5,
 'error_score': nan,
 'estimator__categorical_features': None,
 'estimator__early_stopping': 'auto',
 'estimator__l2_regularization': 0.0,
 'estimator__learning_rate': 0.1,
 'estimator__loss': 'auto',
 'estimator__max_bins': 255,
 'estimator__max_depth': None,
 'estimator__max_iter': 100,
 'estimator__max_leaf_nodes': 31,
 'estimator__min_samples_leaf': 20,
 'estimator__monotonic_cst': None,
 'estimator__n_iter_no_change': 10,
 'estimator__random_state': 11,
 'estimator__scoring': 'loss',
 'estimator__tol': 1e-07,
 'estimator__validation_fraction': 0.1,
 'estimator__verbose': 0,
 'estimator__warm_start': False,
 'estimator': HistGradientBoostingClassifier(random_state=11),
 'n_iter': 200,
 'n_jobs': None,
 'param_distributions': {'learning_rate': <scipy.stats._distn_infrastructure.rv_frozen at 0x1ca96050820>,
  'max_iter': <scipy.stats._distn_infrastructure.rv_frozen at 0x1ca96067520>,
  'max_leaf_nodes': [None],
  'max_depth': [None],
  'min_samples_leaf': <scipy.stats._dist

#### Best params found

In [22]:
# Show the best parameters found:
clf_rand.best_params_

{'learning_rate': 0.08887773751235287,
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 53,
 'max_leaf_nodes': None,
 'min_samples_leaf': 58}

There have been some changes in the best parameters found.

#### Best estimator

In [23]:
clf_rand.best_estimator_

HistGradientBoostingClassifier(learning_rate=0.08887773751235287, max_iter=53,
                               max_leaf_nodes=None, min_samples_leaf=58,
                               random_state=11)

#### Best estimator performance on validation set

In [29]:
# PRINT METRICS:
print("GridSearchCV - precision on validation set: %f"
      %precision_score(y_val, clf_rand.best_estimator_.predict(X_val)))
print("GridSearchCV - accuracy on validation set: %f"
      %accuracy_score(y_val, clf_rand.best_estimator_.predict(X_val)))
print("GridSearchCV - recall on validation set: %f"
      %recall_score(y_val, clf_rand.best_estimator_.predict(X_val)))
print("GridSearchCV - F1 score on validation set: %f"
      %f1_score(y_val, clf_rand.best_estimator_.predict(X_val)))
print("GridSearchCV - ROC - AUC on validation set: %f"
      %roc_auc_score(y_val, clf_rand.best_estimator_.predict(X_val)))


GridSearchCV - precision on validation set: 0.701299
GridSearchCV - accuracy on validation set: 0.668000
GridSearchCV - recall on validation set: 0.920455
GridSearchCV - F1 score on validation set: 0.796069
GridSearchCV - ROC - AUC on validation set: 0.494011


**OBSERVATION:** It seems we have got an improvement, but maybe not enough. However, at the momeent we store this classifier as the best classifier so far.

### Store the best classifier so far

In [30]:
dump(clf_rand, OUT_MODELS_FOLDER + "Best_Model_After_RandSearchCV.joblib") 

['../data/MODELS_ML/Best_Model_After_RandSearchCV.joblib']

#### Try loading the model

In [31]:
loaded_clf = load(OUT_MODELS_FOLDER + "Best_Model_After_RandSearchCV.joblib")

In [32]:
loaded_clf.best_estimator_

HistGradientBoostingClassifier(learning_rate=0.08887773751235287, max_iter=53,
                               max_leaf_nodes=None, min_samples_leaf=58,
                               random_state=11)

In [33]:
loaded_clf.best_estimator_.predict(X_val)

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 0], dtype=int64)