#  MODEL PRESELECTION

In this Notebook we take several _off-the-shelf_ Machine Learning models (classifiers) and apply them to the **oversampled** S4 sample.

We make 5-fold crossvalidation (note that the K-fold split is carried out, by default, in an stratified way, there is no need to specify it) over the training/test set, and measure the performance of the trained model over the validation set. These two sets were previously splitted from S4 sample, stratified by the target variable, and saved to files.

The score metric for cross validation will be `precision`.

We will test the following ML classification models. For each model, we do a basic `GridSearchCV` with the more relevant hyperparameters for each model, generally taking values equal to default value, and one-half (or one tenth) and twice (or ten fold) that default value.

- Linear models:
  - `Perceptron`.
  - `LogisticRegressor`.
  - `PassiveAggressiveClassifier`
- Support Vector Machines:
  - `SVC`
- Nearest-Neighbours models:
  - `KNeighborsClassifier`
- Gaussian Processes models:
  - `GaussianProcessClassifier`
- Tree models:
  - `DecisionTreeClassifier`
- Ensemble models:
  - `RandomForestClassifier`
  - `AdaBoostClassifier` (a particular case of the next type of model)
  - `GradientBoostingClassifier`

## Modules and configuration

### Modules

In [34]:
import pandas as pd
import numpy as np

import copy

from time import time

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, \
    f1_score, log_loss, matthews_corrcoef, classification_report, \
    get_scorer_names, confusion_matrix

from collections import OrderedDict

import warnings

#from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

from sklearn.gaussian_process.kernels import RBF, RationalQuadratic, DotProduct

from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit # Se usa para evitar overfitting
#from sklearn.model_selection import cross_validate #### NOTA, IGUAL ES MEJOR ESTE, PARA TENER EL CONTROL Y SACAR
# TODOS LOS RUNS QUE QUERAMOS
###### NO HACE FALTA EL GRID SEARCH NI NADA DE ESTO??? SIMPLEMENTE LE HACEMOS EL "FIT" Y LUEGO MEDIMOS CON
# EL "PREDICT" SOBRE EL VALIDATION SET

from sklearn.linear_model import Perceptron, LogisticRegression, PassiveAggressiveClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, \
    GradientBoostingClassifier, HistGradientBoostingClassifier

import pickle

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white", {'figure.figsize':(15,10)})

from IPython.display import display

# from imblearn import 

### Configuration

In [99]:
RANDOM_STATE = 11 # For reproducibility

S4_TRAIN_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest_OVERSAMPLED_n3.csv"
# Train/test set for S4 sample, all 112 features
S4_VALIDATION_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv"
# Validation set for S4 sample, all 112 features

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"
#UNREL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Unreliable_features.pickle"

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase'] # Only cesium features and these columns will be kept.

MODELS_FOLDER = "../data/ML_MODELS/ML_model_preselection/"

PRECISION_RESULTS_OUT = "ModelPreselection_PrecisionResults_OversampledSMOTE_n3.csv"
VAL_PREDICTIONS_OUT = "ModelPreselection_ValidationPredictions_OversampledSMOTE_n3.csv"

OTS_CLF_OUT = "OVERSAMPLED_fitted_clf_ots_n3.pickle"
OPT_CLF_OUT = "OVERSAMPLED_fitted_clf_opt_n3.pickle"

# Note: it would be better to use a jason file for this configuration.
OFF_THE_SHELF_CLASSIFIERS = OrderedDict({
    'Perceptron': {
        'clf': Perceptron(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'alpha': [0.001, 0.0001, 0.00001],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'max_iter': [500, 1000, 2000],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'LogisticRegression': {
        'clf': LogisticRegression(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'C': [0.5, 1.0, 2.0],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'solver': ['saga'],
                       'max_iter': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'PassiveAggressiveClassifier': {
        'clf': PassiveAggressiveClassifier(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'max_iter': [500, 1000, 2000],
                       'loss': ['hinge', 'squared_hinge'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'SVC': {
        'clf': SVC(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'recomputed'],
                       'degree': [2, 3, 6],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'KNeighborsClassifier': {
        'clf': KNeighborsClassifier(),
        'param_grid': {'n_neighbors': [1, 3, 5, 10],
                       'weights': ['uniform', 'distance'],
                       'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                       'p': [1, 2]
                      }
    },
    'GaussianProcessClassifier': {
        'clf': GaussianProcessClassifier(),
        'param_grid': {'kernel': [RBF(), RationalQuadratic(), DotProduct()],
                       'max_iter_predict': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'DecisionTreeClassifier': {
        'clf': DecisionTreeClassifier(),
        'param_grid': {'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'RandomForestClassifier': {
        'clf': RandomForestClassifier(),
        'param_grid': {'n_estimators': [50, 100, 200],
                       'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'AdaBoostClassifier': {
        'clf': AdaBoostClassifier(),
        'param_grid': {'n_estimators': [25, 50, 100],
                       'learning_rate': [0.5, 1.0, 2.0],
                       'algorithm': ['SAMME', 'SAMME.R'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'GradientBoostingClassifier': {
        'clf': GradientBoostingClassifier(),
        'param_grid': {'loss': ['log_loss', 'deviance'],
                       'learning_rate': [0.05, 0.1, 0.2],
                       'n_estimators': [25, 50, 100],
                       'criterion': ['friedman_mse', 'squared_error'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    }
})

IMAGE_FOLDER = './img/'

### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_, scaled, and with `NaN` values imputed by a `KNNImputer`.

### Load reliable features list

In [36]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

###  Load the oversampled S4 sample data train set

In [37]:
s4_train_rel = pd.read_csv(S4_TRAIN_SET_IN, sep=',', decimal='.')
s4_train_rel

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,0,-0.674126,0.519174,0.466681,0.766297,1.786498,-0.304944,0.843252,0.189055,1.390901,...,0.646033,0.908818,1.305379,1.413989,0.174334,-0.188773,0.985693,-0.258841,-1.099919,-0.461571
1,1,-1.626729,1.911247,-0.740748,0.691384,0.168331,1.522002,1.166420,0.157675,0.019744,...,1.532902,-1.224350,0.710232,-1.272791,1.617586,1.392776,0.260283,0.708876,1.030413,0.400968
2,0,-0.039057,-1.012107,0.013895,-0.357397,1.168762,-0.232282,-0.443941,-0.136007,-0.412519,...,0.384058,0.882515,1.044322,-1.204443,-0.593335,-1.011092,0.592500,0.135213,0.725303,-0.319881
3,0,-0.039057,1.632833,-0.514355,0.166993,1.477630,-0.544204,-0.572606,-0.586661,-0.338658,...,1.457441,-0.921750,-1.095322,-0.031960,-0.068737,1.152465,-0.672518,0.391616,-1.301501,0.559262
4,0,0.596012,-0.176863,-1.042605,-0.432310,0.242158,-0.277263,-0.498198,-0.370020,-0.451106,...,-0.829296,-0.057582,0.541269,1.529879,-0.689578,1.475290,-1.268004,1.297212,1.327633,-0.059789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1339,1,0.473173,-0.668664,0.135630,-0.259518,-1.373200,-0.470559,-0.589896,-0.548709,-0.448373,...,-0.038051,-0.958444,0.386309,-0.805888,0.697947,0.293784,-0.667106,-0.192877,1.224979,-1.394465
1340,1,-0.993662,-0.524004,0.918516,1.589867,-0.372796,-0.377095,-0.318659,-0.118490,-0.120285,...,0.036617,1.595782,0.586577,-0.903790,0.315474,-0.690108,-0.846619,0.819517,-0.407266,-1.583849
1341,1,0.610021,-1.068148,0.492761,0.150468,-0.980256,-0.169826,-0.341273,-0.177918,-0.341825,...,-0.194276,-0.105307,1.257649,-0.697622,-0.195618,-0.123771,-0.670504,0.582847,1.003724,-1.285203
1342,1,1.061356,-0.373715,0.099776,-0.660900,-1.102506,0.339215,-0.083127,0.190607,-0.361995,...,-0.124422,-0.875222,1.033421,-0.335086,-0.822268,-0.909756,-0.608352,-0.050957,1.095146,-1.091260


Notice that target variable is already encoded as `0` / `1` and that we now only have the features (i.e. the metadata are no longer present).

###  Load the S4 sample data validation set

In [38]:
s4_val = pd.read_csv(S4_VALIDATION_SET_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,False,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,False,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,False,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,False,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,False,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,False,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,False,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,False,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,False,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In this case we do need to do the transformations.

#### Encode target variable (`Pulsating`)

We encode the target variable as `True` / `False` = `0` / `1`.

In [39]:
s4_val['Pulsating'] = (s4_val['Pulsating'] == True).astype(int)
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


#### Filter the relevant columns only

We now filter only by the reliable relevant columns plus the `Pulsating` column.

In [40]:
s4_val_rel = s4_val[['Pulsating'] + rel_features].copy()
s4_val_rel

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,0,-0.991660,0.031948,0.542146,1.740165,-0.298361,2.924506,1.112558,1.223343,-0.528719,...,-0.938100,-1.553810,-0.039531,-1.766924,0.105349,-0.914928,1.624409,1.041240,-1.287022,0.212829
1,0,-1.309194,-1.081711,1.825039,1.815078,-1.611050,-0.692478,0.439292,1.223343,1.528017,...,-0.225260,-0.257974,-0.443062,0.117884,0.110252,-1.094708,-0.142404,0.174193,1.058664,-1.753975
2,0,-0.356591,0.379966,0.844003,0.166993,-1.147748,-0.480752,-0.473181,0.125210,-0.246422,...,-0.828381,-0.244072,-1.117475,1.393767,-0.017646,-0.236540,0.981973,0.721607,-0.990852,-0.054942
3,0,-0.039057,0.519174,0.994931,1.065949,-1.379399,-0.348004,-0.310918,-0.224660,-0.136960,...,0.041384,0.049026,1.533790,-0.885426,1.057050,0.254424,0.604281,-1.070993,-0.705840,-2.196394
4,0,0.596012,-0.664089,-0.212498,0.391732,0.087724,1.196750,2.120285,1.085438,0.930900,...,-0.143457,-0.834589,0.906258,1.467923,0.111756,-0.662832,-0.936044,0.525532,-1.277372,-0.605419
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,0,0.913546,0.101552,-1.419927,-0.057745,-0.761663,0.185932,-0.248147,-0.282121,-0.456553,...,1.095375,0.802932,0.271270,1.003811,-0.814867,0.239043,1.691595,-0.792703,-0.844428,1.395823
246,0,0.278478,-0.733692,1.447717,1.515426,-1.997135,-0.692478,-0.772586,-0.672852,-0.528719,...,-1.125088,0.000321,0.528247,0.944357,1.417351,-1.632795,-1.374663,-1.104800,-0.094134,0.181936
247,0,-0.356591,1.493625,1.221324,-0.507223,1.786498,-0.382451,-0.368627,-0.328089,-0.185930,...,-0.028582,1.325680,1.176487,-1.183766,0.406925,0.391323,-0.620253,-1.156760,0.794597,-0.604688
248,0,0.913546,-1.151314,0.240288,-0.207571,-1.533833,-0.005709,-0.205837,-0.022504,-0.283139,...,-1.312254,0.864168,-0.223449,-0.412393,1.222789,0.142060,0.311169,-0.012413,1.407608,0.625523


In [41]:
print(list(s4_val_rel.columns))

['Pulsating', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_r

## Train all the _off-the-shelf_ classifiers

### Train benchmark classifiers

We train the classifiers just off-the shelf, with all the parameters set to default.

In [111]:
warnings.simplefilter('ignore')
fitted_classifiers_benchmark = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Fitting off-the-shelf classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    #param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid={}, scoring='precision', cv=4, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_benchmark[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Fitting off-the-shelf classifier Perceptron...
... completed. Elapsed time: 0.030 seconds
Fitting off-the-shelf classifier LogisticRegression...
... completed. Elapsed time: 0.107 seconds
Fitting off-the-shelf classifier PassiveAggressiveClassifier...
... completed. Elapsed time: 0.030 seconds
Fitting off-the-shelf classifier SVC...
... completed. Elapsed time: 0.352 seconds
Fitting off-the-shelf classifier KNeighborsClassifier...
... completed. Elapsed time: 0.146 seconds
Fitting off-the-shelf classifier GaussianProcessClassifier...
... completed. Elapsed time: 2.208 seconds
Fitting off-the-shelf classifier DecisionTreeClassifier...
... completed. Elapsed time: 0.219 seconds
Fitting off-the-shelf classifier RandomForestClassifier...
... completed. Elapsed time: 2.659 seconds
Fitting off-the-shelf classifier AdaBoostClassifier...
... completed. Elapsed time: 1.504 seconds
Fitting off-the-shelf classifier GradientBoostingClassifier...
... completed. Elapsed time: 6.749 seconds


In [112]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.029919147491455078)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.10671472549438477)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.029920339584350586)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.3515286445617676)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.14560890197753906)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 2.207959

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [113]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.029919147491455078)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.10671472549438477)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.029920339584350586)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.3515286445617676)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.14560890197753906)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 2.207959

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [114]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[499 173]
 [376 296]]
Validation set:
[[154  70]
 [ 18   8]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[443 229]
 [222 450]]
Validation set:
[[130  94]
 [ 17   9]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[503 169]
 [357 315]]
Validation set:
[[145  79]
 [ 17   9]]


Printing confusion matrices for classifier SVC...
Training set:
[[626  46]
 [ 16 656]]
Validation set:
[[187  37]
 [ 20   6]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[466 206]
 [  0 672]]
Validation set:
[[130  94]
 [ 14  12]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[158  66]
 [ 18   8]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[163  61]
 [ 21   5]]


**OBSERVATION:** The main problem here is that most classifiers overfit to the training set. However, it seems that the behavipour over the training set is now better (i.e. the classifiers do not predict a single label).

#### Classification reports

In [115]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.57      0.74      0.65       672
           1       0.63      0.44      0.52       672

    accuracy                           0.59      1344
   macro avg       0.60      0.59      0.58      1344
weighted avg       0.60      0.59      0.58      1344

	Validation set:
              precision    recall  f1-score   support

           0       0.90      0.69      0.78       224
           1       0.10      0.31      0.15        26

    accuracy                           0.65       250
   macro avg       0.50      0.50      0.47       250
weighted avg       0.81      0.65      0.71       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.67      0.66      0.66       672
           1       0.66      0.67      0.67       672

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [116]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [117]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.029919147491455078),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.5915178571428571),
                                         ('precision', 0.6311300639658849),
                                         ('recall', 0.44047619047619047),
                                         ('F1', 0.5188431200701139),
                                         ('log_loss', 14.723188772786692),
                                         ('MCC', 0.19200593959743664)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.648),
                                         ('precision', 0.10256410256410256),
                                         ('recall', 0.3076923076923077),
                                         ('F1', 0.15384615384615385),
                        

**OBSERVATION:** <font color='red'>**VERY BAD RESULTS**</font><font color='blue'>**, BUT THE IMBALANCE PROBLEM HAS CLEARLY DISSAPEARED NOW**</font>

We have still the problem of overfitting.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [118]:
precision_results = pd.DataFrame(index=fitted_classifiers_benchmark.keys())
precision_results

Perceptron
LogisticRegression
PassiveAggressiveClassifier
SVC
KNeighborsClassifier
GaussianProcessClassifier
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GradientBoostingClassifier


In [119]:
precision_results['BM_tr_precision'] = np.nan
precision_results['BM_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,,
LogisticRegression,,
PassiveAggressiveClassifier,,
SVC,,
KNeighborsClassifier,,
GaussianProcessClassifier,,
DecisionTreeClassifier,,
RandomForestClassifier,,
AdaBoostClassifier,,
GradientBoostingClassifier,,


In [120]:
for clf in fitted_classifiers_benchmark.keys():
    precision_results.loc[clf, 'BM_tr_precision'] = \
        fitted_classifiers_benchmark[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BM_val_precision'] = \
        fitted_classifiers_benchmark[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,0.63113,0.102564
LogisticRegression,0.662739,0.087379
PassiveAggressiveClassifier,0.650826,0.102273
SVC,0.934473,0.139535
KNeighborsClassifier,0.765376,0.113208
GaussianProcessClassifier,1.0,0.108108
DecisionTreeClassifier,1.0,0.075758
RandomForestClassifier,1.0,0.2
AdaBoostClassifier,0.87037,0.132075
GradientBoostingClassifier,0.969298,0.095238


In [121]:
print("TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BM_tr_precision'],
                              precision_results.loc[idx, 'BM_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.63 / 0.10
LogisticRegression: 0.66 / 0.09
PassiveAggressiveClassifier: 0.65 / 0.10
SVC: 0.93 / 0.14
KNeighborsClassifier: 0.77 / 0.11
GaussianProcessClassifier: 1.00 / 0.11
DecisionTreeClassifier: 1.00 / 0.08
RandomForestClassifier: 1.00 / 0.20
AdaBoostClassifier: 0.87 / 0.13
GradientBoostingClassifier: 0.97 / 0.10


**OBSERVATION:** so, it is clear that all the algorithms are performing pretty well for the trainins set, but:

- All of them show heavy overfitting, and even some of them in an extreme way. Notice that, compared wikth the training with imbalanced data, the classifiers less affected for that are the Tree classifiers (DecisionTree and RandomForest).


### Save the models

In [122]:
pickle.dump(fitted_classifiers_benchmark, open(MODELS_FOLDER + OTS_CLF_OUT, 'wb'))

## Optimize all the _off-the-shelf_ classifiers

We now try to do a very simple and quick optimization of those off-the shelf classifiers, to see if the situation improves for some of them.

We choose the grid values around the default values, when possible.

In [123]:
list(OFF_THE_SHELF_CLASSIFIERS.keys())

['Perceptron',
 'LogisticRegression',
 'PassiveAggressiveClassifier',
 'SVC',
 'KNeighborsClassifier',
 'GaussianProcessClassifier',
 'DecisionTreeClassifier',
 'RandomForestClassifier',
 'AdaBoostClassifier',
 'GradientBoostingClassifier']

In [124]:
OFF_THE_SHELF_CLASSIFIERS

OrderedDict([('Perceptron',
              {'clf': Perceptron(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'alpha': [0.001, 0.0001, 1e-05],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'max_iter': [500, 1000, 2000],
                'random_state': [11]}}),
             ('LogisticRegression',
              {'clf': LogisticRegression(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'C': [0.5, 1.0, 2.0],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'solver': ['saga'],
                'max_iter': [50, 100, 200],
                'random_state': [11]}}),
             ('PassiveAggressiveClassifier',
              {'clf': PassiveAggressiveClassifier(),
               'param_grid': {'C': [0.5, 1.0, 2.0],
                'max_iter': [500, 1000, 2000],
                'loss': ['hinge', 'squared_hinge'],
                'random_state': [11]}}),
             ('SVC',
   

### Optimize classifiers

We now try a simple optimization of these off-the shelf classifiers, and targetting the  `precision` metric for minimization.

In [125]:
warnings.simplefilter('ignore')
fitted_classifiers_opt = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Optimizing and fitting classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid=param_grid, scoring='precision', cv=4, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_opt[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Optimizing and fitting classifier Perceptron...
... completed. Elapsed time: 3.008 seconds
Optimizing and fitting classifier LogisticRegression...
... completed. Elapsed time: 35.487 seconds
Optimizing and fitting classifier PassiveAggressiveClassifier...
... completed. Elapsed time: 0.490 seconds
Optimizing and fitting classifier SVC...
... completed. Elapsed time: 9.629 seconds
Optimizing and fitting classifier KNeighborsClassifier...
... completed. Elapsed time: 5.683 seconds
Optimizing and fitting classifier GaussianProcessClassifier...
... completed. Elapsed time: 103.219 seconds
Optimizing and fitting classifier DecisionTreeClassifier...
... completed. Elapsed time: 15.257 seconds
Optimizing and fitting classifier RandomForestClassifier...
... completed. Elapsed time: 3426.392 seconds
Optimizing and fitting classifier AdaBoostClassifier...
... completed. Elapsed time: 21.557 seconds
Optimizing and fitting classifier GradientBoostingClassifier...
... completed. Elapsed time: 7475.

In [126]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='l1',
                                       random_state=11)),
                           ('OptTrain_time', 3.007888078689575)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(l1_ratio=0.3, penalty='elasticnet', random_state=11,
                                               solver='saga')),
                           ('OptTrain_time', 35.487173318862915)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=0.5, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.48962879180908203)])),
             ('SVC',
              OrderedDict([('Fitted_clf',
   

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [127]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='l1',
                                       random_state=11)),
                           ('OptTrain_time', 3.007888078689575)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(l1_ratio=0.3, penalty='elasticnet', random_state=11,
                                               solver='saga')),
                           ('OptTrain_time', 35.487173318862915)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=0.5, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.48962879180908203)])),
             ('SVC',
              OrderedDict([('Fitted_clf',
   

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [128]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[456 216]
 [319 353]]
Validation set:
[[146  78]
 [ 18   8]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[445 227]
 [220 452]]
Validation set:
[[128  96]
 [ 18   8]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[375 297]
 [178 494]]
Validation set:
[[112 112]
 [ 13  13]]


Printing confusion matrices for classifier SVC...
Training set:
[[653  19]
 [  4 668]]
Validation set:
[[199  25]
 [ 20   6]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[194  30]
 [ 22   4]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[656  16]
 [  9 663]]
Validation set:
[[198  26]
 [ 22   4]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[639  33]
 [ 38 634]]
Validation set:
[[173  51]
 [ 19   7]]


**OBSERVATION:** Again, the main problem here is that most classifiers overfit to the training set.

#### Classification reports

In [129]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.59      0.68      0.63       672
           1       0.62      0.53      0.57       672

    accuracy                           0.60      1344
   macro avg       0.60      0.60      0.60      1344
weighted avg       0.60      0.60      0.60      1344

	Validation set:
              precision    recall  f1-score   support

           0       0.89      0.65      0.75       224
           1       0.09      0.31      0.14        26

    accuracy                           0.62       250
   macro avg       0.49      0.48      0.45       250
weighted avg       0.81      0.62      0.69       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.67      0.66      0.67       672
           1       0.67      0.67      0.67       672

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [130]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_opt[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_opt[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [131]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='l1',
                                       random_state=11)),
                           ('OptTrain_time', 3.007888078689575),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.6019345238095238),
                                         ('precision', 0.6203866432337434),
                                         ('recall', 0.5252976190476191),
                                         ('F1', 0.5688960515713134),
                                         ('log_loss', 14.34773404998339),
                                         ('MCC', 0.2063068189451933)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.616),
                                         ('precision', 0.09302325581395349),
                                  

**OBSERVATION:** <font color='red'>**ONCE AGAIN, VERY BAD RESULTS EVEN WITH OPTIMIZATION**</font>

The problem here could be that all classifiers are suffering from overfitting and / or imbalanced dataset.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [132]:
precision_results['BMOPT_tr_precision'] = np.nan
precision_results['BMOPT_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.63113,0.102564,,
LogisticRegression,0.662739,0.087379,,
PassiveAggressiveClassifier,0.650826,0.102273,,
SVC,0.934473,0.139535,,
KNeighborsClassifier,0.765376,0.113208,,
GaussianProcessClassifier,1.0,0.108108,,
DecisionTreeClassifier,1.0,0.075758,,
RandomForestClassifier,1.0,0.2,,
AdaBoostClassifier,0.87037,0.132075,,
GradientBoostingClassifier,0.969298,0.095238,,


In [133]:
for clf in fitted_classifiers_opt.keys():
    precision_results.loc[clf, 'BMOPT_tr_precision'] = \
        fitted_classifiers_opt[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BMOPT_val_precision'] = \
        fitted_classifiers_opt[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.63113,0.102564,0.620387,0.093023
LogisticRegression,0.662739,0.087379,0.665685,0.076923
PassiveAggressiveClassifier,0.650826,0.102273,0.624526,0.104
SVC,0.934473,0.139535,0.972344,0.193548
KNeighborsClassifier,0.765376,0.113208,1.0,0.117647
GaussianProcessClassifier,1.0,0.108108,0.976436,0.133333
DecisionTreeClassifier,1.0,0.075758,0.950525,0.12069
RandomForestClassifier,1.0,0.2,0.98941,0.055556
AdaBoostClassifier,0.87037,0.132075,0.919355,0.14
GradientBoostingClassifier,0.969298,0.095238,0.841954,0.086957


In [134]:
print("TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BMOPT_tr_precision'],
                              precision_results.loc[idx, 'BMOPT_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.62 / 0.09
LogisticRegression: 0.67 / 0.08
PassiveAggressiveClassifier: 0.62 / 0.10
SVC: 0.97 / 0.19
KNeighborsClassifier: 1.00 / 0.12
GaussianProcessClassifier: 0.98 / 0.13
DecisionTreeClassifier: 0.95 / 0.12
RandomForestClassifier: 0.99 / 0.06
AdaBoostClassifier: 0.92 / 0.14
GradientBoostingClassifier: 0.84 / 0.09


**OBSERVATION:** so, it is clear that all the algorithms are still performing badly:

- All of them show some kind of overfitting.

**CONCLUSIONS: IMPROVEMENT/WORSENING WITH A SINGLE OPTIMIZATION OF OFF-THE-SHELF CLASSIFIERS:**
- Perceptron: **optimization has worsened the results a little**, and **overfitting / imbalance is still a problem** (but not so hard as in other cases, and it is still the most train/validation-balanced case from all the classifiers).
- LogisticRegression: **no improvement at all with optimization**. Again, extremely bad results, both in training and in validation. **Probably, suffering from imbalanced classes in an extreme way**. Does not even have the opportunity to overfit.
- PassiveAggressiveClassifier: **improves with optimization**. Performances in line with those of `Perceptron`, but a little bit more overfitting. **Probably, suffering from imbalanced classes**. Does not even have the opportunity to overfit.
- SVC: **great improve with optimization**, but **it now falls in extreme overfitting.**.
- KNeighborsClassifier: **with optimization, results seem more reasonable**. Perfect precision in training, low precision in validation. **Extreme overfitting problem**.
- GaussianProcessClassifier: **much worse with optimization**. Both`precision` values go to null: **extreme problem with imbalanced classes.** Does not even have the opportunity to overfit.
- DecisionTreeClassifier: **with optimization, it seems to overfit less**, but the `precision` in validation goes to zero. **Probably Suffering a lot from overfitting**.
- RandomForestClassifier: **no change with optimization**: again, **problems with both overfitting and umbalanced classes**.
- AdaBoostClassifier: **with optimization, the overfitting problem has improved, it now overfits less**, but still seems to suffer a little from imbalanced classes.
- GradientBoostingClassifier: **optimization has worsened the results**. Zero precision for both the training and validation sets. **Why?**

### Save the models

In [135]:
pickle.dump(fitted_classifiers_opt, open(MODELS_FOLDER + OPT_CLF_OUT, 'wb'))

## Save the results

In [136]:
precision_results = precision_results.reset_index(drop=False).rename(columns={'index': 'Classifier'})
precision_results

Unnamed: 0,Classifier,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
0,Perceptron,0.63113,0.102564,0.620387,0.093023
1,LogisticRegression,0.662739,0.087379,0.665685,0.076923
2,PassiveAggressiveClassifier,0.650826,0.102273,0.624526,0.104
3,SVC,0.934473,0.139535,0.972344,0.193548
4,KNeighborsClassifier,0.765376,0.113208,1.0,0.117647
5,GaussianProcessClassifier,1.0,0.108108,0.976436,0.133333
6,DecisionTreeClassifier,1.0,0.075758,0.950525,0.12069
7,RandomForestClassifier,1.0,0.2,0.98941,0.055556
8,AdaBoostClassifier,0.87037,0.132075,0.919355,0.14
9,GradientBoostingClassifier,0.969298,0.095238,0.841954,0.086957


In [137]:
precision_results.to_csv(MODELS_FOLDER + PRECISION_RESULTS_OUT, sep=',', decimal='.', index=False)

## Predictions on the validation dataset

### Predictions

We now save the predictions on the validation dataset, alongside with all available metadata, so that they can be analysed later on.

In [138]:
s4_val_w_pred = s4_val.copy()
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In [139]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Training set:
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val[rel_features])
    s4_val_w_pred['Prediction_' + classifier] = y_val_pred

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
Calculating predictions for classifier GaussianProcessClassifier...
Calculating predictions for classifier DecisionTreeClassifier...
Calculating predictions for classifier RandomForestClassifier...
Calculating predictions for classifier AdaBoostClassifier...
Calculating predictions for classifier GradientBoostingClassifier...


In [140]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,Prediction_Perceptron,Prediction_LogisticRegression,Prediction_PassiveAggressiveClassifier,Prediction_SVC,Prediction_KNeighborsClassifier,Prediction_GaussianProcessClassifier,Prediction_DecisionTreeClassifier,Prediction_RandomForestClassifier,Prediction_AdaBoostClassifier,Prediction_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,0,0,0,0,0,0,0,0,0,0
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,1,0,1,0,0,0,0,0,0,0
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,0,0,0,0,0,0,0,0,1,1
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,1,0,0,0,0,0,1,1,0,1
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,1,0,0,0,0,0,0,0
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,0,0,0,0,0,0,1,0,0,0
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,1,1,1,0,0,0,0,0,0,0
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,0,1,1,0,0,0,1,1,1,1
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,1,0,0,0,0,0,1,0,0,0
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,1,0,0,0,0,0,0,0


### Prediction probabilities (if available)

In [141]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Validation set:
    try:
        y_val_pred_proba = fitted_classifiers_opt[classifier]['Fitted_clf'].predict_proba(s4_val[rel_features])
        s4_val_w_pred['PredictionProb_' + classifier] = pd.Series(y_val_pred_proba[:, 1])
        print("... ok, probabilities calculated")
    except:
        print("**WARNING: 'predict_proba' method failed for classifier '%s'." %classifier)
        s4_val_w_pred['PredictionProb_' + classifier] = np.nan

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
... ok, probabilities calculated
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GaussianProcessClassifier...
... ok, probabilities calculated
Calculating predictions for classifier DecisionTreeClassifier...
... ok, probabilities calculated
Calculating predictions for classifier RandomForestClassifier...
... ok, probabilities calculated
Calculating predictions for classifier AdaBoostClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GradientBoostingClassifier...
... ok, probabilities calculated


In [142]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,PredictionProb_Perceptron,PredictionProb_LogisticRegression,PredictionProb_PassiveAggressiveClassifier,PredictionProb_SVC,PredictionProb_KNeighborsClassifier,PredictionProb_GaussianProcessClassifier,PredictionProb_DecisionTreeClassifier,PredictionProb_RandomForestClassifier,PredictionProb_AdaBoostClassifier,PredictionProb_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,,0.144057,,,0.0,0.296544,0.0,0.260963,0.392237,0.451709
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,,0.195229,,,0.0,0.279777,0.0,0.096288,0.473364,0.286868
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,,0.452925,,,0.0,0.301869,0.25,0.315447,0.502762,0.531849
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,,0.384932,,,0.0,0.311255,0.943038,0.514411,0.499219,0.622738
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,,0.401998,,,0.0,0.358663,0.0,0.20975,0.490705,0.315834
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,,0.481101,,,0.0,0.393601,1.0,0.454882,0.499896,0.446521
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,,0.532464,,,0.0,0.388108,0.071429,0.237232,0.413694,0.44665
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,,0.673195,,,0.0,0.402867,0.943038,0.500021,0.501755,0.643477
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,,0.312567,,,0.0,0.397218,0.6,0.212781,0.394331,0.36752
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,,0.357148,,,0.0,0.395846,0.130435,0.25975,0.491879,0.364385


### Save the predictions

And we now save the file:

In [143]:
s4_val_w_pred.to_csv(MODELS_FOLDER + VAL_PREDICTIONS_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We tested different classifiers from different families against the S4 sample oversampled with SMOTE.
- In general, results show slight improvement with a very simple and naive model hyperparameter optimization.
- Overfitting seems to be a serious problem, more specially with the tree / ensemble methods.
- We have stored both the precision of the different ML models, as well as their predictions on the validation set (and prediction probabilities when available).

**CONCLUSIONS:**

- Additional work in tree and ensemble classifiers is needed to prevent overfitting: pruning the trees, for example (even if some values for `ccp_alpha` parameter were tried as part of the optimization).
