#  MODEL PRESELECTION - WITH ALL FEATURES

In this Notebook we take several _off-the-shelf_ Machine Learning models (classifiers) and apply them to the **oversampled** S4 sample.

We make 5-fold crossvalidation over the training/test set, and measure the performance of the trained model over the validation set. These two sets were previously splitted from S4 sample, stratified by the target variable, and saved to files.

The score metric for cross validation will be `precision`.

We will test the following ML classification models. For each model, we do a basic `GridSearchCV` with the more relevant hyperparameters for each model, generally taking values equal to default value, and one-half (or one tenth) and twice (or ten fold) that default value.

- Linear models:
  - `Perceptron`.
  - `LogisticRegressor`.
  - `PassiveAggressiveClassifier`
- Support Vector Machines:
  - `SVC`
- Nearest-Neighbours models:
  - `KNeighborsClassifier`
- Gaussian Processes models:
  - `GaussianProcessClassifier`
- Tree models:
  - `DecisionTreeClassifier`
- Ensemble models:
  - `RandomForestClassifier`
  - `AdaBoostClassifier` (a particular case of the next type of model)
  - `GradientBoostingClassifier`

## Modules and configuration

### Modules

In [2]:
import pandas as pd
import numpy as np

import copy

from time import time

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, \
    f1_score, log_loss, matthews_corrcoef, classification_report, \
    get_scorer_names, confusion_matrix

from collections import OrderedDict

import warnings

#from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

from sklearn.gaussian_process.kernels import RBF, RationalQuadratic, DotProduct

from sklearn.model_selection import GridSearchCV # Se usa para evitar overfitting
#from sklearn.model_selection import cross_validate #### NOTA, IGUAL ES MEJOR ESTE, PARA TENER EL CONTROL Y SACAR
# TODOS LOS RUNS QUE QUERAMOS
###### NO HACE FALTA EL GRID SEARCH NI NADA DE ESTO??? SIMPLEMENTE LE HACEMOS EL "FIT" Y LUEGO MEDIMOS CON
# EL "PREDICT" SOBRE EL VALIDATION SET

from sklearn.linear_model import Perceptron, LogisticRegression, PassiveAggressiveClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, \
    GradientBoostingClassifier, HistGradientBoostingClassifier

import pickle

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white", {'figure.figsize':(15,10)})

from IPython.display import display

# from imblearn import 

### Configuration

In [3]:
RANDOM_STATE = 11 # For reproducibility

S4_TRAIN_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest_OVERSAMPLED.csv"
# Train/test set for S4 sample, all 112 features
S4_VALIDATION_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv"
# Validation set for S4 sample, all 112 features

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"
UNREL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Unreliable_features.pickle"

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase'] # Only cesium features and these columns will be kept.

MODELS_FOLDER = "../data/ML_MODELS/ML_model_preselection/"

PRECISION_RESULTS_OUT = "ModelPreselection_AllF_PrecisionResults_OversampledSMOTE.csv"
VAL_PREDICTIONS_OUT = "ModelPreselection_AllF_ValidationPredictions_OversampledSMOTE.csv"

# Note: it would be better to use a jason file for this configuration.
OFF_THE_SHELF_CLASSIFIERS = OrderedDict({
    'Perceptron': {
        'clf': Perceptron(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'alpha': [0.001, 0.0001, 0.00001],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'max_iter': [500, 1000, 2000],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'LogisticRegression': {
        'clf': LogisticRegression(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'C': [0.5, 1.0, 2.0],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'solver': ['saga'],
                       'max_iter': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'PassiveAggressiveClassifier': {
        'clf': PassiveAggressiveClassifier(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'max_iter': [500, 1000, 2000],
                       'loss': ['hinge', 'squared_hinge'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'SVC': {
        'clf': SVC(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'recomputed'],
                       'degree': [2, 3, 6],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'KNeighborsClassifier': {
        'clf': KNeighborsClassifier(),
        'param_grid': {'n_neighbors': [1, 3, 5, 10],
                       'weights': ['uniform', 'distance'],
                       'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                       'p': [1, 2]
                      }
    },
    'GaussianProcessClassifier': {
        'clf': GaussianProcessClassifier(),
        'param_grid': {'kernel': [RBF(), RationalQuadratic(), DotProduct()],
                       'max_iter_predict': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'DecisionTreeClassifier': {
        'clf': DecisionTreeClassifier(),
        'param_grid': {'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'RandomForestClassifier': {
        'clf': RandomForestClassifier(),
        'param_grid': {'n_estimators': [50, 100, 200],
                       'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'AdaBoostClassifier': {
        'clf': AdaBoostClassifier(),
        'param_grid': {'n_estimators': [25, 50, 100],
                       'learning_rate': [0.5, 1.0, 2.0],
                       'algorithm': ['SAMME', 'SAMME.R'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'GradientBoostingClassifier': {
        'clf': GradientBoostingClassifier(),
        'param_grid': {'loss': ['log_loss', 'deviance'],
                       'learning_rate': [0.05, 0.1, 0.2],
                       'n_estimators': [25, 50, 100],
                       'criterion': ['friedman_mse', 'squared_error'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    }
})

IMAGE_FOLDER = './img/'

### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_, scaled, and with `NaN` values imputed by a `KNNImputer`.

### Load reliable features list

In [4]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

### Load unreliable features

In [6]:
unrel_features = pickle.load(open(UNREL_FEATURES_IN, 'rb'))
print(unrel_features)

['avg_err', 'avgt', 'mean', 'med_err', 'std_err', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80', 'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum', 'percent_amplitude', 'percent_close_to_median', 'percent_difference_flux_percentile', 'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std', 'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile', 'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4', 'freq1_freq', 'freq1_lambda', 'freq1_signif', 'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4', 'freq2_freq', 'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4', 'freq3_freq', 'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31', 'freq_frequency_ratio_21', 'freq_frequency_ratio_31', 'freq_model_max

In [12]:
all_features = rel_features + unrel_features

###  Load the oversampled S4 sample data train set

In [16]:
s4_train_all = pd.read_csv(S4_TRAIN_SET_IN, sep=',', decimal='.')
s4_train_all

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,0,-0.674126,0.519174,0.466681,0.766297,1.786498,-0.304944,0.843252,0.189055,1.390901,...,0.646033,0.908818,1.305379,1.413989,0.174334,-0.188773,0.985693,-0.258841,-1.099919,-0.461571
1,1,-1.626729,1.911247,-0.740748,0.691384,0.168331,1.522002,1.166420,0.157675,0.019744,...,1.532902,-1.224350,0.710232,-1.272791,1.617586,1.392776,0.260283,0.708876,1.030413,0.400968
2,0,-0.039057,-1.012107,0.013895,-0.357397,1.168762,-0.232282,-0.443941,-0.136007,-0.412519,...,0.384058,0.882515,1.044322,-1.204443,-0.593335,-1.011092,0.592500,0.135213,0.725303,-0.319881
3,0,-0.039057,1.632833,-0.514355,0.166993,1.477630,-0.544204,-0.572606,-0.586661,-0.338658,...,1.457441,-0.921750,-1.095322,-0.031960,-0.068737,1.152465,-0.672518,0.391616,-1.301501,0.559262
4,0,0.596012,-0.176863,-1.042605,-0.432310,0.242158,-0.277263,-0.498198,-0.370020,-0.451106,...,-0.829296,-0.057582,0.541269,1.529879,-0.689578,1.475290,-1.268004,1.297212,1.327633,-0.059789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1339,1,0.745746,-0.488099,-0.696575,0.862522,1.004277,-0.625246,-0.653433,-0.759447,-0.397622,...,-1.020565,1.368228,1.114817,-0.872012,1.010688,0.954973,1.612974,1.083675,-0.408500,0.397283
1340,1,-0.674126,-0.524881,0.013895,0.691384,0.396592,-0.466417,1.196717,2.700897,2.213595,...,-0.110442,0.416179,-0.356363,0.705712,0.055003,-0.096945,0.031624,1.218864,-0.592419,-1.376299
1341,1,-0.504019,1.622890,-1.255523,-1.570720,1.550711,0.256561,0.336707,-0.129644,0.098069,...,1.172653,-1.144025,-0.812328,-1.520540,-0.885421,0.021674,-1.172280,-0.572453,0.691047,0.172107
1342,1,1.458660,-0.952166,-0.752930,-0.936325,-0.384989,-0.329583,-0.428169,-0.297986,-0.319821,...,-0.247664,-0.306661,1.359459,-0.337809,-0.776955,-0.513643,-0.629629,1.317652,0.435735,-1.248442


Notice that target variable is already encoded as `0` / `1` and that we now only have the features (i.e. the metadata are no longer present).

###  Load the S4 sample data validation set

In [17]:
s4_val = pd.read_csv(S4_VALIDATION_SET_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,False,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,False,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,False,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,False,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,False,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,False,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,False,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,False,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,False,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In this case we do need to do the transformations.

#### Encode target variable (`Pulsating`)

We encode the target variable as `True` / `False` = `0` / `1`.

In [18]:
s4_val['Pulsating'] = (s4_val['Pulsating'] == True).astype(int)
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


#### Filter the relevant columns only

We now filter only by the reliable relevant columns plus the `Pulsating` column.

In [19]:
s4_val_all = s4_val[['Pulsating'] + all_features].copy()
s4_val_all

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,0,-0.991660,0.031948,0.542146,1.740165,-0.298361,2.924506,1.112558,1.223343,-0.528719,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,0,-1.309194,-1.081711,1.825039,1.815078,-1.611050,-0.692478,0.439292,1.223343,1.528017,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,0,-0.356591,0.379966,0.844003,0.166993,-1.147748,-0.480752,-0.473181,0.125210,-0.246422,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,0,-0.039057,0.519174,0.994931,1.065949,-1.379399,-0.348004,-0.310918,-0.224660,-0.136960,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,0,0.596012,-0.664089,-0.212498,0.391732,0.087724,1.196750,2.120285,1.085438,0.930900,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,0,0.913546,0.101552,-1.419927,-0.057745,-0.761663,0.185932,-0.248147,-0.282121,-0.456553,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,0,0.278478,-0.733692,1.447717,1.515426,-1.997135,-0.692478,-0.772586,-0.672852,-0.528719,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,0,-0.356591,1.493625,1.221324,-0.507223,1.786498,-0.382451,-0.368627,-0.328089,-0.185930,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,0,0.913546,-1.151314,0.240288,-0.207571,-1.533833,-0.005709,-0.205837,-0.022504,-0.283139,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In [20]:
print(list(s4_val_all.columns))

['Pulsating', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_r

## Train all the _off-the-shelf_ classifiers

### Train benchmark classifiers

We train the classifiers just off-the shelf, with all the parameters set to default.

In [21]:
warnings.simplefilter('ignore')
fitted_classifiers_benchmark = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Fitting off-the-shelf classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    #param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid={}, scoring='precision', cv=3, refit=True)
    start_time = time()
    cv.fit(s4_train_all[all_features], s4_train_all['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_benchmark[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Fitting off-the-shelf classifier Perceptron...


KeyError: "['avg_err', 'avgt', 'mean', 'med_err', 'std_err', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80', 'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum', 'percent_amplitude', 'percent_close_to_median', 'percent_difference_flux_percentile', 'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std', 'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile', 'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4', 'freq1_freq', 'freq1_lambda', 'freq1_signif', 'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4', 'freq2_freq', 'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4', 'freq3_freq', 'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31', 'freq_frequency_ratio_21', 'freq_frequency_ratio_31', 'freq_model_max_delta_mags', 'freq_model_min_delta_mags', 'freq_n_alias', 'freq_signif_ratio_21', 'freq_signif_ratio_31', 'freq_varrat', 'freq_y_offset', 'linear_trend', 'medperc90_2p_p', 'p2p_scatter_2praw', 'p2p_scatter_over_mad', 'p2p_scatter_pfold_over_mad', 'p2p_ssqr_diff_over_var', 'scatter_res_raw'] not in index"

In [43]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.021939992904663086)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.06881594657897949)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.029920578002929688)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.2733030319213867)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.11864829063415527)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 1.443141

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [44]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.021939992904663086)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.06881594657897949)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.029920578002929688)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.2733030319213867)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.11864829063415527)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 1.443141

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [45]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[409 263]
 [311 361]]
Validation set:
[[138  86]
 [ 14  12]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[435 237]
 [223 449]]
Validation set:
[[122 102]
 [ 14  12]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[354 318]
 [183 489]]
Validation set:
[[102 122]
 [ 10  16]]


Printing confusion matrices for classifier SVC...
Training set:
[[616  56]
 [  6 666]]
Validation set:
[[186  38]
 [ 19   7]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[465 207]
 [  0 672]]
Validation set:
[[127  97]
 [ 14  12]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[155  69]
 [ 17   9]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[181  43]
 [ 20   6]]


**OBSERVATION:** The main problem here is that most classifiers overfit to the training set. However, it seems that the behavipour over the training set is now better (i.e. the classifiers do not predict a single label).

#### Classification reports

In [46]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.57      0.61      0.59       672
           1       0.58      0.54      0.56       672

    accuracy                           0.57      1344
   macro avg       0.57      0.57      0.57      1344
weighted avg       0.57      0.57      0.57      1344

	Validation set:
              precision    recall  f1-score   support

           0       0.91      0.62      0.73       224
           1       0.12      0.46      0.19        26

    accuracy                           0.60       250
   macro avg       0.52      0.54      0.46       250
weighted avg       0.83      0.60      0.68       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.66      0.65      0.65       672
           1       0.65      0.67      0.66       672

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [47]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [48]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.021939992904663086),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.5729166666666666),
                                         ('precision', 0.5785256410256411),
                                         ('recall', 0.5372023809523809),
                                         ('F1', 0.5570987654320989),
                                         ('log_loss', 15.39364363493545),
                                         ('MCC', 0.14620678678305088)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.6),
                                         ('precision', 0.12244897959183673),
                                         ('recall', 0.46153846153846156),
                                         ('F1', 0.1935483870967742),
                            

**OBSERVATION:** <font color='red'>**VERY BAD RESULTS**</font><font color='blue'>**, BUT THE IMBALANCE PROBLEM HAS CLEARLY DISSAPEARED NOW**</font>

We have still the problem of overfitting.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [49]:
precision_results = pd.DataFrame(index=fitted_classifiers_benchmark.keys())
precision_results

Perceptron
LogisticRegression
PassiveAggressiveClassifier
SVC
KNeighborsClassifier
GaussianProcessClassifier
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GradientBoostingClassifier


In [50]:
precision_results['BM_tr_precision'] = np.nan
precision_results['BM_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,,
LogisticRegression,,
PassiveAggressiveClassifier,,
SVC,,
KNeighborsClassifier,,
GaussianProcessClassifier,,
DecisionTreeClassifier,,
RandomForestClassifier,,
AdaBoostClassifier,,
GradientBoostingClassifier,,


In [51]:
for clf in fitted_classifiers_benchmark.keys():
    precision_results.loc[clf, 'BM_tr_precision'] = \
        fitted_classifiers_benchmark[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BM_val_precision'] = \
        fitted_classifiers_benchmark[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,0.578526,0.122449
LogisticRegression,0.654519,0.105263
PassiveAggressiveClassifier,0.605948,0.115942
SVC,0.922438,0.155556
KNeighborsClassifier,0.764505,0.110092
GaussianProcessClassifier,1.0,0.115385
DecisionTreeClassifier,1.0,0.122449
RandomForestClassifier,1.0,0.166667
AdaBoostClassifier,0.859216,0.096154
GradientBoostingClassifier,0.968162,0.125


In [52]:
print("TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BM_tr_precision'],
                              precision_results.loc[idx, 'BM_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.58 / 0.12
LogisticRegression: 0.65 / 0.11
PassiveAggressiveClassifier: 0.61 / 0.12
SVC: 0.92 / 0.16
KNeighborsClassifier: 0.76 / 0.11
GaussianProcessClassifier: 1.00 / 0.12
DecisionTreeClassifier: 1.00 / 0.12
RandomForestClassifier: 1.00 / 0.17
AdaBoostClassifier: 0.86 / 0.10
GradientBoostingClassifier: 0.97 / 0.12


**OBSERVATION:** so, it is clear that all the algorithms are performing pretty well for the trainins set, but:

- All of them show heavy overfitting, and even some of them in an extreme way. Notice that, compared wikth the training with imbalanced data, the classifiers less affected for that are the Tree classifiers (DecisionTree and RandomForest).


### Save the models

In [53]:
pickle.dump(fitted_classifiers_benchmark, open(MODELS_FOLDER + "OVERSAMPLED_fitted_clf_ots.pickle", 'wb'))

## Optimize all the _off-the-shelf_ classifiers

We now try to do a very simple and quick optimization of those off-the shelf classifiers, to see if the situation improves for some of them.

We choose the grid values around the default values, when possible.

In [54]:
list(OFF_THE_SHELF_CLASSIFIERS.keys())

['Perceptron',
 'LogisticRegression',
 'PassiveAggressiveClassifier',
 'SVC',
 'KNeighborsClassifier',
 'GaussianProcessClassifier',
 'DecisionTreeClassifier',
 'RandomForestClassifier',
 'AdaBoostClassifier',
 'GradientBoostingClassifier']

In [55]:
OFF_THE_SHELF_CLASSIFIERS

OrderedDict([('Perceptron',
              {'clf': Perceptron(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'alpha': [0.001, 0.0001, 1e-05],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'max_iter': [500, 1000, 2000],
                'random_state': [11]}}),
             ('LogisticRegression',
              {'clf': LogisticRegression(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'C': [0.5, 1.0, 2.0],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'solver': ['saga'],
                'max_iter': [50, 100, 200],
                'random_state': [11]}}),
             ('PassiveAggressiveClassifier',
              {'clf': PassiveAggressiveClassifier(),
               'param_grid': {'C': [0.5, 1.0, 2.0],
                'max_iter': [500, 1000, 2000],
                'loss': ['hinge', 'squared_hinge'],
                'random_state': [11]}}),
             ('SVC',
   

### Optimize classifiers

We now try a simple optimization of these off-the shelf classifiers, and targetting the  `precision` metric for minimization.

In [56]:
warnings.simplefilter('ignore')
fitted_classifiers_opt = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Optimizing and fitting classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid=param_grid, scoring='precision', cv=3, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_opt[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Optimizing and fitting classifier Perceptron...
... completed. Elapsed time: 1.980 seconds
Optimizing and fitting classifier LogisticRegression...
... completed. Elapsed time: 22.171 seconds
Optimizing and fitting classifier PassiveAggressiveClassifier...
... completed. Elapsed time: 0.348 seconds
Optimizing and fitting classifier SVC...
... completed. Elapsed time: 6.169 seconds
Optimizing and fitting classifier KNeighborsClassifier...
... completed. Elapsed time: 5.489 seconds
Optimizing and fitting classifier GaussianProcessClassifier...
... completed. Elapsed time: 70.054 seconds
Optimizing and fitting classifier DecisionTreeClassifier...
... completed. Elapsed time: 10.043 seconds
Optimizing and fitting classifier RandomForestClassifier...
... completed. Elapsed time: 2487.177 seconds
Optimizing and fitting classifier AdaBoostClassifier...
... completed. Elapsed time: 17.554 seconds
Optimizing and fitting classifier GradientBoostingClassifier...
... completed. Elapsed time: 5170.5

In [57]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.979705810546875)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(C=0.5, l1_ratio=0.15, max_iter=50, penalty='elasticnet',
                                               random_state=11, solver='saga')),
                           ('OptTrain_time', 22.171473503112793)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=2.0, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.3480699062347412)])),
             ('SVC',
              Ord

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [58]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.979705810546875)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(C=0.5, l1_ratio=0.15, max_iter=50, penalty='elasticnet',
                                               random_state=11, solver='saga')),
                           ('OptTrain_time', 22.171473503112793)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=2.0, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.3480699062347412)])),
             ('SVC',
              Ord

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [59]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[279 393]
 [141 531]]
Validation set:
[[ 76 148]
 [  7  19]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[432 240]
 [214 458]]
Validation set:
[[122 102]
 [ 15  11]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[394 278]
 [302 370]]
Validation set:
[[119 105]
 [ 10  16]]


Printing confusion matrices for classifier SVC...
Training set:
[[643  29]
 [  0 672]]
Validation set:
[[194  30]
 [ 20   6]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[672   0]
 [  0 672]]
Validation set:
[[194  30]
 [ 22   4]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[654  18]
 [  0 672]]
Validation set:
[[203  21]
 [ 20   6]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[631  41]
 [ 89 583]]
Validation set:
[[180  44]
 [ 20   6]]


**OBSERVATION:** Again, the main problem here is that most classifiers overfit to the training set.

#### Classification reports

In [60]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.66      0.42      0.51       672
           1       0.57      0.79      0.67       672

    accuracy                           0.60      1344
   macro avg       0.62      0.60      0.59      1344
weighted avg       0.62      0.60      0.59      1344

	Validation set:
              precision    recall  f1-score   support

           0       0.92      0.34      0.50       224
           1       0.11      0.73      0.20        26

    accuracy                           0.38       250
   macro avg       0.51      0.54      0.35       250
weighted avg       0.83      0.38      0.46       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.67      0.64      0.66       672
           1       0.66      0.68      0.67       672

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [61]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_opt[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_opt[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [62]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.075, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.979705810546875),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.6026785714285714),
                                         ('precision', 0.5746753246753247),
                                         ('recall', 0.7901785714285714),
                                         ('F1', 0.6654135338345865),
                                         ('log_loss', 14.320915855497438),
                                         ('MCC', 0.2215228119522081)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.38),
                                         ('precision', 0.11377245508982035),
                          

**OBSERVATION:** <font color='red'>**ONCE AGAIN, VERY BAD RESULTS EVEN WITH OPTIMIZATION**</font>

The problem here could be that all classifiers are suffering from overfitting and / or imbalanced dataset.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [63]:
precision_results['BMOPT_tr_precision'] = np.nan
precision_results['BMOPT_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.578526,0.122449,,
LogisticRegression,0.654519,0.105263,,
PassiveAggressiveClassifier,0.605948,0.115942,,
SVC,0.922438,0.155556,,
KNeighborsClassifier,0.764505,0.110092,,
GaussianProcessClassifier,1.0,0.115385,,
DecisionTreeClassifier,1.0,0.122449,,
RandomForestClassifier,1.0,0.166667,,
AdaBoostClassifier,0.859216,0.096154,,
GradientBoostingClassifier,0.968162,0.125,,


In [64]:
for clf in fitted_classifiers_opt.keys():
    precision_results.loc[clf, 'BMOPT_tr_precision'] = \
        fitted_classifiers_opt[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BMOPT_val_precision'] = \
        fitted_classifiers_opt[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.578526,0.122449,0.574675,0.113772
LogisticRegression,0.654519,0.105263,0.65616,0.097345
PassiveAggressiveClassifier,0.605948,0.115942,0.570988,0.132231
SVC,0.922438,0.155556,0.958631,0.166667
KNeighborsClassifier,0.764505,0.110092,1.0,0.117647
GaussianProcessClassifier,1.0,0.115385,0.973913,0.222222
DecisionTreeClassifier,1.0,0.122449,0.934295,0.12
RandomForestClassifier,1.0,0.166667,0.98797,0.0625
AdaBoostClassifier,0.859216,0.096154,0.902616,0.073171
GradientBoostingClassifier,0.968162,0.125,0.796634,0.105263


In [65]:
print("TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BMOPT_tr_precision'],
                              precision_results.loc[idx, 'BMOPT_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.57 / 0.11
LogisticRegression: 0.66 / 0.10
PassiveAggressiveClassifier: 0.57 / 0.13
SVC: 0.96 / 0.17
KNeighborsClassifier: 1.00 / 0.12
GaussianProcessClassifier: 0.97 / 0.22
DecisionTreeClassifier: 0.93 / 0.12
RandomForestClassifier: 0.99 / 0.06
AdaBoostClassifier: 0.90 / 0.07
GradientBoostingClassifier: 0.80 / 0.11


**OBSERVATION:** so, it is clear that all the algorithms are still performing badly:

- All of them show some kind of overfitting.

**CONCLUSIONS: IMPROVEMENT/WORSENING WITH A SINGLE OPTIMIZATION OF OFF-THE-SHELF CLASSIFIERS:**
- Perceptron: **optimization has worsened the results a little**, and **overfitting / imbalance is still a problem** (but not so hard as in other cases, and it is still the most train/validation-balanced case from all the classifiers).
- LogisticRegression: **no improvement at all with optimization**. Again, extremely bad results, both in training and in validation. **Probably, suffering from imbalanced classes in an extreme way**. Does not even have the opportunity to overfit.
- PassiveAggressiveClassifier: **improves with optimization**. Performances in line with those of `Perceptron`, but a little bit more overfitting. **Probably, suffering from imbalanced classes**. Does not even have the opportunity to overfit.
- SVC: **great improve with optimization**, but **it now falls in extreme overfitting.**.
- KNeighborsClassifier: **with optimization, results seem more reasonable**. Perfect precision in training, low precision in validation. **Extreme overfitting problem**.
- GaussianProcessClassifier: **much worse with optimization**. Both`precision` values go to null: **extreme problem with imbalanced classes.** Does not even have the opportunity to overfit.
- DecisionTreeClassifier: **with optimization, it seems to overfit less**, but the `precision` in validation goes to zero. **Probably Suffering a lot from overfitting**.
- RandomForestClassifier: **no change with optimization**: again, **problems with both overfitting and umbalanced classes**.
- AdaBoostClassifier: **with optimization, the overfitting problem has improved, it now overfits less**, but still seems to suffer a little from imbalanced classes.
- GradientBoostingClassifier: **optimization has worsened the results**. Zero precision for both the training and validation sets. **Why?**

### Save the models

In [66]:
pickle.dump(fitted_classifiers_opt, open(MODELS_FOLDER + "OVERSAMPLED_fitted_clf_opt.pickle", 'wb'))

## Save the results

In [69]:
precision_results = precision_results.reset_index(drop=False).rename(columns={'index': 'Classifier'})
precision_results

Unnamed: 0,Classifier,Classifier.1,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
0,0,Perceptron,0.578526,0.122449,0.574675,0.113772
1,1,LogisticRegression,0.654519,0.105263,0.65616,0.097345
2,2,PassiveAggressiveClassifier,0.605948,0.115942,0.570988,0.132231
3,3,SVC,0.922438,0.155556,0.958631,0.166667
4,4,KNeighborsClassifier,0.764505,0.110092,1.0,0.117647
5,5,GaussianProcessClassifier,1.0,0.115385,0.973913,0.222222
6,6,DecisionTreeClassifier,1.0,0.122449,0.934295,0.12
7,7,RandomForestClassifier,1.0,0.166667,0.98797,0.0625
8,8,AdaBoostClassifier,0.859216,0.096154,0.902616,0.073171
9,9,GradientBoostingClassifier,0.968162,0.125,0.796634,0.105263


In [70]:
precision_results.to_csv(MODELS_FOLDER + PRECISION_RESULTS_OUT, sep=',', decimal='.', index=False)

## Predictions on the validation dataset

### Predictions

We now save the predictions on the validation dataset, alongside with all available metadata, so that they can be analysed later on.

In [80]:
s4_val_w_pred = s4_val.copy()
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In [81]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Training set:
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val[rel_features])
    s4_val_w_pred['Prediction_' + classifier] = y_val_pred

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
Calculating predictions for classifier GaussianProcessClassifier...
Calculating predictions for classifier DecisionTreeClassifier...
Calculating predictions for classifier RandomForestClassifier...
Calculating predictions for classifier AdaBoostClassifier...
Calculating predictions for classifier GradientBoostingClassifier...


In [82]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,Prediction_Perceptron,Prediction_LogisticRegression,Prediction_PassiveAggressiveClassifier,Prediction_SVC,Prediction_KNeighborsClassifier,Prediction_GaussianProcessClassifier,Prediction_DecisionTreeClassifier,Prediction_RandomForestClassifier,Prediction_AdaBoostClassifier,Prediction_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,0,0,0,0,0,0,0,0,0,0
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,0,0,0,0,0,0,0,0,0,0
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,1,0,1,0,0,0,1,0,1,0
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,0,0,0,0,0,0,1,1,0,1
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,1,0,0,0,0,0,0,0
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,1,1,1,0,0,0,1,0,0,1
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,1,1,1,0,0,0,0,0,0,0
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,1,1,1,0,0,0,0,0,0,1
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,1,0,1,0,0,0,0,0,0,0
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,1,0,0,0,0,0,0,0


### Prediction probabilities (if available)

In [83]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Validation set:
    try:
        y_val_pred_proba = fitted_classifiers_opt[classifier]['Fitted_clf'].predict_proba(s4_val[rel_features])
        s4_val_w_pred['PredictionProb_' + classifier] = pd.Series(y_val_pred_proba[:, 1])
        print("... ok, probabilities calculated")
    except:
        print("**WARNING: 'predict_proba' method failed for classifier '%s'." %classifier)
        s4_val_w_pred['PredictionProb_' + classifier] = np.nan

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
... ok, probabilities calculated
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GaussianProcessClassifier...
... ok, probabilities calculated
Calculating predictions for classifier DecisionTreeClassifier...
... ok, probabilities calculated
Calculating predictions for classifier RandomForestClassifier...
... ok, probabilities calculated
Calculating predictions for classifier AdaBoostClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GradientBoostingClassifier...
... ok, probabilities calculated


In [84]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,PredictionProb_Perceptron,PredictionProb_LogisticRegression,PredictionProb_PassiveAggressiveClassifier,PredictionProb_SVC,PredictionProb_KNeighborsClassifier,PredictionProb_GaussianProcessClassifier,PredictionProb_DecisionTreeClassifier,PredictionProb_RandomForestClassifier,PredictionProb_AdaBoostClassifier,PredictionProb_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,,0.254676,,,0.0,0.313916,0.0,0.312359,0.3978,0.449421
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,,0.193567,,,0.0,0.288121,0.079365,0.132156,0.476476,0.290055
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,,0.377001,,,0.0,0.286604,0.846154,0.273521,0.502487,0.494673
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,,0.438314,,,0.0,0.325637,0.962963,0.594324,0.497224,0.535603
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,,0.458241,,,0.0,0.40798,0.022727,0.153099,0.488427,0.350336
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,,0.533833,,,0.0,0.462049,0.949275,0.44029,0.499768,0.506261
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,,0.646664,,,0.0,0.412224,0.352941,0.259671,0.411559,0.418226
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,,0.673551,,,0.0,0.412592,0.377778,0.376673,0.499576,0.653731
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,,0.345656,,,0.0,0.389594,0.079365,0.216017,0.404364,0.358141
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,,0.31465,,,0.0,0.422637,0.022727,0.260407,0.496226,0.369092


### Save the predictions

And we now save the file:

In [85]:
s4_val_w_pred.to_csv(MODELS_FOLDER + VAL_PREDICTIONS_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We tested different classifiers from different families against the S4 sample oversampled with SMOTE.
- In general, results show slight improvement with a very simple and naive model hyperparameter optimization.
- Overfitting seems to be a serious problem, more specially with the tree / ensemble methods.
- We have stored both the precision of the different ML models, as well as their predictions on the validation set (and prediction probabilities when available).

**CONCLUSIONS:**

- Additional work in tree and ensemble classifiers is needed to prevento overfitting: pruning the trees, for example (even if some values for `ccp_alpha` parameter were tried as part of the optimization).
