#  MODEL PRESELECTION

In this Notebook we take several _off-the-shelf_ Machine Learning models (classifiers) and apply them to the S4 sample.

We make 5-fold crossvalidation over the training/test set, and measure the performance of the trained model over the validation set. These two sets were previously splitted from S4 sample, stratified by the target variable, and saved to files.

The score metric for cross validation will be `precision`.

We will test the following ML classification models. For each model, we do a basic `GridSearchCV` with the more relevant hyperparameters for each model, generally taking values equal to default value, and one-half (or one tenth) and twice (or ten fold) that default value.

- Linear models:
  - `Perceptron`.
  - `LogisticRegressor`.
  - `PassiveAggressiveClassifier`
- Support Vector Machines:
  - `SVC`
- Nearest-Neighbours models:
  - `KNeighborsClassifier`
- Gaussian Processes models:
  - `GaussianProcessClassifier`
- Tree models:
  - `DecisionTreeClassifier`
- Ensemble models:
  - `RandomForestClassifier`
  - `AdaBoostClassifier` (a particular case of the next type of model)
  - `GradientBoostingClassifier`

## Modules and configuration

### Modules

In [59]:
import pandas as pd
import numpy as np

import copy

from time import time

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, \
    f1_score, log_loss, matthews_corrcoef, classification_report, \
    get_scorer_names, confusion_matrix

from collections import OrderedDict

import warnings

#from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

from sklearn.gaussian_process.kernels import RBF, RationalQuadratic, DotProduct

from sklearn.model_selection import GridSearchCV # Se usa para evitar overfitting
#from sklearn.model_selection import cross_validate #### NOTA, IGUAL ES MEJOR ESTE, PARA TENER EL CONTROL Y SACAR
# TODOS LOS RUNS QUE QUERAMOS
###### NO HACE FALTA EL GRID SEARCH NI NADA DE ESTO??? SIMPLEMENTE LE HACEMOS EL "FIT" Y LUEGO MEDIMOS CON
# EL "PREDICT" SOBRE EL VALIDATION SET

from sklearn.linear_model import Perceptron, LogisticRegression, PassiveAggressiveClassifier
from sklearn.svm import SVC, NuSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, \
    GradientBoostingClassifier, HistGradientBoostingClassifier

import pickle

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white", {'figure.figsize':(15,10)})

from IPython.display import display

# from imblearn import 

### Configuration

In [126]:
RANDOM_STATE = 11 # For reproducibility

S4_TRAIN_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest.csv"
# Train/test set for S4 sample, all 112 features
S4_VALIDATION_SET_IN = "../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv"
# Validation set for S4 sample, all 112 features

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"
#UNREL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Unreliable_features.pickle"

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase'] # Only cesium features and these columns will be kept.

MODELS_FOLDER = "../data/ML_MODELS/ML_model_preselection/"

PRECISION_RESULTS_OUT = "ModelPreselection_PrecisionResults_Imbalanced.csv"
VAL_PREDICTIONS_OUT = "ModelPreselection_ValidationPredictions_Imbalanced.csv"

# Note: it would be better to use a jason file for this configuration.
OFF_THE_SHELF_CLASSIFIERS = OrderedDict({
    'Perceptron': {
        'clf': Perceptron(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'alpha': [0.001, 0.0001, 0.00001],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'max_iter': [500, 1000, 2000],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'LogisticRegression': {
        'clf': LogisticRegression(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'C': [0.5, 1.0, 2.0],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'solver': ['saga'],
                       'max_iter': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'PassiveAggressiveClassifier': {
        'clf': PassiveAggressiveClassifier(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'max_iter': [500, 1000, 2000],
                       'loss': ['hinge', 'squared_hinge'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'SVC': {
        'clf': SVC(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'recomputed'],
                       'degree': [2, 3, 6],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'KNeighborsClassifier': {
        'clf': KNeighborsClassifier(),
        'param_grid': {'n_neighbors': [1, 3, 5, 10],
                       'weights': ['uniform', 'distance'],
                       'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                       'p': [1, 2]
                      }
    },
    'GaussianProcessClassifier': {
        'clf': GaussianProcessClassifier(),
        'param_grid': {'kernel': [RBF(), RationalQuadratic(), DotProduct()],
                       'max_iter_predict': [50, 100, 200],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'DecisionTreeClassifier': {
        'clf': DecisionTreeClassifier(),
        'param_grid': {'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'RandomForestClassifier': {
        'clf': RandomForestClassifier(),
        'param_grid': {'n_estimators': [50, 100, 200],
                       'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'AdaBoostClassifier': {
        'clf': AdaBoostClassifier(),
        'param_grid': {'n_estimators': [25, 50, 100],
                       'learning_rate': [0.5, 1.0, 2.0],
                       'algorithm': ['SAMME', 'SAMME.R'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'GradientBoostingClassifier': {
        'clf': GradientBoostingClassifier(),
        'param_grid': {'loss': ['log_loss', 'deviance'],
                       'learning_rate': [0.05, 0.1, 0.2],
                       'n_estimators': [25, 50, 100],
                       'criterion': ['friedman_mse', 'squared_error'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    }
})

IMAGE_FOLDER = './img/'

### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_, scaled, and with `NaN` values imputed by a `KNNImputer`.

### Load reliable features list

In [61]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

###  Load the oversampled S4 sample data train set

In [62]:
s4_train = pd.read_csv(S4_TRAIN_SET_IN, sep=',', decimal='.')
s4_train

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00163,False,0.0,0.00,0.0,2.457444e+06,0.00,-0.674126,0.519174,0.466681,...,-0.712310,-1.187392,0.425026,-0.002305,0.495906,-0.537353,-0.028926,-0.262548,-0.135686,-0.705143
1,Star-00123,True,30.0,0.72,0.0,2.457401e+06,0.37,-1.626729,1.911247,-0.740748,...,0.040924,-1.110488,-0.289189,0.056551,0.555375,-0.699590,-0.292135,-0.013533,0.443673,-1.207278
2,Star-00022,False,0.0,0.00,0.0,2.457430e+06,0.00,-0.039057,-1.012107,0.013895,...,-0.943428,0.637603,-0.679383,0.020496,-0.496592,-0.001214,-0.101526,-0.011097,-0.293389,0.242263
3,Star-00708,False,0.0,0.00,0.0,2.459677e+06,0.00,-0.039057,1.632833,-0.514355,...,-1.091456,0.759880,-0.161363,-0.210930,0.135863,0.662121,-0.492481,0.015621,-0.724783,0.682494
4,Star-00484,False,0.0,0.00,0.0,2.457400e+06,0.00,0.596012,-0.176863,-1.042605,...,-0.696260,0.153752,0.936459,0.070402,-0.067689,-0.656553,-0.237337,-0.032597,-0.139141,-0.098080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,Star-00795,False,0.0,0.00,0.0,2.457410e+06,0.00,0.278478,-0.803296,1.070396,...,-0.390463,-0.320856,-0.058001,0.061510,-0.608276,-1.229502,-0.332122,-0.410947,1.478398,-0.121757
746,Star-00221,False,0.0,0.00,0.0,2.457478e+06,0.00,-0.039057,-1.290522,-0.438891,...,0.703166,-1.543128,-0.386300,0.084782,-0.063902,0.199957,0.689712,-0.526121,1.559757,-1.150793
747,Star-00463,False,0.0,0.00,0.0,2.457409e+06,0.00,0.596012,-0.733692,-1.193534,...,-0.741328,1.272454,-0.054747,0.066032,-0.327664,0.414920,0.069037,0.039541,-0.218490,1.238094
748,Star-00873,False,0.0,0.00,0.0,2.457416e+06,0.00,-0.674126,-0.524881,0.542146,...,0.798791,-1.532917,-2.547988,0.149859,1.751137,-0.416064,-0.361215,0.544014,-1.894647,-1.094748


Notice that target variable is already encoded as `0` / `1` and that we now only have the features (i.e. the metadata are no longer present).

#### Encode target variable (`Pulsating`)

We encode the target variable as `True` / `False` = `0` / `1`.

In [63]:
s4_train['Pulsating'] = (s4_train['Pulsating'] == True).astype(int)
s4_train

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00163,0,0.0,0.00,0.0,2.457444e+06,0.00,-0.674126,0.519174,0.466681,...,-0.712310,-1.187392,0.425026,-0.002305,0.495906,-0.537353,-0.028926,-0.262548,-0.135686,-0.705143
1,Star-00123,1,30.0,0.72,0.0,2.457401e+06,0.37,-1.626729,1.911247,-0.740748,...,0.040924,-1.110488,-0.289189,0.056551,0.555375,-0.699590,-0.292135,-0.013533,0.443673,-1.207278
2,Star-00022,0,0.0,0.00,0.0,2.457430e+06,0.00,-0.039057,-1.012107,0.013895,...,-0.943428,0.637603,-0.679383,0.020496,-0.496592,-0.001214,-0.101526,-0.011097,-0.293389,0.242263
3,Star-00708,0,0.0,0.00,0.0,2.459677e+06,0.00,-0.039057,1.632833,-0.514355,...,-1.091456,0.759880,-0.161363,-0.210930,0.135863,0.662121,-0.492481,0.015621,-0.724783,0.682494
4,Star-00484,0,0.0,0.00,0.0,2.457400e+06,0.00,0.596012,-0.176863,-1.042605,...,-0.696260,0.153752,0.936459,0.070402,-0.067689,-0.656553,-0.237337,-0.032597,-0.139141,-0.098080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,Star-00795,0,0.0,0.00,0.0,2.457410e+06,0.00,0.278478,-0.803296,1.070396,...,-0.390463,-0.320856,-0.058001,0.061510,-0.608276,-1.229502,-0.332122,-0.410947,1.478398,-0.121757
746,Star-00221,0,0.0,0.00,0.0,2.457478e+06,0.00,-0.039057,-1.290522,-0.438891,...,0.703166,-1.543128,-0.386300,0.084782,-0.063902,0.199957,0.689712,-0.526121,1.559757,-1.150793
747,Star-00463,0,0.0,0.00,0.0,2.457409e+06,0.00,0.596012,-0.733692,-1.193534,...,-0.741328,1.272454,-0.054747,0.066032,-0.327664,0.414920,0.069037,0.039541,-0.218490,1.238094
748,Star-00873,0,0.0,0.00,0.0,2.457416e+06,0.00,-0.674126,-0.524881,0.542146,...,0.798791,-1.532917,-2.547988,0.149859,1.751137,-0.416064,-0.361215,0.544014,-1.894647,-1.094748


#### Filter the relevant columns only

We now filter only by the reliable relevant columns plus the `Pulsating` column.

In [64]:
s4_train_rel = s4_train[['Pulsating'] + rel_features].copy()
s4_train_rel

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,0,-0.674126,0.519174,0.466681,0.766297,1.786498,-0.304944,0.843252,0.189055,1.390901,...,0.646033,0.908818,1.305379,1.413989,0.174334,-0.188773,0.985693,-0.258841,-1.099919,-0.461571
1,1,-1.626729,1.911247,-0.740748,0.691384,0.168331,1.522002,1.166420,0.157675,0.019744,...,1.532902,-1.224350,0.710232,-1.272791,1.617586,1.392776,0.260283,0.708876,1.030413,0.400968
2,0,-0.039057,-1.012107,0.013895,-0.357397,1.168762,-0.232282,-0.443941,-0.136007,-0.412519,...,0.384058,0.882515,1.044322,-1.204443,-0.593335,-1.011092,0.592500,0.135213,0.725303,-0.319881
3,0,-0.039057,1.632833,-0.514355,0.166993,1.477630,-0.544204,-0.572606,-0.586661,-0.338658,...,1.457441,-0.921750,-1.095322,-0.031960,-0.068737,1.152465,-0.672518,0.391616,-1.301501,0.559262
4,0,0.596012,-0.176863,-1.042605,-0.432310,0.242158,-0.277263,-0.498198,-0.370020,-0.451106,...,-0.829296,-0.057582,0.541269,1.529879,-0.689578,1.475290,-1.268004,1.297212,1.327633,-0.059789
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,0,0.278478,-0.803296,1.070396,0.991036,1.168762,0.022969,-0.301300,-0.379804,-0.414456,...,-0.288258,-0.848774,-0.854496,1.491923,1.008801,-0.519740,-1.680259,-1.300470,0.307095,-1.189971
746,0,-0.039057,-1.290522,-0.438891,-1.630917,-0.916097,1.116014,0.520084,-0.017803,-0.254488,...,1.636850,-1.044746,0.342069,-1.435674,-1.390858,-0.052260,-1.612403,1.449496,-0.028414,1.009892
747,0,0.596012,-0.733692,-1.193534,-0.731962,-0.993314,-0.671992,0.345105,-0.083419,1.325540,...,0.554391,-1.609160,1.073499,-1.750994,-1.230980,-0.473819,-0.204503,0.349673,0.121996,-0.524541
748,0,-0.674126,-0.524881,0.542146,0.916123,1.245979,0.340946,0.439292,-0.069517,0.156860,...,-1.035107,0.803325,-0.968768,-1.073748,-0.666836,0.702556,-0.746085,1.140938,-1.325713,0.126896


In [65]:
print(list(s4_train_rel.columns))

['Pulsating', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_r

###  Load the S4 sample data validation set

In [66]:
s4_val = pd.read_csv(S4_VALIDATION_SET_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,False,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,False,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,False,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,False,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,False,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,False,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,False,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,False,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,False,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


#### Encode target variable (`Pulsating`)

We encode the target variable as `True` / `False` = `0` / `1`.

In [67]:
s4_val['Pulsating'] = (s4_val['Pulsating'] == True).astype(int)
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


#### Filter the relevant columns only

We now filter only by the reliable relevant columns plus the `Pulsating` column.

In [68]:
s4_val_rel = s4_val[['Pulsating'] + rel_features].copy()
s4_val_rel

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,0,-0.991660,0.031948,0.542146,1.740165,-0.298361,2.924506,1.112558,1.223343,-0.528719,...,-0.938100,-1.553810,-0.039531,-1.766924,0.105349,-0.914928,1.624409,1.041240,-1.287022,0.212829
1,0,-1.309194,-1.081711,1.825039,1.815078,-1.611050,-0.692478,0.439292,1.223343,1.528017,...,-0.225260,-0.257974,-0.443062,0.117884,0.110252,-1.094708,-0.142404,0.174193,1.058664,-1.753975
2,0,-0.356591,0.379966,0.844003,0.166993,-1.147748,-0.480752,-0.473181,0.125210,-0.246422,...,-0.828381,-0.244072,-1.117475,1.393767,-0.017646,-0.236540,0.981973,0.721607,-0.990852,-0.054942
3,0,-0.039057,0.519174,0.994931,1.065949,-1.379399,-0.348004,-0.310918,-0.224660,-0.136960,...,0.041384,0.049026,1.533790,-0.885426,1.057050,0.254424,0.604281,-1.070993,-0.705840,-2.196394
4,0,0.596012,-0.664089,-0.212498,0.391732,0.087724,1.196750,2.120285,1.085438,0.930900,...,-0.143457,-0.834589,0.906258,1.467923,0.111756,-0.662832,-0.936044,0.525532,-1.277372,-0.605419
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,0,0.913546,0.101552,-1.419927,-0.057745,-0.761663,0.185932,-0.248147,-0.282121,-0.456553,...,1.095375,0.802932,0.271270,1.003811,-0.814867,0.239043,1.691595,-0.792703,-0.844428,1.395823
246,0,0.278478,-0.733692,1.447717,1.515426,-1.997135,-0.692478,-0.772586,-0.672852,-0.528719,...,-1.125088,0.000321,0.528247,0.944357,1.417351,-1.632795,-1.374663,-1.104800,-0.094134,0.181936
247,0,-0.356591,1.493625,1.221324,-0.507223,1.786498,-0.382451,-0.368627,-0.328089,-0.185930,...,-0.028582,1.325680,1.176487,-1.183766,0.406925,0.391323,-0.620253,-1.156760,0.794597,-0.604688
248,0,0.913546,-1.151314,0.240288,-0.207571,-1.533833,-0.005709,-0.205837,-0.022504,-0.283139,...,-1.312254,0.864168,-0.223449,-0.412393,1.222789,0.142060,0.311169,-0.012413,1.407608,0.625523


In [69]:
print(list(s4_val_rel.columns))

['Pulsating', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_r

## Train all the _off-the-shelf_ classifiers (imbalanced classes)

### Train benchmark classifiers

We train the classifiers just off-the shelf, with all the parameters set to default.

In [70]:
warnings.simplefilter('ignore')
fitted_classifiers_benchmark = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Fitting off-the-shelf classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    #param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid={}, scoring='precision', cv=3, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_benchmark[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Fitting off-the-shelf classifier Perceptron...
... completed. Elapsed time: 0.020 seconds
Fitting off-the-shelf classifier LogisticRegression...
... completed. Elapsed time: 0.051 seconds
Fitting off-the-shelf classifier PassiveAggressiveClassifier...
... completed. Elapsed time: 0.018 seconds
Fitting off-the-shelf classifier SVC...
... completed. Elapsed time: 0.090 seconds
Fitting off-the-shelf classifier KNeighborsClassifier...
... completed. Elapsed time: 0.091 seconds
Fitting off-the-shelf classifier GaussianProcessClassifier...
... completed. Elapsed time: 0.369 seconds
Fitting off-the-shelf classifier DecisionTreeClassifier...
... completed. Elapsed time: 0.079 seconds
Fitting off-the-shelf classifier RandomForestClassifier...
... completed. Elapsed time: 0.996 seconds
Fitting off-the-shelf classifier AdaBoostClassifier...
... completed. Elapsed time: 0.565 seconds
Fitting off-the-shelf classifier GradientBoostingClassifier...
... completed. Elapsed time: 2.026 seconds


In [71]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.019948244094848633)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.05086326599121094)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.017952442169189453)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.08975863456726074)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.09075641632080078)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 0.36901

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [72]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.019948244094848633)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf', LogisticRegression()),
                           ('OptTrain_time', 0.05086326599121094)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf', PassiveAggressiveClassifier()),
                           ('OptTrain_time', 0.017952442169189453)])),
             ('SVC',
              OrderedDict([('Fitted_clf', SVC()),
                           ('OptTrain_time', 0.08975863456726074)])),
             ('KNeighborsClassifier',
              OrderedDict([('Fitted_clf', KNeighborsClassifier()),
                           ('OptTrain_time', 0.09075641632080078)])),
             ('GaussianProcessClassifier',
              OrderedDict([('Fitted_clf', GaussianProcessClassifier()),
                           ('OptTrain_time', 0.36901

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [73]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[622  50]
 [ 64  14]]
Validation set:
[[197  27]
 [ 22   4]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[672   0]
 [ 78   0]]
Validation set:
[[224   0]
 [ 26   0]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[634  38]
 [ 66  12]]
Validation set:
[[210  14]
 [ 24   2]]


Printing confusion matrices for classifier SVC...
Training set:
[[672   0]
 [ 78   0]]
Validation set:
[[224   0]
 [ 26   0]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[668   4]
 [ 76   2]]
Validation set:
[[223   1]
 [ 26   0]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[672   0]
 [  0  78]]
Validation set:
[[212  12]
 [ 22   4]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[672   0]
 [  0  78]]
Validation set:
[[191  33]
 [ 25   1]]


**OBSERVATION:** The main problem here is that most classifiers overfit to the training set. However, it seems that the behavipour over the training set is now better (i.e. the classifiers do not predict a single label).

#### Classification reports

In [74]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       672
           1       0.22      0.18      0.20        78

    accuracy                           0.85       750
   macro avg       0.56      0.55      0.56       750
weighted avg       0.84      0.85      0.84       750

	Validation set:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89       224
           1       0.13      0.15      0.14        26

    accuracy                           0.80       250
   macro avg       0.51      0.52      0.51       250
weighted avg       0.82      0.80      0.81       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95       672
           1       0.00      0.00      0.00        78

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [75]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_benchmark.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_benchmark[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_benchmark[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_benchmark[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_benchmark[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_benchmark[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [76]:
fitted_classifiers_benchmark

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.019948244094848633),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.848),
                                         ('precision', 0.21875),
                                         ('recall', 0.1794871794871795),
                                         ('F1', 0.19718309859154928),
                                         ('log_loss', 5.478635315145807),
                                         ('MCC', 0.11481799238319702)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.804),
                                         ('precision', 0.12903225806451613),
                                         ('recall', 0.15384615384615385),
                                         ('F1', 0.14035087719298245),
                                         ('log_l

**OBSERVATION:** <font color='red'>**VERY BAD RESULTS**</font>

We have still the problem of overfitting.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [77]:
precision_results = pd.DataFrame(index=fitted_classifiers_benchmark.keys())
precision_results

Perceptron
LogisticRegression
PassiveAggressiveClassifier
SVC
KNeighborsClassifier
GaussianProcessClassifier
DecisionTreeClassifier
RandomForestClassifier
AdaBoostClassifier
GradientBoostingClassifier


In [78]:
precision_results['BM_tr_precision'] = np.nan
precision_results['BM_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,,
LogisticRegression,,
PassiveAggressiveClassifier,,
SVC,,
KNeighborsClassifier,,
GaussianProcessClassifier,,
DecisionTreeClassifier,,
RandomForestClassifier,,
AdaBoostClassifier,,
GradientBoostingClassifier,,


In [79]:
for clf in fitted_classifiers_benchmark.keys():
    precision_results.loc[clf, 'BM_tr_precision'] = \
        fitted_classifiers_benchmark[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BM_val_precision'] = \
        fitted_classifiers_benchmark[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision
Perceptron,0.21875,0.129032
LogisticRegression,0.0,0.0
PassiveAggressiveClassifier,0.24,0.125
SVC,0.0,0.0
KNeighborsClassifier,0.333333,0.0
GaussianProcessClassifier,1.0,0.25
DecisionTreeClassifier,1.0,0.029412
RandomForestClassifier,1.0,1.0
AdaBoostClassifier,0.777778,0.153846
GradientBoostingClassifier,1.0,0.166667


In [80]:
print("TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BM_tr_precision'],
                              precision_results.loc[idx, 'BM_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.22 / 0.13
LogisticRegression: 0.00 / 0.00
PassiveAggressiveClassifier: 0.24 / 0.12
SVC: 0.00 / 0.00
KNeighborsClassifier: 0.33 / 0.00
GaussianProcessClassifier: 1.00 / 0.25
DecisionTreeClassifier: 1.00 / 0.03
RandomForestClassifier: 1.00 / 1.00
AdaBoostClassifier: 0.78 / 0.15
GradientBoostingClassifier: 1.00 / 0.17


**OBSERVATION:** so, it is clear that all the algorithms are performing bad, in two respects:

- All suffer heavily from the imbalance of the classes.
- All of them show heavy overfitting, and even some of them in an extreme way.

**CONCLUSIONS: OFF-THE-SHELF CLASSIFIERS:**
- Perceptron: bad precision results, but it is not overfitting. **Probably, suffering from imbalanced classes**.
- LogisticRegression: extremely bad results, both in training and in validation. **Probably, suffering from imbalanced classes in an extreme way**. Does not even have the opportunity to overfit.
- PassiveAggressiveClassifier: extremely bad results, both in training and in validation. **Probably, suffering from imbalanced classes in an extreme way**. Does not even have the opportunity to overfit.
- SVC: extremely bad results, both in training and in validation. **Probably suffering from imbalanced classes in an extreme way**. Does not even have had the opportunity to overfit.
- KNeighborsClassifier: **curious / interesting result**. A relatively high precision in validation (but it is exactly 0.500000, this can be suspicious), but null precision on training set. 
- GaussianProcessClassifier: bad precision results in validation, perfect in training. **Probably suffering from overfitting**.
- DecisionTreeClassifier: **suffering a lot from overfitting**. The training does not seem to be impacted by imbalanced classes.
- RandomForestClassifier: **extremely suffering from overfitting**, but apparently not impacted during training by imbalanced classes.
- AdaBoostClassifier: good results in training, very bad in validation, probably **suffering a lot from overfitting**.
- GradientBoostingClassifier: perfect results in training, null in validation. **Extreme overfitting**.


### Save the models

In [81]:
pickle.dump(fitted_classifiers_benchmark, open(MODELS_FOLDER + "fitted_clf_ots.pickle", 'wb'))

## Optimize all the _off-the-shelf_ classifiers (imbalanced classes)

We now try to do a very simple and quick optimization of those off-the shelf classifiers, to see if the situation improves for some of them.

We choose the grid values around the default values, when possible.

In [82]:
list(OFF_THE_SHELF_CLASSIFIERS.keys())

['Perceptron',
 'LogisticRegression',
 'PassiveAggressiveClassifier',
 'SVC',
 'KNeighborsClassifier',
 'GaussianProcessClassifier',
 'DecisionTreeClassifier',
 'RandomForestClassifier',
 'AdaBoostClassifier',
 'GradientBoostingClassifier']

In [83]:
OFF_THE_SHELF_CLASSIFIERS

OrderedDict([('Perceptron',
              {'clf': Perceptron(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'alpha': [0.001, 0.0001, 1e-05],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'max_iter': [500, 1000, 2000],
                'random_state': [11]}}),
             ('LogisticRegression',
              {'clf': LogisticRegression(),
               'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                'C': [0.5, 1.0, 2.0],
                'l1_ratio': [None, 0.075, 0.15, 0.3],
                'solver': ['saga'],
                'max_iter': [50, 100, 200],
                'random_state': [11]}}),
             ('PassiveAggressiveClassifier',
              {'clf': PassiveAggressiveClassifier(),
               'param_grid': {'C': [0.5, 1.0, 2.0],
                'max_iter': [500, 1000, 2000],
                'loss': ['hinge', 'squared_hinge'],
                'random_state': [11]}}),
             ('SVC',
   

### Optimize classifiers (imbalanced target classes)

We now try a simple optimization of these off-the shelf classifiers, and targetting the  `precision` metric for minimization.

In [84]:
warnings.simplefilter('ignore')
fitted_classifiers_opt = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS.keys()):
    print("Optimizing and fitting classifier %s..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS[classifier]['clf'])
    param_grid = OFF_THE_SHELF_CLASSIFIERS[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid=param_grid, scoring='precision', cv=3, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_opt[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Optimizing and fitting classifier Perceptron...
... completed. Elapsed time: 1.435 seconds
Optimizing and fitting classifier LogisticRegression...
... completed. Elapsed time: 12.924 seconds
Optimizing and fitting classifier PassiveAggressiveClassifier...
... completed. Elapsed time: 0.242 seconds
Optimizing and fitting classifier SVC...
... completed. Elapsed time: 3.444 seconds
Optimizing and fitting classifier KNeighborsClassifier...
... completed. Elapsed time: 2.663 seconds
Optimizing and fitting classifier GaussianProcessClassifier...
... completed. Elapsed time: 14.203 seconds
Optimizing and fitting classifier DecisionTreeClassifier...
... completed. Elapsed time: 7.709 seconds
Optimizing and fitting classifier RandomForestClassifier...
... completed. Elapsed time: 929.023 seconds
Optimizing and fitting classifier AdaBoostClassifier...
... completed. Elapsed time: 8.699 seconds
Optimizing and fitting classifier GradientBoostingClassifier...
... completed. Elapsed time: 2055.440 

In [85]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.3, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.4351637363433838)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(C=0.5, max_iter=50, penalty='l1', random_state=11,
                                               solver='saga')),
                           ('OptTrain_time', 12.92373275756836)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=2.0, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.24235224723815918)])),
             ('SVC',
              OrderedDict([('Fitted_clf',

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [86]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.3, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.4351637363433838)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(C=0.5, max_iter=50, penalty='l1', random_state=11,
                                               solver='saga')),
                           ('OptTrain_time', 12.92373275756836)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=2.0, loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.24235224723815918)])),
             ('SVC',
              OrderedDict([('Fitted_clf',

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [87]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[604  68]
 [ 69   9]]
Validation set:
[[196  28]
 [ 23   3]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[672   0]
 [ 78   0]]
Validation set:
[[224   0]
 [ 26   0]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[655  17]
 [ 72   6]]
Validation set:
[[212  12]
 [ 25   1]]


Printing confusion matrices for classifier SVC...
Training set:
[[666   6]
 [ 78   0]]
Validation set:
[[220   4]
 [ 26   0]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[672   0]
 [  0  78]]
Validation set:
[[216   8]
 [ 24   2]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[672   0]
 [ 78   0]]
Validation set:
[[224   0]
 [ 26   0]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[659  13]
 [ 49  29]]
Validation set:
[[215   9]
 [ 25   1]]


**OBSERVATION:** Again, the main problem here is that most classifiers overfit to the training set, and some of them even predict just one single class (non-pulsating).

#### Classification reports

In [88]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.90      0.90      0.90       672
           1       0.12      0.12      0.12        78

    accuracy                           0.82       750
   macro avg       0.51      0.51      0.51       750
weighted avg       0.82      0.82      0.82       750

	Validation set:
              precision    recall  f1-score   support

           0       0.89      0.88      0.88       224
           1       0.10      0.12      0.11        26

    accuracy                           0.80       250
   macro avg       0.50      0.50      0.50       250
weighted avg       0.81      0.80      0.80       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.90      1.00      0.95       672
           1       0.00      0.00      0.00        78

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [89]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_opt.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_opt[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_opt[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_opt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_opt[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_opt[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_opt[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [90]:
fitted_classifiers_opt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf',
                            Perceptron(alpha=0.001, l1_ratio=0.3, max_iter=500, penalty='elasticnet',
                                       random_state=11)),
                           ('OptTrain_time', 1.4351637363433838),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.8173333333333334),
                                         ('precision', 0.11688311688311688),
                                         ('recall', 0.11538461538461539),
                                         ('F1', 0.11612903225806452),
                                         ('log_loss', 6.583974019078733),
                                         ('MCC', 0.01427539397998512)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.796),
                                         ('precision', 0.0967741935483871),
                        

**OBSERVATION:** <font color='red'>**ONCE AGAIN, VERY BAD RESULTS EVEN WITH OPTIMIZATION**</font>

The problem here could be that all classifiers are suffering from overfitting and / or imbalanced dataset.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [91]:
precision_results['BMOPT_tr_precision'] = np.nan
precision_results['BMOPT_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.21875,0.129032,,
LogisticRegression,0.0,0.0,,
PassiveAggressiveClassifier,0.24,0.125,,
SVC,0.0,0.0,,
KNeighborsClassifier,0.333333,0.0,,
GaussianProcessClassifier,1.0,0.25,,
DecisionTreeClassifier,1.0,0.029412,,
RandomForestClassifier,1.0,1.0,,
AdaBoostClassifier,0.777778,0.153846,,
GradientBoostingClassifier,1.0,0.166667,,


In [92]:
for clf in fitted_classifiers_opt.keys():
    precision_results.loc[clf, 'BMOPT_tr_precision'] = \
        fitted_classifiers_opt[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BMOPT_val_precision'] = \
        fitted_classifiers_opt[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision
Perceptron,0.21875,0.129032,0.116883,0.096774
LogisticRegression,0.0,0.0,0.0,0.0
PassiveAggressiveClassifier,0.24,0.125,0.26087,0.076923
SVC,0.0,0.0,0.0,0.0
KNeighborsClassifier,0.333333,0.0,1.0,0.2
GaussianProcessClassifier,1.0,0.25,0.0,0.0
DecisionTreeClassifier,1.0,0.029412,0.690476,0.1
RandomForestClassifier,1.0,1.0,0.0,0.0
AdaBoostClassifier,0.777778,0.153846,0.947368,0.333333
GradientBoostingClassifier,1.0,0.166667,0.0,0.0


In [93]:
print("TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BMOPT_tr_precision'],
                              precision_results.loc[idx, 'BMOPT_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS
Perceptron: 0.12 / 0.10
LogisticRegression: 0.00 / 0.00
PassiveAggressiveClassifier: 0.26 / 0.08
SVC: 0.00 / 0.00
KNeighborsClassifier: 1.00 / 0.20
GaussianProcessClassifier: 0.00 / 0.00
DecisionTreeClassifier: 0.69 / 0.10
RandomForestClassifier: 0.00 / 0.00
AdaBoostClassifier: 0.95 / 0.33
GradientBoostingClassifier: 0.00 / 0.00


**OBSERVATION:** so, it is clear that all the algorithms are still performing badly, in two respects:

- All of them are suffering from the imbalanced nature of the classes.
- All of them show some kind of overfitting, and even some of them in an extreme way.

**CONCLUSIONS: IMPROVEMENT/WORSENING WITH A SINGLE OPTIMIZATION OF OFF-THE-SHELF CLASSIFIERS:**
- Perceptron: **optimization has worsened the results a little**, and **overfitting / imbalance is still a problem** (but not so hard as in other cases, and it is still the most train/validation-balanced case from all the classifiers).
- LogisticRegression: **no improvement at all with optimization**. Again, extremely bad results, both in training and in validation. **Probably, suffering from imbalanced classes in an extreme way**. Does not even have the opportunity to overfit.
- PassiveAggressiveClassifier: **improves with optimization**. Performances in line with those of `Perceptron`, but a little bit more overfitting. **Probably, suffering from imbalanced classes**. Does not even have the opportunity to overfit.
- SVC: **great improve with optimization**, but **it now falls in extreme overfitting.**.
- KNeighborsClassifier: **with optimization, results seem more reasonable**. Perfect precision in training, low precision in validation. **Extreme overfitting problem**.
- GaussianProcessClassifier: **much worse with optimization**. Both`precision` values go to null: **extreme problem with imbalanced classes.** Does not even have the opportunity to overfit.
- DecisionTreeClassifier: **with optimization, it seems to overfit less**, but the `precision` in validation goes to zero. **Probably Suffering a lot from overfitting**.
- RandomForestClassifier: **no change with optimization**: again, **problems with both overfitting and umbalanced classes**.
- AdaBoostClassifier: **with optimization, the overfitting problem has improved, it now overfits less**, but still seems to suffer a little from imbalanced classes.
- GradientBoostingClassifier: **optimization has worsened the results**. Zero precision for both the training and validation sets. **Why?**

### Save the models

In [94]:
pickle.dump(fitted_classifiers_opt, open(MODELS_FOLDER + "fitted_clf_opt.pickle", 'wb'))

## Applying weights to the samples

We now try to apply weights to the samples. Note that not all of the classifiers have the `class_weight` paremeter, so we leave out those which have not that parameter, namely `KNeighborsClassifier`, `GaussianProcessClassifier`, `GradientBoostingClassifier`. For those, we just redo the optimization without that parameter (as `random_state` is the same, they should yield the same results.

For all of them, the `class_weight` is set to `balanced` (i.e. weights based on the frequency of classes), with this exceptions:
- For `RandomForestClassifier` we try the two possible options: `balanced` and `balanced_subsample`. In the second case, the balanced weights are calculated individually for each bootstrap sample chosen for each of the trees.
- Although `AdaBoostClassifier` has not a `class_weight` parameter per se, we will set the parameter in the base estimator of the algorithm (`DecisionTreeClassifier`).

Additionally, we revert to the original (off-the-shelf) classifiers where optimization has worsened the results.

For that, we redefine the dictionary with models and parameter grids.

In [97]:
OFF_THE_SHELF_CLASSIFIERS_CLWT = OrderedDict({
#    'Perceptron': {
#        'clf': Perceptron(),
#        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
#                       'alpha': [0.001, 0.0001, 0.00001],
#                       'l1_ratio': [None, 0.075, 0.15, 0.30],
#                       'max_iter': [500, 1000, 2000],
#                       'class_weight': ['balanced'],
#                       'random_state': [RANDOM_STATE]
#                      }
#    },
    'Perceptron': {
        'clf': Perceptron(),
        'param_grid': {}
    },
    'LogisticRegression': {
        'clf': LogisticRegression(),
        'param_grid': {'penalty': ['l1', 'l2', 'elasticnet'],
                       'C': [0.5, 1.0, 2.0],
                       'l1_ratio': [None, 0.075, 0.15, 0.30],
                       'solver': ['saga'],
                       'max_iter': [50, 100, 200],
                       'class_weight': ['balanced'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'PassiveAggressiveClassifier': {
        'clf': PassiveAggressiveClassifier(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'max_iter': [500, 1000, 2000],
                       'loss': ['hinge', 'squared_hinge'],
                       'class_weight': ['balanced'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'SVC': {
        'clf': SVC(),
        'param_grid': {'C': [0.5, 1.0, 2.0],
                       'kernel': ['linear', 'poly', 'rbf', 'sigmoid', 'recomputed'],
                       'degree': [2, 3, 6],
                       'class_weight': ['balanced'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'KNeighborsClassifier': {
        'clf': KNeighborsClassifier(),
        'param_grid': {'n_neighbors': [1, 3, 5, 10],
                       'weights': ['uniform', 'distance'],
                       'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                       'p': [1, 2]
                      }
    },
#    'GaussianProcessClassifier': {
#        'clf': GaussianProcessClassifier(),
#        'param_grid': {'kernel': [RBF(), RationalQuadratic(), DotProduct()],
#                       'max_iter_predict': [50, 100, 200],
#                       'random_state': [RANDOM_STATE]
#                      }
#    },
    'GaussianProcessClassifier': {
        'clf': GaussianProcessClassifier(),
        'param_grid': {}
    },
    'DecisionTreeClassifier': {
        'clf': DecisionTreeClassifier(),
        'param_grid': {'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'class_weight': ['balanced'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'RandomForestClassifier': {
        'clf': RandomForestClassifier(),
        'param_grid': {'n_estimators': [50, 100, 200],
                       'criterion': ['gini', 'entropy', 'log_loss'],
                       'max_depth': [25, 50, 100],
                       'min_samples_leaf': [5, 10, 20],
                       'max_features': [None, 'sqrt', 'log2'],
                       'ccp_alpha': [0.005, 0.015, 0.030],
                       'class_weight': ['balanced', 'balanced_subsample'],
                       'random_state': [RANDOM_STATE]
                      }
    },
    'AdaBoostClassifier': {
        'clf': AdaBoostClassifier(),
        'param_grid': {'estimator': [DecisionTreeClassifier(max_depth=1, class_weight='balanced')],
                       'n_estimators': [25, 50, 100],
                       'learning_rate': [0.5, 1.0, 2.0],
                       'algorithm': ['SAMME', 'SAMME.R'],
                       'random_state': [RANDOM_STATE]
                      }
    },
#    'GradientBoostingClassifier': {
#        'clf': GradientBoostingClassifier(),
#        'param_grid': {'loss': ['log_loss', 'deviance'],
#                       'learning_rate': [0.05, 0.1, 0.2],
#                       'n_estimators': [25, 50, 100],
#                       'criterion': ['friedman_mse', 'squared_error'],
#                       'max_depth': [25, 50, 100],
#                       'min_samples_leaf': [5, 10, 20],
#                       'ccp_alpha': [0.005, 0.015, 0.030],
#                       'random_state': [RANDOM_STATE]
#                      }
#    }
    'GradientBoostingClassifier': {
        'clf': GradientBoostingClassifier(),
        'param_grid': {}
    }
})


### Optimize classifiers (with class weights)

In [98]:
warnings.simplefilter('ignore')
fitted_classifiers_clwt = OrderedDict()
for classifier in list(OFF_THE_SHELF_CLASSIFIERS_CLWT.keys()):
    print("Optimizing and fitting classifier %s with class weights..." %classifier)
    clf = copy.deepcopy(OFF_THE_SHELF_CLASSIFIERS_CLWT[classifier]['clf'])
    param_grid = OFF_THE_SHELF_CLASSIFIERS_CLWT[classifier]['param_grid']
    # Optimize with training data:
    cv = GridSearchCV(clf, param_grid=param_grid, scoring='precision', cv=3, refit=True)
    start_time = time()
    cv.fit(s4_train_rel[rel_features], s4_train_rel['Pulsating'])
    end_time = time()
    elapsed_time = end_time - start_time
    print("... completed. Elapsed time: %.3f seconds" %elapsed_time)
    # Add the best fitted classifier to the dictionary:
    fitted_classifiers_clwt[classifier] = OrderedDict({
        'Fitted_clf': copy.deepcopy(cv.best_estimator_),
        'OptTrain_time': elapsed_time
    })

Optimizing and fitting classifier Perceptron with class weights...
... completed. Elapsed time: 0.023 seconds
Optimizing and fitting classifier LogisticRegression with class weights...
... completed. Elapsed time: 10.622 seconds
Optimizing and fitting classifier PassiveAggressiveClassifier with class weights...
... completed. Elapsed time: 0.258 seconds
Optimizing and fitting classifier SVC with class weights...
... completed. Elapsed time: 2.419 seconds
Optimizing and fitting classifier KNeighborsClassifier with class weights...
... completed. Elapsed time: 2.576 seconds
Optimizing and fitting classifier GaussianProcessClassifier with class weights...
... completed. Elapsed time: 0.380 seconds
Optimizing and fitting classifier DecisionTreeClassifier with class weights...
... completed. Elapsed time: 6.536 seconds
Optimizing and fitting classifier RandomForestClassifier with class weights...
... completed. Elapsed time: 1851.608 seconds
Optimizing and fitting classifier AdaBoostClassif

In [99]:
fitted_classifiers_clwt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.022939205169677734)])),
             ('LogisticRegression',
              OrderedDict([('Fitted_clf',
                            LogisticRegression(C=0.5, class_weight='balanced', penalty='l1',
                                               random_state=11, solver='saga')),
                           ('OptTrain_time', 10.621622323989868)])),
             ('PassiveAggressiveClassifier',
              OrderedDict([('Fitted_clf',
                            PassiveAggressiveClassifier(C=2.0, class_weight='balanced',
                                                        loss='squared_hinge', max_iter=500,
                                                        random_state=11)),
                           ('OptTrain_time', 0.25833940505981445)])),
             ('SVC',
              OrderedDict([('Fitted_clf',
                            SVC(C=0.5, class_wei

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

### Performance measurements

In [100]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_clwt.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_clwt[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_clwt[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_clwt[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_clwt[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [101]:
fitted_classifiers_clwt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.022939205169677734),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.848),
                                         ('precision', 0.21875),
                                         ('recall', 0.1794871794871795),
                                         ('F1', 0.19718309859154928),
                                         ('log_loss', 5.478635315145807),
                                         ('MCC', 0.11481799238319702)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.804),
                                         ('precision', 0.12903225806451613),
                                         ('recall', 0.15384615384615385),
                                         ('F1', 0.14035087719298245),
                                         ('log_l

**OBSERVATION:** Ok, but for some of the classifiers, it seems that in general balancing the classes by weighting them have improved the situation.

Let's see the classification reports for all the classifiers:

#### Confusion matrices

In [102]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_clwt.keys()):
    print("\n\nPrinting confusion matrices for classifier %s..." %classifier)
    print("Training set:")
    y_train_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(confusion_matrix(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("Validation set:")
    y_val_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(confusion_matrix(y_true=y_val_true, y_pred=y_val_pred))



Printing confusion matrices for classifier Perceptron...
Training set:
[[622  50]
 [ 64  14]]
Validation set:
[[197  27]
 [ 22   4]]


Printing confusion matrices for classifier LogisticRegression...
Training set:
[[429 243]
 [ 29  49]]
Validation set:
[[110 114]
 [ 14  12]]


Printing confusion matrices for classifier PassiveAggressiveClassifier...
Training set:
[[466 206]
 [ 54  24]]
Validation set:
[[162  62]
 [ 16  10]]


Printing confusion matrices for classifier SVC...
Training set:
[[663   9]
 [ 45  33]]
Validation set:
[[216   8]
 [ 22   4]]


Printing confusion matrices for classifier KNeighborsClassifier...
Training set:
[[672   0]
 [  0  78]]
Validation set:
[[216   8]
 [ 24   2]]


Printing confusion matrices for classifier GaussianProcessClassifier...
Training set:
[[672   0]
 [  0  78]]
Validation set:
[[212  12]
 [ 22   4]]


Printing confusion matrices for classifier DecisionTreeClassifier...
Training set:
[[541 131]
 [  2  76]]
Validation set:
[[162  62]
 [ 15  11]]


**OBSERVATION:** we are still observing low precision values and even extreme overfitting for some of the classifiers.

#### Classification reports

In [103]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_clwt.keys()):
    print("Printing classification reports for classifier %s..." %classifier)
    print("\tTraining set:")
    y_train_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    print(classification_report(y_true=y_train_true, y_pred=y_train_pred))
    # Training set:
    print("\tValidation set:")
    y_val_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    print(classification_report(y_true=y_val_true, y_pred=y_val_pred))


Printing classification reports for classifier Perceptron...
	Training set:
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       672
           1       0.22      0.18      0.20        78

    accuracy                           0.85       750
   macro avg       0.56      0.55      0.56       750
weighted avg       0.84      0.85      0.84       750

	Validation set:
              precision    recall  f1-score   support

           0       0.90      0.88      0.89       224
           1       0.13      0.15      0.14        26

    accuracy                           0.80       250
   macro avg       0.51      0.52      0.51       250
weighted avg       0.82      0.80      0.81       250

Printing classification reports for classifier LogisticRegression...
	Training set:
              precision    recall  f1-score   support

           0       0.94      0.64      0.76       672
           1       0.17      0.63      0.26        78

    a

We now validate the performance of each classifier, both on the training set and on the validation set.

We will calculate and record the following metrics for each of the optimized and fitted classifier:
- Accuracy
- Precision
- Recall
- F1_score
- LogLoss
- Matthews correlation coefficient

#### Main metrics

In [104]:
warnings.simplefilter('ignore')
y_train_true = s4_train_rel['Pulsating']
y_val_true = s4_val_rel['Pulsating']
for classifier in list(fitted_classifiers_clwt.keys()):
    print("Calculating predictions and performance for classifier %s..." %classifier)
    # Training set:
    y_train_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_train_rel[rel_features])
    fitted_classifiers_clwt[classifier]['Training metrics'] = OrderedDict({})
    fitted_classifiers_clwt[classifier]['Training metrics']['accuracy'] = \
        accuracy_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['precision'] = \
        precision_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['recall'] = \
        recall_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['F1'] = \
        f1_score(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['log_loss'] = \
        log_loss(y_true=y_train_true, y_pred=y_train_pred)
    fitted_classifiers_clwt[classifier]['Training metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_train_true, y_pred=y_train_pred)
    # Validation set:
    y_val_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_val_rel[rel_features])
    fitted_classifiers_clwt[classifier]['Validation metrics'] = OrderedDict({})
    fitted_classifiers_clwt[classifier]['Validation metrics']['accuracy'] = \
        accuracy_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['precision'] = \
        precision_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['recall'] = \
        recall_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['F1'] = \
        f1_score(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['log_loss'] = \
        log_loss(y_true=y_val_true, y_pred=y_val_pred)
    fitted_classifiers_clwt[classifier]['Validation metrics']['MCC'] = \
        matthews_corrcoef(y_true=y_val_true, y_pred=y_val_pred)


Calculating predictions and performance for classifier Perceptron...
Calculating predictions and performance for classifier LogisticRegression...
Calculating predictions and performance for classifier PassiveAggressiveClassifier...
Calculating predictions and performance for classifier SVC...
Calculating predictions and performance for classifier KNeighborsClassifier...
Calculating predictions and performance for classifier GaussianProcessClassifier...
Calculating predictions and performance for classifier DecisionTreeClassifier...
Calculating predictions and performance for classifier RandomForestClassifier...
Calculating predictions and performance for classifier AdaBoostClassifier...
Calculating predictions and performance for classifier GradientBoostingClassifier...


In [105]:
fitted_classifiers_clwt

OrderedDict([('Perceptron',
              OrderedDict([('Fitted_clf', Perceptron()),
                           ('OptTrain_time', 0.022939205169677734),
                           ('Training metrics',
                            OrderedDict([('accuracy', 0.848),
                                         ('precision', 0.21875),
                                         ('recall', 0.1794871794871795),
                                         ('F1', 0.19718309859154928),
                                         ('log_loss', 5.478635315145807),
                                         ('MCC', 0.11481799238319702)])),
                           ('Validation metrics',
                            OrderedDict([('accuracy', 0.804),
                                         ('precision', 0.12903225806451613),
                                         ('recall', 0.15384615384615385),
                                         ('F1', 0.14035087719298245),
                                         ('log_l

**OBSERVATION:** <font color='red'>**STILL NOT GOOD ENOUGH RESULTS**</font><font color='blue'>**, BUT THINGS SEEM TO HAVE IMPROVED A LITTLE WITH CLASS BALANCING**</font>

The problem here could be that the classifiers are still suffering from overfitting and / or an imbalanced source dataset.

#### Focus on `precision`

We now set the focus on `precision`, as it is the metric we are more interested in.

In [106]:
precision_results['BMOPTCW_tr_precision'] = np.nan
precision_results['BMOPTCW_val_precision'] = np.nan
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision,BMOPTCW_tr_precision,BMOPTCW_val_precision
Perceptron,0.21875,0.129032,0.116883,0.096774,,
LogisticRegression,0.0,0.0,0.0,0.0,,
PassiveAggressiveClassifier,0.24,0.125,0.26087,0.076923,,
SVC,0.0,0.0,0.0,0.0,,
KNeighborsClassifier,0.333333,0.0,1.0,0.2,,
GaussianProcessClassifier,1.0,0.25,0.0,0.0,,
DecisionTreeClassifier,1.0,0.029412,0.690476,0.1,,
RandomForestClassifier,1.0,1.0,0.0,0.0,,
AdaBoostClassifier,0.777778,0.153846,0.947368,0.333333,,
GradientBoostingClassifier,1.0,0.166667,0.0,0.0,,


In [107]:
for clf in fitted_classifiers_clwt.keys():
    precision_results.loc[clf, 'BMOPTCW_tr_precision'] = \
        fitted_classifiers_clwt[clf]['Training metrics']['precision']
    precision_results.loc[clf, 'BMOPTCW_val_precision'] = \
        fitted_classifiers_clwt[clf]['Validation metrics']['precision']
precision_results

Unnamed: 0,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision,BMOPTCW_tr_precision,BMOPTCW_val_precision
Perceptron,0.21875,0.129032,0.116883,0.096774,0.21875,0.129032
LogisticRegression,0.0,0.0,0.0,0.0,0.167808,0.095238
PassiveAggressiveClassifier,0.24,0.125,0.26087,0.076923,0.104348,0.138889
SVC,0.0,0.0,0.0,0.0,0.785714,0.333333
KNeighborsClassifier,0.333333,0.0,1.0,0.2,1.0,0.2
GaussianProcessClassifier,1.0,0.25,0.0,0.0,1.0,0.25
DecisionTreeClassifier,1.0,0.029412,0.690476,0.1,0.36715,0.150685
RandomForestClassifier,1.0,1.0,0.0,0.0,0.785714,0.058824
AdaBoostClassifier,0.777778,0.153846,0.947368,0.333333,0.55,0.058824
GradientBoostingClassifier,1.0,0.166667,0.0,0.0,1.0,0.166667


In [108]:
print("TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS, WITH CLASS WEIGHTS")
for idx in (precision_results.index):
    print("%s: %.2f / %.2f" %(idx,
                              precision_results.loc[idx, 'BMOPTCW_tr_precision'],
                              precision_results.loc[idx, 'BMOPTCW_val_precision']))

TRAINING / VALIDATION PRECISION RESULTS, SLIGHTLY OPTIMIZED OFF-THE-SHELF CLASSIFIERS, WITH CLASS WEIGHTS
Perceptron: 0.22 / 0.13
LogisticRegression: 0.17 / 0.10
PassiveAggressiveClassifier: 0.10 / 0.14
SVC: 0.79 / 0.33
KNeighborsClassifier: 1.00 / 0.20
GaussianProcessClassifier: 1.00 / 0.25
DecisionTreeClassifier: 0.37 / 0.15
RandomForestClassifier: 0.79 / 0.06
AdaBoostClassifier: 0.55 / 0.06
GradientBoostingClassifier: 1.00 / 0.17


**OBSERVATION:** so, it is clear that none of the algorithms is performing well yet.

**CONCLUSIONS:**<font color='blue'> **HOWEVER, IT SEEMS THAT THE BALANCING OF THE CLASSES HAVE IMPROVED THE SITUATION A LITTLE BIT. NOW, ALL THE CLASSIFIERS COULD HAVE INTERESTING STARTING POINTS**</font><font color='red'>**, EVEN IF THE PERFORMANCE OF THEM ALL IS LESS THAN ACCEPTABLE AND THREE OF THEM (`KNeighborsClassifier`, `GaussianProcessClassifier` and `GradientBoostingClassifier`) ARE STILL EXTREMELY OVERFITTING THE TRAINING SET.**</font>


### Save the models

In [109]:
pickle.dump(fitted_classifiers_clwt, open(MODELS_FOLDER + "fitted_clf_cw_opt.pickle", 'wb'))

**CONCLUSSION:** so, it is clear that all the algorithms, as off-the-shelf and slightly optimized:

- All of them are suffering from the imbalanced nature of the classes.
- All of them show some kind of overfitting, and even some of them in an extreme way.

Before investigating each algorithm in depth, let's try another strategy to deal with dataset imbalance: oversampling, instead of assigning weights to the classes, but we will do it in another, separate Notebook.

## Save the results

In [113]:
precision_results = precision_results.reset_index(drop=False).rename(columns={'index': 'Classifier'})
precision_results

Unnamed: 0,Classifier,BM_tr_precision,BM_val_precision,BMOPT_tr_precision,BMOPT_val_precision,BMOPTCW_tr_precision,BMOPTCW_val_precision
0,Perceptron,0.21875,0.129032,0.116883,0.096774,0.21875,0.129032
1,LogisticRegression,0.0,0.0,0.0,0.0,0.167808,0.095238
2,PassiveAggressiveClassifier,0.24,0.125,0.26087,0.076923,0.104348,0.138889
3,SVC,0.0,0.0,0.0,0.0,0.785714,0.333333
4,KNeighborsClassifier,0.333333,0.0,1.0,0.2,1.0,0.2
5,GaussianProcessClassifier,1.0,0.25,0.0,0.0,1.0,0.25
6,DecisionTreeClassifier,1.0,0.029412,0.690476,0.1,0.36715,0.150685
7,RandomForestClassifier,1.0,1.0,0.0,0.0,0.785714,0.058824
8,AdaBoostClassifier,0.777778,0.153846,0.947368,0.333333,0.55,0.058824
9,GradientBoostingClassifier,1.0,0.166667,0.0,0.0,1.0,0.166667


In [115]:
precision_results.to_csv(MODELS_FOLDER + PRECISION_RESULTS_OUT, sep=',', decimal='.', index=False)

## Save predictions on the validation dataset

### Predictions

We now save the predictions on the validation dataset, alongside with all available metadata, so that they can be analysed later on.

In [169]:
s4_val_w_pred = s4_val.copy()
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


In [170]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_clwt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Validation set:
    y_val_pred = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict(s4_val[rel_features])
    s4_val_w_pred['Prediction_' + classifier] = y_val_pred

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
Calculating predictions for classifier GaussianProcessClassifier...
Calculating predictions for classifier DecisionTreeClassifier...
Calculating predictions for classifier RandomForestClassifier...
Calculating predictions for classifier AdaBoostClassifier...
Calculating predictions for classifier GradientBoostingClassifier...


In [171]:
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,Prediction_Perceptron,Prediction_LogisticRegression,Prediction_PassiveAggressiveClassifier,Prediction_SVC,Prediction_KNeighborsClassifier,Prediction_GaussianProcessClassifier,Prediction_DecisionTreeClassifier,Prediction_RandomForestClassifier,Prediction_AdaBoostClassifier,Prediction_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0,0,1,0,0,0,1,0,0,0
1,Star-00868,0,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0,0,1,0,0,0,0,0,0,0
2,Star-00106,0,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,1,0,0,0,0,0,0,0,0,0
3,Star-00120,0,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,0,0,1,0,0,0,0,0,0,0
4,Star-00559,0,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
245,Star-00232,0,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,0,0,0,0,0,0,0,0,0,0
246,Star-00943,0,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0,0,1,0,0,0,0,0,0,0
247,Star-00721,0,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0,1,1,0,0,0,1,0,0,0
248,Star-00926,0,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,0,0,0,0,0,0,1,0,0,0


In [165]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,Prediction_Perceptron,Prediction_LogisticRegression,Prediction_PassiveAggressiveClassifier,Prediction_SVC,Prediction_KNeighborsClassifier,Prediction_GaussianProcessClassifier,Prediction_DecisionTreeClassifier,Prediction_RandomForestClassifier,Prediction_AdaBoostClassifier,Prediction_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,0,0,1,0,0,0,1,0,0,0
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,0,0,1,0,0,0,0,0,0,0
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,1,0,0,0,0,0,0,0,0,0
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,0,0,1,0,0,0,0,0,0,0
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,0,0,0,0,0,0,0,0
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,0,1,1,0,0,0,1,0,0,0
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,0,1,1,0,0,0,0,0,0,0
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,0,1,0,0,0,0,0,0,0,0
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,0,0,0,0,0,0,0,0,0,1
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,0,0,0,0,0,0,0,0,0,0


### Prediction probabilities (if available)

In [172]:
warnings.simplefilter('ignore')
for classifier in list(fitted_classifiers_clwt.keys()):
    print("Calculating predictions for classifier %s..." %classifier)
    # Validation set:
    try:
        y_val_pred_proba = fitted_classifiers_clwt[classifier]['Fitted_clf'].predict_proba(s4_val[rel_features])
        s4_val_w_pred['PredictionProb_' + classifier] = pd.Series(y_val_pred_proba[:, 1])
        print("... ok, probabilities calculated")
    except:
        print("**WARNING: 'predict_proba' method failed for classifier '%s'." %classifier)
        s4_val_w_pred['PredictionProb_' + classifier] = np.nan

Calculating predictions for classifier Perceptron...
Calculating predictions for classifier LogisticRegression...
... ok, probabilities calculated
Calculating predictions for classifier PassiveAggressiveClassifier...
Calculating predictions for classifier SVC...
Calculating predictions for classifier KNeighborsClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GaussianProcessClassifier...
... ok, probabilities calculated
Calculating predictions for classifier DecisionTreeClassifier...
... ok, probabilities calculated
Calculating predictions for classifier RandomForestClassifier...
... ok, probabilities calculated
Calculating predictions for classifier AdaBoostClassifier...
... ok, probabilities calculated
Calculating predictions for classifier GradientBoostingClassifier...
... ok, probabilities calculated


In [174]:
s4_val_w_pred.head(20)

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,PredictionProb_Perceptron,PredictionProb_LogisticRegression,PredictionProb_PassiveAggressiveClassifier,PredictionProb_SVC,PredictionProb_KNeighborsClassifier,PredictionProb_GaussianProcessClassifier,PredictionProb_DecisionTreeClassifier,PredictionProb_RandomForestClassifier,PredictionProb_AdaBoostClassifier,PredictionProb_GradientBoostingClassifier
0,Star-00107,0,0.0,0.0,0.0,2457430.0,0.0,-0.99166,0.031948,0.542146,...,,0.292668,,,0.229641,0.499994,0.831169,0.173323,0.403245,0.029434
1,Star-00868,0,0.0,0.0,0.0,2457432.0,0.0,-1.309194,-1.081711,1.825039,...,,0.295265,,,0.0,0.499941,0.160689,0.259035,0.489024,0.077813
2,Star-00106,0,0.0,0.0,0.0,2457404.0,0.0,-0.356591,0.379966,0.844003,...,,0.414528,,,0.0,0.499428,0.088258,0.2184,0.498291,0.030231
3,Star-00120,0,0.0,0.0,0.0,2457395.0,0.0,-0.039057,0.519174,0.994931,...,,0.365153,,,0.308834,0.499748,0.160689,0.350822,0.495524,0.087292
4,Star-00559,0,0.0,0.0,0.0,2457441.0,0.0,0.596012,-0.664089,-0.212498,...,,0.401949,,,0.0,0.499006,0.0,0.274581,0.49381,0.02527
5,Star-00205,1,71.94,0.53,0.0,2457510.0,0.05,-0.039057,1.493625,0.391217,...,,0.520586,,,0.320539,0.498293,0.934891,0.348724,0.496568,0.074594
6,Star-00061,0,0.0,0.0,0.0,2457432.0,0.0,-0.356591,1.493625,1.221324,...,,0.66579,,,0.0,0.499922,0.0,0.400572,0.416515,0.078724
7,Star-00258,0,0.0,0.0,0.0,2457401.0,0.0,0.278478,-0.176863,-0.665284,...,,0.590587,,,0.0,0.474454,0.0,0.424901,0.498678,0.11503
8,Star-00124,0,0.0,0.0,0.0,2457401.0,0.0,-1.944263,1.0064,2.051432,...,,0.377777,,,0.0,0.499996,0.0,0.451169,0.422502,0.50319
9,Star-00775,0,0.0,0.0,0.0,2457404.0,0.0,0.596012,-0.664089,-0.212498,...,,0.354509,,,0.0,0.499947,0.0,0.311714,0.497007,0.038833


### Save the predictions

And we now save the file:

In [175]:
s4_val_w_pred.to_csv(MODELS_FOLDER + VAL_PREDICTIONS_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We tested different classifiers from different families against the S4 sample.
- In general, results show slight improvement with a very simple and naive model hyperparameter optimization.
- Imbalance of the dataset also seems to pose a problem for many of the classifiers.
- Overfitting seems to be a serious problem, more specially with the tree / ensemble methods.
- We have stored both the precision of the different ML models, as well as their predictions on the validation set (and prediction probabilities when available).

**CONCLUSIONS:**

- We should opt for a balanced training dataset.
- Additional work in tree and ensemble classifiers is needed to prevento overfitting: pruning the trees, for example (even if some values for `ccp_alpha` parameter were tried as part of the optimization).
