#  TRAIN/TEST AND VALIDATION SPLIT

In this Notebook we make the split of the S4 sample in train/test set, and validation subsamples. We don't further split the train/test subsample (_training set_ from now on), because later on we will apply k-fold crosvalidation when it comes to train or optimize models. 

We will take $25\%$ of the records for the validation set and $75\%$ of the records for the training set. The validation set is set apart, and which be used to assess the performance of the model on new, previously unseen data, after being optimized and trained by crossvalidation with the training set.

The choice of the $75\%$ data points to set apart will be chosen randomly, but with a stratify strategy based on the target variable (i.e. the `Pulsating` field of S4 sample), so as the training and validation sets contain the same fraction of pulsating stars. We could have chosen a more complex stratification strategy, for example involving all the features, and that would ensure that the validation set have the same statistical characteristics as the training set. However, taken into account what we saw comparing ML subsample with S4 sample in terms of features, we prefer to be conservative with the validation of models, and try not to simplify too much the task of those models.

Notice that we do this split for the S4 sample with all the $112$ features: later on we will select or not the proper subset of features (namely, the $48$ reliable features), as needed. That way, any comparison of results using $112$ or $48$ features will share the shame initial conditions. Ideally, several runs of the experiments with different random seeds should be carried out to ensure that the selected split does not have any special characteristic which gives advantage to one case over the other.

## Modules and configuration

### Modules

In [1]:
import pandas as pd
#import numpy as np

#import warnings

from sklearn.model_selection import train_test_split

#from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

import pickle

#import matplotlib.pyplot as plt
#import seaborn as sns
#sns.set_style("white", {'figure.figsize':(15,10)})

#from IPython.display import display

### Configuration

In [2]:
RANDOM_STATE = 11 # For reproducibility
VAL_SIZE = 0.25 # The fraction of samples in the validation dataset

S4_INPUT_SCALED_AND_IMPUTED = "../data/DATASETS_ML/S4_02_DS_AfterImputing.csv"

#REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"

S4_TRAIN_SET_OUT = "../data/DATASETS_ML/S4_02_DS_AfterImputing_TrainTest.csv"
# Train/test set for S4 sample, all 112 features
S4_VALIDATION_SET_OUT = "../data/DATASETS_ML/S4_02_DS_AfterImputing_Validation.csv"
# Validation set for S4 sample, all 112 features

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase'] # Only cesium features and these columns will be kept.

IMAGE_FOLDER = './img/'

### Functions

## Load data

We load the data, which are the S4 sample dataset, scaled, and with `NaN` values imputed by a `KNNImputer`.

In [3]:
s4 = pd.read_csv(S4_INPUT_SCALED_AND_IMPUTED, sep=',', decimal='.')
s4.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00000,True,10.33,1.14,0.0,2457582.0,0.9,-1.944263,2.120058,1.900503,...,0.175639,-1.179443,0.567705,0.100876,-0.114591,-0.21621,0.045386,-0.543694,1.459163,-1.311671
1,Star-00001,True,14.92,1.3,0.0,2457522.0,0.02,1.231081,0.171155,0.08936,...,0.28129,0.536314,0.053099,0.19184,-0.184775,1.237821,-0.589916,-0.372454,-1.663617,-0.333991
2,Star-00002,False,0.0,0.0,0.0,2457549.0,0.0,0.596012,-0.664089,-0.212498,...,-0.307759,0.629682,-0.137934,0.052383,-0.134684,-0.332168,0.316178,-0.194267,-0.020196,0.651418
3,Star-00003,False,0.0,0.0,0.0,2457460.0,0.0,2.501218,-0.664089,-1.495391,...,-0.076569,0.420813,-0.544527,0.065099,-0.073476,0.021419,-0.521209,-0.244714,0.08157,-0.272277
4,Star-00004,True,28.74,0.9,0.0,2457451.0,0.29,-0.039057,1.076003,0.391217,...,-0.39023,-1.060328,2.146213,0.010452,-0.401182,-0.849181,0.06774,-0.433568,0.438785,-0.757588


In [4]:
s4.shape

(1000, 119)

## Split Train/Test and Validation sets

In [6]:
# Notice that 'test_size' refers, in this context, to the validation set size (i.e. 25% of the full set), and
# that we discard the 'y' results of the split (as we will just store the sets with both features and target variable)
X_traintest, X_val, _, _ = train_test_split(s4, s4['Pulsating'], test_size=VAL_SIZE, stratify=s4['Pulsating'])

In [7]:
X_traintest.shape

(750, 119)

In [8]:
X_val.shape

(250, 119)

In [9]:
X_traintest

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
163,Star-00163,False,0.0,0.00,0.0,2.457444e+06,0.00,-0.674126,0.519174,0.466681,...,-0.712310,-1.187392,0.425026,-0.002305,0.495906,-0.537353,-0.028926,-0.262548,-0.135686,-0.705143
123,Star-00123,True,30.0,0.72,0.0,2.457401e+06,0.37,-1.626729,1.911247,-0.740748,...,0.040924,-1.110488,-0.289189,0.056551,0.555375,-0.699590,-0.292135,-0.013533,0.443673,-1.207278
22,Star-00022,False,0.0,0.00,0.0,2.457430e+06,0.00,-0.039057,-1.012107,0.013895,...,-0.943428,0.637603,-0.679383,0.020496,-0.496592,-0.001214,-0.101526,-0.011097,-0.293389,0.242263
708,Star-00708,False,0.0,0.00,0.0,2.459677e+06,0.00,-0.039057,1.632833,-0.514355,...,-1.091456,0.759880,-0.161363,-0.210930,0.135863,0.662121,-0.492481,0.015621,-0.724783,0.682494
484,Star-00484,False,0.0,0.00,0.0,2.457400e+06,0.00,0.596012,-0.176863,-1.042605,...,-0.696260,0.153752,0.936459,0.070402,-0.067689,-0.656553,-0.237337,-0.032597,-0.139141,-0.098080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,Star-00795,False,0.0,0.00,0.0,2.457410e+06,0.00,0.278478,-0.803296,1.070396,...,-0.390463,-0.320856,-0.058001,0.061510,-0.608276,-1.229502,-0.332122,-0.410947,1.478398,-0.121757
221,Star-00221,False,0.0,0.00,0.0,2.457478e+06,0.00,-0.039057,-1.290522,-0.438891,...,0.703166,-1.543128,-0.386300,0.084782,-0.063902,0.199957,0.689712,-0.526121,1.559757,-1.150793
463,Star-00463,False,0.0,0.00,0.0,2.457409e+06,0.00,0.596012,-0.733692,-1.193534,...,-0.741328,1.272454,-0.054747,0.066032,-0.327664,0.414920,0.069037,0.039541,-0.218490,1.238094
873,Star-00873,False,0.0,0.00,0.0,2.457416e+06,0.00,-0.674126,-0.524881,0.542146,...,0.798791,-1.532917,-2.547988,0.149859,1.751137,-0.416064,-0.361215,0.544014,-1.894647,-1.094748


In [10]:
X_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
107,Star-00107,False,0.0,0.0,0.0,2.457430e+06,0.0,-0.991660,0.031948,0.542146,...,0.215296,-1.171010,-1.399418,0.202986,-0.550638,0.473838,-0.300629,-0.565171,-0.717831,-0.998997
868,Star-00868,False,0.0,0.0,0.0,2.457432e+06,0.0,-1.309194,-1.081711,1.825039,...,0.859339,-1.201677,-0.242453,-0.032886,-0.381893,4.479455,0.554354,0.946671,-2.945576,-0.390979
106,Star-00106,False,0.0,0.0,0.0,2.457404e+06,0.0,-0.356591,0.379966,0.844003,...,-0.619653,1.401153,0.280531,0.057394,-0.394560,0.012444,-0.506509,-0.073337,-0.019620,0.850942
120,Star-00120,False,0.0,0.0,0.0,2.457395e+06,0.0,-0.039057,0.519174,0.994931,...,-0.544944,-0.806949,-0.860069,0.186396,0.224160,-0.895172,-0.068329,-0.094626,0.283729,-1.018961
559,Star-00559,False,0.0,0.0,0.0,2.457441e+06,0.0,0.596012,-0.664089,-0.212498,...,0.600208,0.897864,0.219432,0.051864,-0.446379,-0.502312,-0.305140,-0.144503,0.889861,0.402366
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232,Star-00232,False,0.0,0.0,0.0,2.457436e+06,0.0,0.913546,0.101552,-1.419927,...,-0.705425,0.564869,-0.512824,0.063347,-0.448318,-0.790483,0.590039,-0.056457,0.771706,0.054225
943,Star-00943,False,0.0,0.0,0.0,2.457418e+06,0.0,0.278478,-0.733692,1.447717,...,0.591706,-1.002341,0.166327,0.253063,0.827107,-0.848836,-0.189881,-0.728901,0.989328,-1.001833
721,Star-00721,False,0.0,0.0,0.0,2.457398e+06,0.0,-0.356591,1.493625,1.221324,...,0.631140,-1.384100,-0.953155,0.073608,0.003997,-0.510853,-0.228942,-0.016225,-0.886376,-1.003102
926,Star-00926,False,0.0,0.0,0.0,2.457425e+06,0.0,0.913546,-1.151314,0.240288,...,-0.040429,0.545847,-0.283424,0.096595,-0.575888,-0.223498,0.142206,-0.298919,-0.741117,0.182974


Let's check the fraction of pulsating stars on each set:

In [11]:
print("Number of pulsating stars in Train/Test set: %d" 
      %len(X_traintest[X_traintest['Pulsating'] == True]))
print("Number of pulsating stars in Validation set: %d" 
      %len(X_val[X_val['Pulsating'] == True]))

Number of pulsating stars in Train/Test set: 78
Number of pulsating stars in Validation set: 26


In [12]:
print("Ratio of pulsating stars in Train/Test set: %.2f%%" 
      %(100.0 * len(X_traintest[X_traintest['Pulsating'] == True])/len(X_traintest)))
print("Ratio of pulsating stars in Validation set: %.2f%%" 
      %(100.0 * len(X_val[X_val['Pulsating'] == True])/len(X_val)))

Ratio of pulsating stars in Train/Test set: 10.40%
Ratio of pulsating stars in Validation set: 10.40%


In [13]:
print()




## Save results

### Save the train/test set

In [14]:
X_traintest.to_csv(S4_TRAIN_SET_OUT, sep=',', decimal='.', index=False)

### Save the validation set

In [15]:
X_val.to_csv(S4_VALIDATION_SET_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We did the split between train/test set and validation set for S4 sample, and stored the results.
