#  FEATURE ENGINEERING - FEATURE SCALING - ALTERNATIVE S4B SAMPLE FOR 1-NN IMPLEMENTATION

In this Notebook we apply a standard feature scaling, by subtracting the mean and dividing by the standard deviation

**(PROBAR TAMBIÉN CON MINMAXSCALER A 0-1)**

**NOTE:** we will define the scaler based just in the ALT_S4B sample, and apply the sampling to ALT_S4 sample. We will save the scaler for later use with the ML subsample.

## Modules and configuration

### Modules

In [74]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
import pickle

#import matplotlib.pyplot as plt
#import seaborn as sns
#sns.set_style("white", {'figure.figsize':(15,10)})

### Configuration

In [75]:
CARMENES_IN = "../data/DATASETS_ML/1NN_CARMENES_00_DS_Initial.csv"
TRAIN_S4B_IN = "../data/DATASETS_ML/1NN_TRAIN_S4B_00_DS_Initial.csv"
VAL_S4B_IN = "../data/DATASETS_ML/1NN_VAL_S4B_00_DS_Initial.csv"

CESIUM_FEATURES_FILE = "../data/cesium_Features_by_Category.csv"

B_SCALER_FILE = "../data/ML_MODELS/ML_pipeline_steps/1NN_scaler.pickle" # Will store the fitted scaler object

CARMENES_OUT = "../data/DATASETS_ML/1NN_CARMENES_01_DS_AfterScaling.csv" # The scaled features of the ML subsample.
TRAIN_S4B_OUT = "../data/DATASETS_ML/1NN_TRAIN_S4B_01_DS_AfterScaling.csv" # The scaled features of the TRAINING S4 sample
VAL_S4B_OUT = "../data/DATASETS_ML/1NN_VAL_S4B_01_DS_AfterScaling.csv" # The scaled features of the VALIDATION S4 sample

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase',
                  'CARMENES_source_idx', 'CARMENES_Ref_star'] # Only cesium features and these columns will be kept.


### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_.

### Load the cesium features

In [76]:
cs_f = pd.read_csv(CESIUM_FEATURES_FILE, sep=';', decimal='.')
cs_f

Unnamed: 0,Type,Feature,Description
0,Cadence/Error,all_times_nhist_numpeaks,Number of peaks (local maxima) in histogram of...
1,Cadence/Error,all_times_nhist_peak1_bin,Return the (bin) index of the ith largest peak...
2,Cadence/Error,all_times_nhist_peak2_bin,Return the (bin) index of the ith largest peak...
3,Cadence/Error,all_times_nhist_peak3_bin,Return the (bin) index of the ith largest peak...
4,Cadence/Error,all_times_nhist_peak4_bin,Return the (bin) index of the ith largest peak...
...,...,...,...
107,Lomb-Scargle (periodic),p2p_scatter_2praw,Get ratio of variability (sum of squared diffe...
108,Lomb-Scargle (periodic),p2p_scatter_over_mad,Get ratio of variability of folded and unfolde...
109,Lomb-Scargle (periodic),p2p_scatter_pfold_over_mad,Get ratio of median of period-folded data over...
110,Lomb-Scargle (periodic),p2p_ssqr_diff_over_var,Get sum of squared differences of consecutive ...


In [77]:
cs_f_list = cs_f['Feature'].to_list()
print(cs_f_list)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65',

### Load the TRAINING S4 sample data

In [78]:
s4_tr = pd.read_csv(TRAIN_S4B_IN, sep=',', decimal='.')
s4_tr

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.00,0,J23505-095,10.0,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,10.0,...,0.875066,0.666432,0.140860,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.585830
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,10.0,...,0.913891,0.672340,0.120452,-0.000041,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,10.0,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.50,0,J23505-095,10.0,...,0.930988,0.698053,-0.051849,-0.000303,0.619250,0.574467,1.934343,1.050505,2.382155,0.730568
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,ALT-B_Star-37275,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,12.0,...,0.916000,0.561798,0.263209,-0.002615,0.857144,0.652312,1.617647,1.245098,2.364025,0.406583
37276,ALT-B_Star-37276,True,64.0,1.6,0.0,0.0,0.50,232,J00051+457,12.0,...,1.000267,0.575509,0.042197,-0.001708,0.888439,0.863114,1.583333,1.000000,1.918760,0.492849
37277,ALT-B_Star-37277,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,12.0,...,0.856494,0.592491,0.047063,-0.002131,0.800533,0.931275,1.403636,1.098182,2.033018,0.397017
37278,ALT-B_Star-37278,True,64.0,1.6,0.0,0.0,0.75,232,J00051+457,12.0,...,0.904447,0.546488,-0.374507,-0.001698,1.423353,0.905443,1.049451,1.060440,1.823070,0.385341


In [79]:
print(list(s4_tr.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

### Read the VALIDATION S4 sample data

In [80]:
s4_val = pd.read_csv(VAL_S4B_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,11.0,...,0.948040,0.809789,0.008037,-0.000402,0.952390,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,8.0,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,11.0,...,1.131387,0.234854,-0.012060,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.114480
3,B_Star-00003,False,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,12.0,...,0.957384,0.539035,0.130598,-0.000051,1.186332,0.588031,1.942197,1.005780,2.343527,0.457471
4,B_Star-00004,False,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,12.0,...,0.926403,0.625322,-0.001450,-0.000311,1.389174,0.576179,2.175000,1.000000,2.457354,0.486447
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,False,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,7.0,...,0.871491,0.508868,0.113752,0.002208,0.670102,1.184699,1.859155,1.267606,1.190609,0.517696
3989,B_Star-03996,False,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,4.0,...,1.080301,0.727763,-0.132918,-0.000010,1.251124,0.839217,1.426966,1.202247,1.971105,0.446290
3990,B_Star-03997,False,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,10.0,...,1.016076,0.752359,-0.028916,0.000088,0.993836,0.703634,1.431034,1.275862,2.133724,0.686681
3991,B_Star-03998,False,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,10.0,...,0.998230,0.806347,-0.023802,-0.000209,0.898179,0.768453,1.546392,1.206186,2.199315,0.621301


In [81]:
print(list(s4_val.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

###  Read the CARMENES ML subsample data

In [82]:
carm = pd.read_csv(CARMENES_IN, sep=',', decimal='.')
carm

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,...,0.813481,0.554379,0.489066,0.004200,1.295720,1.436865,0.920684,1.092416,0.978539,0.445028
1,J23492+024,7.0,8.0,16.0,24.0,32.0,1.222588,1.557039,5.172370,1.273560,...,0.707163,0.879691,-0.910369,0.000814,1.044964,1.000000,0.935512,0.935512,1.242137,0.829985
2,J23431+365,11.0,17.0,29.0,48.0,13.0,1.076923,1.076923,1.400000,1.000000,...,0.991922,0.222404,0.082574,0.000863,1.033456,0.496443,1.216607,0.641686,1.907862,0.091860
3,J23419+441,10.0,38.0,25.0,40.0,13.0,1.265000,1.445714,1.552147,1.142857,...,0.733129,0.550059,0.182636,-0.001152,1.053811,0.971801,0.921956,0.994958,1.188842,0.546181
4,J23381-162,9.0,13.0,10.0,16.0,45.0,1.448718,3.054054,3.054054,2.108108,...,0.872267,0.684783,-0.039567,0.001306,1.099656,1.011043,1.268031,1.157254,2.078314,0.457373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228,J00184+440,13.0,7.0,11.0,9.0,3.0,1.000000,1.159763,1.185006,1.159763,...,0.666641,0.558432,-0.085931,0.002910,1.049630,2.126325,0.857117,1.381938,0.643994,0.752413
229,J00183+440,10.0,10.0,19.0,24.0,26.0,5.729958,6.722772,7.760000,1.173267,...,0.633442,0.543139,-0.020521,-0.000210,1.005997,1.155374,0.609987,0.673215,0.591909,0.472642
230,J00162+198E,14.0,15.0,23.0,34.0,13.0,1.000000,1.142857,1.333333,1.142857,...,0.918027,0.215861,-0.329471,0.003725,2.401874,0.826581,2.169255,1.637703,1.688356,0.169315
231,J00067-075,6.0,29.0,27.0,16.0,45.0,1.262032,1.627586,1.918699,1.289655,...,1.030706,0.693464,-0.023253,-0.001416,0.920014,1.296351,1.311097,1.494734,1.489506,0.609811


In [83]:
print(list(carm.columns))

['Karmn', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_rati

## Define and train the scaler

In [84]:
fit_data = s4_tr[cs_f_list].copy()
fit_data

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,4.090909,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,4.090909,...,0.875066,0.666432,0.140860,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.585830
2,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,4.090909,...,0.913891,0.672340,0.120452,-0.000041,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,4.090909,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,4.090909,...,0.930988,0.698053,-0.051849,-0.000303,0.619250,0.574467,1.934343,1.050505,2.382155,0.730568
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,12.0,2.0,25.0,22.0,6.0,1.443038,1.701493,2.590909,1.179104,1.795455,...,0.916000,0.561798,0.263209,-0.002615,0.857144,0.652312,1.617647,1.245098,2.364025,0.406583
37276,12.0,2.0,25.0,22.0,6.0,1.443038,1.701493,2.590909,1.179104,1.795455,...,1.000267,0.575509,0.042197,-0.001708,0.888439,0.863114,1.583333,1.000000,1.918760,0.492849
37277,12.0,2.0,25.0,22.0,6.0,1.443038,1.701493,2.590909,1.179104,1.795455,...,0.856494,0.592491,0.047063,-0.002131,0.800533,0.931275,1.403636,1.098182,2.033018,0.397017
37278,12.0,2.0,25.0,22.0,6.0,1.443038,1.701493,2.590909,1.179104,1.795455,...,0.904447,0.546488,-0.374507,-0.001698,1.423353,0.905443,1.049451,1.060440,1.823070,0.385341


In [85]:
scaler = StandardScaler()
scaler.fit(fit_data)

### Save the trained scaler

In [86]:
pickle.dump(scaler, open(SCALER_FILE, 'wb'))

## Reload and apply the scaler to S4 sample and to ML subsample

### Reload scaler

In [87]:
ld_scaler = pickle.load(open(SCALER_FILE, 'rb'))
ld_scaler

### Scale the features in TRAINING S4 sample

In [88]:
scaled_s4_tr = s4_tr.copy()
scaled_s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,10.0,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.875066,0.666432,0.14086,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.58583
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,10.0,...,0.913891,0.67234,0.120452,-4.1e-05,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,10.0,...,0.930988,0.698053,-0.051849,-0.000303,0.61925,0.574467,1.934343,1.050505,2.382155,0.730568


In [89]:
scaled_s4_tr.loc[:, cs_f_list] = ld_scaler.transform(scaled_s4_tr.loc[:, cs_f_list])
scaled_s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.90897,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.54105,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,0.205178,...,-0.272705,0.96209,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367


In [90]:
s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,10.0,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.875066,0.666432,0.14086,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.58583
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,10.0,...,0.913891,0.67234,0.120452,-4.1e-05,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,10.0,...,0.930988,0.698053,-0.051849,-0.000303,0.61925,0.574467,1.934343,1.050505,2.382155,0.730568


In [91]:
s4_tr[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,37280.0,37280.0,36800.0,36000.0,34720.0,36800.0,36000.0,34720.0,36000.0,34720.0,...,37264.0,37264.0,37264.0,37264.0,36778.0,37280.0,37280.0,37280.0,37280.0,37264.0
mean,9.343348,18.042918,20.647826,24.391111,26.124424,1.456544,1.944625,2.56214,1.36751,1.816531,...,0.964652,0.4625134,-0.001427365,0.000517,1.430096,0.775381,1.661754,1.302762,2.109784,0.381641
std,3.200453,14.103665,13.185888,13.212565,12.997465,0.715273,1.241585,1.888397,0.695504,1.085073,...,0.123448,0.2448243,0.2252681,0.096315,18.68824,0.338476,1.04106,1.472788,0.461259,0.283369
min,2.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.186169,3.39825e-08,-1.463254,-5.271012,2.541953e-11,0.11257,0.090293,0.090293,0.172812,0.000359
25%,7.0,8.0,9.0,14.0,14.0,1.071111,1.330827,1.5,1.02381,1.235294,...,0.895362,0.2426018,-0.09878183,-0.000482,0.9285395,0.575155,1.335683,0.980263,1.852855,0.1454
50%,9.0,13.0,18.0,24.0,26.0,1.25,1.6,2.0,1.181818,1.5,...,0.959796,0.4544353,-8.930678e-11,1e-06,1.110795,0.736751,1.529304,1.212766,2.0636,0.339266
75%,11.0,26.0,31.0,35.0,37.0,1.535714,2.027778,3.0,1.446154,2.0,...,1.028426,0.670386,0.09537411,0.000489,1.508761,0.905465,1.787105,1.44,2.317864,0.576435
max,18.0,49.0,49.0,49.0,49.0,7.288276,13.351852,15.0,9.37037,9.37037,...,1.375438,0.9966831,1.936008,4.129752,3564.057,7.05428,85.626667,213.4375,4.519623,11.745672


In [92]:
scaled_s4_tr[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,37280.0,37280.0,36800.0,36000.0,34720.0,36800.0,36000.0,34720.0,36000.0,34720.0,...,37264.0,37264.0,37264.0,37264.0,36778.0,37280.0,37280.0,37280.0,37280.0,37264.0
mean,-4.2693550000000005e-17,-9.148619e-17,8.959017000000001e-17,-5.052748e-17,-1.3097560000000001e-17,9.576880000000001e-17,-5.842240000000001e-17,8.185976000000001e-17,1.831621e-16,4.584147e-17,...,3.3559340000000005e-17,-2.745764e-16,7.627123e-18,-2.2881370000000002e-18,-3.670758e-18,-9.529811000000001e-17,-9.758527000000001e-17,7.242657e-17,-5.611153e-16,-1.830509e-16
std,1.000013,1.000013,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,1.000013,1.000013,1.000013,1.000013,1.000014,1.000013,1.000013,1.000013,1.000013,1.000013
min,-2.294502,-1.279324,-1.565925,-1.84608,-2.009992,-0.6382879,-0.7608323,-0.8272426,-0.5284149,-0.7525235,...,-6.306264,-1.88919,-6.489361,-54.73265,-0.0765249,-1.958247,-1.509502,-0.8232581,-4.199372,-1.345553
25%,-0.7322022,-0.7120882,-0.8833674,-0.7864677,-0.9328434,-0.5388684,-0.4943731,-0.562464,-0.4941809,-0.535674,...,-0.5613034,-0.8982544,-0.4321774,-0.01037192,-0.02683846,-0.5915596,-0.3132149,-0.2189744,-0.5570242,-0.8336991
50%,-0.1072824,-0.3575656,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,...,-0.0393412,-0.03299619,0.006336377,-0.005353901,-0.01708591,-0.1141285,-0.1272274,-0.06110645,-0.1001276,-0.1495433
75%,0.5176375,0.5641929,0.7851056,0.8029505,0.836758,0.110687,0.06697425,0.2318719,0.1130764,0.1690871,...,0.5166123,0.8490798,0.4297225,-0.000288742,0.004209387,0.3843297,0.1204085,0.09318402,0.4511185,0.6874313
max,2.704857,2.194997,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.50672,6.961699,...,3.327655,2.181878,8.600693,42.87256,190.6373,18.55076,80.65438,144.0381,5.224551,40.10389


#### Save the scaled TRAINING S4 sample dataset

In [93]:
scaled_s4_tr.to_csv(TRAIN_S4B_OUT, sep=',', decimal='.', index=False)

### Scale the features in VALIDATION S4 sample

In [94]:
scaled_s4_val = s4_val.copy()
scaled_s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,11.0,...,0.94804,0.809789,0.008037,-0.000402,0.95239,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,8.0,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,11.0,...,1.131387,0.234854,-0.01206,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.11448
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,12.0,...,0.957384,0.539035,0.130598,-5.1e-05,1.186332,0.588031,1.942197,1.00578,2.343527,0.457471
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,12.0,...,0.926403,0.625322,-0.00145,-0.000311,1.389174,0.576179,2.175,1.0,2.457354,0.486447


In [95]:
scaled_s4_val.loc[:, cs_f_list] = ld_scaler.transform(scaled_s4_val.loc[:, cs_f_list])
scaled_s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.00227,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.01698,0.060763,0.030306,1.36553,0.577156,0.547318,-0.942817
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.00219,-0.588534,0.49301,-0.205573,0.753534,0.369862


In [96]:
s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,11.0,...,0.94804,0.809789,0.008037,-0.000402,0.95239,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,8.0,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,11.0,...,1.131387,0.234854,-0.01206,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.11448
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,12.0,...,0.957384,0.539035,0.130598,-5.1e-05,1.186332,0.588031,1.942197,1.00578,2.343527,0.457471
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,12.0,...,0.926403,0.625322,-0.00145,-0.000311,1.389174,0.576179,2.175,1.0,2.457354,0.486447


In [97]:
s4_val[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,3993.0,3993.0,3939.0,3867.0,3728.0,3939.0,3867.0,3728.0,3867.0,3728.0,...,3987.0,3987.0,3987.0,3987.0,3938.0,3993.0,3993.0,3993.0,3993.0,3987.0
mean,9.356875,18.049086,20.869002,24.577709,26.037822,1.441207,1.95183,2.565117,1.384547,1.835793,...,0.974717,0.469068,0.002993,-0.000556,1.33714,0.76677,1.653001,1.278361,2.110168,0.382817
std,3.232803,14.019475,13.283738,13.272984,13.054052,0.693493,1.274144,1.901285,0.731997,1.134257,...,0.109177,0.246879,0.224028,0.074621,0.9721879,0.320665,0.726211,0.68753,0.44939,0.269142
min,2.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.412037,0.012189,-1.300869,-1.998146,8.961103e-10,0.143365,0.178138,0.216418,0.49487,0.008886
25%,7.0,8.0,9.0,14.0,14.0,1.066667,1.300971,1.5,1.022727,1.235294,...,0.903999,0.247992,-0.089468,-0.00048,0.9223354,0.572295,1.337705,0.968421,1.86078,0.151018
50%,9.0,13.0,18.0,24.0,25.0,1.234831,1.6,2.0,1.180851,1.5,...,0.964275,0.460597,6e-06,-2e-06,1.101106,0.731889,1.530612,1.202247,2.063319,0.333944
75%,12.0,26.0,31.0,36.0,37.0,1.522774,2.116279,3.0,1.454545,2.0,...,1.029773,0.676191,0.09704,0.000485,1.486661,0.89514,1.780822,1.433071,2.326295,0.573129
max,18.0,49.0,49.0,49.0,49.0,7.288276,13.351852,15.0,9.37037,9.37037,...,1.375438,0.992947,1.455934,3.289913,25.00916,4.36005,17.538462,18.153846,4.545622,3.029625


In [98]:
scaled_s4_val[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,3993.0,3993.0,3939.0,3867.0,3728.0,3939.0,3867.0,3728.0,3867.0,3728.0,...,3987.0,3987.0,3987.0,3987.0,3938.0,3993.0,3993.0,3993.0,3993.0,3987.0
mean,0.004227,0.000437,0.016774,0.014123,-0.006663,-0.021442,0.005803,0.001576,0.024496,0.017752,...,0.081534,0.026774,0.019621,-0.011142,-0.004974,-0.02544,-0.008408,-0.016568,0.000833,0.004148
std,1.010121,0.994044,1.007434,1.004587,1.004368,0.969563,1.026238,1.006839,1.052484,1.045343,...,0.884413,1.008406,0.994507,0.774763,0.052022,0.947391,0.697579,0.466829,0.974282,0.949807
min,-2.294502,-1.279324,-1.565925,-1.84608,-2.009992,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,...,-4.47657,-1.839401,-5.768502,-20.751503,-0.076525,-1.867264,-1.425121,-0.73762,-3.501148,-1.315462
25%,-0.732202,-0.712088,-0.883367,-0.786468,-0.932843,-0.545082,-0.51842,-0.562464,-0.495737,-0.535674,...,-0.491335,-0.876238,-0.390831,-0.010355,-0.02717,-0.60001,-0.311273,-0.227015,-0.539844,-0.813875
50%,-0.107282,-0.357566,-0.20081,-0.029602,-0.086512,-0.309973,-0.277572,-0.297685,-0.268383,-0.291718,...,-0.003054,-0.007826,0.006365,-0.005383,-0.017604,-0.128495,-0.125971,-0.068249,-0.100736,-0.168325
75%,0.830097,0.564193,0.785106,0.878637,0.836758,0.092596,0.138256,0.231872,0.125142,0.169087,...,0.527522,0.872792,0.437119,-0.000329,0.003027,0.353823,0.114373,0.088479,0.469398,0.675763
max,2.704857,2.194997,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.506715,6.961699,...,3.327655,2.166616,6.469539,34.152774,1.261723,10.590762,15.250728,11.441779,5.280916,9.344787


#### Save the scaled VALIDATION S4 sample dataset

In [99]:
scaled_s4_val.to_csv(VAL_S4B_OUT, sep=',', decimal='.', index=False)

### Scale the features in CARMENES subsample

In [100]:
scaled_carm = carm.copy()
scaled_carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,...,0.813481,0.554379,0.489066,0.0042,1.29572,1.436865,0.920684,1.092416,0.978539,0.445028
1,J23492+024,7.0,8.0,16.0,24.0,32.0,1.222588,1.557039,5.17237,1.27356,...,0.707163,0.879691,-0.910369,0.000814,1.044964,1.0,0.935512,0.935512,1.242137,0.829985
2,J23431+365,11.0,17.0,29.0,48.0,13.0,1.076923,1.076923,1.4,1.0,...,0.991922,0.222404,0.082574,0.000863,1.033456,0.496443,1.216607,0.641686,1.907862,0.09186
3,J23419+441,10.0,38.0,25.0,40.0,13.0,1.265,1.445714,1.552147,1.142857,...,0.733129,0.550059,0.182636,-0.001152,1.053811,0.971801,0.921956,0.994958,1.188842,0.546181
4,J23381-162,9.0,13.0,10.0,16.0,45.0,1.448718,3.054054,3.054054,2.108108,...,0.872267,0.684783,-0.039567,0.001306,1.099656,1.011043,1.268031,1.157254,2.078314,0.457373


In [101]:
scaled_carm.loc[:, cs_f_list] = ld_scaler.transform(scaled_carm.loc[:, cs_f_list])
scaled_carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,...,-1.224593,0.375234,2.177405,0.038241,-0.007191,1.95433,-0.711851,-0.142823,-2.452548,0.223693
1,J23492+024,-0.732202,-0.712088,-0.35249,-0.029602,0.452062,-0.32709,-0.312174,1.382266,-0.135084,...,-2.085847,1.704012,-4.034985,0.003083,-0.020609,0.663629,-0.697608,-0.24936,-1.881067,1.582213
2,J23431+365,0.517637,-0.073948,0.633426,1.786876,-1.009783,-0.530743,-0.698876,-0.61542,-0.528415,...,0.220901,-0.980754,0.372899,0.00359,-0.021224,-0.824111,-0.427596,-0.448866,-0.437768,-1.022645
3,J23419+441,0.205178,1.415047,0.330067,1.181383,-1.009783,-0.267795,-0.401839,-0.534849,-0.323011,...,-1.8755,0.357589,0.817097,-0.017334,-0.020135,0.580317,-0.71063,-0.208997,-1.99661,0.580664
4,J23381-162,-0.107282,-0.357566,-0.807528,-0.635095,1.452272,-0.010941,0.893571,0.260497,1.064851,...,-0.748389,0.907884,-0.169309,0.008196,-0.017682,0.696254,-0.3782,-0.098799,-0.068227,0.267257


In [102]:
carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,...,0.813481,0.554379,0.489066,0.0042,1.29572,1.436865,0.920684,1.092416,0.978539,0.445028
1,J23492+024,7.0,8.0,16.0,24.0,32.0,1.222588,1.557039,5.17237,1.27356,...,0.707163,0.879691,-0.910369,0.000814,1.044964,1.0,0.935512,0.935512,1.242137,0.829985
2,J23431+365,11.0,17.0,29.0,48.0,13.0,1.076923,1.076923,1.4,1.0,...,0.991922,0.222404,0.082574,0.000863,1.033456,0.496443,1.216607,0.641686,1.907862,0.09186
3,J23419+441,10.0,38.0,25.0,40.0,13.0,1.265,1.445714,1.552147,1.142857,...,0.733129,0.550059,0.182636,-0.001152,1.053811,0.971801,0.921956,0.994958,1.188842,0.546181
4,J23381-162,9.0,13.0,10.0,16.0,45.0,1.448718,3.054054,3.054054,2.108108,...,0.872267,0.684783,-0.039567,0.001306,1.099656,1.011043,1.268031,1.157254,2.078314,0.457373


In [103]:
carm[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,233.0,233.0,230.0,225.0,217.0,230.0,225.0,217.0,225.0,217.0,...,233.0,233.0,233.0,233.0,228.0,233.0,233.0,233.0,233.0,233.0
mean,9.343348,18.042918,20.647826,24.391111,26.124424,1.456544,1.944625,2.56214,1.36751,1.816531,...,0.931129,0.406203,0.044568,-0.00553,1.354686,0.970528,1.461775,1.338192,1.673497,0.344544
std,3.207301,14.133839,13.214468,13.241841,13.027329,0.716823,1.244336,1.892736,0.697046,1.087566,...,0.13928,0.209434,0.479494,0.063154,0.713218,0.434143,1.22245,1.203844,0.570022,0.231031
min,2.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,...,0.581979,0.050107,-1.378112,-0.946019,0.000299,0.30094,0.387727,0.352122,0.391725,0.020064
25%,7.0,8.0,9.0,14.0,14.0,1.07119,1.330827,1.5,1.02381,1.235294,...,0.846946,0.222533,-0.157819,-0.001681,0.964158,0.657207,0.982467,0.948118,1.298997,0.1338
50%,9.0,13.0,18.0,24.0,26.0,1.25,1.6,2.0,1.181818,1.5,...,0.934611,0.380692,0.011492,0.000343,1.169603,0.916748,1.251863,1.177379,1.614112,0.325024
75%,11.0,26.0,31.0,35.0,37.0,1.532738,2.027778,3.0,1.446154,2.0,...,1.006945,0.559939,0.267217,0.001723,1.534653,1.179091,1.556039,1.381938,2.029083,0.492239
max,18.0,49.0,49.0,49.0,49.0,7.288276,13.351852,15.0,9.37037,9.37037,...,1.375437,0.998119,1.855925,0.054773,5.925674,3.069775,15.930603,16.869044,3.703782,1.313978


In [104]:
scaled_carm[cs_f_list].describe()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,233.0,233.0,230.0,225.0,217.0,230.0,225.0,217.0,225.0,217.0,...,233.0,233.0,233.0,233.0,228.0,233.0,233.0,233.0,233.0,233.0
mean,8.386234e-17,-9.148619e-17,8.881784000000001e-17,-7.105427000000001e-17,-1.2278960000000002e-17,8.495620000000001e-17,-4.9343250000000004e-17,9.823171000000001e-17,2.289527e-16,4.9115860000000003e-17,...,-0.271566,-0.230007,0.204185,-0.062786,-0.004035,0.576555,-0.192094,0.024057,-0.945874,-0.130916
std,1.002153,1.002153,1.002181,1.00223,1.002312,1.002181,1.00223,1.002312,1.00223,1.002312,...,1.128264,0.855456,2.128576,0.655705,0.038165,1.282657,1.174251,0.817402,1.235812,0.815313
min,-2.294502,-1.279324,-1.565925,-1.84608,-2.009992,-0.6382879,-0.7608323,-0.8272426,-0.5284149,-0.7525235,...,-3.099925,-1.684524,-6.111399,-9.827589,-0.076509,-1.401715,-1.223795,-0.645478,-3.724766,-1.276013
25%,-0.7322022,-0.7120882,-0.8833674,-0.7864677,-0.9328434,-0.5387575,-0.4943731,-0.562464,-0.4941809,-0.535674,...,-0.953505,-0.980227,-0.694258,-0.022823,-0.024932,-0.349139,-0.652505,-0.2408,-1.757794,-0.874639
50%,-0.1072824,-0.3575656,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,...,-0.243356,-0.334207,0.057353,-0.001805,-0.013939,0.417663,-0.39373,-0.085134,-1.074622,-0.199803
75%,0.5176375,0.5641929,0.7851056,0.8029505,0.836758,0.1065261,0.06697425,0.2318719,0.1130764,0.1690871,...,0.342597,0.397944,1.192571,0.012517,0.005595,1.192746,-0.101547,0.05376,-0.174962,0.390301
max,2.704857,2.194997,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.50672,6.961699,...,3.327644,2.187744,8.245187,0.56332,0.24056,6.778697,13.706263,10.569407,3.4558,3.290234


<font color='blue'>**SIMILAR TO S4 SAMPLE**</font>

**OBSERVATION:** as we can see, mean and standard deviation for scaled ML sample are not exactly equal to 0 and 1, respectively, and even for some features the mean and standard deviation values differ a lot from those values. This is somehow expected, as we saw previously that for many features the distribution of values were very different between ML subsample and S4 sample.

#### Save the scaled ML subsample dataset

In [105]:
scaled_carm.to_csv(CARMENES_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We have created and saved a `StandardScaler`, fitted with the data in S4 sample (notice that `NaN` values are ignored during the fit and maintained during the transform)
- We have reloaded that scaler and applied to the _cesium_ features of the S4 sample and the ML subsample.
- **NOTE:** as expected (as the values distributions of many _cesium_ features were different for ML subsample and S4 sample), the mean and standard deviation for the scaled ML subsample are not 0 and 1 for many of the features. 