#  FEATURE ENGINEERING - ANALYSIS TO DECIDE ON IMPUTING STRATEGIES (FOR 1NN STRATEGY)

In this Notebook we check the features that need imputing, decide on the most adequate imputing strategies.

We also do a preselection of all the features to be delivered to the ML pipeline.

## Modules and configuration

### Modules

In [1]:
import pandas as pd
import numpy as np

import pickle

from cesium.data_management import TimeSeries
from cesium.features import cadence_features as cscf
from cesium.features import graphs as csgr
from cesium.features import period_folding as cspf

import sys

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white", {'figure.figsize':(15,10)})

### Configuration

In [57]:
CARMENES_IN = "../data/DATASETS_CESIUM/1NN/cesium_ML_Subsample_Dataset.csv"
TR_S4B_IN = "../data/DATASETS_CESIUM/1NN/cesium_ML_FINAL_ALT_S4B.csv"
VAL_S4B_IN = "../data/DATASETS_CESIUM/1NN/cesium_ML_FINAL_S4B.csv"

CESIUM_FEATURES_FILE = "../data/cesium_Features_by_Category.csv"
CARM_PG_FILE = "../data/SELECTION_for_ML_CARM_VIS_objects_with_PG.csv"
# To have additional metadata available for ML subsample

ML_CURVES_FOLDER = "../data/CARM_VIS_RVs/"
ML_PREFIX = ""
ML_SUFFIX = ".avc.dat"
TR_S4B_CURVES_FOLDER = "../SYNTH_RV_SAMPLES/S4B_ALTERNATIVE_ts_files/"
TR_S4B_PREFIX = "RV_ALT-B_"
TR_S4B_SUFFIX = '.dat'
VAL_S4B_CURVES_FOLDER = "../SYNTH_RV_SAMPLES/S4B_ts_files/"
VAL_S4B_PREFIX = "S4-RV_B_"
VAL_S4B_SUFFIX = '.dat'

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase',
                  'CARMENES_source_idx', 'CARMENES_Ref_star'] # Only cesium features and these columns will be kept.

CARMENES_OUT = "../data/DATASETS_ML/1NN/ML_00_DS_Initial.csv"
TR_S4B_OUT = "../data/DATASETS_ML/1NN/1NN_TRAIN_S4B_00_DS_Initial.csv"
VAL_S4B_OUT = "../data/DATASETS_ML/1NN/1NN_VAL_S4B_00_DS_Initial.csv"

IMPUTED_FEATURES_OUT = "../data/ML_MODELS/ML_pipeline_steps/1NN/1NN_imputed_features_list.pickle"
NAN_FEATURES_TABLE = './1NN_NaN_features_table.tex'

### Functions

In [58]:
def draw_curves(times, values, value_label, title, fig_filename=None):
    '''Draws several time series curves in the same plot'''
    '''Had to be modified for new seaborn version --- Needs improvement'''
    # Plots and saves the figure
    #kwargs = dict({'ms': 2.0})
    plt.figure(figsize=(10,7))
    plt.title(title, fontsize=16)
    #plt.grid(axis='both', alpha=0.75)
    plt.xlabel("Time [BJD]", fontsize=12)
    plt.ylabel(value_label, fontsize=12)
    for i in range(0, len(times)):
        data = pd.DataFrame(data={'Time': times[i], 'RV': values[i]})
        g = sns.scatterplot(data=data, x='Time', y='RV')
        #g = sns.scatterplot(times[i], values[i], label=labels[i])
    #plt.legend();
    #ylabels = ['{:,.2f}'.format(y) for y in g.get_yticks()]
    #g.set_yticklabels(ylabels)
    # Save the image:
    if fig_filename is None:
        pass
    else:
        plt.savefig(fig_filename, format='jpg', bbox_inches='tight')

## Load data

We load the data, which are the time series as previously featurized by _cesium_. Each data also includes a lot of other columns (identifiers, characteritics of the star, etc.) that will be dropped in the output file.

### Load cesium features

In [59]:
cs_f = pd.read_csv(CESIUM_FEATURES_FILE, sep=';', decimal='.')
cs_f

Unnamed: 0,Type,Feature,Description
0,Cadence/Error,all_times_nhist_numpeaks,Number of peaks (local maxima) in histogram of...
1,Cadence/Error,all_times_nhist_peak1_bin,Return the (bin) index of the ith largest peak...
2,Cadence/Error,all_times_nhist_peak2_bin,Return the (bin) index of the ith largest peak...
3,Cadence/Error,all_times_nhist_peak3_bin,Return the (bin) index of the ith largest peak...
4,Cadence/Error,all_times_nhist_peak4_bin,Return the (bin) index of the ith largest peak...
...,...,...,...
107,Lomb-Scargle (periodic),p2p_scatter_2praw,Get ratio of variability (sum of squared diffe...
108,Lomb-Scargle (periodic),p2p_scatter_over_mad,Get ratio of variability of folded and unfolde...
109,Lomb-Scargle (periodic),p2p_scatter_pfold_over_mad,Get ratio of median of period-folded data over...
110,Lomb-Scargle (periodic),p2p_ssqr_diff_over_var,Get sum of squared differences of consecutive ...


We now extract the list of cesium features.

In [60]:
csf_list = cs_f['Feature'].to_list()
print(csf_list)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65',

### Load CARMENES full PG data

In [61]:
carm_pg = pd.read_csv(CARM_PG_FILE, sep=',', decimal='.')
carm_pg

Unnamed: 0,Karmn,Name,Comp,GJ,RA_J2016_deg,DE_J2016_deg,RA_J2000,DE_J2000,l_J2016_deg,b_J2016_deg,...,WF_offset_PG_TESS,WF_e_offset_PG_TESS,WF_FAP_PG_TESS,WF_valid_PG_TESS,WF_error_PG_TESS,WF_elapsed_time_PG_TESS,WF_plain_file_TESS,WF_fig_file_TESS,PG_file_RV,PG_file_TESS
0,J23505-095,LP 763-012,-,4367,357.634705,-9.560964,23:50:31.64,-09:33:32.7,80.777067,-67.303426,...,1000.000122,9.022946e-07,1.0,1.0,,132.607176,../data/CARM_VIS_TESS_WinFunc_PGs/WF_J23505-09...,../data/CARM_VIS_TESS_WinFunc_PGs/figures/WF_J...,../data/CARM_VIS_RVs_PGs/J23505-095_RV_PG.dat,../data/CARM_VIS_TESS_PGs/J23505-095_RV_PG.dat
1,J23492+024,BR Psc,-,908,357.306604,2.396918,23:49:12.53,+02:24:04.4,93.567467,-56.885396,...,,,,0.0,Not recognized as a supported data product:\nn...,0.001995,,,../data/CARM_VIS_RVs_PGs/J23492+024_RV_PG.dat,
2,J23431+365,GJ 1289,-,1289,355.781509,36.536310,23:43:06.31,+36:32:13.1,107.922839,-24.336479,...,999.999512,4.306074e-06,1.0,1.0,,97.939914,../data/CARM_VIS_TESS_WinFunc_PGs/WF_J23431+36...,../data/CARM_VIS_TESS_WinFunc_PGs/figures/WF_J...,../data/CARM_VIS_RVs_PGs/J23431+365_RV_PG.dat,../data/CARM_VIS_TESS_PGs/J23431+365_RV_PG.dat
3,J23419+441,HH And,-,905,355.480015,44.170376,23:41:55.04,+44:10:38.8,109.989338,-16.947350,...,,,,0.0,Not recognized as a supported data product:\nn...,0.000998,,,../data/CARM_VIS_RVs_PGs/J23419+441_RV_PG.dat,
4,J23381-162,G 273-093,-,4352,354.532687,-16.236514,23:38:08.16,-16:14:10.2,61.845437,-69.825220,...,1000.000122,9.022946e-07,1.0,1.0,,136.603404,../data/CARM_VIS_TESS_WinFunc_PGs/WF_J23381-16...,../data/CARM_VIS_TESS_WinFunc_PGs/figures/WF_J...,../data/CARM_VIS_RVs_PGs/J23381-162_RV_PG.dat,../data/CARM_VIS_TESS_PGs/J23381-162_RV_PG.dat
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228,J00184+440,GQ And,B,15B,4.625301,44.028745,00:18:25.82,+44:01:38.1,116.700222,-18.444111,...,,,,0.0,Not recognized as a supported data product:\nn...,0.000998,,,../data/CARM_VIS_RVs_PGs/J00184+440_RV_PG.dat,
229,J00183+440,GX And,A,15A,4.613226,44.024787,00:18:22.88,+44:01:22.6,116.690592,-18.446865,...,999.999756,2.134945e-06,1.0,1.0,,191.241441,../data/CARM_VIS_TESS_WinFunc_PGs/WF_J00183+44...,../data/CARM_VIS_TESS_WinFunc_PGs/figures/WF_J...,../data/CARM_VIS_RVs_PGs/J00183+440_RV_PG.dat,../data/CARM_VIS_TESS_PGs/J00183+440_RV_PG.dat
230,J00162+198E,LP 404-062,B,1006B,4.070651,19.860692,00:16:16.15,+19:51:50.5,111.738263,-42.245679,...,,,,0.0,Not recognized as a supported data product:\nn...,0.001995,,,../data/CARM_VIS_RVs_PGs/J00162+198E_RV_PG.dat,
231,J00067-075,GJ 1002,-,1002,1.676350,-7.546475,00:06:43.20,-07:32:17.0,92.444693,-67.730511,...,,,,0.0,Not recognized as a supported data product:\nn...,0.001995,,,../data/CARM_VIS_RVs_PGs/J00067-075_RV_PG.dat,


In [62]:
print(list(carm_pg.columns))

['Karmn', 'Name', 'Comp', 'GJ', 'RA_J2016_deg', 'DE_J2016_deg', 'RA_J2000', 'DE_J2000', 'l_J2016_deg', 'b_J2016_deg', 'Ref01', 'SpT', 'SpTnum', 'Ref02', 'Teff_K', 'eTeff_K', 'logg', 'elogg', '[Fe/H]', 'e[Fe/H]', 'Ref03', 'L_Lsol', 'eL_Lsol', 'Ref04', 'R_Rsol', 'eR_Rsol', 'Ref05', 'M_Msol', 'eM_Msol', 'Ref06', 'muRA_masa-1', 'emuRA_masa-1', 'muDE_masa-1', 'emuDE_masa-1', 'Ref07', 'pi_mas', 'epi_mas', 'Ref08', 'd_pc', 'ed_pc', 'Ref09', 'Vr_kms-1', 'eVr_kms-1', 'Ref10', 'ruwe', 'Ref11', 'U_kms-1', 'eU_kms-1', 'V_kms-1', 'eV_kms-1', 'W_kms-1', 'eW_kms-1', 'Ref12', 'sa_m/s/a', 'esa_m/s/a', 'Ref13', 'SKG', 'Ref14', 'SKG_lit', 'Ref14_lit', 'Pop', 'Ref15', 'vsini_flag', 'vsini_kms-1', 'evsini_kms-1', 'Ref16', 'P_d', 'eP_d', 'Ref17', 'pEWHalpha_A', 'epEWHalpha_A', 'Ref18', 'log(LHalpha/Lbol)', 'elog(LHalpha/Lbol)', 'Ref19', '1RXS', 'CRT_s-1', 'eCRT_s-1', 'HR1', 'eHR1', 'HR2', 'eHR2', 'Flux_X_E-13_ergcm-2s-1', 'eFlux_X_E-13_ergcm-2s-1', 'LX/LJ', 'eLX/LJ', 'Ref20', 'Activity', 'Ref21', 'FUV_mag',

### Load  CARMENES data (cesium)

In [63]:
carm_l = pd.read_csv(CARMENES_IN, sep=',', decimal='.')
carm_l

Unnamed: 0,Karmn,SpT,SpTnum,Teff_K,eTeff_K,logg,elogg,[Fe/H],e[Fe/H],L_Lsol,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,M4.0 V,4.0,3377.0,34.0,4.83,0.10,-0.08,0.10,0.010298,...,0.813481,0.554379,0.489066,0.004200,1.295720,1.436865,0.920684,1.092416,0.978539,0.445028
1,J23492+024,M1.0 V,1.0,3573.0,23.0,4.94,0.13,-0.55,0.08,0.025559,...,0.707163,0.879691,-0.910369,0.000814,1.044964,1.000000,0.935512,0.935512,1.242137,0.829985
2,J23431+365,M4.0 V,4.0,3301.0,30.0,5.18,0.12,-0.10,0.10,0.005663,...,0.991922,0.222404,0.082574,0.000863,1.033456,0.496443,1.216607,0.641686,1.907862,0.091860
3,J23419+441,M5.0 V,5.0,3186.0,41.0,5.15,0.18,0.04,0.17,0.002349,...,0.733129,0.550059,0.182636,-0.001152,1.053811,0.971801,0.921956,0.994958,1.188842,0.546181
4,J23381-162,M2.0 V,2.0,3570.0,22.0,5.07,0.12,-0.35,0.08,0.019451,...,0.872267,0.684783,-0.039567,0.001306,1.099656,1.011043,1.268031,1.157254,2.078314,0.457373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228,J00184+440,M3.5 V,3.5,3318.0,53.0,5.20,0.11,-0.36,0.17,0.003344,...,0.666641,0.558432,-0.085931,0.002910,1.049630,2.126325,0.857117,1.381938,0.643994,0.752413
229,J00183+440,M1.0 V,1.0,3603.0,24.0,4.99,0.14,-0.52,0.11,0.023892,...,0.633442,0.543139,-0.020521,-0.000210,1.005997,1.155374,0.609987,0.673215,0.591909,0.472642
230,J00162+198E,M4.0 V,4.0,3329.0,30.0,4.93,0.07,-0.11,0.09,0.008648,...,0.918027,0.215861,-0.329471,0.003725,2.401874,0.826581,2.169255,1.637703,1.688356,0.169315
231,J00067-075,M5.5 V,5.5,3169.0,53.0,5.20,0.16,-0.15,0.22,0.001367,...,1.030706,0.693464,-0.023253,-0.001416,0.920014,1.296351,1.311097,1.494734,1.489506,0.609811


We now preselect the columns we will work with (but keep the original with all the columns):

In [64]:
print(list(carm_l.columns))

['Karmn', 'SpT', 'SpTnum', 'Teff_K', 'eTeff_K', 'logg', 'elogg', '[Fe/H]', 'e[Fe/H]', 'L_Lsol', 'eL_Lsol', 'R_Rsol', 'eR_Rsol', 'M_Msol', 'eM_Msol', 'muRA_masa-1', 'emuRA_masa-1', 'muDE_masa-1', 'emuDE_masa-1', 'pi_mas', 'epi_mas', 'd_pc', 'ed_pc', 'Vr_kms-1', 'eVr_kms-1', 'ruwe', 'U_kms-1', 'eU_kms-1', 'V_kms-1', 'eV_kms-1', 'W_kms-1', 'eW_kms-1', 'sa_m/s/a', 'esa_m/s/a', 'Pop', 'vsini_flag', 'vsini_kms-1', 'P_d', 'pEWHalpha_A', 'epEWHalpha_A', 'Activity', 'FUV_mag', 'eFUV_mag', 'NUV_mag', 'eNUV_mag', 'u_mag', 'eu_mag', 'BT_mag', 'eBT_mag', 'B_mag', 'eB_mag', 'BP_mag', 'eBP_mag', 'g_mag', 'eg_mag', 'VT_mag', 'eVT_mag', 'V_mag', 'eV_mag', 'Ra_mag', 'r_mag', 'er_mag', 'GG_mag', 'eGG_mag', 'i_mag', 'ei_mag', 'RP_mag', 'eRP_mag', 'IN_mag', 'J_mag', 'eJ_mag', 'H_mag', 'eH_mag', 'Ks_mag', 'eKs_mag', 'QFlag_2M', 'W1_mag', 'eW1_mag', 'W2_mag', 'eW2_mag', 'W3_mag', 'eW3_mag', 'W4_mag', 'eW4_mag', 'QFlag_WISE', 'Multiplicity', 'Planet', 'PlanetNum', 'Teff_min_K', 'Teff_max_K', 'logg_min', 'logg

In [65]:
carm = carm_l[ML_ADD_COLUMNS + csf_list].copy()
carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,1.022727,...,0.813481,0.554379,0.489066,0.0042,1.29572,1.436865,0.920684,1.092416,0.978539,0.445028
1,J23492+024,7.0,8.0,16.0,24.0,32.0,1.222588,1.557039,5.17237,1.27356,...,0.707163,0.879691,-0.910369,0.000814,1.044964,1.0,0.935512,0.935512,1.242137,0.829985
2,J23431+365,11.0,17.0,29.0,48.0,13.0,1.076923,1.076923,1.4,1.0,...,0.991922,0.222404,0.082574,0.000863,1.033456,0.496443,1.216607,0.641686,1.907862,0.09186
3,J23419+441,10.0,38.0,25.0,40.0,13.0,1.265,1.445714,1.552147,1.142857,...,0.733129,0.550059,0.182636,-0.001152,1.053811,0.971801,0.921956,0.994958,1.188842,0.546181
4,J23381-162,9.0,13.0,10.0,16.0,45.0,1.448718,3.054054,3.054054,2.108108,...,0.872267,0.684783,-0.039567,0.001306,1.099656,1.011043,1.268031,1.157254,2.078314,0.457373


#### Save the initial ML subsample dataset

In [66]:
carm.to_csv(CARMENES_OUT, sep=',', decimal='.', index=False)

### Load TRAINING S4 sample

In [67]:
s4_tr_l = pd.read_csv(TR_S4B_IN, sep=',', decimal='.')
s4_tr_l

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,Tobs,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.00,0,J23505-095,1581.691279,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,1581.691279,...,0.875066,0.666432,0.140860,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.585830
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,1581.691279,...,0.913891,0.672340,0.120452,-0.000041,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,1581.691279,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.50,0,J23505-095,1581.691279,...,0.930988,0.698053,-0.051849,-0.000303,0.619250,0.574467,1.934343,1.050505,2.382155,0.730568
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,ALT-B_Star-37275,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,693.216479,...,0.916000,0.561798,0.263209,-0.002615,0.857144,0.652312,1.617647,1.245098,2.364025,0.406583
37276,ALT-B_Star-37276,True,64.0,1.6,0.0,0.0,0.50,232,J00051+457,693.216479,...,1.000267,0.575509,0.042197,-0.001708,0.888439,0.863114,1.583333,1.000000,1.918760,0.492849
37277,ALT-B_Star-37277,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,693.216479,...,0.856494,0.592491,0.047063,-0.002131,0.800533,0.931275,1.403636,1.098182,2.033018,0.397017
37278,ALT-B_Star-37278,True,64.0,1.6,0.0,0.0,0.75,232,J00051+457,693.216479,...,0.904447,0.546488,-0.374507,-0.001698,1.423353,0.905443,1.049451,1.060440,1.823070,0.385341


We now preselect the columns we will work with:

In [68]:
print(list(s4_tr_l.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'Tobs', 'Ps_mean', 'Ps_median', 'Ps_stdev', 'NumPoints', 'errorRV_dist_loc', 'errorRV_dist_scale', 'errorRV_mean', 'errorRV_median', 'errorRV_stdev', 'S4_ALT_file', 'VALID_RECORD', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', '

In [69]:
s4_tr = s4_tr_l[S4_ADD_COLUMNS + csf_list].copy()
s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,10.0,...,0.852444,0.595003,-0.152339,-0.000978,1.074144,0.772892,1.201439,1.104317,2.008315,0.482745
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.875066,0.666432,0.14086,-0.000429,1.024657,0.962333,1.214286,1.314286,2.107205,0.58583
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,10.0,...,0.913891,0.67234,0.120452,-4.1e-05,0.767704,0.779927,1.320423,1.038732,2.067927,0.447995
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,10.0,...,0.961831,0.714286,-0.015359,0.000299,0.832362,0.868586,1.435185,1.550926,2.188623,0.487488
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,10.0,...,0.930988,0.698053,-0.051849,-0.000303,0.61925,0.574467,1.934343,1.050505,2.382155,0.730568


#### Save the initial TRAINING S4 sample dataset

In [70]:
s4_tr.to_csv(TR_S4B_OUT, sep=',', decimal='.', index=False)

### Load VALIDATION S4 sample

In [71]:
s4_val_l = pd.read_csv(VAL_S4B_IN, sep=',', decimal='.')
s4_val_l

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,errorRV_dist_loc,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,1.469807,...,0.948040,0.809789,0.008037,-0.000402,0.952390,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,0.882503,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,1.093215,...,1.131387,0.234854,-0.012060,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.114480
3,B_Star-00003,False,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,1.213872,...,0.957384,0.539035,0.130598,-0.000051,1.186332,0.588031,1.942197,1.005780,2.343527,0.457471
4,B_Star-00004,False,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,1.381081,...,0.926403,0.625322,-0.001450,-0.000311,1.389174,0.576179,2.175000,1.000000,2.457354,0.486447
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,B_Star-03995,False,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,2.170328,...,0.871491,0.508868,0.113752,0.002208,0.670102,1.184699,1.859155,1.267606,1.190609,0.517696
3996,B_Star-03996,False,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,1.176807,...,1.080301,0.727763,-0.132918,-0.000010,1.251124,0.839217,1.426966,1.202247,1.971105,0.446290
3997,B_Star-03997,False,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,1.223698,...,1.016076,0.752359,-0.028916,0.000088,0.993836,0.703634,1.431034,1.275862,2.133724,0.686681
3998,B_Star-03998,False,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,1.524068,...,0.998230,0.806347,-0.023802,-0.000209,0.898179,0.768453,1.546392,1.206186,2.199315,0.621301


We now preselect the columns we will work with:

In [72]:
print(list(s4_val_l.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'errorRV_dist_loc', 'errorRV_dist_scale', 'errorRV_mean', 'errorRV_median', 'errorRV_stdev', 'Tobs', 'Ps_mean', 'Ps_median', 'Ps_stdev', 'NumPoints', 'S4_file', 'S3_file', 'VALID_RECORD', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000

In [73]:
s4_val = s4_val_l[S4_ADD_COLUMNS + csf_list].copy()
s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,11.0,...,0.94804,0.809789,0.008037,-0.000402,0.95239,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,8.0,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,11.0,...,1.131387,0.234854,-0.01206,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.11448
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,12.0,...,0.957384,0.539035,0.130598,-5.1e-05,1.186332,0.588031,1.942197,1.00578,2.343527,0.457471
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,12.0,...,0.926403,0.625322,-0.00145,-0.000311,1.389174,0.576179,2.175,1.0,2.457354,0.486447


#### Save the initial VALIDATION S4 sample dataset

In [109]:
s4_val.to_csv(VAL_S4B_OUT, sep=',', decimal='.', index=False)

## Check and understand features for imputing and decide strategy

We now check which features have values to be imputed, and depending on the definition of those features, we decide upon an imputing strategy.

### Check fo `NaN` values

#### Overall impact

We first see the overall `NaN`impact (i.e., how many records have at least one `NaN` value).

##### In CARMENES

In [75]:
count_anynan_carm = carm[csf_list].isna().any(axis=1).sum()
print("CARMENES records with any `NaN` value: %d" %count_anynan_carm)
print("Ratio: %.2f%%" %(100.0 * count_anynan_carm / len(carm)))

CARMENES records with any `NaN` value: 20
Ratio: 8.58%


##### In TRAINING S4 sample

In [76]:
count_anynan_s4_tr = s4_tr[csf_list].isna().any(axis=1).sum()
print("TRAINING S4 sample records with any `NaN` value: %d" %count_anynan_s4_tr)
print("Ratio: %.2f%%" %(100.0 * count_anynan_s4_tr / len(s4_tr)))

TRAINING S4 sample records with any `NaN` value: 2992
Ratio: 8.03%


##### In VALIDATION S4 sample

In [77]:
count_anynan_s4_val = s4_val[csf_list].isna().any(axis=1).sum()
print("VALIDATION S4 sample records with any `NaN` value: %d" %count_anynan_s4_val)
print("Ratio: %.2f%%" %(100.0 * count_anynan_s4_val / len(s4_val)))

VALIDATION S4 sample records with any `NaN` value: 316
Ratio: 7.90%


<font color='blue'>**SIMILAR TO THE VERY ORIGINAL CASE**</font>

#### CARMENES

In [78]:
carm_nan_count = carm[csf_list].isna().sum().copy()
carm_nan_count

all_times_nhist_numpeaks       0
all_times_nhist_peak1_bin      0
all_times_nhist_peak2_bin      3
all_times_nhist_peak3_bin      8
all_times_nhist_peak4_bin     16
                              ..
p2p_scatter_2praw              0
p2p_scatter_over_mad           0
p2p_scatter_pfold_over_mad     0
p2p_ssqr_diff_over_var         0
scatter_res_raw                0
Length: 112, dtype: int64

In [79]:
carm_nan_count[carm_nan_count > 0]

all_times_nhist_peak2_bin       3
all_times_nhist_peak3_bin       8
all_times_nhist_peak4_bin      16
all_times_nhist_peak_1_to_2     3
all_times_nhist_peak_1_to_3     8
all_times_nhist_peak_1_to_4    16
all_times_nhist_peak_2_to_3     8
all_times_nhist_peak_2_to_4    16
all_times_nhist_peak_3_to_4    16
fold2P_slope_10percentile       5
fold2P_slope_90percentile       5
medperc90_2p_p                  5
dtype: int64

In [80]:
carm_nan_count[carm_nan_count > 0].max()

16

In [81]:
carm_nan_list = carm_nan_count[carm_nan_count > 0].index.to_list()
carm_nan_list

['all_times_nhist_peak2_bin',
 'all_times_nhist_peak3_bin',
 'all_times_nhist_peak4_bin',
 'all_times_nhist_peak_1_to_2',
 'all_times_nhist_peak_1_to_3',
 'all_times_nhist_peak_1_to_4',
 'all_times_nhist_peak_2_to_3',
 'all_times_nhist_peak_2_to_4',
 'all_times_nhist_peak_3_to_4',
 'fold2P_slope_10percentile',
 'fold2P_slope_90percentile',
 'medperc90_2p_p']

#### TRAINING S4 sample

In [82]:
s4_tr_nan_count = s4_tr[csf_list].isna().sum().copy()
s4_tr_nan_count

all_times_nhist_numpeaks         0
all_times_nhist_peak1_bin        0
all_times_nhist_peak2_bin      480
all_times_nhist_peak3_bin     1280
all_times_nhist_peak4_bin     2560
                              ... 
p2p_scatter_2praw                0
p2p_scatter_over_mad             0
p2p_scatter_pfold_over_mad       0
p2p_ssqr_diff_over_var           0
scatter_res_raw                 16
Length: 112, dtype: int64

In [83]:
s4_tr_nan_count[s4_tr_nan_count > 0]

all_times_nhist_peak2_bin       480
all_times_nhist_peak3_bin      1280
all_times_nhist_peak4_bin      2560
all_times_nhist_peak_1_to_2     480
all_times_nhist_peak_1_to_3    1280
all_times_nhist_peak_1_to_4    2560
all_times_nhist_peak_2_to_3    1280
all_times_nhist_peak_2_to_4    2560
all_times_nhist_peak_3_to_4    2560
fold2P_slope_10percentile       485
fold2P_slope_90percentile       485
freq1_amplitude1                 16
freq1_amplitude2                 16
freq1_amplitude3                 16
freq1_amplitude4                 16
freq1_rel_phase2                 16
freq1_rel_phase3                 16
freq1_rel_phase4                 16
freq2_amplitude1                 16
freq2_amplitude2                 16
freq2_amplitude3                 16
freq2_amplitude4                 16
freq2_rel_phase2                 16
freq2_rel_phase3                 16
freq2_rel_phase4                 16
freq3_amplitude1                 16
freq3_amplitude2                 16
freq3_amplitude3            

**OBSERVATION:** there are more features with `NaN` values than in S4 sample, but the features with the most `NaN` values are the same, and we now have many more records, so more `NaN` values could be perfectly normal.

In [84]:
s4_tr_nan_count[s4_tr_nan_count > 0].max()

2560

In [85]:
s4_tr_nan_list = s4_tr_nan_count[s4_tr_nan_count > 0].index.to_list()
s4_tr_nan_list

['all_times_nhist_peak2_bin',
 'all_times_nhist_peak3_bin',
 'all_times_nhist_peak4_bin',
 'all_times_nhist_peak_1_to_2',
 'all_times_nhist_peak_1_to_3',
 'all_times_nhist_peak_1_to_4',
 'all_times_nhist_peak_2_to_3',
 'all_times_nhist_peak_2_to_4',
 'all_times_nhist_peak_3_to_4',
 'fold2P_slope_10percentile',
 'fold2P_slope_90percentile',
 'freq1_amplitude1',
 'freq1_amplitude2',
 'freq1_amplitude3',
 'freq1_amplitude4',
 'freq1_rel_phase2',
 'freq1_rel_phase3',
 'freq1_rel_phase4',
 'freq2_amplitude1',
 'freq2_amplitude2',
 'freq2_amplitude3',
 'freq2_amplitude4',
 'freq2_rel_phase2',
 'freq2_rel_phase3',
 'freq2_rel_phase4',
 'freq3_amplitude1',
 'freq3_amplitude2',
 'freq3_amplitude3',
 'freq3_amplitude4',
 'freq3_rel_phase2',
 'freq3_rel_phase3',
 'freq3_rel_phase4',
 'freq_amplitude_ratio_21',
 'freq_amplitude_ratio_31',
 'freq_model_max_delta_mags',
 'freq_model_min_delta_mags',
 'freq_signif_ratio_21',
 'freq_signif_ratio_31',
 'freq_varrat',
 'freq_y_offset',
 'linear_trend',


In [86]:
carm_nan_list == s4_tr_nan_list

False

In [87]:
len(carm_nan_list)

12

<font color='blue'>**SIMILAR TO S4 SAMPLE**</font>

**OBSERVATION:** We have $12$ features for which a relatively high number of `NaN` values are observed in ALTERNATIVE S4 sample, and those are the same features for which ML subsample shows `NaN` values.

#### VALIDATION S4 sample

In [88]:
s4_val_nan_count = s4_val[csf_list].isna().sum().copy()
s4_val_nan_count

all_times_nhist_numpeaks        0
all_times_nhist_peak1_bin       0
all_times_nhist_peak2_bin      54
all_times_nhist_peak3_bin     126
all_times_nhist_peak4_bin     265
                             ... 
p2p_scatter_2praw               0
p2p_scatter_over_mad            0
p2p_scatter_pfold_over_mad      0
p2p_ssqr_diff_over_var          0
scatter_res_raw                 6
Length: 112, dtype: int64

In [89]:
s4_val_nan_count[s4_val_nan_count > 0]

all_times_nhist_peak2_bin       54
all_times_nhist_peak3_bin      126
all_times_nhist_peak4_bin      265
all_times_nhist_peak_1_to_2     54
all_times_nhist_peak_1_to_3    126
all_times_nhist_peak_1_to_4    265
all_times_nhist_peak_2_to_3    126
all_times_nhist_peak_2_to_4    265
all_times_nhist_peak_3_to_4    265
std_double_to_single_step        7
fold2P_slope_10percentile       49
fold2P_slope_90percentile       49
freq1_amplitude1                 6
freq1_amplitude2                 6
freq1_amplitude3                 6
freq1_amplitude4                 6
freq1_rel_phase2                 6
freq1_rel_phase3                 6
freq1_rel_phase4                 6
freq2_amplitude1                 6
freq2_amplitude2                 6
freq2_amplitude3                 6
freq2_amplitude4                 6
freq2_rel_phase2                 6
freq2_rel_phase3                 6
freq2_rel_phase4                 6
freq3_amplitude1                 6
freq3_amplitude2                 6
freq3_amplitude3    

**OBSERVATION:** there are more features with `NaN` values than in S4 sample, but the features with the most `NaN` values are the same, and we now have many more records, so more `NaN` values could be perfectly normal.

In [90]:
s4_val_nan_count[s4_val_nan_count > 0].max()

265

In [91]:
s4_val_nan_list = s4_val_nan_count[s4_val_nan_count > 0].index.to_list()
s4_val_nan_list

['all_times_nhist_peak2_bin',
 'all_times_nhist_peak3_bin',
 'all_times_nhist_peak4_bin',
 'all_times_nhist_peak_1_to_2',
 'all_times_nhist_peak_1_to_3',
 'all_times_nhist_peak_1_to_4',
 'all_times_nhist_peak_2_to_3',
 'all_times_nhist_peak_2_to_4',
 'all_times_nhist_peak_3_to_4',
 'std_double_to_single_step',
 'fold2P_slope_10percentile',
 'fold2P_slope_90percentile',
 'freq1_amplitude1',
 'freq1_amplitude2',
 'freq1_amplitude3',
 'freq1_amplitude4',
 'freq1_rel_phase2',
 'freq1_rel_phase3',
 'freq1_rel_phase4',
 'freq2_amplitude1',
 'freq2_amplitude2',
 'freq2_amplitude3',
 'freq2_amplitude4',
 'freq2_rel_phase2',
 'freq2_rel_phase3',
 'freq2_rel_phase4',
 'freq3_amplitude1',
 'freq3_amplitude2',
 'freq3_amplitude3',
 'freq3_amplitude4',
 'freq3_rel_phase2',
 'freq3_rel_phase3',
 'freq3_rel_phase4',
 'freq_amplitude_ratio_21',
 'freq_amplitude_ratio_31',
 'freq_model_max_delta_mags',
 'freq_model_min_delta_mags',
 'freq_signif_ratio_21',
 'freq_signif_ratio_31',
 'freq_varrat',
 'fre

In [92]:
s4_tr_nan_list == s4_val_nan_list

False

In [93]:
len(s4_tr_nan_list)

43

In [94]:
len(s4_val_nan_list)

44

<font color='blue'>**SIMILAR TO S4 SAMPLE**</font>

**OBSERVATION:** We have $12$ features for which a relatively high number of `NaN` values are observed in ALTERNATIVE S4 sample, and those are the same features for which ML subsample shows `NaN` values.

### Check for `inf` values

#### CARMENES subsample

In [95]:
carm_inf_count = np.isinf(carm[csf_list]).sum()
carm_inf_count

all_times_nhist_numpeaks      0
all_times_nhist_peak1_bin     0
all_times_nhist_peak2_bin     0
all_times_nhist_peak3_bin     0
all_times_nhist_peak4_bin     0
                             ..
p2p_scatter_2praw             0
p2p_scatter_over_mad          0
p2p_scatter_pfold_over_mad    0
p2p_ssqr_diff_over_var        0
scatter_res_raw               0
Length: 112, dtype: int64

In [96]:
carm_inf_count[carm_inf_count > 0]

Series([], dtype: int64)

In [97]:
carm_inf_list = carm_inf_count[carm_inf_count > 0].index.to_list()
carm_inf_list

[]

#### TRAINING S4 sample

In [98]:
s4_tr_inf_count = np.isinf(s4_tr[csf_list]).sum()
s4_tr_inf_count

all_times_nhist_numpeaks      0
all_times_nhist_peak1_bin     0
all_times_nhist_peak2_bin     0
all_times_nhist_peak3_bin     0
all_times_nhist_peak4_bin     0
                             ..
p2p_scatter_2praw             0
p2p_scatter_over_mad          0
p2p_scatter_pfold_over_mad    0
p2p_ssqr_diff_over_var        0
scatter_res_raw               0
Length: 112, dtype: int64

In [99]:
s4_tr_inf_count[s4_tr_inf_count > 0]

Series([], dtype: int64)

In [100]:
s4_tr_inf_list = s4_tr_inf_count[s4_tr_inf_count > 0].index.to_list()
s4_tr_inf_list

[]

**OBSERVATION:** no `inf` values were observed, so there is no need to have a strategy to impute `inf` values.

#### VALIDATION S4 sample

In [101]:
s4_val_inf_count = np.isinf(s4_val[csf_list]).sum()
s4_val_inf_count

all_times_nhist_numpeaks      0
all_times_nhist_peak1_bin     0
all_times_nhist_peak2_bin     0
all_times_nhist_peak3_bin     0
all_times_nhist_peak4_bin     0
                             ..
p2p_scatter_2praw             0
p2p_scatter_over_mad          0
p2p_scatter_pfold_over_mad    0
p2p_ssqr_diff_over_var        0
scatter_res_raw               0
Length: 112, dtype: int64

In [102]:
s4_val_inf_count[s4_val_inf_count > 0]

avg_double_to_single_step    7
dtype: int64

In [103]:
s4_val_inf_list = s4_val_inf_count[s4_val_inf_count > 0].index.to_list()
s4_val_inf_list

['avg_double_to_single_step']

<font color='red'>**THIS WILL BE A PROBLEM**</font>

**OBSERVATION:** `inf` values were observed in 7 records, for this feature.

Will investigate a little, but the impact is very low (only 7 out of 4,000 records), so that we could just drop those records.

In [104]:
s4_val[np.isinf(s4_val['avg_double_to_single_step'])]

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
454,B_Star-00454,False,0.0,0.0,0.0,2457461.0,0.0,170,J07274+052,9.0,...,0.825166,0.966598,-0.004244,0.000168,1.039408,0.954902,1.578947,1.578947,2.015612,0.962357
494,B_Star-00494,False,0.0,0.0,0.0,2457445.0,0.0,170,J07274+052,9.0,...,0.908908,0.97232,-0.002758,7.9e-05,0.985232,1.022895,1.466667,1.6,2.000693,0.919786
738,B_Star-00738,False,0.0,0.0,0.0,2457493.0,0.0,170,J07274+052,9.0,...,0.909144,0.966162,0.001957,-9.3e-05,0.975167,0.983752,1.52381,1.5,2.03397,0.910079
1273,B_Star-01273,False,0.0,0.0,0.0,2457396.0,0.0,170,J07274+052,9.0,...,0.931958,0.971093,0.006811,-0.0001,0.967565,0.999757,1.440994,1.465839,2.008721,0.945826
2256,B_Star-02256,False,0.0,0.0,0.0,2460047.0,0.0,170,J07274+052,9.0,...,0.918588,0.965348,-0.023179,6.9e-05,1.006096,1.06131,1.375723,1.468208,1.945585,0.964282
2519,B_Star-02519,False,0.0,0.0,0.0,2457396.0,0.0,170,J07274+052,9.0,...,0.761764,0.965708,0.001983,0.000119,0.960475,0.983414,1.54321,1.54321,1.926551,0.954005
3496,B_Star-03496,False,0.0,0.0,0.0,2457463.0,0.0,170,J07274+052,9.0,...,0.855133,0.970577,-0.003226,8e-06,0.97046,1.006511,1.655629,1.602649,2.007968,0.953881


In [106]:
s4_val.loc[s4_val['CARMENES_Ref_star'] == "J07274+052",
           ['ID', 'Pulsating', 'avg_double_to_single_step']]

Unnamed: 0,ID,Pulsating,avg_double_to_single_step
79,B_Star-00079,False,-184923.8
436,B_Star-00436,False,-184725.0
454,B_Star-00454,False,inf
494,B_Star-00494,False,inf
733,B_Star-00733,False,-184923.0
738,B_Star-00738,False,inf
1273,B_Star-01273,False,inf
1601,B_Star-01601,False,-185143.9
2060,B_Star-02060,False,-184726.5
2083,B_Star-02083,False,-184725.0


**OBSERVATION:** it is curious that some of the stars yield `inf` and others not.

Anyway, we will drop those synthetic stars.

##### Remove records with `inf` and save the new VALIDATION S4 sample

In [108]:
s4_val = s4_val[~np.isinf(s4_val['avg_double_to_single_step'])].copy()
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,11.0,...,0.948040,0.809789,0.008037,-0.000402,0.952390,0.823571,1.387755,1.673469,2.091815,0.732392
1,B_Star-00001,False,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,8.0,...,0.980112,0.604269,0.250163,0.000595,1.387666,0.716612,1.701031,1.134021,2.363048,0.372797
2,B_Star-00002,False,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,11.0,...,1.131387,0.234854,-0.012060,0.002152,2.565635,0.785639,3.083333,2.152778,2.362236,0.114480
3,B_Star-00003,False,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,12.0,...,0.957384,0.539035,0.130598,-0.000051,1.186332,0.588031,1.942197,1.005780,2.343527,0.457471
4,B_Star-00004,False,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,12.0,...,0.926403,0.625322,-0.001450,-0.000311,1.389174,0.576179,2.175000,1.000000,2.457354,0.486447
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,B_Star-03995,False,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,7.0,...,0.871491,0.508868,0.113752,0.002208,0.670102,1.184699,1.859155,1.267606,1.190609,0.517696
3996,B_Star-03996,False,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,4.0,...,1.080301,0.727763,-0.132918,-0.000010,1.251124,0.839217,1.426966,1.202247,1.971105,0.446290
3997,B_Star-03997,False,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,10.0,...,1.016076,0.752359,-0.028916,0.000088,0.993836,0.703634,1.431034,1.275862,2.133724,0.686681
3998,B_Star-03998,False,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,10.0,...,0.998230,0.806347,-0.023802,-0.000209,0.898179,0.768453,1.546392,1.206186,2.199315,0.621301


In [110]:
s4_val.to_csv(VAL_S4B_OUT, sep=',', decimal='.', index=False)

### Imputing strategy

We will not repeat this here, we will just save the list of features to be imputed, for later reference.

In [116]:
imputing_features = cs_f[cs_f['Feature'].isin(nan_list)]
imputing_features

Unnamed: 0,Type,Feature,Description
2,Cadence/Error,all_times_nhist_peak2_bin,Return the (bin) index of the ith largest peak...
3,Cadence/Error,all_times_nhist_peak3_bin,Return the (bin) index of the ith largest peak...
4,Cadence/Error,all_times_nhist_peak4_bin,Return the (bin) index of the ith largest peak...
5,Cadence/Error,all_times_nhist_peak_1_to_2,Compute the ratio of the values of the ith and...
6,Cadence/Error,all_times_nhist_peak_1_to_3,Compute the ratio of the values of the ith and...
7,Cadence/Error,all_times_nhist_peak_1_to_4,Compute the ratio of the values of the ith and...
8,Cadence/Error,all_times_nhist_peak_2_to_3,Compute the ratio of the values of the ith and...
9,Cadence/Error,all_times_nhist_peak_2_to_4,Compute the ratio of the values of the ith and...
10,Cadence/Error,all_times_nhist_peak_3_to_4,Compute the ratio of the values of the ith and...
39,Cadence/Error,std_double_to_single_step,Standard deviation of ratios (t[i+2] - t[i]) /...


#### Save the list of features to be imputed

In [117]:
pickle.dump(nan_list, open(IMPUTED_FEATURES_OUT, "wb"))

#### Print and save a LaTeX table

In [121]:
# Print to screen:
print(imputing_features.to_latex(index=False, longtable=False,
                                 caption=("Cesium features with NaN values.",
                                          "Cesium features with NaN values.")))

\begin{table}
\centering
\caption[Cesium features with NaN values.]{Cesium features with NaN values.}
\begin{tabular}{lll}
\toprule
                   Type &                     Feature &                                        Description \\
\midrule
          Cadence/Error &   all\_times\_nhist\_peak2\_bin & Return the (bin) index of the ith largest peak.... \\
          Cadence/Error &   all\_times\_nhist\_peak3\_bin & Return the (bin) index of the ith largest peak.... \\
          Cadence/Error &   all\_times\_nhist\_peak4\_bin & Return the (bin) index of the ith largest peak.... \\
          Cadence/Error & all\_times\_nhist\_peak\_1\_to\_2 & Compute the ratio of the values of the ith and ... \\
          Cadence/Error & all\_times\_nhist\_peak\_1\_to\_3 & Compute the ratio of the values of the ith and ... \\
          Cadence/Error & all\_times\_nhist\_peak\_1\_to\_4 & Compute the ratio of the values of the ith and ... \\
          Cadence/Error & all\_times\_nhist\_peak\_2\_to\_3

  print(imputing_features.to_latex(index=False, longtable=False,


In [122]:
# Write to file:
# BACKUP THE STANDARD OUTPUT:
original_stdout = sys.stdout
with open(NAN_FEATURES_TABLE, 'w') as f:
    sys.stdout = f # Change the standard output to the file we created.
    print(imputing_features.to_latex(index=False, longtable=False,
                                     caption=("Cesium features with NaN values.",
                                              "Cesium features with NaN values.")))
sys.stdout = original_stdout # Reset the standard output to its original value

  print(imputing_features.to_latex(index=False, longtable=False,


## Summary

<font color='blue'>**SIMILAR TO S4 SAMPLE**</font>

**CONCLUSION:**

- We extracted and saved the list of _cesium_ features that show some `NaN` values and hence need a imputing strategy.
- We removed 7 synthetic stars from the VALIDATION S4 sample because of their `inf` value in one feature. The impact is low: only one feature affected, only 7 stars affected (out of 4,000), all of them with source in CARMENES star "J07274+052" (strange, this CARMENES star does not seem to have anythin special, like too few points, etc...).

- **IMPORTANT:**
  - As the customized algorithm for `NaN` we have chosen depends on distances measured with features, scaling is mandatory. The `NaN` imputing strategy that we have chosen can only be applied after feature scaling is carried out.
  - Additionally, for a tree-based ML algorithm (decision tree or random forest), scaling is not needed and can even make interpretability more difficult to implement (as we would need to revert the feature scaling to get the interpretability straight away).
