#  FEATURE ENGINEERING - `NaN` VALUES IMPUTING - samples for ALTERNATIVE 1-NN METHODOLOGY

In this Notebook we apply the chosen `NaN` imputing strategy. The imputing operation is carried out on the scaled datasets.

We will use a k-nn algorithm (weighted by distance) to assign values to the `NaN` values present in a few of the features. For each feature to be imputed, we will use all the rest of the non-`NaN` features, even those which do not belong to the same cathegory.

**IMPORTANT:** Notice that, contrary to the scaling operation, in this case we fitted two different k-nn models: one for S4 sample and another one for ML subsample, so as to keep the specific, independent characteristics of both samples (remember the differences between the values distributions for many features).


## Modules and configuration

### Modules

In [2]:
import pandas as pd

from sklearn.impute import KNNImputer

import pickle

#import matplotlib.pyplot as plt
#import seaborn as sns
#sns.set_style("white", {'figure.figsize':(15,10)})

### Configuration

In [48]:
CARMENES_IN = "../data/DATASETS_ML/1NN/1NN_CARMENES_01_DS_AfterScaling.csv"
TRAIN_S4B_IN = "../data/DATASETS_ML/1NN/1NN_TRAIN_S4B_01_DS_AfterScaling.csv"
VAL_S4B_IN = "../data/DATASETS_ML/1NN/1NN_VAL_S4B_01_DS_AfterScaling.csv"

CARMENES_IMP_FILE = "../data/ML_MODELS/ML_pipeline_steps/1NN/1NN_CARMENES_imputer.pickle"
# Will store the fitted imputer object
TRAIN_S4B_IMP_FILE = "../data/ML_MODELS/ML_pipeline_steps/1NN/1NN_TRAIN_S4B_imputer.pickle"
# Will store the fitted imputer object
VAL_S4B_IMP_FILE = "../data/ML_MODELS/ML_pipeline_steps/1NN/1NN_VAL_S4B_imputer.pickle"
# Will store the fitted imputer object

IMPUTED_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/1NN/1NN_imputed_features_list.pickle"

CARMENES_OUT = "../data/DATASETS_ML/1NN/1NN_CARMENES_02_DS_AfterImputing.csv"
# The scaled features of the CARMENES subsample.
TRAIN_S4B_OUT = "../data/DATASETS_ML/1NN/1NN_TRAIN_S4B_02_DS_AfterImputing.csv"
# The scaled features of the TRAINING S4 sample
VAL_S4B_OUT = "../data/DATASETS_ML/1NN/1NN_VAL_S4B_02_DS_AfterImputing.csv"
# The scaled features of the VALIDATION S4 sample

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase',
                  'CARMENES_source_idx', 'CARMENES_Ref_star'] # Only cesium features and these columns will be kept.


### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_ and scaled.

### Load the list of features to impute

In [49]:
impute_f_list = pickle.load(open(IMPUTED_FEATURES_IN, "rb"))
impute_f_list

['all_times_nhist_peak2_bin',
 'all_times_nhist_peak3_bin',
 'all_times_nhist_peak4_bin',
 'all_times_nhist_peak_1_to_2',
 'all_times_nhist_peak_1_to_3',
 'all_times_nhist_peak_1_to_4',
 'all_times_nhist_peak_2_to_3',
 'all_times_nhist_peak_2_to_4',
 'all_times_nhist_peak_3_to_4',
 'fold2P_slope_10percentile',
 'fold2P_slope_90percentile',
 'medperc90_2p_p',
 'freq1_amplitude1',
 'freq1_amplitude2',
 'freq1_amplitude3',
 'freq1_amplitude4',
 'freq1_rel_phase2',
 'freq1_rel_phase3',
 'freq1_rel_phase4',
 'freq2_amplitude1',
 'freq2_amplitude2',
 'freq2_amplitude3',
 'freq2_amplitude4',
 'freq2_rel_phase2',
 'freq2_rel_phase3',
 'freq2_rel_phase4',
 'freq3_amplitude1',
 'freq3_amplitude2',
 'freq3_amplitude3',
 'freq3_amplitude4',
 'freq3_rel_phase2',
 'freq3_rel_phase3',
 'freq3_rel_phase4',
 'freq_amplitude_ratio_21',
 'freq_amplitude_ratio_31',
 'freq_model_max_delta_mags',
 'freq_model_min_delta_mags',
 'freq_signif_ratio_21',
 'freq_signif_ratio_31',
 'freq_varrat',
 'freq_y_offset'

### Load the TRAINING S4 sample data

In [50]:
s4_tr = pd.read_csv(TRAIN_S4B_IN, sep=',', decimal='.')
s4_tr

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.908970,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.541050,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.50,0,J23505-095,0.205178,...,-0.272705,0.962090,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,ALT-B_Star-37275,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.394117,0.405540,1.174777,-0.032514,-0.030659,-0.363602,-0.042368,-0.039153,0.551196,0.088020
37276,ALT-B_Star-37276,True,64.0,1.6,0.0,0.0,0.50,232,J00051+457,0.830097,...,0.288502,0.461546,0.193656,-0.023099,-0.028984,0.259205,-0.075329,-0.205573,-0.414141,0.392452
37277,ALT-B_Star-37277,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.876158,0.530909,0.215261,-0.027492,-0.033688,0.460585,-0.247941,-0.138908,-0.166431,0.054261
37278,ALT-B_Star-37278,True,64.0,1.6,0.0,0.0,0.75,232,J00051+457,0.830097,...,-0.487701,0.343003,-1.656181,-0.023002,-0.000361,0.384265,-0.588162,-0.164535,-0.621599,0.013055


In [51]:
print(list(s4_tr.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

### Load the VALIDATION S4 sample data

In [52]:
s4_val = pd.read_csv(VAL_S4B_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,False,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,False,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817
3,B_Star-00003,False,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,False,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,False,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,-0.754672,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138
3989,B_Star-03996,False,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,0.936833,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148
3990,B_Star-03997,False,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,0.416565,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491
3991,B_Star-03998,False,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,0.272003,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764


In [53]:
print(list(s4_val.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

###  Read the CARMENES subsample data

In [54]:
carm = pd.read_csv(CARMENES_IN, sep=',', decimal='.')
carm

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.683950,0.963624,-0.495737,...,-1.224593,0.375234,2.177405,0.038241,-0.007191,1.954330,-0.711851,-0.142823,-2.452548,0.223693
1,J23492+024,-0.732202,-0.712088,-0.352490,-0.029602,0.452062,-0.327090,-0.312174,1.382266,-0.135084,...,-2.085847,1.704012,-4.034985,0.003083,-0.020609,0.663629,-0.697608,-0.249360,-1.881067,1.582213
2,J23431+365,0.517637,-0.073948,0.633426,1.786876,-1.009783,-0.530743,-0.698876,-0.615420,-0.528415,...,0.220901,-0.980754,0.372899,0.003590,-0.021224,-0.824111,-0.427596,-0.448866,-0.437768,-1.022645
3,J23419+441,0.205178,1.415047,0.330067,1.181383,-1.009783,-0.267795,-0.401839,-0.534849,-0.323011,...,-1.875500,0.357589,0.817097,-0.017334,-0.020135,0.580317,-0.710630,-0.208997,-1.996610,0.580664
4,J23381-162,-0.107282,-0.357566,-0.807528,-0.635095,1.452272,-0.010941,0.893571,0.260497,1.064851,...,-0.748389,0.907884,-0.169309,0.008196,-0.017682,0.696254,-0.378200,-0.098799,-0.068227,0.267257
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
228,J00184+440,1.142557,-0.782993,-0.731688,-1.164901,-1.779174,-0.638288,-0.632154,-0.729271,-0.298703,...,-2.414100,0.391792,-0.375132,0.024843,-0.020359,3.991311,-0.772912,0.053760,-3.177844,1.308460
229,J00183+440,0.205178,-0.570279,-0.124971,-0.029602,-0.009573,5.974603,3.848480,2.752565,-0.279287,...,-2.683037,0.329324,-0.084762,-0.007546,-0.022694,1.122675,-1.010299,-0.427458,-3.290765,0.321142
230,J00162+198E,1.455017,-0.215757,0.178388,0.727264,-1.009783,-0.638288,-0.645770,-0.650724,-0.323011,...,-0.377701,-1.007479,-1.456254,0.033312,0.052000,0.151270,0.487492,0.227423,-0.913661,-0.749305
231,J00067-075,-1.044662,0.776906,0.481747,-0.635095,1.452272,-0.271944,-0.255353,-0.340739,-0.111941,...,0.535083,0.943346,-0.096890,-0.020070,-0.027295,1.539187,-0.336831,0.130348,-1.344769,0.805214


#### Extract the _cesium_ feature list

In [55]:
cs_f_list = list(s4_tr.drop(columns=S4_ADD_COLUMNS).columns)
print(cs_f_list)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65',

## Define and train the imputers

### Imputer for TRAINING S4 sample

In [56]:
fit_data_s4_tr = s4_tr[cs_f_list].copy()
fit_data_s4_tr.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-0.90897,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-0.411204,0.857062,0.54105,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-0.272705,0.96209,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367


In [57]:
imputer_s4_tr = KNNImputer(weights='distance')
imputer_s4_tr.fit(fit_data_s4_tr)

#### Save the trained imputer (TRAINING S4 sample)

In [58]:
pickle.dump(imputer_s4_tr, open(TRAIN_S4B_IMP_FILE, 'wb'))

### Imputer for VALIDATION S4 sample

In [59]:
fit_data_s4_val = s4_val[cs_f_list].copy()
fit_data_s4_val.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,0.517637,-1.137515,-0.580009,-1.089214,0.144305,0.266356,-0.19947,-0.372978,-0.484844,-0.634592,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,-0.419742,0.280575,0.860945,0.727264,1.60615,-0.58553,-0.662208,-0.547754,-0.411041,-0.317319,...,0.125231,0.579019,1.116863,0.000807,-0.00227,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,0.517637,-1.137515,0.405907,0.575891,1.760028,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,...,1.350668,-0.929902,-0.047202,0.01698,0.060763,0.030306,1.36553,0.577156,0.547318,-0.942817
3,0.830097,-0.357566,-0.049131,-1.240587,-1.2406,-0.268982,-0.367065,-0.51203,-0.272801,-0.511149,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,0.830097,-0.570279,0.330067,-0.256662,-0.240391,-0.573261,-0.703301,-0.789417,-0.494181,-0.73058,...,-0.309842,0.665012,-0.000102,-0.008592,-0.00219,-0.588534,0.49301,-0.205573,0.753534,0.369862


In [60]:
imputer_s4_val = KNNImputer(weights='distance')
imputer_s4_val.fit(fit_data_s4_val)

#### Save the trained imputer (VALIDATION S4 sample)

In [61]:
pickle.dump(imputer_s4_val, open(VAL_S4B_IMP_FILE, 'wb'))

### Imputer for CARMENES subsample

In [62]:
fit_data_carm = carm[cs_f_list].copy()
fit_data_carm.head()

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,2.096091,...,-1.224593,0.375234,2.177405,0.038241,-0.007191,1.95433,-0.711851,-0.142823,-2.452548,0.223693
1,-0.732202,-0.712088,-0.35249,-0.029602,0.452062,-0.32709,-0.312174,1.382266,-0.135084,2.224898,...,-2.085847,1.704012,-4.034985,0.003083,-0.020609,0.663629,-0.697608,-0.24936,-1.881067,1.582213
2,0.517637,-0.073948,0.633426,1.786876,-1.009783,-0.530743,-0.698876,-0.61542,-0.528415,-0.47604,...,0.220901,-0.980754,0.372899,0.00359,-0.021224,-0.824111,-0.427596,-0.448866,-0.437768,-1.022645
3,0.205178,1.415047,0.330067,1.181383,-1.009783,-0.267795,-0.401839,-0.534849,-0.323011,-0.543324,...,-1.8755,0.357589,0.817097,-0.017334,-0.020135,0.580317,-0.71063,-0.208997,-1.99661,0.580664
4,-0.107282,-0.357566,-0.807528,-0.635095,1.452272,-0.010941,0.893571,0.260497,1.064851,0.268721,...,-0.748389,0.907884,-0.169309,0.008196,-0.017682,0.696254,-0.3782,-0.098799,-0.068227,0.267257


In [63]:
imputer_carm = KNNImputer(weights='distance')
imputer_carm.fit(fit_data_carm)

#### Save the trained imputer (CARMENES subsample)

In [64]:
pickle.dump(imputer_carm, open(CARMENES_IMP_FILE, 'wb'))

## Reload and apply the imputers

### Imputer for TRAINING S4

#### Reload imputer

In [65]:
ld_imputer_s4_tr = pickle.load(open(TRAIN_S4B_IMP_FILE, 'rb'))
ld_imputer_s4_tr

#### Impute the features in TRAINING S4 sample

In [66]:
imputed_s4_tr = s4.copy()
imputed_s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.90897,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.54105,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,0.205178,...,-0.272705,0.96209,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367


In [67]:
imputed_s4_tr.loc[:, cs_f_list] = ld_imputer_s4_tr.transform(imputed_s4_tr.loc[:, cs_f_list])
imputed_s4_tr.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.90897,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.54105,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.0,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.5,0,J23505-095,0.205178,...,-0.272705,0.96209,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367


#### Check correct imputing

In [68]:
s4_tr[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,37280.0,37280.0,36800.0,36000.0,34720.0,36800.0,36000.0,34720.0,36000.0,34720.0,...,37264.0,37264.0,37264.0,37264.0,36778.0,37280.0,37280.0,37280.0,37280.0,37264.0
mean,-4.574309e-17,-1.036843e-16,6.255865000000001e-17,-4.1053580000000005e-17,-3.929269e-17,9.267949e-17,-3.315866e-17,1.015061e-16,2.178998e-16,5.2390250000000004e-17,...,2.7457640000000003e-17,-2.684747e-16,6.101698e-18,2.2309330000000003e-17,1.816059e-17,-8.729307000000001e-17,-9.453573000000001e-17,7.738207000000001e-17,-5.56541e-16,-1.830509e-16
std,1.000013,1.000013,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,1.000013,1.000013,1.000013,1.000013,1.000014,1.000013,1.000013,1.000013,1.000013,1.000013
50%,-0.1072824,-0.3575656,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,...,-0.0393412,-0.03299619,0.006336377,-0.005353901,-0.01708591,-0.1141285,-0.1272274,-0.06110645,-0.1001276,-0.1495433


In [69]:
imputed_s4_tr[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,...,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0
mean,-4.574309e-17,-1.036843e-16,0.02237,-0.005005,0.015581,0.004847,0.015257,0.011455,0.023351,0.017819,...,0.001183,-1.6e-05,4.6e-05,0.000108,0.002971,-8.729307000000001e-17,-9.453573000000001e-17,7.738207000000001e-17,-5.56541e-16,0.000476
std,1.000013,1.000013,1.013986,0.994641,0.988737,0.9963,0.990477,0.979001,0.996694,0.981413,...,1.001618,0.999856,0.999839,1.00136,1.051442,1.000013,1.000013,1.000013,1.000013,1.00041
50%,-0.1072824,-0.3575656,-0.20081,-0.029602,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.291718,...,-0.038922,-0.032865,0.006336,-0.005353,-0.01679,-0.1141285,-0.1272274,-0.06110645,-0.1001276,-0.149042


In [70]:
s4_tr.isna().sum()[s4_tr.isna().sum() > 0]

all_times_nhist_peak2_bin       480
all_times_nhist_peak3_bin      1280
all_times_nhist_peak4_bin      2560
all_times_nhist_peak_1_to_2     480
all_times_nhist_peak_1_to_3    1280
all_times_nhist_peak_1_to_4    2560
all_times_nhist_peak_2_to_3    1280
all_times_nhist_peak_2_to_4    2560
all_times_nhist_peak_3_to_4    2560
fold2P_slope_10percentile       485
fold2P_slope_90percentile       485
freq1_amplitude1                 16
freq1_amplitude2                 16
freq1_amplitude3                 16
freq1_amplitude4                 16
freq1_rel_phase2                 16
freq1_rel_phase3                 16
freq1_rel_phase4                 16
freq2_amplitude1                 16
freq2_amplitude2                 16
freq2_amplitude3                 16
freq2_amplitude4                 16
freq2_rel_phase2                 16
freq2_rel_phase3                 16
freq2_rel_phase4                 16
freq3_amplitude1                 16
freq3_amplitude2                 16
freq3_amplitude3            

In [71]:
imputed_s4_tr.isna().sum()[imputed_s4_tr.isna().sum() > 0]

Series([], dtype: int64)

In [72]:
s4_tr[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,36800.0,36000.0,34720.0,36800.0,36000.0,34720.0,36000.0,34720.0,34720.0,36795.0,...,37264.0,37264.0,37264.0,37264.0,37264.0,37264.0,37264.0,37264.0,37264.0,37280.0
mean,6.255865000000001e-17,-4.1053580000000005e-17,-3.929269e-17,9.267949e-17,-3.315866e-17,1.015061e-16,2.178998e-16,5.2390250000000004e-17,2.521281e-16,-3.39871e-17,...,4.3474600000000004e-17,5.2245790000000005e-17,5.1101720000000005e-17,-8.511869e-16,2.7457640000000003e-17,-2.684747e-16,6.101698e-18,2.2309330000000003e-17,-1.830509e-16,4.574309e-17
std,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,1.000014,...,1.000013,1.000013,1.000013,1.000013,1.000013,1.000013,1.000013,1.000013,1.000013,1.000013
min,-1.565925,-1.84608,-2.009992,-0.6382879,-0.7608323,-0.8272426,-0.5284149,-0.7525235,-0.7034631,-104.7253,...,-0.02164458,-0.007337622,-0.008894588,-7.160046,-6.306264,-1.88919,-6.489361,-54.73265,-1.345553,-0.06778085
25%,-0.8833674,-0.7864677,-0.9328434,-0.5388684,-0.4943731,-0.562464,-0.4941809,-0.535674,-0.6330852,-0.09534969,...,-0.02164458,-0.007337576,-0.008894376,-0.5211363,-0.5613034,-0.8982544,-0.4321774,-0.01037192,-0.8336991,-0.06771991
50%,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,-0.3813333,0.3492266,...,-0.02164458,-0.007337493,-0.008894011,-0.0281692,-0.0393412,-0.03299619,0.006336377,-0.005353901,-0.1495433,-0.06749057
75%,0.7851056,0.8029505,0.836758,0.110687,0.06697425,0.2318719,0.1130764,0.1690871,0.332071,0.3909024,...,-0.02164458,-0.007337345,-0.008893426,0.4811128,0.5166123,0.8490798,0.4297225,-0.000288742,0.6874313,-0.0667942
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.50672,6.961699,5.560173,0.5043307,...,151.7005,181.5094,124.4601,8.296186,3.327655,2.181878,8.600693,42.87256,40.10389,15.23114


In [73]:
imputed_s4_tr[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,...,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0,37280.0
mean,0.02237,-0.005005,0.015581,0.004847,0.015257,0.011455,0.023351,0.017819,0.00451,0.002958,...,4e-05,0.002054,0.001795,0.001218,0.001183,-1.6e-05,4.6e-05,0.000108,0.000476,4.574309e-17
std,1.013986,0.994641,0.988737,0.9963,0.990477,0.979001,0.996694,0.981413,0.974633,0.994538,...,0.999818,1.046046,1.019847,1.001745,1.001618,0.999856,0.999839,1.00136,1.00041,1.000013
min,-1.565925,-1.84608,-2.009992,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,-0.703463,-104.725287,...,-0.021645,-0.007338,-0.008895,-7.160046,-6.306264,-1.88919,-6.489361,-54.732654,-1.345553,-0.06778085
25%,-0.883367,-0.786468,-0.855904,-0.538425,-0.492354,-0.55765,-0.486313,-0.522121,-0.627077,-0.085988,...,-0.021645,-0.007338,-0.008894,-0.520969,-0.560841,-0.898124,-0.432064,-0.010372,-0.83362,-0.06771991
50%,-0.20081,-0.029602,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.291718,-0.355483,0.349134,...,-0.021645,-0.007337,-0.008894,-0.02776,-0.038922,-0.032865,0.006336,-0.005353,-0.149042,-0.06749057
75%,0.860945,0.802951,0.836758,0.114528,0.205688,0.231872,0.190498,0.169087,0.340476,0.390902,...,-0.021645,-0.007337,-0.008893,0.482182,0.517508,0.848637,0.429438,-0.000285,0.688447,-0.0667942
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.506715,6.961699,5.560173,0.504331,...,151.700547,181.509409,124.460051,8.296186,3.327655,2.181878,8.600693,42.872563,40.103892,15.23114


Ok, now no feature has `NaN` value any longer, and the statistics remain similar for the imputed statistics. Although there are variations in the mean and standard deviation, both remain very close to 0 and 1 respectively (or in any case close to the value they had previously).

For the non-imputed statistics, they remain equal.

#### Save the imputed TRAINING S4 sample dataset

In [74]:
imputed_s4_tr.to_csv(TRAIN_S4B_OUT, sep=',', decimal='.', index=False)

### Imputer for VALIDATION S4

#### Reload imputer

In [75]:
ld_imputer_s4_val = pickle.load(open(VAL_S4B_IMP_FILE, 'rb'))
ld_imputer_s4_val

#### Impute the features in VALIDATION S4 sample

In [76]:
imputed_s4_val = s4_val.copy()
imputed_s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.00227,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.01698,0.060763,0.030306,1.36553,0.577156,0.547318,-0.942817
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.00219,-0.588534,0.49301,-0.205573,0.753534,0.369862


In [77]:
imputed_s4_val.loc[:, cs_f_list] = ld_imputer_s4_val.transform(imputed_s4_val.loc[:, cs_f_list])
imputed_s4_val.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.00227,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.01698,0.060763,0.030306,1.36553,0.577156,0.547318,-0.942817
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.00219,-0.588534,0.49301,-0.205573,0.753534,0.369862


#### Check correct imputing

In [78]:
s4_val[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,3993.0,3993.0,3939.0,3867.0,3728.0,3939.0,3867.0,3728.0,3867.0,3728.0,...,3987.0,3987.0,3987.0,3987.0,3938.0,3993.0,3993.0,3993.0,3993.0,3987.0
mean,0.004227,0.000437,0.016774,0.014123,-0.006663,-0.021442,0.005803,0.001576,0.024496,0.017752,...,0.081534,0.026774,0.019621,-0.011142,-0.004974,-0.02544,-0.008408,-0.016568,0.000833,0.004148
std,1.010121,0.994044,1.007434,1.004587,1.004368,0.969563,1.026238,1.006839,1.052484,1.045343,...,0.884413,1.008406,0.994507,0.774763,0.052022,0.947391,0.697579,0.466829,0.974282,0.949807
50%,-0.107282,-0.357566,-0.20081,-0.029602,-0.086512,-0.309973,-0.277572,-0.297685,-0.268383,-0.291718,...,-0.003054,-0.007826,0.006365,-0.005383,-0.017604,-0.128495,-0.125971,-0.068249,-0.100736,-0.168325


In [79]:
imputed_s4_val[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,...,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0
mean,0.004227,0.000437,0.034947,0.00994,0.013156,-0.014491,0.018311,0.011398,0.039985,0.022895,...,0.086101,0.026586,0.019553,-0.015139,-0.004614,-0.02544,-0.008408,-0.016568,0.000833,0.005616
std,1.010121,0.994044,1.015055,0.996017,0.985942,0.966574,1.016337,0.987415,1.045123,1.016448,...,0.891655,1.007851,0.993798,0.785755,0.051971,0.947391,0.697579,0.466829,0.974282,0.950607
50%,-0.107282,-0.357566,-0.20081,-0.100236,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.291718,...,-0.000923,-0.003964,0.00641,-0.0054,-0.017261,-0.128495,-0.125971,-0.068249,-0.100736,-0.166726


In [80]:
s4_val.isna().sum()[s4_val.isna().sum() > 0]

all_times_nhist_peak2_bin       54
all_times_nhist_peak3_bin      126
all_times_nhist_peak4_bin      265
all_times_nhist_peak_1_to_2     54
all_times_nhist_peak_1_to_3    126
all_times_nhist_peak_1_to_4    265
all_times_nhist_peak_2_to_3    126
all_times_nhist_peak_2_to_4    265
all_times_nhist_peak_3_to_4    265
fold2P_slope_10percentile       49
fold2P_slope_90percentile       49
freq1_amplitude1                 6
freq1_amplitude2                 6
freq1_amplitude3                 6
freq1_amplitude4                 6
freq1_rel_phase2                 6
freq1_rel_phase3                 6
freq1_rel_phase4                 6
freq2_amplitude1                 6
freq2_amplitude2                 6
freq2_amplitude3                 6
freq2_amplitude4                 6
freq2_rel_phase2                 6
freq2_rel_phase3                 6
freq2_rel_phase4                 6
freq3_amplitude1                 6
freq3_amplitude2                 6
freq3_amplitude3                 6
freq3_amplitude4    

In [81]:
imputed_s4_val.isna().sum()[imputed_s4_val.isna().sum() > 0]

Series([], dtype: int64)

In [82]:
s4_val[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,3939.0,3867.0,3728.0,3939.0,3867.0,3728.0,3867.0,3728.0,3728.0,3944.0,...,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3987.0,3993.0
mean,0.016774,0.014123,-0.006663,-0.021442,0.005803,0.001576,0.024496,0.017752,-0.0074,0.023494,...,-0.015434,-0.007337412,-0.008893802,0.083113,0.081534,0.026774,0.019621,-0.011142,0.004148,-0.013986
std,1.007434,1.004587,1.004368,0.969563,1.026238,1.006839,1.052484,1.045343,0.977518,0.807887,...,0.137604,2.55033e-07,7.347424e-07,0.875585,0.884413,1.008406,0.994507,0.774763,0.949807,0.879818
min,-1.565925,-1.84608,-2.009992,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,-0.703463,-18.685852,...,-0.021645,-0.007337622,-0.008894588,-5.045137,-4.47657,-1.839401,-5.768502,-20.751503,-1.315462,-0.067781
25%,-0.883367,-0.786468,-0.932843,-0.545082,-0.51842,-0.562464,-0.495737,-0.535674,-0.633085,-0.065811,...,-0.021645,-0.007337577,-0.008894375,-0.448198,-0.491335,-0.876238,-0.390831,-0.010355,-0.813875,-0.067728
50%,-0.20081,-0.029602,-0.086512,-0.309973,-0.277572,-0.297685,-0.268383,-0.291718,-0.381333,0.353264,...,-0.021645,-0.007337496,-0.00889401,0.014706,-0.003054,-0.007826,0.006365,-0.005383,-0.168325,-0.067491
75%,0.785106,0.878637,0.836758,0.092596,0.138256,0.231872,0.125142,0.169087,0.332071,0.390902,...,-0.021645,-0.007337354,-0.008893411,0.508181,0.527522,0.872792,0.437119,-0.000329,0.675763,-0.066794
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.506715,6.961699,5.560173,0.514521,...,7.168335,-0.007335418,-0.008889638,3.459856,3.327655,2.166616,6.469539,34.152774,9.344787,15.231474


In [83]:
imputed_s4_val[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,...,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0,3993.0
mean,0.034947,0.00994,0.013156,-0.014491,0.018311,0.011398,0.039985,0.022895,-0.006584,0.025045,...,-0.015444,-0.007337412,-0.008893803,0.087871,0.086101,0.026586,0.019553,-0.015139,0.005616,-0.013986
std,1.015055,0.996017,0.985942,0.966574,1.016337,0.987415,1.045123,1.016448,0.950944,0.804416,...,0.137501,2.549365e-07,7.345775e-07,0.88358,0.891655,1.007851,0.993798,0.785755,0.950607,0.879818
min,-1.565925,-1.84608,-2.009992,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,-0.703463,-18.685852,...,-0.021645,-0.007337622,-0.008894588,-5.045137,-4.47657,-1.839401,-5.768502,-20.751503,-1.315462,-0.067781
25%,-0.807528,-0.786468,-0.855904,-0.541329,-0.51087,-0.55765,-0.490719,-0.522121,-0.624887,-0.061483,...,-0.021645,-0.007337577,-0.008894377,-0.447739,-0.490642,-0.8762,-0.390817,-0.010374,-0.813766,-0.067728
50%,-0.20081,-0.100236,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.291718,-0.355483,0.352584,...,-0.021645,-0.007337496,-0.008894012,0.017045,-0.000923,-0.003964,0.00641,-0.0054,-0.166726,-0.067491
75%,0.860945,0.802951,0.835478,0.110687,0.205688,0.231872,0.190498,0.169087,0.316982,0.390902,...,-0.021645,-0.007337354,-0.008893412,0.511255,0.531346,0.871478,0.435942,-0.000329,0.676512,-0.066794
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.506715,6.961699,5.560173,0.514521,...,7.168335,-0.007335418,-0.008889638,3.459856,3.327655,2.166616,6.469539,34.152774,9.344787,15.231474


Ok, now no feature has `NaN` value any longer, and the statistics remain similar for the imputed statistics. Although there are variations in the mean and standard deviation, both remain very close to 0 and 1 respectively (or in any case close to the value they had previously).

For the non-imputed statistics, they remain equal.

#### Save the imputed VALIDATION S4 sample dataset

In [84]:
imputed_s4_val.to_csv(VAL_S4B_OUT, sep=',', decimal='.', index=False)

### Imputer for CARMENES subsample

#### Reload imputer

In [85]:
ld_imputer_carm = pickle.load(open(CARMENES_IMP_FILE, 'rb'))
ld_imputer_carm

#### Impute the features in CARMENES ML sample

In [86]:
imputed_carm = carm.copy()
imputed_carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,...,-1.224593,0.375234,2.177405,0.038241,-0.007191,1.95433,-0.711851,-0.142823,-2.452548,0.223693
1,J23492+024,-0.732202,-0.712088,-0.35249,-0.029602,0.452062,-0.32709,-0.312174,1.382266,-0.135084,...,-2.085847,1.704012,-4.034985,0.003083,-0.020609,0.663629,-0.697608,-0.24936,-1.881067,1.582213
2,J23431+365,0.517637,-0.073948,0.633426,1.786876,-1.009783,-0.530743,-0.698876,-0.61542,-0.528415,...,0.220901,-0.980754,0.372899,0.00359,-0.021224,-0.824111,-0.427596,-0.448866,-0.437768,-1.022645
3,J23419+441,0.205178,1.415047,0.330067,1.181383,-1.009783,-0.267795,-0.401839,-0.534849,-0.323011,...,-1.8755,0.357589,0.817097,-0.017334,-0.020135,0.580317,-0.71063,-0.208997,-1.99661,0.580664
4,J23381-162,-0.107282,-0.357566,-0.807528,-0.635095,1.452272,-0.010941,0.893571,0.260497,1.064851,...,-0.748389,0.907884,-0.169309,0.008196,-0.017682,0.696254,-0.3782,-0.098799,-0.068227,0.267257


In [87]:
imputed_carm.loc[:, cs_f_list] = ld_imputer_carm.transform(imputed_carm.loc[:, cs_f_list])
imputed_carm.head()

Unnamed: 0,Karmn,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,0.205178,-0.570279,0.102548,-0.937841,0.375123,-0.538868,-0.68395,0.963624,-0.495737,...,-1.224593,0.375234,2.177405,0.038241,-0.007191,1.95433,-0.711851,-0.142823,-2.452548,0.223693
1,J23492+024,-0.732202,-0.712088,-0.35249,-0.029602,0.452062,-0.32709,-0.312174,1.382266,-0.135084,...,-2.085847,1.704012,-4.034985,0.003083,-0.020609,0.663629,-0.697608,-0.24936,-1.881067,1.582213
2,J23431+365,0.517637,-0.073948,0.633426,1.786876,-1.009783,-0.530743,-0.698876,-0.61542,-0.528415,...,0.220901,-0.980754,0.372899,0.00359,-0.021224,-0.824111,-0.427596,-0.448866,-0.437768,-1.022645
3,J23419+441,0.205178,1.415047,0.330067,1.181383,-1.009783,-0.267795,-0.401839,-0.534849,-0.323011,...,-1.8755,0.357589,0.817097,-0.017334,-0.020135,0.580317,-0.71063,-0.208997,-1.99661,0.580664
4,J23381-162,-0.107282,-0.357566,-0.807528,-0.635095,1.452272,-0.010941,0.893571,0.260497,1.064851,...,-0.748389,0.907884,-0.169309,0.008196,-0.017682,0.696254,-0.3782,-0.098799,-0.068227,0.267257


#### Check correct imputing

In [88]:
carm[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,233.0,233.0,230.0,225.0,217.0,230.0,225.0,217.0,225.0,217.0,...,233.0,233.0,233.0,233.0,228.0,233.0,233.0,233.0,233.0,233.0
mean,7.623849000000001e-17,-7.623849000000001e-17,7.337126e-17,-4.7369520000000006e-17,-8.185976e-18,1.081261e-16,-4.1448330000000005e-17,1.227896e-16,2.368476e-16,3.2743900000000003e-17,...,-0.271566,-0.230007,0.204185,-0.062786,-0.004035,0.576555,-0.192094,0.024057,-0.945874,-0.130916
std,1.002153,1.002153,1.002181,1.00223,1.002312,1.002181,1.00223,1.002312,1.00223,1.002312,...,1.128264,0.855456,2.128576,0.655705,0.038165,1.282657,1.174251,0.817402,1.235812,0.815313
50%,-0.1072824,-0.3575656,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,...,-0.243356,-0.334207,0.057353,-0.001805,-0.013939,0.417663,-0.39373,-0.085134,-1.074622,-0.199803


In [89]:
imputed_carm[cs_f_list].describe().loc[['count', 'mean', 'std', '50%']]

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
count,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,...,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0
mean,7.623849000000001e-17,-7.623849000000001e-17,0.007979,0.007233,0.00478,-0.00313,-0.001616,0.009433,0.00484,0.007673,...,-0.271566,-0.230007,0.204185,-0.062786,-0.003598,0.576555,-0.192094,0.024057,-0.945874,-0.130916
std,1.002153,1.002153,0.998942,0.989065,0.973791,0.996213,0.985729,0.976627,0.986461,0.96985,...,1.128264,0.855456,2.128576,0.655705,0.037986,1.282657,1.174251,0.817402,1.235812,0.815313
50%,-0.1072824,-0.3575656,-0.20081,-0.029602,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.279264,...,-0.243356,-0.334207,0.057353,-0.001805,-0.012902,0.417663,-0.39373,-0.085134,-1.074622,-0.199803


In [90]:
carm.isna().sum()[carm.isna().sum() > 0]

all_times_nhist_peak2_bin       3
all_times_nhist_peak3_bin       8
all_times_nhist_peak4_bin      16
all_times_nhist_peak_1_to_2     3
all_times_nhist_peak_1_to_3     8
all_times_nhist_peak_1_to_4    16
all_times_nhist_peak_2_to_3     8
all_times_nhist_peak_2_to_4    16
all_times_nhist_peak_3_to_4    16
fold2P_slope_10percentile       5
fold2P_slope_90percentile       5
medperc90_2p_p                  5
dtype: int64

In [91]:
imputed_carm.isna().sum()[imputed_carm.isna().sum() > 0]

Series([], dtype: int64)

In [92]:
carm[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,230.0,225.0,217.0,230.0,225.0,217.0,225.0,217.0,217.0,228.0,...,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0
mean,7.337126e-17,-4.7369520000000006e-17,-8.185976e-18,1.081261e-16,-4.1448330000000005e-17,1.227896e-16,2.368476e-16,3.2743900000000003e-17,2.783232e-16,-0.084037,...,-0.02164458,0.550467,3.173603,-0.133645,-0.271566,-0.230007,0.204185,-0.062786,-0.130916,1e-06
std,1.002181,1.00223,1.002312,1.002181,1.00223,1.002312,1.00223,1.002312,1.002312,1.263167,...,4.290984e-10,7.187458,34.032918,1.082541,1.128264,0.855456,2.128576,0.655705,0.815313,1.002168
min,-1.565925,-1.84608,-2.009992,-0.6382879,-0.7608323,-0.8272426,-0.5284149,-0.7525235,-0.7034631,-10.187285,...,-0.02164458,-0.007338,-0.008895,-2.943778,-3.099925,-1.684524,-6.111399,-9.827589,-1.276013,-0.067781
25%,-0.8833674,-0.7864677,-0.9328434,-0.5387575,-0.4943731,-0.562464,-0.4941809,-0.535674,-0.6330852,0.14841,...,-0.02164458,-0.007338,-0.008894,-0.742544,-0.953505,-0.980227,-0.694258,-0.022823,-0.874639,-0.06772
50%,-0.2008103,-0.02960186,-0.009573079,-0.2887663,-0.2775723,-0.2976853,-0.2669921,-0.2917182,-0.3813333,0.356422,...,-0.02164458,-0.007337,-0.008893,-0.090383,-0.243356,-0.334207,0.057353,-0.001805,-0.199803,-0.067491
75%,0.7851056,0.8029505,0.836758,0.1065261,0.06697425,0.2318719,0.1130764,0.1690871,0.332071,0.390902,...,-0.02164458,-0.007337,-0.008892,0.455428,0.342597,0.397944,1.192571,0.012517,0.390301,-0.066794
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.50672,6.961699,5.560173,0.395996,...,-0.02164458,107.620232,465.799387,3.459848,3.327644,2.187744,8.245187,0.56332,3.290234,15.231372


In [93]:
imputed_carm[impute_f_list].describe()

Unnamed: 0,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,all_times_nhist_peak_3_to_4,fold2P_slope_10percentile,...,freq_amplitude_ratio_31,freq_model_max_delta_mags,freq_model_min_delta_mags,freq_signif_ratio_21,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,scatter_res_raw,std_double_to_single_step
count,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,...,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0,233.0
mean,0.007979,0.007233,0.00478,-0.00313,-0.001616,0.009433,0.00484,0.007673,0.010287,-0.082035,...,-0.02164458,0.550467,3.173603,-0.133645,-0.271566,-0.230007,0.204185,-0.062786,-0.130916,1e-06
std,0.998942,0.989065,0.973791,0.996213,0.985729,0.976627,0.986461,0.96985,0.973593,1.250777,...,4.290984e-10,7.187458,34.032918,1.082541,1.128264,0.855456,2.128576,0.655705,0.815313,1.002168
min,-1.565925,-1.84608,-2.009992,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,-0.703463,-10.187285,...,-0.02164458,-0.007338,-0.008895,-2.943778,-3.099925,-1.684524,-6.111399,-9.827589,-1.276013,-0.067781
25%,-0.883367,-0.786468,-0.778965,-0.538425,-0.492354,-0.547754,-0.484844,-0.518682,-0.615087,0.135099,...,-0.02164458,-0.007338,-0.008894,-0.742544,-0.953505,-0.980227,-0.694258,-0.022823,-0.874639,-0.06772
50%,-0.20081,-0.029602,-0.009573,-0.288766,-0.277572,-0.297685,-0.24085,-0.279264,-0.355483,0.353642,...,-0.02164458,-0.007337,-0.008893,-0.090383,-0.243356,-0.334207,0.057353,-0.001805,-0.199803,-0.067491
75%,0.785106,0.802951,0.759819,0.094043,0.087471,0.231872,0.125142,0.169087,0.340476,0.390902,...,-0.02164458,-0.007337,-0.008892,0.455428,0.342597,0.397944,1.192571,0.012517,0.390301,-0.066794
max,2.15022,1.862563,1.760028,8.153267,9.187762,6.586559,11.506715,6.961699,5.560173,0.395996,...,-0.02164458,107.620232,465.799387,3.459848,3.327644,2.187744,8.245187,0.56332,3.290234,15.231372


Again, no feature has `NaN` value any longer, and the statistics remain similar for the imputed statistics. Although there are variations in the mean and standard deviation, both remain very close to 0 and 1 respectively (or in any case close to the value they had previously).

For the non-imputed statistics, they remain equal.

#### Save the imputed ML subsample dataset

In [94]:
imputed_carm.to_csv(CARMENES_OUT, sep=',', decimal='.', index=False)

## Summary

**RESULTS:**

- We have created and saved a `KNNImputer`, fitted with the data in S4 sample (notice that `NaN` values are ignored during the fit and maintained during the transform)
- We have reloaded that scaler and applied to the _cesium_ features of the S4 sample and the ML subsample.
- It has been observed that the imputing operation has not greatly affected the statistics of the features of each dataset.