# PoC - FEATURE EXTRACTION WITH `cesium`

In this notebook we test the feature extraction with `cesium`, just to show how the process is done.

We will use a few of the individual RV curves of _GTO_ objects from _Carmencita_ database.

**IMPORTANT NOTE:** this code is probably not very efficient (for example, too many dataframe `append` operations, which is costly), but there is no special need at the moment to be more efficient. Maybe the solution is to create a 2D numpy array and then, at the end, create the DataFrame.

## Modules and configuration

### Modules

In [1]:
# Module import:
#import warnings

import pandas as pd

from cesium.data_management import TimeSeries
from cesium.featurize import featurize_single_ts

### Configuration

In [2]:
GTO_FILE = "../data/SELECTION_GTO_objects_with_PG.csv"
RV_FOLDER = "../data/CARMENES_GTO_RVs/"

SINTHETIC_FOLDER = "../data/RV_DATASETS/"
SYNTHETIC_DB_FILE = "RV_All_GTO_SyntheticDatasets.csv"


CS_FEATURES_FOLDER = "../data/DATASETS_CESIUM/"
CS_FEATURES_FILE = "TEST_RV_cesium_DS.csv"

IMAGE_FOLDER = "./img/"

# A LIST OF ALL THE FEATURES CESIUM CAN EXTRACT (FOR REFERENCE PURPOSES)
ALL_CS_FEATURES = ['all_times_nhist_numpeaks',
                   'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin',
                   'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4',
                   'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4',
                   'all_times_nhist_peak_3_to_4',
                   'all_times_nhist_peak_val',
                   'avg_double_to_single_step', 'avg_err', 'avgt',
                   'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50',
                   'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000',
                   'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000',
                   'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000',
                   'cads_avg', 'cads_med', 'cads_std', 'mean',
                   'med_double_to_single_step', 'med_err',
                   'n_epochs', 'std_double_to_single_step', 'std_err',
                   'total_time', 'amplitude',
                   'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50',
                   'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80',
                   'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum',
                   'percent_amplitude', 'percent_beyond_1_std', 'percent_close_to_median', 'percent_difference_flux_percentile',
                   'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std',
                   'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile',
                   'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4',
                   'freq1_freq', 'freq1_lambda', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq1_signif',
                   'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4',
                   'freq2_freq', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4',
                   'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4',
                   'freq3_freq', 'freq3_rel_phase2', 'freq3_rel_phase3', 'freq3_rel_phase4',
                   'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31',
                   'freq_frequency_ratio_21', 'freq_frequency_ratio_31',
                   'freq_model_max_delta_mags', 'freq_model_min_delta_mags', 'freq_model_phi1_phi2',
                   'freq_n_alias', 'freq_signif_ratio_21', 'freq_signif_ratio_31',
                   'freq_varrat', 'freq_y_offset', 'linear_trend', 'medperc90_2p_p',
                   'p2p_scatter_2praw', 'p2p_scatter_over_mad', 'p2p_scatter_pfold_over_mad', 'p2p_ssqr_diff_over_var',
                   'scatter_res_raw']


## Load GTO information table

In [3]:
gto = pd.read_csv(GTO_FILE, sep=',', decimal='.')
gto.head()

Unnamed: 0,Karmn,Name,Comp,GJ,RA_J2016_deg,DE_J2016_deg,RA_J2000,DE_J2000,l_J2016_deg,b_J2016_deg,...,T0_PG_TESS,e_T0_PG_TESS,offset_PG_TESS,e_offset_PG_TESS,FAP_PG_TESS,valid_PG_TESS,error_PG_TESS,elapsed_time_PG_TESS,fits_file_TESS,fig_file_TESS
0,J23548+385,RX J2354.8+3831,-,,358.713658,38.52634,23:54:51.46,+38:31:36.2,110.941908,-23.024449,...,1764.609498,0.010704,8262.219751,1.365667,1.0,1.0,,344.002685,../data/CARMENES_GTO_TESS_PGs/J23548+385_TESS_...,../data/CARMENES_GTO_TESS_PGs/figures/J23548+3...
1,J23505-095,LP 763-012,-,4367.0,357.634705,-9.560964,23:50:31.64,-09:33:32.7,80.777067,-67.303426,...,1354.108815,0.001261,7767.134654,0.094298,0.064148,1.0,,473.533042,../data/CARMENES_GTO_TESS_PGs/J23505-095_TESS_...,../data/CARMENES_GTO_TESS_PGs/figures/J23505-0...
2,J23431+365,GJ 1289,-,1289.0,355.781509,36.53631,23:43:06.31,+36:32:13.1,107.922839,-24.336479,...,1764.717539,0.00372,16158.288258,0.164698,0.002785,1.0,,352.262793,../data/CARMENES_GTO_TESS_PGs/J23431+365_TESS_...,../data/CARMENES_GTO_TESS_PGs/figures/J23431+3...
3,J23381-162,G 273-093,-,4352.0,354.532687,-16.236514,23:38:08.16,-16:14:10.2,61.845437,-69.82522,...,1354.111098,0.000422,30353.1479,0.175123,0.031223,1.0,,485.008036,../data/CARMENES_GTO_TESS_PGs/J23381-162_TESS_...,../data/CARMENES_GTO_TESS_PGs/figures/J23381-1...
4,J23245+578,BD+57 2735,-,895.0,351.126628,57.853057,23:24:30.51,+57:51:15.5,111.552287,-3.085183,...,1955.800582,0.00142,84823.865767,0.391298,0.799167,1.0,,476.798646,../data/CARMENES_GTO_TESS_PGs/J23245+578_TESS_...,../data/CARMENES_GTO_TESS_PGs/figures/J23245+5...


In [4]:
print(list(gto.columns))

['Karmn', 'Name', 'Comp', 'GJ', 'RA_J2016_deg', 'DE_J2016_deg', 'RA_J2000', 'DE_J2000', 'l_J2016_deg', 'b_J2016_deg', 'Ref01', 'SpT', 'SpTnum', 'Ref02', 'Teff_K', 'eTeff_K', 'logg', 'elogg', '[Fe/H]', 'e[Fe/H]', 'Ref03', 'L_Lsol', 'eL_Lsol', 'Ref04', 'R_Rsol', 'eR_Rsol', 'Ref05', 'M_Msol', 'eM_Msol', 'Ref06', 'muRA_masa-1', 'emuRA_masa-1', 'muDE_masa-1', 'emuDE_masa-1', 'Ref07', 'pi_mas', 'epi_mas', 'Ref08', 'd_pc', 'ed_pc', 'Ref09', 'Vr_kms-1', 'eVr_kms-1', 'Ref10', 'ruwe', 'Ref11', 'U_kms-1', 'eU_kms-1', 'V_kms-1', 'eV_kms-1', 'W_kms-1', 'eW_kms-1', 'Ref12', 'sa_m/s/a', 'esa_m/s/a', 'Ref13', 'SKG', 'Ref14', 'SKG_lit', 'Ref14_lit', 'Pop', 'Ref15', 'vsini_flag', 'vsini_kms-1', 'evsini_kms-1', 'Ref16', 'P_d', 'eP_d', 'Ref17', 'pEWHalpha_A', 'epEWHalpha_A', 'Ref18', 'log(LHalpha/Lbol)', 'elog(LHalpha/Lbol)', 'Ref19', '1RXS', 'CRT_s-1', 'eCRT_s-1', 'HR1', 'eHR1', 'HR2', 'eHR2', 'Flux_X_E-13_ergcm-2s-1', 'eFlux_X_E-13_ergcm-2s-1', 'LX/LJ', 'eLX/LJ', 'Ref20', 'Activity', 'Ref21', 'FUV_mag',

## Feature extraction with `cesium` for a real RV curve

Let's test now the feature extraction with `cesium` library.

We will work with a single time series for each sample, so `cesium`multichannel feature will not be needed.

At the moment, metafeatures will only include the star ID: it will be `Karmn` ID in the real data and the synthetic dataset star ID ("RV-<i>"), and one of the other parameters (any one will do).

### First element

In [5]:
rv_idx = 0

In [6]:
rv = pd.read_csv(gto.loc[rv_idx, 'rv_file'], sep=' ', decimal='.', names=['time', 'rv', 'error_rv'])
rv.head()

Unnamed: 0,time,rv,error_rv
0,2457593.0,-32.99589,2.712948
1,2457604.0,6.867186,4.044076
2,2457623.0,65.265606,5.040799
3,2457644.0,1.339671,3.315438
4,2457650.0,-41.292629,2.576929


We first need to create the `TimeSeries` object as per `cesium` format, with two example metafeatures (not sure how this metadata are managed in this case, we have had to add them "manually" to the final features record):

In [7]:
ts_object = TimeSeries(t=rv['time'], m=rv['rv'], e=rv['error_rv'], label=gto.loc[rv_idx, 'Name'],
                       meta_features={'Karmn': gto.loc[rv_idx, 'Karmn'], 'Teff_K': gto.loc[rv_idx, 'Teff_K']})
ts_object

<cesium.time_series.TimeSeries at 0x1e47226d190>

In [8]:
cs_rv_real = featurize_single_ts(ts_object,
                                 features_to_use=ALL_CS_FEATURES)
cs_rv_real

feature                     channel
all_times_nhist_numpeaks    0          15.000000
all_times_nhist_peak1_bin   0          13.000000
all_times_nhist_peak2_bin   0          16.000000
all_times_nhist_peak3_bin   0          23.000000
all_times_nhist_peak4_bin   0          30.000000
                                         ...    
p2p_scatter_2praw           0           1.077280
p2p_scatter_over_mad        0           1.567387
p2p_scatter_pfold_over_mad  0           1.073014
p2p_ssqr_diff_over_var      0           1.917366
scatter_res_raw             0           0.169913
Length: 112, dtype: float64

In [9]:
metadata_real_idx = ['Karmn', 'Teff_K']
metadata_real_idx

['Karmn', 'Teff_K']

In [10]:
metadata_real_values = list(gto.loc[rv_idx, metadata_real_idx])
metadata_real_values

['J23548+385', 3263.0]

In [11]:
type(cs_rv_real)

pandas.core.series.Series

In [12]:
cs_rv_real.values

array([ 1.50000000e+01,  1.30000000e+01,  1.60000000e+01,  2.30000000e+01,
        3.00000000e+01,  1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
        1.00000000e+00,  1.00000000e+00,  1.00000000e+00,  1.88875396e-02,
        1.44548128e+00,  3.04445125e+00,  2.45768080e+06,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
        8.33333333e-02,  1.66666667e-01,  1.00000000e+00,  1.00000000e+00,
        1.00000000e+00,  1.00000000e+00,  1.00000000e+00,  1.00000000e+00,
        1.41412817e+01,  1.39400150e+01,  7.53370321e+00, -1.09917488e+01,
        2.88841069e+00,  2.71294826e+00,  1.30000000e+01,  7.23753469e+00,
        8.87987383e-01,  1.69695380e+02,  6.92130139e+01,  3.27419058e-17,
        1.68022081e-16,  2.14845982e-16,  4.02799125e-13,  1.56674462e-09,
        1.99072146e+01,  6.52656057e+01, -6.22750513e+00,  2.64673326e+01,
       -7.31604222e+01,  

In [13]:
print(list(cs_rv_real.index.get_level_values('feature')))

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65',

We then combine metadata with `cesium` features to create a DataFrame record.

In [14]:
indices = metadata_real_idx + list(cs_rv_real.index.get_level_values('feature'))
print(indices)

['Karmn', 'Teff_K', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_perce

In [15]:
values = metadata_real_values + list(cs_rv_real.values)
print(values)

['J23548+385', 3263.0, 15.0, 13.0, 16.0, 23.0, 30.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.01888753957314272, 1.4454812791997642, 3.0444512508953845, 2457680.798544615, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08333333333333334, 0.16666666666666669, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 14.141281666660992, 13.940015000058338, 7.533703209309434, -10.991748818471537, 2.888410687201187, 2.71294825508, 13.0, 7.237534688840733, 0.8879873829277999, 169.6953799999319, 69.2130139475, 3.274190577580113e-17, 1.6802208092468987e-16, 2.148459815751033e-16, 4.027991250916157e-13, 1.5667446248154175e-09, 19.907214565170477, 65.2656056761, -6.22750512816, 26.46733261404, -73.1604222189, 5.93153143426635e+26, 0.3076923076923077, 0.3076923076923077, 2.372612576494331e+26, 148.59490367769868, 4.7616056683497465, 0.46100524301749446, 0.3906843032252637, 35.98339767635544, 300.546519694236, 0.9971493403891576, -18.363863772657183, nan, nan, 16.147488576551304, 0.9008404039827114, 0.05744092748669481, 0.0447747384

In [16]:
df_real = pd.DataFrame(data=[values], columns=indices)
df_real.head()

Unnamed: 0,Karmn,Teff_K,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23548+385,3263.0,15.0,13.0,16.0,23.0,30.0,1.0,1.0,1.0,...,0.948789,0.34326,-2.149188,-0.054085,,1.07728,1.567387,1.073014,1.917366,0.169913


### Add another element

In [17]:
rv_idx = 1
# Get metafeatures values:
metadata_real_idx = ['Karmn', 'Teff_K'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_real_values = list(gto.loc[rv_idx, metadata_real_idx])


In [18]:
gto.loc[rv_idx, 'rv_file']

'../data/CARMENES_GTO_RVs/J23505-095.dat'

In [19]:
# load RV file:
rv = pd.read_csv(gto.loc[rv_idx, 'rv_file'], sep=' ', decimal='.',
                 names=['time', 'rv', 'error_rv'])


In [20]:
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'], e=rv['error_rv'])


In [21]:
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)


In [22]:
# Join metadata and features for the dataframe:
indices = metadata_real_idx + list(cs.index.get_level_values('feature'))
values = metadata_real_values + list(cs.values)



In [23]:
# Create a new DataFrame:
new_df = pd.DataFrame(data=[values], columns=indices)
new_df.head()

Unnamed: 0,Karmn,Teff_K,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23505-095,3377.0,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,...,0.809719,0.498257,-2.800944,9.6e-05,0.793332,1.495594,1.052292,1.496514,0.894181,0.684958


In [24]:
# Append the new dataframe to the existing one:
df_real = df_real.append(new_df, ignore_index=True)
df_real

Unnamed: 0,Karmn,Teff_K,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23548+385,3263.0,15.0,13.0,16.0,23.0,30.0,1.0,1.0,1.0,...,0.948789,0.34326,-2.149188,-0.054085,,1.07728,1.567387,1.073014,1.917366,0.169913
1,J23505-095,3377.0,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,...,0.809719,0.498257,-2.800944,9.6e-05,0.793332,1.495594,1.052292,1.496514,0.894181,0.684958


### Add another element yet (this would go in the loop)

In [25]:
rv_idx = 2
# Get metafeatures values:
metadata_real_idx = ['Karmn', 'Teff_K'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_real_values = list(gto.loc[rv_idx, metadata_real_idx])
# load RV file:
rv = pd.read_csv(gto.loc[rv_idx, 'rv_file'], sep=' ', decimal='.',
                 names=['time', 'rv', 'error_rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'], e=rv['error_rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_real_idx + list(cs.index.get_level_values('feature'))
values = metadata_real_values + list(cs.values)
# Create a new DataFrame:
new_df = pd.DataFrame(data=[values], columns=indices)
# Append the new dataframe to the existing one:
df_real = df_real.append(new_df, ignore_index=True)


In [26]:
# See the result:
df_real

Unnamed: 0,Karmn,Teff_K,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,J23548+385,3263.0,15.0,13.0,16.0,23.0,30.0,1.0,1.0,1.0,...,0.948789,0.34326,-2.149188,-0.054085,,1.07728,1.567387,1.073014,1.917366,0.169913
1,J23505-095,3377.0,10.0,10.0,22.0,12.0,31.0,1.071111,1.095455,4.381818,...,0.809719,0.498257,-2.800944,9.6e-05,0.793332,1.495594,1.052292,1.496514,0.894181,0.684958
2,J23431+365,3301.0,11.0,17.0,29.0,48.0,13.0,1.076923,1.076923,1.4,...,0.925715,0.129382,0.438334,0.001309,0.618542,1.042247,0.917263,1.269583,1.426457,0.235807


### Save the result

In [27]:
df_real.to_csv("TEST_REAL_featurized.csv", sep=',', decimal='.', index=None)

## Load the synthetic RV curves database

In [28]:
synth = pd.read_csv(SINTHETIC_FOLDER + SYNTHETIC_DB_FILE, sep=',', decimal='.')
synth.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,D1_Ps,D1_Tobs,D2_noiseRV_mean,...,D3_PsRV_median,D3_PsRV_stdev,D3_NumRV,D4_noiseRV_mean,D4_noiseRV_median,D4_noiseRV_stdev,ds1_file,ds2_file,ds3_file,ds4_file
0,RV-0,True,54.703173,1.133098,640.476258,2457653.0,0.737927,0.0025,0.25,0.24406,...,9.07285,83.352561,52,-0.416842,-0.659304,2.989282,./RV_DATASETS/DS1_ts_files/DS1-RV-RV-0.dat,./RV_DATASETS/DS2_ts_files/DS2-RV-RV-0.dat,./RV_DATASETS/DS3_ts_files/DS3-RV-RV-0.dat,./RV_DATASETS/DS4_ts_files/DS4-RV-RV-0.dat
1,RV-1,True,45.870515,0.945738,1392.825171,2458556.0,0.604099,0.0025,0.25,-0.064906,...,11.596005,161.691684,9,0.779823,1.449257,2.049067,./RV_DATASETS/DS1_ts_files/DS1-RV-RV-1.dat,./RV_DATASETS/DS2_ts_files/DS2-RV-RV-1.dat,./RV_DATASETS/DS3_ts_files/DS3-RV-RV-1.dat,./RV_DATASETS/DS4_ts_files/DS4-RV-RV-1.dat
2,RV-2,False,0.0,0.0,1799.268703,2457479.0,0.0,0.0025,0.25,0.299835,...,17.86763,200.840308,26,-0.334425,-0.653148,2.803447,./RV_DATASETS/DS1_ts_files/DS1-RV-RV-2.dat,./RV_DATASETS/DS2_ts_files/DS2-RV-RV-2.dat,./RV_DATASETS/DS3_ts_files/DS3-RV-RV-2.dat,./RV_DATASETS/DS4_ts_files/DS4-RV-RV-2.dat
3,RV-3,False,0.0,0.0,918.007885,2457612.0,0.0,0.0025,0.25,0.032798,...,24.89643,66.240066,6,0.854171,1.319561,3.02131,./RV_DATASETS/DS1_ts_files/DS1-RV-RV-3.dat,./RV_DATASETS/DS2_ts_files/DS2-RV-RV-3.dat,./RV_DATASETS/DS3_ts_files/DS3-RV-RV-3.dat,./RV_DATASETS/DS4_ts_files/DS4-RV-RV-3.dat
4,RV-4,True,69.503227,1.447038,-1375.321599,2460177.0,0.96217,0.0025,0.25,-0.403387,...,63.321765,303.823082,7,0.591993,-0.473597,3.343537,./RV_DATASETS/DS1_ts_files/DS1-RV-RV-4.dat,./RV_DATASETS/DS2_ts_files/DS2-RV-RV-4.dat,./RV_DATASETS/DS3_ts_files/DS3-RV-RV-4.dat,./RV_DATASETS/DS4_ts_files/DS4-RV-RV-4.dat


In [30]:
print(synth.columns)

Index(['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV',
       'phase', 'D1_Ps', 'D1_Tobs', 'D2_noiseRV_mean', 'D2_noiseRV_median',
       'D2_noiseRV_stdev', 'D3_samplingRV_idx', 'D3_PsRV_mean',
       'D3_PsRV_median', 'D3_PsRV_stdev', 'D3_NumRV', 'D4_noiseRV_mean',
       'D4_noiseRV_median', 'D4_noiseRV_stdev', 'ds1_file', 'ds2_file',
       'ds3_file', 'ds4_file'],
      dtype='object')


## Feature extraction with `cesium` for a synthetic RV curve

### First feature extraction for synthetic RV curves, pulsating star

In [88]:
rv_idx = 0
# Get metafeatures values:
metadata_synth_idx = ['ID', 'Pulsating', 'frequency', 'amplitudeRV'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_synth_values = list(synth.loc[rv_idx, metadata_synth_idx])
# load RV file:
rv_filename = SINTHETIC_FOLDER + synth.loc[rv_idx, 'ds1_file'].replace("./RV_DATASETS/", "") # TRICKY...
rv = pd.read_csv(rv_filename, sep=' ', decimal='.',
                 names=['time', 'rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_synth_idx + list(cs.index.get_level_values('feature'))
values = metadata_synth_values + list(cs.values)
# Create a new DataFrame (with the first item):
synth_df = pd.DataFrame(data=[values], columns=indices)


  return (cads[2:] + cads[:-2]) / (cads[1:-1] - cads[:-2])
  x = asanyarray(arr - arrmean)


In [89]:
synth_df

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,RV-0,True,54.703173,1.133098,6.0,4.0,8.0,21.0,25.0,1.115607,...,1.008862,2e-06,-2.656407e-09,-0.247608,1.079738,3.690841,0.843676,1.563795,0.703888,0.99711


### Add feature extraction for a second synthetic RV curve, non-pulsating star

In [90]:
rv_idx = 2
# Get metafeatures values:
metadata_synth_idx = ['ID', 'Pulsating', 'frequency', 'amplitudeRV'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_synth_values = list(synth.loc[rv_idx, metadata_synth_idx])
# load RV file:
rv_filename = SINTHETIC_FOLDER + synth.loc[rv_idx, 'ds1_file'].replace("./RV_DATASETS/", "") # TRICKY...
rv = pd.read_csv(rv_filename, sep=' ', decimal='.',
                 names=['time', 'rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_synth_idx + list(cs.index.get_level_values('feature'))
values = metadata_synth_values + list(cs.values)
# Create a new DataFrame (with the first item):
new_df = pd.DataFrame(data=[values], columns=indices)
# Append the new dataframe to the existing one:
synth_df = synth_df.append(new_df, ignore_index=True)


  return (y_high - y_low) / (y_95 - y_5)
  return max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))
  return (y_95 - y_5) / y_50
  out_dict['psd'] = psd[j] * 0.5 / varcn
  prob = stats.f.sf(0.5 * (ntime - 1. - detrend_order) * (1. -out_dict['chi2'] / out_dict['chi0']), 2, ntime - 1 - detrend_order)
  out_dict['chi2qso_nu_nuNULL_ratio'] = out_dict['chi2_qso/nu'] / out_dict['chi2_qso/nu_NULL']
  out_dict['log_chi2nuNULL_chi2nu'] = np.log(out_dict['chi2_qso/nu_NULL'] / out_dict['chi2_qso/nu'])
  cutoff = (a['alpha_1'] / np.power(np.abs(period - a['per']), 0.25)
  return (cads[2:] + cads[:-2]) / (cads[1:-1] - cads[:-2])
  out_dict['scatter_2praw'] = sumsqr_diff_2per_fold / sumsqr_diff_unfold
  out_dict['scatter_over_mad'] = median_diff / mad
  out_dict['scatter_pfold_over_mad'] = median_1per_fold_diff / mad
  x = asanyarray(arr - arrmean)


In [91]:
synth_df

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,RV-0,True,54.703173,1.133098,6.0,4.0,8.0,21.0,25.0,1.115607,...,1.008862,1.552702e-06,-2.656407e-09,-0.2476085,1.079738,3.690841,0.843676,1.563795,0.703888,0.99711
1,RV-2,False,0.0,0.0,1.0,,,,,,...,,1.596939e-32,0.0,7.677770000000001e-23,,,,,0.0,


### Add feature extraction for a third synthetic RV curve, non-pulsating star, but the noisy version

In [92]:
rv_idx = 2
# Get metafeatures values:
metadata_synth_idx = ['ID', 'Pulsating', 'frequency', 'amplitudeRV'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_synth_values = list(synth.loc[rv_idx, metadata_synth_idx])
# load RV file:
rv_filename = SINTHETIC_FOLDER + synth.loc[rv_idx, 'ds2_file'].replace("./RV_DATASETS/", "") # TRICKY...
rv = pd.read_csv(rv_filename, sep=' ', decimal='.',
                 names=['time', 'rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_synth_idx + list(cs.index.get_level_values('feature'))
values = metadata_synth_values + list(cs.values)
# Create a new DataFrame (with the first item):
new_df = pd.DataFrame(data=[values], columns=indices)
# Append the new dataframe to the existing one:
synth_df = synth_df.append(new_df, ignore_index=True)


  return (y_high - y_low) / (y_95 - y_5)
  return max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))
  return (y_95 - y_5) / y_50
  cutoff = (a['alpha_1'] / np.power(np.abs(period - a['per']), 0.25)
  return (cads[2:] + cads[:-2]) / (cads[1:-1] - cads[:-2])
  x = asanyarray(arr - arrmean)


In [93]:
synth_df

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,RV-0,True,54.703173,1.133098,6.0,4.0,8.0,21.0,25.0,1.115607,...,1.008862,1.552702e-06,-2.656407e-09,-0.2476085,1.079738,3.690841,0.843676,1.563795,0.703888,0.99711
1,RV-2,False,0.0,0.0,1.0,,,,,,...,,1.596939e-32,0.0,7.677770000000001e-23,,,,,0.0,
2,RV-2,False,0.0,0.0,1.0,,,,,,...,0.91941,3.334243e-06,0.004095163,-0.8241427,1.004691,1.0,1.850241,1.865297,1.983014,1.018985


### Add the other two versions (DS3 and DS4) of the non-pulsating star

In [94]:
rv_idx = 2
# Get metafeatures values:
metadata_synth_idx = ['ID', 'Pulsating', 'frequency', 'amplitudeRV'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_synth_values = list(synth.loc[rv_idx, metadata_synth_idx])
# load RV file:
rv_filename = SINTHETIC_FOLDER + synth.loc[rv_idx, 'ds3_file'].replace("./RV_DATASETS/", "") # TRICKY...
rv = pd.read_csv(rv_filename, sep=' ', decimal='.',
                 names=['time', 'rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_synth_idx + list(cs.index.get_level_values('feature'))
values = metadata_synth_values + list(cs.values)
# Create a new DataFrame (with the first item):
new_df = pd.DataFrame(data=[values], columns=indices)
# Append the new dataframe to the existing one:
synth_df = synth_df.append(new_df, ignore_index=True)


  return (y_high - y_low) / (y_95 - y_5)
  return max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))
  return (y_95 - y_5) / y_50
  power = (YC * YC / CC + YS * YS / SS) / YY
  power = (YC * YC / CC + YS * YS / SS) / YY
  power = (YC * YC / CC + YS * YS / SS) / YY
  power = (YC * YC / CC + YS * YS / SS) / YY
  power = (YC * YC / CC + YS * YS / SS) / YY
  power = (YC * YC / CC + YS * YS / SS) / YY
  out_dict['chi2qso_nu_nuNULL_ratio'] = out_dict['chi2_qso/nu'] / out_dict['chi2_qso/nu_NULL']
  out_dict['log_chi2nuNULL_chi2nu'] = np.log(out_dict['chi2_qso/nu_NULL'] / out_dict['chi2_qso/nu'])
  return (cf.median_absolute_deviation(lomb_resid) /
  out_dict['scatter_2praw'] = sumsqr_diff_2per_fold / sumsqr_diff_unfold
  out_dict['scatter_over_mad'] = median_diff / mad
  out_dict['scatter_pfold_over_mad'] = median_1per_fold_diff / mad


In [95]:
rv_idx = 2
# Get metafeatures values:
metadata_synth_idx = ['ID', 'Pulsating', 'frequency', 'amplitudeRV'] # TO BE DEFINED AT THE START OF THE LOOP
metadata_synth_values = list(synth.loc[rv_idx, metadata_synth_idx])
# load RV file:
rv_filename = SINTHETIC_FOLDER + synth.loc[rv_idx, 'ds4_file'].replace("./RV_DATASETS/", "") # TRICKY...
rv = pd.read_csv(rv_filename, sep=' ', decimal='.',
                 names=['time', 'rv'])
# Create TimeSeries object:
ts = TimeSeries(t=rv['time'], m=rv['rv'])
# Featurize the time series:
cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
# Join metadata and features for the dataframe:
indices = metadata_synth_idx + list(cs.index.get_level_values('feature'))
values = metadata_synth_values + list(cs.values)
# Create a new DataFrame (with the first item):
new_df = pd.DataFrame(data=[values], columns=indices)
# Append the new dataframe to the existing one:
synth_df = synth_df.append(new_df, ignore_index=True)


  return (y_high - y_low) / (y_95 - y_5)
  return max(abs((y_max - y_med) / y_med), abs((y_med - y_min) / y_med))
  return (y_95 - y_5) / y_50


In [96]:
synth_df

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,RV-0,True,54.703173,1.133098,6.0,4.0,8.0,21.0,25.0,1.115607,...,1.008862,1.552702e-06,-2.656407e-09,-0.2476085,1.079738,3.690841,0.843676,1.563795,0.703888,0.99711
1,RV-2,False,0.0,0.0,1.0,,,,,,...,,1.596939e-32,0.0,7.677770000000001e-23,,,,,0.0,
2,RV-2,False,0.0,0.0,1.0,,,,,,...,0.91941,3.334243e-06,0.004095163,-0.8241427,1.004691,1.0,1.850241,1.865297,1.983014,1.018985
3,RV-2,False,0.0,0.0,7.0,9.0,16.0,26.0,43.0,1.5,...,0.234772,2.705583e-29,2.533625e-09,-4.249476e-15,0.231209,,,,0.0,inf
4,RV-2,False,0.0,0.0,7.0,9.0,16.0,26.0,43.0,1.5,...,1.054615,7.665963e-07,-0.2784362,-2.093587e-05,2.378014,0.525475,1.635725,1.185163,1.966647,0.141466


**OBSERVATION:** be careful, there seems to be some infinite values returned for some curves.

### Save the result

In [97]:
synth_df.to_csv("TEST_SYNTH_featurized.csv", sep=',', decimal='.', index=None)

## Summary

**CONCLUSIONS:**
- Extracting the features from the RV curves seem pretty straightforward.
- Some aspects must be remembered, though:
  - Warnings must be silenced. If not, a lot of output will be generated.
  - Be aware that some features report infinite values in some cases: this must be taken into account for the future ML models.
  - We should also include a "valid/not_valid" flag, in case something wrong happens, and add the record anyway. For DS1-DS4 this will be important to keep exactly the same number of objects and in the same order throughout the four synthetic datasets.
  - Use the same approach as we did with the GLS: save the file very often, so that we can resume later by just changing the loop range, in case something wrong happens.
