# BULK FEATURE EXTRACTION OF THE ML SYNTHETIC RV CURVES WITH `cesium`

In this notebook we do the bulk feature extraction with `cesium` for all the 1000 in ML synthetic samples (S1, S2, S3 or S4, configurable in the second cell of the notebook).

**NOTE:** for S1 and S2 samples, it was neccesary to reduce the number of RV data points to just the first $26,251$ points, as using the full time series with $62,521$ points crashed the notebook. This should not be a major problem, we have still at least 336 full cycles of the underlying signals included in that time span, and the width of the frequency is still small enough ($\sim0.024\;d^{-1}$).

**IMPORTANT NOTE:** this code is probably not very efficient (for example, too many dataframe `append` operations, which is costly), but there is no special need at the moment to be more efficient. Maybe the solution is to create a 2D numpy array and then, at the end, create the DataFrame.

## Modules and configuration

### Modules

In [1]:
# Module import:
from IPython.display import clear_output
import warnings
import time

import pandas as pd
import numpy as np

from cesium.data_management import TimeSeries
from cesium.featurize import featurize_single_ts

### Configuration

In [17]:
FILE_ID = "S4" # For file name
CASE_ID = "S4" # For column name, metadata, etc.

#POINTS_LIMIT = 26251 # INTENDED FOR S1 AND S2, SO AS THE CESIUM CALCULATION DOES NOT CRASH THE KERNEL.
    # 26251  POINTS SHOULD BE ENOUGH, AS IT SPANS 42 d (TIME SAMPLING IS 0.0016 d), WHICH IN TURN
    # INCLUDES 336 COMPLETE CYCLES OF THE LOWEST FREQUENCY (8.0 d^{-A}), i.e. HIGHEST PERIOD (0.125 d).
POINTS_LIMIT = None # USE THIS 'None' VALUE FOR S3 AND S4 - THE CARMENES SAMPLING ALREADY REDUCES THE
    # NUMBER OF POINTS TO A REASONABLE VALUE.

GTO_FILE = "../data/RV_FINAL_ML_SyntheticDatasets_without_PG.csv"
#RV_FOLDER = "../data/ML_RVs/"

CS_FEATURES_FOLDER = "../data/DATASETS_CESIUM/"
OUT_DATASET_FILE = "cesium_ML_FINAL_" + FILE_ID + ".csv"



# LIST OF STAR METADATA TO ADD (FROM CARMENCITA DATABASE):
METADATA = {
    'S1': ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
           'S1_Ps', 'S1_Tobs'],
    'S2': ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
           'S1_Ps', 'S1_Tobs',
           'S2_errorRV_dist_idx', 'S2_errorRV_dist_name', 'S2_errorRV_dist_loc', 'S2_errorRV_dist_scale',
           'S2_errorRV_mean', 'S2_errorRV_median', 'S2_errorRV_stdev'],
    'S3': ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
           'S3_sampling_idx', 'S3_Tobs', 'S3_Ps_mean', 'S3_Ps_median', 'S3_Ps_stdev', 'S3_NumPoints'],
#    'S4': ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
#           'S3_sampling_idx', 'S3_Tobs', 'S3_Ps_mean', 'S3_Ps_median', 'S3_Ps_stdev', 'S3_NumPoints',
#           'S2_errorRV_dist_idx', 'S2_errorRV_dist_name', 'S2_errorRV_dist_loc', 'S2_errorRV_dist_scale',
#           'S4_errorRV_mean', 'S4_errorRV_median', 'S4_errorRV_stdev'],
    'S4': ['ID', 'Pulsating',
           'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 
           'CARMENES_source_idx', 'CARMENES_Ref_star',
           'errorRV_dist_loc', 'errorRV_dist_scale', 'errorRV_mean', 'errorRV_median', 'errorRV_stdev',
           'Tobs', 'Ps_mean', 'Ps_median', 'Ps_stdev', 'NumPoints',
           'S4_file', 'S3_file']
}

# A LIST OF ALL THE FEATURES CESIUM CAN EXTRACT (FOR REFERENCE PURPOSES)
ALL_CS_FEATURES = ['all_times_nhist_numpeaks',
                   'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin',
                   'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4',
                   'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4',
                   'all_times_nhist_peak_3_to_4',
                   'all_times_nhist_peak_val',
                   'avg_double_to_single_step', 'avg_err', 'avgt',
                   'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50',
                   'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000',
                   'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000',
                   'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000',
                   'cads_avg', 'cads_med', 'cads_std', 'mean',
                   'med_double_to_single_step', 'med_err',
                   'n_epochs', 'std_double_to_single_step', 'std_err',
                   'total_time', 'amplitude',
                   'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50',
                   'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80',
                   'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum',
                   'percent_amplitude', 'percent_beyond_1_std', 'percent_close_to_median', 'percent_difference_flux_percentile',
                   'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std',
                   'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile',
                   'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4',
                   'freq1_freq', 'freq1_lambda', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq1_signif',
                   'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4',
                   'freq2_freq', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4',
                   'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4',
                   'freq3_freq', 'freq3_rel_phase2', 'freq3_rel_phase3', 'freq3_rel_phase4',
                   'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31',
                   'freq_frequency_ratio_21', 'freq_frequency_ratio_31',
                   'freq_model_max_delta_mags', 'freq_model_min_delta_mags', 'freq_model_phi1_phi2',
                   'freq_n_alias', 'freq_signif_ratio_21', 'freq_signif_ratio_31',
                   'freq_varrat', 'freq_y_offset', 'linear_trend', 'medperc90_2p_p',
                   'p2p_scatter_2praw', 'p2p_scatter_over_mad', 'p2p_scatter_pfold_over_mad', 'p2p_ssqr_diff_over_var',
                   'scatter_res_raw']


## Load ML subsample information table

In [18]:
gto = pd.read_csv(GTO_FILE, sep=',', decimal='.')
gto.head()

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,errorRV_dist_loc,...,errorRV_mean,errorRV_median,errorRV_stdev,Tobs,Ps_mean,Ps_median,Ps_stdev,NumPoints,S4_file,S3_file
0,B_Star-00000,False,0.0,0.0,0.0,2457432.0,0.0,116,J11511+352,1.469807,...,1.561607,1.505,0.495668,1265.696252,11.402669,3.072043,30.706143,112,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...,../data/SYNTH_RV_SAMPLES/S3B_ts_files/S3-RV_B_...
1,B_Star-00001,False,0.0,0.0,0.0,2457487.0,0.0,29,J20336+617,0.882503,...,1.588462,1.53,0.439122,1564.855816,30.683447,9.07285,83.352561,52,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...,../data/SYNTH_RV_SAMPLES/S3B_ts_files/S3-RV_B_...
2,B_Star-00002,False,0.0,0.0,0.0,2457417.0,0.0,156,J08402+314,1.093215,...,1.265714,1.28,0.079796,665.040019,110.840003,40.398042,121.050265,7,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...,../data/SYNTH_RV_SAMPLES/S3B_ts_files/S3-RV_B_...
3,B_Star-00003,False,0.0,0.0,0.0,2457431.0,0.0,180,J05421+124,1.213872,...,1.2322,1.22,0.257987,1678.220745,34.249403,10.960469,56.986056,50,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...,../data/SYNTH_RV_SAMPLES/S3B_ts_files/S3-RV_B_...
4,B_Star-00004,False,0.0,0.0,0.0,2461026.0,0.0,67,J17052-050,1.381081,...,1.5752,1.47,0.301631,1644.61359,33.563543,17.070651,63.454219,50,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...,../data/SYNTH_RV_SAMPLES/S3B_ts_files/S3-RV_B_...


In [19]:
print(list(gto.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'errorRV_dist_loc', 'errorRV_dist_scale', 'errorRV_mean', 'errorRV_median', 'errorRV_stdev', 'Tobs', 'Ps_mean', 'Ps_median', 'Ps_stdev', 'NumPoints', 'S4_file', 'S3_file']


In [20]:
#gto[['rv_file']]
gto[[CASE_ID + '_file']]

Unnamed: 0,S4_file
0,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
1,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
2,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
3,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
4,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
...,...
3995,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
3996,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
3997,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...
3998,../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_...


In [21]:
gto.loc[0, CASE_ID + '_file']

'../data/SYNTH_RV_SAMPLES/S4B_ts_files/S4-RV_B_Star-00000.dat'

## Feature extraction with `cesium` for the real RV curves

In [23]:
# DISABLE WARNINGS:
warnings.filterwarnings('ignore')
# Batch processing:
lapse_list = []
median_lapse = None
# Initialize features dataframe and metafeatures (from disk, or new):
try:
    df = pd.read_csv(CS_FEATURES_FOLDER + OUT_DATASET_FILE, sep=',', decimal='.')
    i0 = len(df)
    print("Previous result found, will continue at record %d..." %len(df))
except:
    # No previous data stored in disk, initialize the DataFrame:
    print("No previous results found, initializing dataframe...")
    df = None
    i0 = 0
#metadata_idx = METADATA
metadata_idx = METADATA[CASE_ID]

#for i in range(i0, 3): # TEST
for i in range(i0, len(gto)):
    clear_output(wait=True)
    start_time = time.time()
    print("Record: %d, started at %s..."
          %(i, time.strftime('%d/%m/%Y, %H:%M:%S', time.localtime(start_time))))
    if median_lapse is None:
        print("Previous median lapse time: %s" %median_lapse)
    else:
        print("Previous median lapse time: %.2f seconds" %median_lapse)
    # Get metafeatures values:
    metadata_values = list(gto.loc[i, metadata_idx])
    if True: # TEST
    #try:
        print("PROCESSING STAR %s..." %gto.loc[i, 'ID'])
        # load RV file:
        #rv = pd.read_csv(gto.loc[i, 'rv_file'], sep=' ', decimal='.',
        #                 names=['time', 'rv', 'error_rv'])
        print("Loading RV file...")
        rv = pd.read_csv(gto.loc[i, CASE_ID + '_file'], sep=' ', decimal='.',
                         names=['time', 'rv', 'error_rv'])
        # Limiting the number of points (if applicable):
        if POINTS_LIMIT is None:
            print("\tNote: No limit - took all the %d points in the series." %len(rv))
            pass
        else:
            print("\tNote: Limitation - took only the first %d points in the series" %POINTS_LIMIT)
            rv = rv.head(n=POINTS_LIMIT)
        # Create TimeSeries object:
        print("Creating time series...")
        ts = TimeSeries(t=rv['time'], m=rv['rv'], e=rv['error_rv'])
        # Featurize the time series:
        print("Calculating cesium features...")
        cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
        # Join metadata and features for the dataframe:
        indices = metadata_idx + ['VALID_RECORD'] + list(cs.index.get_level_values('feature'))
        values = metadata_values + [True] + list(cs.values)
    #except Exception as e:
    else: # TEST
        # An exception was found, mark the record as invalid and set the features to 'nan':
        print("***ERROR: some error happened in record %d, marking the record as invalid. Error: %s" %(i, str(e)))
        indices = metadata_idx + ['VALID_RECORD'] + ALL_CS_FEATURES
        values = metadata_values + [False] + [np.nan] * 112
    if df is None:
        # Initialize DataFrame (with the first item):
        df = pd.DataFrame(data=[values], columns=indices)
    else:
        # Create a new DataFrame (with the new item):
        new_df = pd.DataFrame(data=[values], columns=indices)
        # Append the new dataframe to the existing one:
        df = df.append(new_df, ignore_index=True)
    # UPDATE THE AVERAGE RECORD PROCESSING TIME:
    lapse = time.time() - start_time
    lapse_list.append(lapse)
    median_lapse = np.nanmedian(lapse_list)
    # Save the results:
    df.to_csv(CS_FEATURES_FOLDER + OUT_DATASET_FILE, sep=',', decimal='.', index=False)

print("--- FINISHED ---")

Record: 3999, started at 19/02/2023, 13:57:05...
Previous median lapse time: 0.17 seconds
PROCESSING STAR B_Star-03999...
Loading RV file...
	Note: No limit - took all the 79 points in the series.
Creating time series...
Calculating cesium features...
--- FINISHED ---


### Next steps are to be executed only if the cell execution is user-interrupted

For example, if the user decided to interrupt the cell execution because it got stuck in some record, the next cells update the info for that record with an "invalid record" mark.

Afterwards, the loop (previous cell) can be executed again and it will start from the record following the problematic one.

## Review the records with errors

In [24]:
df[df['VALID_RECORD'] == False]

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,errorRV_dist_loc,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw


In [25]:
print(list(df.loc[df['VALID_RECORD'] == False, 'ID']))

[]


## Summary

**CONCLUSIONS:**
- Completed the `cesium` feature extraction for all the stars in the ML subsample (four runs of the notebook).
- No one of this objects yielded any error during calculation.