# BULK FEATURE EXTRACTION OF THE SYNTHETIC RV CURVES WITH `cesium`

In this notebook we do the bulk feature extraction with `cesium` for the RV curves in the synthetic Datasets 4 that will be used for validation.

**IMPORTANT NOTE:** this code is probably not very efficient (for example, too many dataframe `append` operations, which is costly), but there is no special need at the moment to be more efficient. Maybe the solution is to create a 2D numpy array and then, only at the end, create the DataFrame.

## Modules and configuration

### Modules

In [1]:
# Module import:
import warnings
import time

import pandas as pd
import numpy as np

from cesium.data_management import TimeSeries
from cesium.featurize import featurize_single_ts

### Configuration

In [2]:
NUM_VAL_DS = 20 #Number of validation datasets


#SYNTH_FILE = "../data/RV_DATASETS/RV_All_GTO_SyntheticDatasets.csv"
RV_DS_FOLDER = "../data/VAL_DATASETS/"

CS_FEATURES_FOLDER = "../data/DATASETS_CESIUM/"
OUT_DATASET_GEN_FILE = "cesium_VAL_DS<number>_Dataset.csv"

# LIST OF STAR METADATA TO ADD (DIFFERENT FOR EACH DATASET):
METADATA = {
    1: ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
        'D1_Ps', 'D1_Tobs'],
    2: ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
        'D1_Ps', 'D1_Tobs',
        'D2_noiseRV_mean', 'D2_noiseRV_median', 'D2_noiseRV_stdev'],
    3: ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
        'D3_samplingRV_idx', 'D3_PsRV_mean', 'D3_PsRV_median', 'D3_PsRV_stdev', 'D3_NumRV'],
    4: ['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase',
        'D3_samplingRV_idx', 'D3_PsRV_mean', 'D3_PsRV_median', 'D3_PsRV_stdev', 'D3_NumRV',
        'D4_noiseRV_mean', 'D4_noiseRV_median', 'D4_noiseRV_stdev']
}

# A LIST OF ALL THE FEATURES CESIUM CAN EXTRACT (FOR REFERENCE PURPOSES)
ALL_CS_FEATURES = ['all_times_nhist_numpeaks',
                   'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin',
                   'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4',
                   'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4',
                   'all_times_nhist_peak_3_to_4',
                   'all_times_nhist_peak_val',
                   'avg_double_to_single_step', 'avg_err', 'avgt',
                   'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50',
                   'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000',
                   'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000',
                   'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000',
                   'cads_avg', 'cads_med', 'cads_std', 'mean',
                   'med_double_to_single_step', 'med_err',
                   'n_epochs', 'std_double_to_single_step', 'std_err',
                   'total_time', 'amplitude',
                   'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50',
                   'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80',
                   'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum',
                   'percent_amplitude', 'percent_beyond_1_std', 'percent_close_to_median', 'percent_difference_flux_percentile',
                   'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std',
                   'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile',
                   'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4',
                   'freq1_freq', 'freq1_lambda', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq1_signif',
                   'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4',
                   'freq2_freq', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4',
                   'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4',
                   'freq3_freq', 'freq3_rel_phase2', 'freq3_rel_phase3', 'freq3_rel_phase4',
                   'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31',
                   'freq_frequency_ratio_21', 'freq_frequency_ratio_31',
                   'freq_model_max_delta_mags', 'freq_model_min_delta_mags', 'freq_model_phi1_phi2',
                   'freq_n_alias', 'freq_signif_ratio_21', 'freq_signif_ratio_31',
                   'freq_varrat', 'freq_y_offset', 'linear_trend', 'medperc90_2p_p',
                   'p2p_scatter_2praw', 'p2p_scatter_over_mad', 'p2p_scatter_pfold_over_mad', 'p2p_ssqr_diff_over_var',
                   'scatter_res_raw']


## Feature extraction with `cesium` for the synthetic RV curves of Datasets 4

In [4]:
# DISABLE WARNINGS:
warnings.filterwarnings('ignore')

#CS_FEATURES_FOLDER = "../data/DATASETS_CESIUM/"
#OUT_DATASET_GEN_FILE = "cesium_DS<number>_Dataset.csv"

#for j in range(0, 2): # TEST
for j in range(0, NUM_VAL_DS):
    print("Processing validation dataset %d" %j)
    ds_subfolder = "VAL_DS-" + str(j) + "/"
    # Load dataset info table:
    synth_file = RV_DS_FOLDER + "VAL_DS-" + str(j) + "/VAL_DS-" + str(j) + "_SynthDatasets.csv"
    synth = pd.read_csv(synth_file, sep=',', decimal='.')
    # Initialize features dataframes and metafeatures:
    df = {
        1: None,
        2: None,
        3: None,
        4: None
    }
    i0=0

    # Batch processing:
    lapse_list = []
    median_lapse = None
    # Initialize features dataframes and metafeatures (from disk, or new):
    metadata_idx = METADATA

    #for i in range(0, 3): # TEST
    for i in range(i0, len(synth)):
        start_time = time.time()
        if i % 100 == 0:
            print("Record: %d, started at %s..."
                  %(i, time.strftime('%d/%m/%Y, %H:%M:%S', time.localtime(start_time))))
            if median_lapse is None:
                print("Previous median lapse time: %s" %median_lapse)
            else:
                print("Previous median lapse time: %.2f seconds" %median_lapse)
        for ds in [4]: # Only DS4
        #for ds in [1, 2, 3, 4]:
            # For each dataset:
            # Get metafeatures values:
            metadata_values = list(synth.loc[i, metadata_idx[ds]])
            try:
                # load RV file:
                filename = synth.loc[i, 'ds' + str(ds) + '_file']
                rv = pd.read_csv(filename, sep=' ', decimal='.',
                                 names=['time', 'rv'])
                # Create TimeSeries object:
                ts = TimeSeries(t=rv['time'], m=rv['rv'])
                # Featurize the time series:
                cs = featurize_single_ts(ts, features_to_use=ALL_CS_FEATURES)
                # Join metadata and features for the dataframe:
                indices = metadata_idx[ds] + ['VALID_RECORD'] + list(cs.index.get_level_values('feature'))
                values = metadata_values + [True] + list(cs.values)
            except Exception as e:
                # An exception was found, mark the record as invalid and set the features to 'nan':
                print("***ERROR: some error happened in record %d, dataset %d, " \
                      "marking the record as invalid. Error: %s"
                      %(i, ds, str(e)))
                indices = metadata_idx[ds] + ['VALID_RECORD'] + ALL_CS_FEATURES
                values = metadata_values + [False] + [np.nan] * 112
            if df[ds] is None:
                # Initialize DataFrame (with the first item):
                df[ds] = pd.DataFrame(data=[values], columns=indices)
            else:
                # Create a new DataFrame (with the new item):
                new_df = pd.DataFrame(data=[values], columns=indices)
                # Append the new dataframe to the existing one:
                df[ds] = df[ds].append(new_df, ignore_index=True)
            # UPDATE THE AVERAGE RECORD PROCESSING TIME:
            lapse = time.time() - start_time
            lapse_list.append(lapse)
            median_lapse = np.nanmedian(lapse_list)
            # Save the results:
            df[ds].to_csv(CS_FEATURES_FOLDER + OUT_DATASET_GEN_FILE.replace("<number>", str(j) + "_" + str(ds)),
                          sep=',', decimal='.', index=False)


Processing validation dataset 0
Record: 0, started at 04/05/2022, 17:35:50...
Previous median lapse time: None
Record: 100, started at 04/05/2022, 17:36:24...
Previous median lapse time: 0.15 seconds
Record: 200, started at 04/05/2022, 17:37:04...
Previous median lapse time: 0.16 seconds
Record: 300, started at 04/05/2022, 17:37:40...
Previous median lapse time: 0.17 seconds
Processing validation dataset 1
Record: 0, started at 04/05/2022, 17:38:06...
Previous median lapse time: None
Record: 100, started at 04/05/2022, 17:38:45...
Previous median lapse time: 0.17 seconds
Record: 200, started at 04/05/2022, 17:39:12...
Previous median lapse time: 0.16 seconds
Record: 300, started at 04/05/2022, 17:39:47...
Previous median lapse time: 0.16 seconds
Processing validation dataset 2
Record: 0, started at 04/05/2022, 17:40:28...
Previous median lapse time: None
Record: 100, started at 04/05/2022, 17:41:02...
Previous median lapse time: 0.18 seconds
Record: 200, started at 04/05/2022, 17:41:32

### Next steps are to be executed only if the cell execution is user-interrupted

For example, if the user decided to interrupt the cell execution because it got stuck in some record, the next cells update the info for that record with an "invalid record" mark.

Afterwards, the loop (previous cell) can be executed again and it will start from the record following the problematic one.

## Review the records with errors

## Summary

**CONCLUSIONS:**
- Completed the `cesium` feature extraction of the DS4 synthetic datasets.
- No problems were found in calculations.