# ALTERNATIVE ML MODEL - 1-NN WITH ALTERNATIVE S4B AND ALT_S4B SAMPLES

This notebook tries the concept of alternative ML classifier: a 1-NN classifier, trained on ALT_S4B sample (37,280 stars) and validated against S4B sample (4,000 stars).

- Training set: ALT_S4 sample. 37,280 stars. From each of the CARMENES stars, we extract the sampling pattern and noise characteristics. Then, we create 80 non-pulsating stars based on that CARMENES star, and 80 pulsating stars based on that CARMENES star, the pulsating stars covering all the points of the following parameter grid for amplitudes (A), frequencies (f), and phases (p): A = \[0.1, 0.2, 0.4, 0.8, 1.6\] $m\;s^{-1}$ x f = \[8.0, 16.0, 32.0, 64.0\] $d^{-1}$ x p = \[0.00, 0.25, 0.50, 0.75\]
- Validation set: S4B sample. 4,000 stars, created with random amplitude, frequency, phase and reference epoch, and incorporating the sampling pattern and noise characteristics of a random CARMENES star (both characteristics taken from the same random star).

Then, a 1-NN classifier is trained on the training set (ALT_S4B), and applied to the validation set (S4B): this will give us an idea of the classifier performance. With this, we can know what we coiuld expect from the application of the classifier to the CARMENES star.

**Reasoning:** with this method we have the synthetic samples much more under control, and we can also more easily gain insight about what amplitudes, frequencies or phases can be detected.

Another possibility is:
- Instead of designing just one overall classifier, take each individual CARMENES star, and find the synthetic star among its siblings most close to it.
- And, instead of using the 1-NN classifier for this, we couls just plainly compare the CARMENES time series with each of its siblings, i.e. without using the _cesium_ features, to search for the more close to it.


## Modules and configuration

### Modules

In [71]:
import pandas as pd
import numpy as np

import pickle

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, \
    f1_score, log_loss, matthews_corrcoef, classification_report, \
    get_scorer_names, confusion_matrix

from sklearn.model_selection import train_test_split

### Configuration

In [56]:
RANDOM_STATE = 11 # For reproducibility

TRAIN_S4B_IN = "../data/DATASETS_ML/1NN_TRAIN_S4B_02_DS_AfterImputing.csv"
VAL_S4B_IN = "../data/DATASETS_ML/1NN_VAL_S4B_02_DS_AfterImputing.csv"

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"
#UNREL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Unreliable_features.pickle"



## Load data

### Load reliable feature list

In [3]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

### Load TRAINING S4 data

In [4]:
s4_tr = pd.read_csv(TRAIN_S4B_IN, sep=',', decimal='.')
s4_tr

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,True,8.0,0.1,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.908970,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,True,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.541050,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,False,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,True,8.0,0.1,0.0,0.0,0.50,0,J23505-095,0.205178,...,-0.272705,0.962090,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,ALT-B_Star-37275,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.394117,0.405540,1.174777,-0.032514,-0.030659,-0.363602,-0.042368,-0.039153,0.551196,0.088020
37276,ALT-B_Star-37276,True,64.0,1.6,0.0,0.0,0.50,232,J00051+457,0.830097,...,0.288502,0.461546,0.193656,-0.023099,-0.028984,0.259205,-0.075329,-0.205573,-0.414141,0.392452
37277,ALT-B_Star-37277,False,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.876158,0.530909,0.215261,-0.027492,-0.033688,0.460585,-0.247941,-0.138908,-0.166431,0.054261
37278,ALT-B_Star-37278,True,64.0,1.6,0.0,0.0,0.75,232,J00051+457,0.830097,...,-0.487701,0.343003,-1.656181,-0.023002,-0.000361,0.384265,-0.588162,-0.164535,-0.621599,0.013055


#### Set `Pulsating` field to `0` / `1`

In [5]:
s4_tr['Pulsating'] = s4_tr['Pulsating'].map(lambda x: 1 if x == True else 0)
s4_tr

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,ALT-B_Star-00000,1,8.0,0.1,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.908970,0.541171,-0.669931,-0.015526,-0.019047,-0.007352,-0.442166,-0.134743,-0.219987,0.356795
1,ALT-B_Star-00001,0,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.725712,0.832929,0.631644,-0.009818,-0.021695,0.552345,-0.429826,0.007825,-0.005592,0.720584
2,ALT-B_Star-00002,1,8.0,0.1,0.0,0.0,0.25,0,J23505-095,0.205178,...,-0.411204,0.857062,0.541050,-0.005791,-0.035445,0.013431,-0.327874,-0.179274,-0.090747,0.234164
3,ALT-B_Star-00003,0,0.0,0.0,0.0,0.0,0.00,0,J23505-095,0.205178,...,-0.022857,1.028396,-0.061844,-0.002265,-0.031985,0.275371,-0.217636,0.168502,0.170924,0.373533
4,ALT-B_Star-00004,1,8.0,0.1,0.0,0.0,0.50,0,J23505-095,0.205178,...,-0.272705,0.962090,-0.223832,-0.008516,-0.043389,-0.593592,0.261842,-0.171281,0.590502,1.231367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37275,ALT-B_Star-37275,0,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.394117,0.405540,1.174777,-0.032514,-0.030659,-0.363602,-0.042368,-0.039153,0.551196,0.088020
37276,ALT-B_Star-37276,1,64.0,1.6,0.0,0.0,0.50,232,J00051+457,0.830097,...,0.288502,0.461546,0.193656,-0.023099,-0.028984,0.259205,-0.075329,-0.205573,-0.414141,0.392452
37277,ALT-B_Star-37277,0,0.0,0.0,0.0,0.0,0.00,232,J00051+457,0.830097,...,-0.876158,0.530909,0.215261,-0.027492,-0.033688,0.460585,-0.247941,-0.138908,-0.166431,0.054261
37278,ALT-B_Star-37278,1,64.0,1.6,0.0,0.0,0.75,232,J00051+457,0.830097,...,-0.487701,0.343003,-1.656181,-0.023002,-0.000361,0.384265,-0.588162,-0.164535,-0.621599,0.013055


In [6]:
print(list(s4_tr.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

### Load VALIDATION S4 data

In [7]:
s4_val = pd.read_csv(VAL_S4B_IN, sep=',', decimal='.')
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,False,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,False,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,False,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817
3,B_Star-00003,False,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,False,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,False,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,-0.754672,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138
3989,B_Star-03996,False,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,0.936833,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148
3990,B_Star-03997,False,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,0.416565,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491
3991,B_Star-03998,False,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,0.272003,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764


#### Set `Pulsating` field to `0` / `1`

In [8]:
s4_val['Pulsating'] = s4_val['Pulsating'].map(lambda x: 1 if x == True else 0)
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,0,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,0,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,0,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817
3,B_Star-00003,0,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,0,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,0,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,-0.754672,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138
3989,B_Star-03996,0,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,0.936833,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148
3990,B_Star-03997,0,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,0.416565,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491
3991,B_Star-03998,0,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,0.272003,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764


In [9]:
print(list(s4_val.columns))

['ID', 'Pulsating', 'frequency', 'amplitudeRV', 'offsetRV', 'refepochRV', 'phase', 'CARMENES_source_idx', 'CARMENES_Ref_star', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'avg_err', 'avgt', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'mean', 'med_double_to_single_step', 'med_err', 'n_epochs', 'std_double_to_single_step', 'std_err', 'total_time', 'amplitude

## Train a 1-NN classifier

In [45]:
clf = KNeighborsClassifier(n_neighbors=1, p=1)
clf

### Fit with TRAINING S4 sample

In [46]:
clf.fit(s4_tr[rel_features], s4_tr['Pulsating'])

### Measure performance in TRAINING S4 sample

In [47]:
s4_tr_true = s4_tr['Pulsating']

In [48]:
s4_tr_pred = clf.predict(s4_tr[rel_features])

**NOTE:** should be perfect.

In [49]:
print(confusion_matrix(y_true=s4_tr_true, y_pred=s4_tr_pred))

[[18640     0]
 [    0 18640]]


In [50]:
print(classification_report(y_true=s4_tr_true, y_pred=s4_tr_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     18640
           1       1.00      1.00      1.00     18640

    accuracy                           1.00     37280
   macro avg       1.00      1.00      1.00     37280
weighted avg       1.00      1.00      1.00     37280



## Predict VALIDATION S4 sample

In [51]:
s4_val_true = s4_val['Pulsating']

In [52]:
s4_val_pred = clf.predict(s4_val[rel_features])

### Measure performance

In [53]:
print(confusion_matrix(y_true=s4_val_true, y_pred=s4_val_pred))

[[1819 1783]
 [ 206  185]]


In [54]:
print(classification_report(y_true=s4_val_true, y_pred=s4_val_pred))

              precision    recall  f1-score   support

           0       0.90      0.50      0.65      3602
           1       0.09      0.47      0.16       391

    accuracy                           0.50      3993
   macro avg       0.50      0.49      0.40      3993
weighted avg       0.82      0.50      0.60      3993



<font color='red'>**RESULTS ARE STILL BAD**</font>

## Play only with TRAINING S4 sample

The idea is to separate the TRAINING S4 sample into training and validation sets.

### Train / test split

In [58]:
X_train, X_test, y_train, y_test = train_test_split(
    s4_tr[rel_features], s4_tr['Pulsating'], test_size=0.25, random_state=RANDOM_STATE)

In [59]:
X_train

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
6251,-0.419742,-0.641184,1.164304,-0.483721,-1.779174,-0.638288,-0.731001,-0.430075,-0.475162,-0.061316,...,0.470717,1.444513,-1.619419,-0.775016,-1.213945,0.525860,-0.336030,-0.283972,-0.581092,-0.851403
3823,-1.357122,1.911379,-0.580009,1.408443,0.529001,-0.539599,-0.006810,2.085322,0.634082,3.921359,...,1.201811,-1.587610,-0.685743,0.198864,1.009546,0.581340,0.980758,-0.358097,0.036167,0.208330
26163,-0.107282,1.698665,-0.428329,0.197458,1.452272,-0.504558,-0.561468,-0.562464,-0.329112,-0.412405,...,-0.931495,-0.805541,0.546852,-0.049608,-0.833383,-0.844392,-1.013059,0.812555,-1.263493,0.350246
10131,0.830097,1.556856,-0.731688,0.273144,0.298184,-0.172259,-0.492354,-0.650724,-0.528415,-0.752524,...,1.511909,1.283164,-0.366209,1.346654,-0.966745,1.458410,1.689501,0.581498,-1.133650,0.431541
16257,-1.669582,1.698665,2.150220,-1.391960,0.065769,0.060755,0.850034,0.025824,0.909410,0.411752,...,-0.201084,1.589649,1.510891,1.579019,-1.346943,1.009694,0.434195,1.411591,1.523785,-0.618738
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32081,0.517637,-0.499375,-0.504169,0.046085,-1.779174,-0.363664,-0.566932,-0.647193,-0.475162,-0.641930,...,-0.188166,1.504825,0.289116,1.133467,1.189776,-0.311121,-1.313601,-1.485219,0.452718,-1.489118
7259,1.142557,0.422384,1.998540,0.727264,-1.702235,-0.538425,-0.467947,-0.474204,-0.136281,-0.240518,...,-1.083901,1.240981,-1.529864,-0.586108,-0.064865,0.720048,-0.334110,-1.151119,-1.359780,0.051597
21584,-0.107282,0.564193,-0.731688,1.786876,-0.932843,-0.079053,-0.156757,0.496651,-0.168959,0.629892,...,-0.740913,-0.828039,0.949395,1.515150,1.470903,0.355586,1.480949,1.464846,-0.652079,-0.162682
36543,1.142557,-0.782993,-0.731688,-1.164901,-1.779174,-0.638288,-0.632154,-0.729271,-0.298703,-0.582020,...,1.423179,-1.455470,0.409953,0.758012,1.269449,1.427370,0.421573,0.299977,-0.155388,1.421975


In [60]:
X_test

Unnamed: 0,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,all_times_nhist_peak_2_to_4,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
14593,1.455017,-1.279324,-0.959207,-1.694707,0.605940,-0.438561,-0.277572,-0.509508,0.046715,-0.383879,...,1.126917,-1.405911,-0.715785,0.057072,0.684620,0.969989,-0.298310,-1.389999,-1.128451,0.108514
31567,-1.669582,1.202334,0.026709,1.862563,1.760028,0.759799,0.044601,-0.827243,-0.528415,-0.752524,...,-0.520804,-0.254642,-1.570554,-1.646547,1.288711,0.380138,-1.655664,1.263372,0.383331,-1.637511
37148,0.830097,-1.137515,0.330067,-0.180975,-1.548357,-0.018882,-0.195827,0.015235,-0.270894,-0.019424,...,-0.784986,-0.186787,1.416679,-1.224422,-0.388261,-0.489972,0.056282,-0.293078,1.498969,-0.277611
2161,0.830097,-0.782993,-0.200810,-0.332348,1.067576,3.452410,1.595806,1.316203,-0.528415,-0.489206,...,1.096592,-0.411749,-0.961004,0.704154,-0.886104,0.694836,-1.020631,1.070909,0.541768,0.256799
31928,-1.044662,-0.782993,-0.580009,0.878637,1.221454,-0.638288,-0.760832,-0.827243,-0.528415,-0.752524,...,-1.721346,-0.006808,-1.730393,1.470101,-0.559932,-1.538848,-1.577593,1.544863,1.574900,2.100883
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18691,0.517637,-1.137515,-0.580009,-1.089214,0.144305,0.266356,-0.199470,-0.372978,-0.484844,-0.634592,...,-0.426338,-0.579885,0.317304,0.805193,0.545774,1.444448,-1.079305,-0.347010,-1.717975,-0.221826
6244,-0.419742,-0.641184,1.164304,-0.483721,-1.779174,-0.638288,-0.731001,-0.430075,-0.475162,-0.061316,...,-1.019752,-0.493274,1.538604,0.004575,1.170522,-1.602590,1.481350,-1.151206,-0.644879,0.149955
16102,0.517637,-0.712088,-0.352490,-0.029602,-1.009783,-0.045279,-0.008403,-0.208199,-0.013486,-0.270526,...,-1.209915,-0.396115,-1.123548,1.253577,-1.246568,1.027455,-0.407534,0.145617,1.557368,0.194906
36891,1.455017,-0.215757,0.178388,0.727264,-1.009783,-0.638288,-0.645770,-0.650724,-0.323011,-0.445320,...,0.442024,1.214635,0.944415,1.679382,1.211632,0.916082,1.711454,-1.268229,0.257459,-0.040189


In [61]:
y_train

6251     0
3823     0
26163    0
10131    0
16257    0
        ..
32081    0
7259     0
21584    1
36543    0
10137    0
Name: Pulsating, Length: 27960, dtype: int64

In [62]:
y_test

14593    0
31567    0
37148    1
2161     0
31928    1
        ..
18691    0
6244     1
16102    1
36891    0
2794     1
Name: Pulsating, Length: 9320, dtype: int64

### Fit data to train set

In [63]:
clf = KNeighborsClassifier(n_neighbors=1, p=1)
clf

In [64]:
clf.fit(X_train, y_train)

### Predict labels in test set

In [65]:
y_true = y_test

In [66]:
y_pred = clf.predict(X_test)

### Performance on the test set

In [67]:
print(confusion_matrix(y_true=y_true, y_pred=y_pred))

[[2212 2444]
 [2372 2292]]


In [68]:
print(classification_report(y_true=y_true, y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.48      0.48      0.48      4656
           1       0.48      0.49      0.49      4664

    accuracy                           0.48      9320
   macro avg       0.48      0.48      0.48      9320
weighted avg       0.48      0.48      0.48      9320



**OBSERVATION:** This is interesting. Over the test set it is apparently acceptable (low precision, but balanced).

## Use one different classifier for each CARMENES star

We now try another different approach: for each synthetic star is VALIDATION S4 sample, we get the source CARMENES star. Then we retrieve all the synthetic stars in TRAINGIN S4 sample created from that CARMENES star (160 stars: 80 pulsating and 80 non-pulsating). We then use those 160 stars in TRAINING S4 to fit a 1-nn classifier which in turn we use to predict the label of the TRAINING S4 star under analysis.

### Training and prediction

In [69]:
s4_val_w_pred = s4_val.copy()
s4_val

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,B_Star-00000,0,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,-0.134573,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807
1,B_Star-00001,0,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.125231,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212
2,B_Star-00002,0,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,1.350668,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817
3,B_Star-00003,0,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,-0.058875,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605
4,B_Star-00004,0,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,-0.309842,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,0,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,-0.754672,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138
3989,B_Star-03996,0,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,0.936833,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148
3990,B_Star-03997,0,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,0.416565,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491
3991,B_Star-03998,0,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,0.272003,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764


In [87]:
s4_val_w_pred['pred_multiclf'] = -1
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw,pred_multiclf
0,B_Star-00000,0,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807,-1
1,B_Star-00001,0,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212,-1
2,B_Star-00002,0,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817,-1
3,B_Star-00003,0,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605,-1
4,B_Star-00004,0,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,0,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138,-1
3989,B_Star-03996,0,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148,-1
3990,B_Star-03997,0,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491,-1
3991,B_Star-03998,0,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764,-1


In [92]:
for i in range(0, len(s4_val_w_pred['ID'])):
    # Get source star
    karmn_id = s4_val_w_pred.loc[i, 'CARMENES_Ref_star']
    # Extract the TRAINING samples coming from that star:
    X_train = s4_tr.loc[s4_tr['CARMENES_Ref_star'] == karmn_id, rel_features].copy()
    y_train = s4_tr.loc[s4_tr['CARMENES_Ref_star'] == karmn_id, 'Pulsating'].copy()
    # Fit a classifier on that TRAINING samples:
    clf = KNeighborsClassifier(n_neighbors=1, p=1)
    clf.fit(X_train, y_train)
    # Predict the class of the VALIDATION star:
    X_test = pd.DataFrame(s4_val.loc[i, rel_features]).T
    y_pred = clf.predict(X_test)
    s4_val_w_pred.loc[i, 'pred_multiclf'] = y_pred


In [93]:
s4_val_w_pred

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,CARMENES_source_idx,CARMENES_Ref_star,all_times_nhist_numpeaks,...,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw,pred_multiclf
0,B_Star-00000,0,0.00,0.00,0.0,2.457432e+06,0.00,116,J11511+352,0.517637,...,1.418486,0.042013,-0.009545,-0.025562,0.142376,-0.263196,0.251708,-0.038958,1.237807,0
1,B_Star-00001,0,0.00,0.00,0.0,2.457487e+06,0.00,29,J20336+617,-0.419742,...,0.579019,1.116863,0.000807,-0.002270,-0.173628,0.037728,-0.114574,0.549077,-0.031212,0
2,B_Star-00002,0,0.00,0.00,0.0,2.457417e+06,0.00,156,J08402+314,0.517637,...,-0.929902,-0.047202,0.016980,0.060763,0.030306,1.365530,0.577156,0.547318,-0.942817,0
3,B_Star-00003,0,0.00,0.00,0.0,2.457431e+06,0.00,180,J05421+124,0.830097,...,0.312563,0.586087,-0.005898,-0.013044,-0.553518,0.269385,-0.201648,0.506757,0.267605,1
4,B_Star-00004,0,0.00,0.00,0.0,2.461026e+06,0.00,67,J17052-050,0.830097,...,0.665012,-0.000102,-0.008592,-0.002190,-0.588534,0.493010,-0.205573,0.753534,0.369862,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3988,B_Star-03995,0,0.00,0.00,0.0,2.459911e+06,0.00,53,J18165+048,-0.732202,...,0.189341,0.511304,0.017560,-0.040668,1.209315,0.189618,-0.023871,-1.992779,0.480138,0
3989,B_Star-03996,0,0.00,0.00,0.0,2.457428e+06,0.00,8,J23216+172,-1.669582,...,1.083443,-0.583714,-0.005475,-0.009577,0.188601,-0.225531,-0.068249,-0.300657,0.228148,0
3990,B_Star-03997,0,0.00,0.00,0.0,2.458409e+06,0.00,3,J23419+441,0.205178,...,1.183910,-0.122027,-0.004456,-0.023344,-0.211972,-0.221623,-0.018265,0.051901,1.076491,1
3991,B_Star-03998,0,0.00,0.00,0.0,2.457468e+06,0.00,181,J05415+534,0.205178,...,1.404428,-0.099327,-0.007536,-0.028463,-0.020467,-0.110814,-0.065574,0.194104,0.845764,0


### Performance

In [94]:
y_true = s4_val_w_pred['Pulsating']

In [95]:
y_pred = s4_val_w_pred['pred_multiclf']

In [96]:
print(confusion_matrix(y_true=y_true, y_pred=y_pred))

[[1819 1783]
 [ 205  186]]


In [97]:
print(classification_report(y_true=y_true, y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.90      0.50      0.65      3602
           1       0.09      0.48      0.16       391

    accuracy                           0.50      3993
   macro avg       0.50      0.49      0.40      3993
weighted avg       0.82      0.50      0.60      3993



<font color='red'>**CLEARLY, THIS IS NOT WORKING WELL, EITHER. THERE MUST BE A PROBLEM WITH THIS APPROACH:**</font>

<font color='red'>**- Maybe _cesium_ is not capturing well the characteristics of the curves.**</font>

<font color='red'>**- Maybe the high noise level (comparable to the amplitude of pulsations) is masking the pulsations with lower amplitudes. If these low-amplitude pulsations are present in the training set, they are probably introducing a confusion factor for the training of the model.**</font>


<font color='blue'>**SOME POSSIBLE ALTERNATIVES COULD BE:**</font>

<font color='blue'>**- Infer a reasonable `predict_proba` value beyond which we can be sure that the star is pulsating.**</font>

<font color='blue'>**- Remove the low amplitude pulsators from the training set.**</font>

<font color='blue'>**- Drop the _cesium_ approach and work directly with the values in the time series.**</font>


## Summary:

**CONCLUSIONS:**

- We have seen that this approach does not seem to work, either, it has similar problems (low precision and overfitting) that the other approach.
- **At the moment, we will just conclude the analysis of the previous results before investigating this other ways.**