#  FEATURE ENGINEERING - FEATURE SELECTION

In this Notebook we investigate a Feature Selection step. Given the fact that we already did some investigations on the features which best match S4 sample and ML subsample, which lead us to discard $64$ of the $112$ features, it is probably not advisable to discard more featues. Anyway, we will apply a `SelectKBest` feature selector to see the importance of features. We use the `mutual_info_classif` as scoring function, based on k-nn, because it is able to identify non-linear relationships between a feature and the target variable.

Obviously, we can only apply it to S4 sample, as for ML subsample we do not have the true labels of the objetcs.

## Modules and configuration

### Modules

In [1]:
import pandas as pd
import numpy as np

import warnings

from sklearn.feature_selection import SelectKBest, mutual_info_classif, f_classif

import pickle

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white", {'figure.figsize':(15,10)})

from IPython.display import display

### Configuration

In [2]:
RANDOM_STATE = 11 # For reproducibility

ML_INPUT_SCALED = "../data/DATASETS_ML/ML_02_DS_AfterImputing.csv"
S4_INPUT_SCALED = "../data/DATASETS_ML/S4_02_DS_AfterImputing.csv"

REL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Reliable_features.pickle"
UNREL_FEATURES_IN = "../data/ML_MODELS/ML_pipeline_steps/Unreliable_features.pickle"

SEL_FEATURE_A_LIST_OUT = "../data/ML_MODELS/ML_pipeline_steps/SelectedFeatures_mic.pickle"
SEL_FEATURE_B_LIST_OUT = "../data/ML_MODELS/ML_pipeline_steps/SelectedFeatures_fc.pickle"

ML_ADD_COLUMNS = ['Karmn'] # Only cesium features and this column will be kept.
S4_ADD_COLUMNS = ['ID', 'Pulsating', 'frequency', 'amplitudeRV',
                  'offsetRV', 'refepochRV', 'phase'] # Only cesium features and these columns will be kept.

IMAGE_FOLDER = './img/'

### Functions

## Load data

We load the data, which are the time series as previously featurized by _cesium_, scaled, and with `NaN` values imputed by a `KNNImputer`.

### Load reliable features list

In [3]:
rel_features = pickle.load(open(REL_FEATURES_IN, 'rb'))
print(rel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_rel_phase3', '

### Load unreliable features list

In [4]:
unrel_features = pickle.load(open(UNREL_FEATURES_IN, 'rb'))
print(unrel_features)

['avg_err', 'avgt', 'mean', 'med_err', 'std_err', 'amplitude', 'flux_percentile_ratio_mid20', 'flux_percentile_ratio_mid35', 'flux_percentile_ratio_mid50', 'flux_percentile_ratio_mid65', 'flux_percentile_ratio_mid80', 'max_slope', 'maximum', 'median', 'median_absolute_deviation', 'minimum', 'percent_amplitude', 'percent_close_to_median', 'percent_difference_flux_percentile', 'period_fast', 'qso_log_chi2_qsonu', 'qso_log_chi2nuNULL_chi2nu', 'skew', 'std', 'stetson_j', 'stetson_k', 'weighted_average', 'fold2P_slope_10percentile', 'fold2P_slope_90percentile', 'freq1_amplitude1', 'freq1_amplitude2', 'freq1_amplitude3', 'freq1_amplitude4', 'freq1_freq', 'freq1_lambda', 'freq1_signif', 'freq2_amplitude1', 'freq2_amplitude2', 'freq2_amplitude3', 'freq2_amplitude4', 'freq2_freq', 'freq3_amplitude1', 'freq3_amplitude2', 'freq3_amplitude3', 'freq3_amplitude4', 'freq3_freq', 'freq_amplitude_ratio_21', 'freq_amplitude_ratio_31', 'freq_frequency_ratio_21', 'freq_frequency_ratio_31', 'freq_model_max

###  Read the S4 sample data (scaled and imputed features)

In [5]:
s4 = pd.read_csv(S4_INPUT_SCALED, sep=',', decimal='.')
s4

Unnamed: 0,ID,Pulsating,frequency,amplitudeRV,offsetRV,refepochRV,phase,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,...,freq_signif_ratio_31,freq_varrat,freq_y_offset,linear_trend,medperc90_2p_p,p2p_scatter_2praw,p2p_scatter_over_mad,p2p_scatter_pfold_over_mad,p2p_ssqr_diff_over_var,scatter_res_raw
0,Star-00000,True,10.33,1.14,0.0,2.457582e+06,0.90,-1.944263,2.120058,1.900503,...,0.175639,-1.179443,0.567705,0.100876,-0.114591,-0.216210,0.045386,-0.543694,1.459163,-1.311671
1,Star-00001,True,14.92,1.30,0.0,2.457522e+06,0.02,1.231081,0.171155,0.089360,...,0.281290,0.536314,0.053099,0.191840,-0.184775,1.237821,-0.589916,-0.372454,-1.663617,-0.333991
2,Star-00002,False,0.00,0.00,0.0,2.457549e+06,0.00,0.596012,-0.664089,-0.212498,...,-0.307759,0.629682,-0.137934,0.052383,-0.134684,-0.332168,0.316178,-0.194267,-0.020196,0.651418
3,Star-00003,False,0.00,0.00,0.0,2.457460e+06,0.00,2.501218,-0.664089,-1.495391,...,-0.076569,0.420813,-0.544527,0.065099,-0.073476,0.021419,-0.521209,-0.244714,0.081570,-0.272277
4,Star-00004,True,28.74,0.90,0.0,2.457451e+06,0.29,-0.039057,1.076003,0.391217,...,-0.390230,-1.060328,2.146213,0.010452,-0.401182,-0.849181,0.067740,-0.433568,0.438785,-0.757588
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Star-00995,False,0.00,0.00,0.0,2.457504e+06,0.00,0.913546,-0.037656,0.089360,...,0.842777,-0.334041,-1.827351,-0.257412,-0.616404,-1.315109,0.886334,0.702323,-0.270151,0.406118
996,Star-00996,False,0.00,0.00,0.0,2.457673e+06,0.00,-1.626729,1.632833,2.051432,...,1.724900,-0.463711,-1.194149,0.021936,0.815043,0.401297,-0.372013,-0.028165,1.854842,-1.064909
997,Star-00997,False,0.00,0.00,0.0,2.458634e+06,0.00,-0.356591,-0.803296,-0.514355,...,-0.486141,1.825751,-0.108959,0.070002,-0.285029,0.368291,-0.238927,0.264164,-0.103797,2.159378
998,Star-00998,False,0.00,0.00,0.0,2.457397e+06,0.00,-0.356591,-0.037656,-0.891677,...,-0.288419,1.285370,0.083736,0.024976,-0.290043,0.132179,-0.323890,-0.160471,0.052323,1.120890


#### Filter the relevant columns only

We now filter only by the reliable relevant columns plus the `Pulsating` column.

In [6]:
s4_rel = s4[['Pulsating'] + rel_features].copy()
s4_rel

Unnamed: 0,Pulsating,all_times_nhist_numpeaks,all_times_nhist_peak1_bin,all_times_nhist_peak2_bin,all_times_nhist_peak3_bin,all_times_nhist_peak4_bin,all_times_nhist_peak_1_to_2,all_times_nhist_peak_1_to_3,all_times_nhist_peak_1_to_4,all_times_nhist_peak_2_to_3,...,freq1_rel_phase2,freq1_rel_phase3,freq1_rel_phase4,freq2_rel_phase2,freq2_rel_phase3,freq2_rel_phase4,freq3_rel_phase2,freq3_rel_phase3,freq3_rel_phase4,freq_model_phi1_phi2
0,True,-1.944263,2.120058,1.900503,0.032450,0.058428,-0.175766,0.142098,0.142904,0.146788,...,0.585852,0.760826,0.461844,1.375471,-1.544456,-0.315672,1.078066,1.258823,-1.522632,-0.302747
1,True,1.231081,0.171155,0.089360,-1.630917,-0.684446,-0.341852,-0.577907,-0.567365,-0.512199,...,0.378288,-1.366258,-1.617260,1.566127,0.098099,0.141892,-1.445867,0.086891,-0.557137,-0.738010
2,False,0.596012,-0.664089,-0.212498,0.391732,0.087724,1.196750,2.120285,1.085438,0.930900,...,0.297032,-1.118182,1.496541,1.537578,-1.054755,-1.021736,-0.545619,-1.173806,0.715002,-1.630625
3,False,2.501218,-0.664089,-1.495391,0.166993,-1.533833,-0.462828,-0.578686,-0.708032,-0.419026,...,-1.437135,0.551779,0.448023,-1.655605,-0.854472,0.534856,1.277610,1.484307,-0.080514,1.992878
4,True,-0.039057,1.076003,0.391217,-1.256352,-0.993314,0.012129,-0.287835,-0.534947,-0.391603,...,-0.182144,0.544265,-1.012917,0.668693,-1.428288,-1.588016,0.644710,1.079623,-1.606967,-1.430279
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,False,0.913546,-0.037656,0.089360,-1.331265,-0.993314,-0.692478,-0.368627,-0.586661,0.156860,...,-1.699344,-1.574378,0.220713,0.238513,0.122477,-0.827172,1.420243,-0.877574,-1.223558,1.185163
996,False,-1.626729,1.632833,2.051432,-1.406178,-0.151283,0.082590,0.843252,-0.068494,0.842438,...,-1.000715,0.015125,-0.882959,0.522536,0.093065,-1.035738,-0.495161,-0.582121,1.205916,0.425432
997,False,-0.356591,-0.803296,-0.514355,-0.132658,0.242158,-0.328457,-0.425885,0.292986,-0.312971,...,1.377438,-0.094915,-1.389352,1.686697,-0.906353,0.705300,-1.653687,-0.368324,0.225336,2.247789
998,False,-0.356591,-0.037656,-0.891677,0.316819,0.087724,-0.677428,-0.686637,-0.777780,-0.397438,...,0.139384,0.388687,-0.000302,1.710044,0.342520,0.410963,-0.811926,-0.752704,-1.525543,-0.335513


In [7]:
print(list(s4_rel.columns))

['Pulsating', 'all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak2_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_2', 'all_times_nhist_peak_1_to_3', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_3', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'all_times_nhist_peak_val', 'avg_double_to_single_step', 'cad_probs_1', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100', 'cad_probs_500', 'cad_probs_1000', 'cad_probs_5000', 'cad_probs_10000', 'cad_probs_50000', 'cad_probs_100000', 'cad_probs_500000', 'cad_probs_1000000', 'cad_probs_5000000', 'cad_probs_10000000', 'cads_avg', 'cads_med', 'cads_std', 'med_double_to_single_step', 'n_epochs', 'std_double_to_single_step', 'total_time', 'percent_beyond_1_std', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq1_rel_phase4', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq3_r

## Feature selection (with `mutual_info_classif` score)

We define and run the feature selection. Due to the randomness in the calculations, we decide to run $50$ iterations to see which most relevant $24$ features remain (half of $48$).

In [8]:
np.random.seed(RANDOM_STATE) # For reproducibility
common_features_mic = set(rel_features)
for i in range(0, 50):
    fsel = SelectKBest(score_func=mutual_info_classif, k=24)
    fsel.fit(X=s4[rel_features], y=s4['Pulsating'])
    sel_idx = fsel.get_support(indices=True).tolist()
    sel_features = set([rel_features[i] for i in range(0, len(rel_features)) if i in sel_idx])
    common_features_mic = set.intersection(common_features_mic, sel_features)

common_features_mic

{'freq1_rel_phase2', 'freq2_rel_phase2'}

In [9]:
len(common_features_mic)

2

**OBSERVATION:** as we see, out of the $24$ features we established as the threshold, after $50$ runs only $2$ features appear as more important.

Let's try it another way: counting the occurrences of each time a feature enters in the top-$24$ list.Ç

In [10]:
np.random.seed(RANDOM_STATE) # For reproducibility
count_features_mic = {}
for i in range(0, 50):
    fsel = SelectKBest(score_func=mutual_info_classif, k=24)
    fsel.fit(X=s4[rel_features], y=s4['Pulsating'])
    sel_idx = fsel.get_support(indices=True).tolist()
    sel_features = [rel_features[i] for i in range(0, len(rel_features)) if i in sel_idx]
    for f in sel_features:
        count_features_mic[f] = count_features_mic.get(f, 0) + 1


In [11]:
print(count_features_mic)

{'all_times_nhist_peak1_bin': 13, 'all_times_nhist_peak4_bin': 12, 'all_times_nhist_peak_1_to_2': 48, 'all_times_nhist_peak_1_to_4': 31, 'all_times_nhist_peak_2_to_4': 20, 'all_times_nhist_peak_val': 42, 'avg_double_to_single_step': 31, 'cad_probs_1': 27, 'cad_probs_10': 24, 'cad_probs_30': 23, 'cad_probs_100': 32, 'cad_probs_1000': 36, 'cad_probs_5000': 35, 'cad_probs_100000': 45, 'cad_probs_1000000': 27, 'cad_probs_10000000': 22, 'cads_std': 16, 'med_double_to_single_step': 42, 'n_epochs': 47, 'total_time': 48, 'percent_beyond_1_std': 45, 'freq1_rel_phase2': 50, 'freq2_rel_phase2': 50, 'freq3_rel_phase2': 49, 'all_times_nhist_numpeaks': 31, 'all_times_nhist_peak_2_to_3': 16, 'cad_probs_40': 32, 'cad_probs_50': 32, 'cad_probs_10000': 6, 'cads_avg': 35, 'freq3_rel_phase4': 15, 'freq_model_phi1_phi2': 25, 'all_times_nhist_peak2_bin': 10, 'all_times_nhist_peak_3_to_4': 38, 'cad_probs_20': 22, 'cad_probs_5000000': 19, 'all_times_nhist_peak3_bin': 14, 'cad_probs_500': 26, 'cads_med': 22, '

Sort descending:

In [12]:
sorted_count_features_mic = [(key, value) for key, value in count_features_mic.items()]
sorted_count_features_mic.sort(key=lambda x: x[1], reverse=True)
sorted_count_features_mic

[('freq1_rel_phase2', 50),
 ('freq2_rel_phase2', 50),
 ('freq3_rel_phase2', 49),
 ('all_times_nhist_peak_1_to_2', 48),
 ('total_time', 48),
 ('n_epochs', 47),
 ('cad_probs_100000', 45),
 ('percent_beyond_1_std', 45),
 ('all_times_nhist_peak_val', 42),
 ('med_double_to_single_step', 42),
 ('all_times_nhist_peak_3_to_4', 38),
 ('cad_probs_1000', 36),
 ('cad_probs_5000', 35),
 ('cads_avg', 35),
 ('cad_probs_100', 32),
 ('cad_probs_40', 32),
 ('cad_probs_50', 32),
 ('all_times_nhist_peak_1_to_4', 31),
 ('avg_double_to_single_step', 31),
 ('all_times_nhist_numpeaks', 31),
 ('cad_probs_1', 27),
 ('cad_probs_1000000', 27),
 ('cad_probs_500', 26),
 ('freq_model_phi1_phi2', 25),
 ('cad_probs_10', 24),
 ('cad_probs_30', 23),
 ('cad_probs_10000000', 22),
 ('cad_probs_20', 22),
 ('cads_med', 22),
 ('all_times_nhist_peak_2_to_4', 20),
 ('cad_probs_5000000', 19),
 ('cads_std', 16),
 ('all_times_nhist_peak_2_to_3', 16),
 ('freq3_rel_phase4', 15),
 ('all_times_nhist_peak3_bin', 14),
 ('all_times_nhist

In [13]:
len(sorted_count_features_mic)

46

**OBSERVATION:** oiut of the $48$ initial features, up to $46$ of them have appeared in the top-$24$ list in some of the $50$ trials, and only about $10$ to $15$ features appear consistently in the top-$24$ list. Only $2$ features appear always in the top-$24$ list.

This seems to suggest that feature selection is not reliable, and we should not use it, leaving that feature selection task to the ML model itself during optimization and training with crossvalidation.

Anyway, we store this information for later use, if needed.

### Save selected features info

In [14]:
pickle.dump(sorted_count_features_mic, open(SEL_FEATURE_A_LIST_OUT, 'wb'))

## Feature selection (with `f_classif` score)

Let's try with another score function, `f_classif`, which focuses more on linear relationships of features with the target variable, we will again limit it to just $24$ features, to see if the same features appear, and again with a $50$ iteration run.

In [15]:
warnings.filterwarnings(action='ignore')
np.random.seed(RANDOM_STATE) # For reproducibility
count_features_fc = {}
for i in range(0, 50):
    fsel = SelectKBest(score_func=f_classif, k=24)
    fsel.fit(X=s4[rel_features], y=s4['Pulsating'])
    sel_idx = fsel.get_support(indices=True).tolist()
    sel_features = [rel_features[i] for i in range(0, len(rel_features)) if i in sel_idx]
    for f in sel_features:
        count_features_fc[f] = count_features_fc.get(f, 0) + 1


In [16]:
print(count_features_fc)

{'all_times_nhist_numpeaks': 50, 'all_times_nhist_peak1_bin': 50, 'all_times_nhist_peak3_bin': 50, 'all_times_nhist_peak4_bin': 50, 'all_times_nhist_peak_1_to_4': 50, 'all_times_nhist_peak_2_to_4': 50, 'all_times_nhist_peak_3_to_4': 50, 'cad_probs_10': 50, 'cad_probs_20': 50, 'cad_probs_30': 50, 'cad_probs_40': 50, 'cad_probs_50': 50, 'cad_probs_100000': 50, 'cad_probs_1000000': 50, 'cads_avg': 50, 'n_epochs': 50, 'total_time': 50, 'freq1_rel_phase2': 50, 'freq1_rel_phase3': 50, 'freq2_rel_phase2': 50, 'freq2_rel_phase3': 50, 'freq2_rel_phase4': 50, 'freq3_rel_phase2': 50, 'freq_model_phi1_phi2': 50}


Sort descending:

In [17]:
sorted_count_features_fc = [(key, value) for key, value in count_features_fc.items()]
sorted_count_features_fc.sort(key=lambda x: x[1], reverse=True)
sorted_count_features_fc

[('all_times_nhist_numpeaks', 50),
 ('all_times_nhist_peak1_bin', 50),
 ('all_times_nhist_peak3_bin', 50),
 ('all_times_nhist_peak4_bin', 50),
 ('all_times_nhist_peak_1_to_4', 50),
 ('all_times_nhist_peak_2_to_4', 50),
 ('all_times_nhist_peak_3_to_4', 50),
 ('cad_probs_10', 50),
 ('cad_probs_20', 50),
 ('cad_probs_30', 50),
 ('cad_probs_40', 50),
 ('cad_probs_50', 50),
 ('cad_probs_100000', 50),
 ('cad_probs_1000000', 50),
 ('cads_avg', 50),
 ('n_epochs', 50),
 ('total_time', 50),
 ('freq1_rel_phase2', 50),
 ('freq1_rel_phase3', 50),
 ('freq2_rel_phase2', 50),
 ('freq2_rel_phase3', 50),
 ('freq2_rel_phase4', 50),
 ('freq3_rel_phase2', 50),
 ('freq_model_phi1_phi2', 50)]

In [18]:
len(sorted_count_features_fc)

24

**OBSERVATION:** this is interesting, results are stable with this `f_classif` score function: the same $24$ features appear consistently.

Let's run it one more time with all the features and check the p-values.

In [19]:
fsel = SelectKBest(score_func=f_classif, k=24)
fsel.fit(X=s4[rel_features], y=s4['Pulsating'])
sel_idx = fsel.get_support(indices=True).tolist()
sel_features = [rel_features[i] for i in range(0, len(rel_features)) if i in sel_idx]
print(sel_features)

['all_times_nhist_numpeaks', 'all_times_nhist_peak1_bin', 'all_times_nhist_peak3_bin', 'all_times_nhist_peak4_bin', 'all_times_nhist_peak_1_to_4', 'all_times_nhist_peak_2_to_4', 'all_times_nhist_peak_3_to_4', 'cad_probs_10', 'cad_probs_20', 'cad_probs_30', 'cad_probs_40', 'cad_probs_50', 'cad_probs_100000', 'cad_probs_1000000', 'cads_avg', 'n_epochs', 'total_time', 'freq1_rel_phase2', 'freq1_rel_phase3', 'freq2_rel_phase2', 'freq2_rel_phase3', 'freq2_rel_phase4', 'freq3_rel_phase2', 'freq_model_phi1_phi2']


In [20]:
len(sel_features)

24

In [21]:
scores = fsel.scores_.tolist()
print(scores)

[2.3699766197518413, 4.496974881126946, 0.1986572123634698, 0.3813400031556547, 2.9777779690358943, 0.003806909728075224, 0.2652077401070679, 0.5809279626642733, 0.13956413446540564, 1.854567966581379, 1.3124597203167931, 0.2585632616698999, 0.2081726841161018, nan, 0.4575666030869316, 0.9415903673421611, 1.2877160517888129, 0.31966513729425833, 0.3630054873475876, 0.13038712519792328, 0.026694518833785194, 0.027755316370961684, 0.17043184860791924, 0.09090892035803584, 0.22508677606681918, 1.118691675200149, 0.010652963397768169, 0.3106154581057065, nan, nan, 0.385254703338002, 0.18571279636497912, 0.001321981306001428, 0.30802292840684653, 2.2244265803520373, 0.23069256401428595, 0.7020683202586707, 0.06734418002145164, 2.5424674528027587, 0.5223392771119948, 0.009219515919619756, 0.8919170768091889, 1.1215080882939141, 2.221924809721255, 0.4441610272126721, 0.1783551037913882, 0.2598396455280381, 0.4339040347861714]


In [22]:
print(fsel.pvalues_)

[0.12400589 0.03420101 0.65590357 0.53702783 0.08472463 0.95081397
 0.60667877 0.44612891 0.70879405 0.17356036 0.25222443 0.61122132
 0.64830269        nan 0.49892016 0.33210428 0.25674175 0.57193603
 0.54697865 0.71810741 0.87024873 0.86771912 0.67981697 0.76308763
 0.63529488 0.29045629 0.9178145  0.57742773        nan        nan
 0.53494605 0.66660209 0.97100333 0.5790203  0.13615909 0.63111604
 0.4022902  0.79529716 0.1111381  0.47001484 0.92352531 0.3451869
 0.2898502  0.13637931 0.50527494 0.67288163 0.61034304 0.51023143]


In [23]:
len(scores)

48

How large are the p-values of each feature? How many are above 0.95?

In [24]:
(fsel.pvalues_ > 0.95).sum()

2

In [25]:
features_to_drop = [(i,rel_features[i]) for i in range(0, len(rel_features)) if fsel.pvalues_[i] > 0.95]
features_to_drop

[(5, 'all_times_nhist_peak_1_to_2'), (32, 'cads_std')]

That is to say, only $2$ features can be said to be statistically independent of the target value. That is to say, we could only drop these $2$ features.

## Summary

**RESULTS:**

- We got inconsistent results of the top-$24$ list of selected features when using the `mutual_info_classif` score.
- Using the `f_classif`, the results are consistent, but according to the p-values results, only two features have p-values above $0.95\%$, which means that we could only discard $2$ features.

**CONCLUSIONS:**

- In line with the results shown by the PCA, it is not advisable to use feature extraction in our case.