## Goal : find the better set of features

--

Thanks to all the feature importances given by all the ML algorithms we test.


#### TODO :
- create a importance score based on the importance from the csv

In [1]:
import numpy as np
import pandas as pd
import tuxml

### I- First we have to import all the datasets with features importances

The first tree + some data processing => keep all

In [2]:
fi_tree = np.array(pd.read_csv("feature_net.csv", skiprows = 1, names = ['features', 'imp'])['features'])
len(fi_tree)

1118

The actual tree => keep all

In [3]:
fi_tree2 = np.array(pd.read_csv("feature_importance.csv", skiprows = 1, names = ['features', 'imp'])['features'])
len(fi_tree2)

1051

Maybe another Decision Tree? => keep all

In [4]:
fi_tree3 = np.array(pd.read_csv("feature_importanceDT.csv", skiprows = 1, names = ['features', 'imp'])['features'])
len(fi_tree3)

369

Random forest => keep 99.9% of the predictive power

In [5]:
# we select the features based on random forest importance
df_selection = pd.read_csv('feature_importanceRF.csv', names = ['features', 'imp'], skiprows = 0)
selection = df_selection['features']
imp = df_selection['imp']

# we select the more important features until we get more than alpha % of the 'predictive power'
alpha = 0.999
sorted_imp = sorted(imp, reverse = True)
sum_imp = 0
i = 0

while sum_imp < alpha:
    sum_imp+=sorted_imp[i]
    i+=1
threshold_imp = sorted_imp[i]

fi_rf = selection[imp>threshold_imp]

len(fi_rf)

4229

Probably ElasticNet => keep the 1000 first features

In [6]:
fi_elan = np.array(pd.read_csv("feature_importanceEN.csv", skiprows = 1, names = ['features', 'imp'])['features'])[0:1000]
len(fi_elan)

1000

Probably Gradient Boosting => keep them all

In [7]:
fi_gb = np.array(pd.read_csv("feature_importanceGB.csv", skiprows = 1, names = ['features', 'imp'])['features'])
len(fi_gb)

160

I guess Linear Regression => keep the 1000 first features

In [8]:
fi_linreg = np.array(pd.read_csv("feature_importanceLR.csv", skiprows = 1, names = ['features', 'imp'])['features'])[0:1000]
len(fi_linreg)

1000

Lasso => keep the 1000 first features

In [9]:
fi_lasso= np.array(pd.read_csv("feature_importanceLasso.csv", skiprows = 1, names = ['features', 'imp'])['features'])[0:1000]
len(fi_lasso)

1000

Ridge => keep the 1000 first features

In [10]:
fi_ridge = np.array(pd.read_csv("feature_importanceRidge.csv", skiprows = 1, names = ['features', 'imp'])['features'])[0:1000]
len(fi_ridge)

1000

Correlation => we keep the 1000 with vmlinux highest correlated features (absolute value)

In [11]:
fi_correlation = pd.read_csv("correlations_vmlinux.csv", skiprows= 1, names = ['features', 'imp'])
fi_correlation['imp'] =  np.abs(fi_correlation['imp'])
fi_corr = np.array(fi_correlation.sort_values(by = 'imp', ascending = False)['features'][0:1000])
len(fi_corr)

1000

Welch test

In [12]:
fi_welch = pd.read_csv("welch_test_output.csv", skiprows = 1, names = ['features','imp'])['features']

## II- Let's the ML vote!

For each feature, we count how many time it's been considered as important for one ml algo

In [13]:
features = tuxml.load_dataset()

list_features = features.columns

In [14]:
count = []
for f in list_features:
    sum_ml = 0
    if f in fi_tree:
        sum_ml+=1
    if f in fi_tree2:
        sum_ml+=1
    if f in fi_tree3:
        sum_ml+=1
    if f in fi_rf:
        sum_ml+=1
    if f in fi_elan:
        sum_ml+=1
    if f in fi_gb:
        sum_ml+=1
    if f in fi_linreg:
        sum_ml+=1
    if f in fi_lasso:
        sum_ml+=1
    if f in fi_ridge:
        sum_ml+=1
    if f in fi_corr:
        sum_ml+=1
    if f in fi_welch:
        sum_ml+=1
    count.append(sum_ml)

  after removing the cwd from sys.path.


In [15]:
vote = pd.DataFrame(np.transpose([list_features, count]), columns = ['features', 'count'])
vote.head()

Unnamed: 0,features,count
0,X86_LOCAL_APIC,1
1,OPENVSWITCH,1
2,TEXTSEARCH_FSM,1
3,LOCKDEP_SUPPORT,1
4,GENERIC_CLOCKEVENTS_MIN_ADJUST,1


In [16]:
res_vote = vote.sort_values(by='count', ascending = False)
res_vote[0:20]

Unnamed: 0,features,count
7111,DEBUG_INFO_SPLIT,7
12438,GCOV_PROFILE_ALL,7
7828,KASAN_OUTLINE,7
5822,RANDOMIZE_BASE,7
618,KASAN,7
2691,UBSAN_SANITIZE_ALL,7
7084,DEBUG_INFO_REDUCED,7
6131,X86_VSMP,6
5829,X86_NEED_RELOCS,6
5622,DEBUG_INFO,6


Maybe the 20 most important features?

Small set

In [17]:
small_feature_set = np.array(res_vote['features'])[np.where(res_vote['count']>2)]
print(len(small_feature_set))

273


In [18]:
#np.savetxt("small_feature_set.csv", big_feature_set, delimiter=",", fmt='%s')

Bigger set

In [19]:
big_feature_set = np.array(res_vote['features'])[np.where(res_vote['count']>1)]
print(len(big_feature_set))

1376


In [20]:
#np.savetxt("big_feature_set.csv", big_feature_set, delimiter=",", fmt='%s')