# Linux Transfer Learning notebook - Feature Ranking List

The goal of this notebook is to produce a reliable list of features ranked by their importance regarding the size of the kernel (vmlinux).

Importing the dataset

In [1]:
import pandas as pd
df_413 = pd.read_pickle("datasets/dataset_413.pkl")

In [2]:
df_413

Unnamed: 0,X86_LOCAL_APIC,OPENVSWITCH,TEXTSEARCH_FSM,NETFILTER_XT_MATCH_TCPMSS,MPLS,NFC_HCI,NETFILTER_XT_MATCH_TIME,NET_MPLS_GSO,NFC_SHDLC,NETFILTER_XT_MATCH_U32,...,XZ-bzImage,XZ-vmlinux,XZ,LZO-bzImage,LZO-vmlinux,LZO,LZ4-bzImage,LZ4-vmlinux,LZ4,cid
0,1,0,0,0,1,0,0,1,0,0,...,5178320.0,7264848,4980068,8922064.0,11008072,8734199,9839568.0,11925896,9638560,30000
1,1,0,0,0,0,0,0,0,0,0,...,2840016.0,4924448,2695928,4519376.0,6603288,4385061,4838864.0,6923096,4693085,30001
2,1,0,0,0,0,0,0,0,0,0,...,8496592.0,10581024,8351248,12391888.0,14475800,12256864,13362640.0,15446872,13214970,30002
3,1,0,0,0,0,0,0,0,0,0,...,6304720.0,8390008,6156724,8782800.0,10867576,8647251,9302992.0,11388080,9155423,30003
4,1,0,0,0,0,1,0,0,1,0,...,12321744.0,14407032,12176312,17933264.0,20018040,17796721,19346384.0,21431472,19197696,30004
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92557,1,0,0,0,0,0,0,0,0,0,...,552400.0,2638880,411384,691664.0,2777624,558713,724432.0,2810712,578376,126756
92558,1,0,0,0,0,0,0,0,0,0,...,552400.0,2638880,411312,691664.0,2777624,558713,724432.0,2810712,578376,126757
92559,1,0,0,0,0,0,0,0,0,0,...,552400.0,2638880,411328,691664.0,2777624,558713,724432.0,2810712,578376,126758
92560,1,0,0,0,0,0,0,0,0,0,...,552400.0,2638880,411336,691664.0,2777624,558713,724432.0,2810712,578376,126759


In [3]:
size_columns = ["GZIP-bzImage", "GZIP-vmlinux", "GZIP", "BZIP2-bzImage", "vmlinux", 
              "BZIP2-vmlinux", "BZIP2", "LZMA-bzImage", "LZMA-vmlinux", "LZMA", "XZ-bzImage", "XZ-vmlinux", "XZ", 
              "LZO-bzImage", "LZO-vmlinux", "LZO", "LZ4-bzImage", "LZ4-vmlinux", "LZ4"]

Splitting the dataset into training and testing set. We will use most of the dataset (90%) for training.

In [4]:
from sklearn import ensemble, tree
from sklearn.model_selection import train_test_split

train_size = 0.9
X_train, X_test, y_train, y_test = train_test_split(df_413.drop(columns=size_columns+["cid"], errors="ignore"), df_413["vmlinux"], train_size=train_size)

Training a single Random Forest over the training set : 

In [5]:
# Setting hyperparameters for the Random Forest
reg = ensemble.RandomForestRegressor(n_estimators=48, max_depth=20, min_samples_split=10, n_jobs=8)

# Fitting the model
reg.fit(X_train, y_train)

# Predicting the testing set and computing the Mean Average Percentage Error (MAPE)
y_pred = reg.predict(X_test)
dfErrorsFold = pd.DataFrame({"% error":((y_pred - y_test)/y_test).abs()*100})

print("MAPE : ", dfErrorsFold["% error"].mean())

MAPE :  8.332935361473846


A MAPE of 8% is good enough to get a list.

In [6]:
feature_importance = pd.Series(reg.feature_importances_, X_train.columns)
feature_importance.sort_values()

X86_LOCAL_APIC          0.000000
PNFS_FLEXFILE_LAYOUT    0.000000
IPW2200_PROMISCUOUS     0.000000
USB_F_TCM               0.000000
NFT_CHAIN_NAT_IPV6      0.000000
                          ...   
X86_NEED_RELOCS         0.082012
DEBUG_INFO_SPLIT        0.083730
DEBUG_INFO_REDUCED      0.112955
active_options          0.189966
DEBUG_INFO              0.333074
Length: 9468, dtype: float64

In [9]:
list(feature_importance.sort_values(ascending=False).index)[:20]

['DEBUG_INFO',
 'active_options',
 'DEBUG_INFO_REDUCED',
 'DEBUG_INFO_SPLIT',
 'X86_NEED_RELOCS',
 'RANDOMIZE_BASE',
 'UBSAN_SANITIZE_ALL',
 'KASAN',
 'KASAN_OUTLINE',
 'UBSAN_ALIGNMENT',
 'GCOV_PROFILE_ALL',
 'DRM_NOUVEAU',
 'XFS_DEBUG',
 'XFS_FS',
 'DRM_RADEON',
 'KCOV_INSTRUMENT_ALL',
 'DRM_AMDGPU',
 'BLK_MQ_PCI',
 'MAXSMP',
 'UBSAN_NULL']

Create another Random Forest with the same parameters to compare the feature list.

In [11]:
# Setting hyperparameters for the Random Forest
reg = ensemble.RandomForestRegressor(n_estimators=48, max_depth=20, min_samples_split=10, n_jobs=8)

# Fitting the model
reg.fit(X_train, y_train)

# Predicting the testing set and computing the Mean Average Percentage Error (MAPE)
y_pred = reg.predict(X_test)
dfErrorsFold = pd.DataFrame({"% error":((y_pred - y_test)/y_test).abs()*100})

print("MAPE : ", dfErrorsFold["% error"].mean())

MAPE :  8.329455250921


In [12]:
feature_importance_2 = pd.Series(reg.feature_importances_, X_train.columns)
feature_importance_2.sort_values()

X86_LOCAL_APIC            0.000000
RT2500PCI                 0.000000
DWMAC_DWC_QOS_ETH         0.000000
ATH10K_SDIO               0.000000
ROADRUNNER_LARGE_RINGS    0.000000
                            ...   
DEBUG_INFO_SPLIT          0.084633
RANDOMIZE_BASE            0.088638
DEBUG_INFO_REDUCED        0.112856
active_options            0.190650
DEBUG_INFO                0.334011
Length: 9468, dtype: float64

If we compare the top 300 features, not even half of them are in both lists.

In [14]:
len(set(list(feature_importance.sort_values(ascending=False).index)[:300]).intersection(set(list(feature_importance_2.sort_values(ascending=False).index)[:300])))

130

Despite being trained on the exact same dataset and with the same hyperparameters, we can't reach a consistent list.

What is commonly done to fight the impact of randomness on experiment is to repeat the same operation multiple times and take the average.

Running 20 Random Forests : 

In [18]:
df_importance = pd.DataFrame()

for _ in range(0,20):
    reg = ensemble.RandomForestRegressor(n_estimators=48, max_depth=20, min_samples_split=10, n_jobs=8)

    reg.fit(X_train, y_train)

    y_pred = reg.predict(X_test)

    dfErrorsFold = pd.DataFrame({"% error":((y_pred - y_test)/y_test).abs()*100})
    print("MAPE", dfErrorsFold["% error"].mean())

    df_importance = df_importance.append(pd.DataFrame([reg.feature_importances_], columns=X_train.columns), ignore_index=True)

MAPE 8.285511414502052
MAPE 8.274145726672739
MAPE 8.242216798686933
MAPE 8.244932803008837
MAPE 8.242479623075914
MAPE 8.29850871228274
MAPE 8.309497755535356
MAPE 8.252582547160882
MAPE 8.279412430914705
MAPE 8.27467161620697
MAPE 8.296036435366942
MAPE 8.287716557760033
MAPE 8.312320296409563
MAPE 8.253339028699319
MAPE 8.25276033316606
MAPE 8.306490357823227
MAPE 8.275159235832254
MAPE 8.272655252871742
MAPE 8.293269522151649
MAPE 8.29017433648496


In [20]:
df_importance

Unnamed: 0,X86_LOCAL_APIC,OPENVSWITCH,TEXTSEARCH_FSM,NETFILTER_XT_MATCH_TCPMSS,MPLS,NFC_HCI,NETFILTER_XT_MATCH_TIME,NET_MPLS_GSO,NFC_SHDLC,NETFILTER_XT_MATCH_U32,...,APDS9960,ARCH_SUPPORTS_INT128,SLABINFO,MICROCODE_AMD,ISDN_DRV_HISAX,CHARGER_BQ24190,SND_SOC_NAU8825,BH1750,NETWORK_FILESYSTEMS,active_options
0,1.94777e-08,2.711327e-06,1.421777e-07,1.819928e-06,8.6e-05,1.882985e-05,8.295066e-07,2e-05,1.515036e-07,1.904537e-06,...,7e-06,1.43532e-08,4e-06,1.004575e-06,7.572362e-08,8e-06,2e-06,6.802712e-06,2.2e-05,0.190339
1,1.15361e-08,1.032557e-05,3.966991e-07,2.237942e-05,4.8e-05,1.055633e-05,0.0,5e-06,5.804795e-07,2.178889e-08,...,4e-05,0.0,1.3e-05,3.04646e-06,2.990701e-07,1.7e-05,1e-06,3.082698e-06,1.4e-05,0.189601
2,0.0,3.704904e-05,3.995143e-06,1.099438e-09,0.000114,1.253737e-05,0.0,7e-06,1.048344e-05,4.418103e-08,...,6.8e-05,1.139941e-08,4e-06,3.743083e-06,2.548254e-08,1.3e-05,2e-06,7.553984e-07,9e-06,0.188627
3,2.778242e-08,6.919974e-06,1.065871e-07,5.452853e-08,5.1e-05,5.265264e-06,0.0,7e-06,2.684252e-06,2.216115e-08,...,2.5e-05,0.0,3e-06,5.998434e-06,3.07944e-08,1e-05,2e-06,3.046843e-06,2.2e-05,0.189715
4,3.306345e-08,7.985219e-05,1.335731e-07,0.0,0.000132,5.909183e-07,0.0,5e-06,3.887822e-06,8.63615e-09,...,3.1e-05,0.0,5e-06,7.140426e-06,2.454632e-08,1.7e-05,2e-06,2.469203e-06,3.4e-05,0.188967
5,0.0,4.307789e-05,7.111024e-08,9.971875e-09,7.5e-05,7.542196e-06,0.0,1.5e-05,7.721866e-07,3.961225e-09,...,3.9e-05,1.061883e-08,1.7e-05,2.969906e-05,1.136279e-07,5e-06,2e-06,3.999839e-06,3.2e-05,0.188504
6,0.0,4.773652e-05,1.305358e-07,2.772411e-08,2.1e-05,4.478896e-06,2.495846e-08,2e-06,2.038826e-06,2.916973e-08,...,2.4e-05,1.084466e-08,3e-06,3.755865e-06,8.237768e-09,2.9e-05,3e-06,2.0744e-06,2.2e-05,0.189207
7,1.584593e-08,1.688e-05,7.127066e-08,8.882972e-09,0.00011,6.217624e-06,2.131213e-08,9e-06,2.461935e-06,3.321757e-06,...,1.9e-05,7.974841e-09,1.4e-05,4.614817e-06,7.286357e-09,2.1e-05,2e-06,4.135085e-06,5e-06,0.189837
8,0.0,2.351858e-05,8.218472e-08,5.528946e-07,4.7e-05,9.299145e-06,0.0,4e-06,2.53119e-07,8.179744e-08,...,2.8e-05,2.634271e-08,7e-06,1.510898e-06,1.130277e-08,2e-06,2e-06,4.302821e-06,4.1e-05,0.187515
9,0.0,5.638109e-05,1.273168e-07,5.862267e-08,6.4e-05,8.311707e-06,1.617544e-08,1.2e-05,2.619293e-06,4.555317e-08,...,2.5e-05,9.290595e-09,3e-06,5.52794e-06,3.843503e-09,1.1e-05,2e-06,2.442374e-06,2.6e-05,0.189877


Adding a row being the average of all Random Forests feature importance.

In [21]:
df_importance.loc["mean"] = df_importance.mean()

Creating a ranking value instead of arbitrary feature importance value : 

In [22]:
df_values = df_importance.T
for i in df_values.columns:
    df_values["ranking-"+str(i)] = df_values[i].sort_values(ascending=False).rank(method="min", ascending=False)
df_values.sort_values("ranking-mean")

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,ranking-11,ranking-12,ranking-13,ranking-14,ranking-15,ranking-16,ranking-17,ranking-18,ranking-19,ranking-mean
DEBUG_INFO,0.332192,0.333460,0.333127,0.334529,0.334833,0.334251,0.333904,0.333460,0.332828,0.333958,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
active_options,0.190339,0.189601,0.188627,0.189715,0.188967,0.188504,0.189207,0.189837,0.187515,0.189877,...,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
DEBUG_INFO_REDUCED,0.114519,0.113372,0.114125,0.112747,0.112122,0.112506,0.112742,0.113206,0.114613,0.112829,...,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
DEBUG_INFO_SPLIT,0.085347,0.085117,0.085581,0.084328,0.084896,0.085363,0.084545,0.085443,0.087078,0.084405,...,4.0,4.0,4.0,4.0,5.0,4.0,4.0,4.0,4.0,4.0
RANDOMIZE_BASE,0.065824,0.077844,0.081720,0.079105,0.069965,0.077121,0.078492,0.070109,0.088558,0.073020,...,5.0,5.0,5.0,6.0,4.0,5.0,5.0,5.0,5.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
SND_SBAWE_SEQ,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,8953.0,8958.0,8952.0,8969.0,8949.0,8950.0,8940.0,8946.0,8966.0,9334.0
MIXCOMWD,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,8953.0,8958.0,8952.0,8969.0,8949.0,8950.0,8940.0,8946.0,8966.0,9334.0
PCWATCHDOG,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,8953.0,8958.0,8952.0,8969.0,8949.0,8950.0,8940.0,8946.0,8966.0,9334.0
SND_SB8,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,8953.0,8958.0,8952.0,8969.0,8949.0,8950.0,8940.0,8946.0,8966.0,9334.0


Here is the feature ranking list and displaying the top 20 : 

In [23]:
feature_ranking_list = list(df_values.sort_values("ranking-mean")["ranking-mean"].index)
feature_ranking_list[:20]

['DEBUG_INFO',
 'active_options',
 'DEBUG_INFO_REDUCED',
 'DEBUG_INFO_SPLIT',
 'RANDOMIZE_BASE',
 'X86_NEED_RELOCS',
 'UBSAN_SANITIZE_ALL',
 'KASAN',
 'KASAN_OUTLINE',
 'UBSAN_ALIGNMENT',
 'GCOV_PROFILE_ALL',
 'DRM_NOUVEAU',
 'XFS_DEBUG',
 'XFS_FS',
 'KCOV_INSTRUMENT_ALL',
 'UBSAN_NULL',
 'MAXSMP',
 'DRM_RADEON',
 'DRM_AMDGPU',
 'BLK_MQ_PCI']

Repeating the operationwith 20 new Random Forests : 

In [19]:
df_importance_2 = pd.DataFrame()

for _ in range(0,20):
    reg = ensemble.RandomForestRegressor(n_estimators=48, max_depth=20, min_samples_split=10, n_jobs=8)

    reg.fit(X_train, y_train)

    y_pred = reg.predict(X_test)

    dfErrorsFold = pd.DataFrame({"% error":((y_pred - y_test)/y_test).abs()*100})
    print("MAPE", dfErrorsFold["% error"].mean())

    df_importance_2 = df_importance_2.append(pd.DataFrame([reg.feature_importances_], columns=X_train.columns), ignore_index=True)

MAPE 8.287202180072523
MAPE 8.268799452000907
MAPE 8.269959063607905
MAPE 8.230731081761608
MAPE 8.25857992098621
MAPE 8.272208631122407
MAPE 8.256220039656345
MAPE 8.27052677409201
MAPE 8.309480808728713
MAPE 8.258349634800531
MAPE 8.321599424168372
MAPE 8.266521859342685
MAPE 8.242671467628158
MAPE 8.263093950175165
MAPE 8.289319022203797
MAPE 8.244905469823037
MAPE 8.27170823500447
MAPE 8.297003691270122
MAPE 8.291465965584294
MAPE 8.284461482230416


In [24]:
df_importance_2.loc["mean"] = df_importance_2.mean()

df_values = df_importance_2.T
for i in df_values.columns:
    df_values["ranking-"+str(i)] = df_values[i].sort_values(ascending=False).rank(method="min", ascending=False)
df_values.sort_values("ranking-mean")

feature_ranking_list_2 = list(df_values.sort_values("ranking-mean")["ranking-mean"].index)


On the top 300, we get a much more consistent list : 

In [25]:
len(set(feature_ranking_list[:300]).intersection(set(feature_ranking_list_2[:300])))

249

Exporting the Feature Ranking List : 

In [27]:
import json
with open("feature_ranking_list.json","w") as f:
    json.dump(feature_ranking_list, f)

Repeating the process for 4.15

In [28]:
df_415 = pd.read_pickle("datasets/dataset_415.pkl")

In [29]:
train_size = 0.9
X_train, X_test, y_train, y_test = train_test_split(df_415.drop(columns=size_columns+["cid"], errors="ignore"), df_415["vmlinux"], train_size=train_size)

In [36]:
df_importance_415 = pd.DataFrame()

for _ in range(0,20):
    reg = ensemble.RandomForestRegressor(n_estimators=48, max_depth=20, min_samples_split=10, n_jobs=8)

    reg.fit(X_train, y_train)

    y_pred = reg.predict(X_test)

    dfErrorsFold = pd.DataFrame({"% error":((y_pred - y_test)/y_test).abs()*100})
    print("MAPE", dfErrorsFold["% error"].mean())

    df_importance_415 = df_importance_415.append(pd.DataFrame([reg.feature_importances_], columns=X_train.columns), ignore_index=True)

MAPE 9.679986650260258
MAPE 9.745662923246531
MAPE 9.64720082211712
MAPE 9.676742159201646
MAPE 9.69135123212296
MAPE 9.712152074772114
MAPE 9.763815968122232
MAPE 9.738455313284804
MAPE 9.736208395714042
MAPE 9.659228226023227
MAPE 9.705100769878776
MAPE 9.749813265843999
MAPE 9.71220417157918
MAPE 9.732416282881617
MAPE 9.699609358357515
MAPE 9.711474810673192
MAPE 9.682008213397783
MAPE 9.716561093573524
MAPE 9.675108662698907
MAPE 9.716774405109081


In [37]:
df_importance_415.loc["mean"] = df_importance_415.mean()

df_values = df_importance_415.T
for i in df_values.columns:
    df_values["ranking-"+str(i)] = df_values[i].sort_values(ascending=False).rank(method="min", ascending=False)
df_values.sort_values("ranking-mean")

feature_ranking_list_415 = list(df_values.sort_values("ranking-mean")["ranking-mean"].index)


In [38]:
with open("feature_ranking_list_415.json","w") as f:
    json.dump(feature_ranking_list_415, f)