## Goal : compare the documentation to the data collected with kernel compilation

--

Can we find the informations given by the documentation with the dataset?

### Results : 70/107  features corresponding

### TODO:
- check if each modality is well represented
- read the doc again?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import the different datasets

#### I - Extracted from the documentation:

In [2]:
doc = pd.read_csv(r'C:\Users\llesoil\Documents\tuxml\features_details.csv', index_col=0)
doc.head()

Unnamed: 0_level_0,keyword,yes_increase_size,size_diff
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ZRAM,compress,False,
KERNEL_GZIP,compress,False,
KERNEL_BZIP2,compress,False,
KERNEL_LZMA,compress,False,
KERNEL_XZ,compress,False,


- features => name
- keyword => which csv give the result (useless here)
- yes_increase_size => boolean, answering the question "Does the kernel size increase (True) or Decrease (False) if I activate this feature?"
- size_diff => if it's mentioned in the documentation, what difference of size should be seen between a kernel with/without this feature activated?

In [3]:
feat_doc = np.array(doc.index)

#### II - Collected data:

In [4]:
features = pd.read_csv("dataset_before_encoding.csv", low_memory = False)

#### III - Some tests/cleaning

##### a] Are all the features from the documentation in our dataset?

In [5]:
all_features = features.columns
f_in_dataset = []
for f in feat_doc:
    if f in all_features:
        f_in_dataset.append(f)
print(int(len(f_in_dataset)/len(feat_doc)*100), "%  of the features of the documentation are present in the dataset")

78 %  of the features of the documentation are present in the dataset


In [20]:
features_common = features[f_in_dataset]

We keep them in a new dataset

###### b] We remove all the features with wrong type (i.e. other than strings)

In [21]:
wrong_type_f = features_common.columns[np.where(features_common.dtypes != object)]
features_common = features_common.drop(wrong_type_f , axis=1)
features_common.head()

Unnamed: 0,ZRAM,KERNEL_GZIP,KERNEL_BZIP2,KERNEL_LZMA,KERNEL_XZ,KERNEL_LZO,KERNEL_LZ4,INITRAMFS_COMPRESSION_NONE,INITRAMFS_COMPRESSION_GZIP,INITRAMFS_COMPRESSION_BZIP2,...,XEN_STUB,CAN_CALC_BITTIMING,ADVISE_SYSCALLS,SLOB,BATMAN_ADV_BLA,BATMAN_ADV_DAT,BATMAN_ADV_NC,SYSFS,SYSTEM_EXTRA_CERTIFICATE,X86_FEATURE_NAMES
0,n,n,n,n,n,n,y,n,n,n,...,n,n,n,n,n,n,n,y,n,y
1,n,n,n,n,n,n,y,n,n,n,...,n,n,n,n,n,n,n,y,n,y
2,y,n,n,n,n,n,y,n,n,n,...,n,n,y,n,n,n,n,y,n,y
3,n,n,n,n,n,n,y,n,n,n,...,n,n,n,n,n,n,n,y,n,y
4,n,n,n,n,n,n,y,n,n,n,...,n,y,n,n,n,n,n,y,n,n


###### c] We remove all the features which don' t take the yes modality or just take the yes values

In [22]:
to_drop_f = []

for f in features_common.columns:
    tab = features_common[f]
    unique_values = tab.unique()
    if 'y' not in unique_values or len(unique_values) == 1:
        to_drop_f.append(f)

features_common = features_common.drop(to_drop_f , axis=1)

# we add the kernel size
features_common['vmlinux'] = features['vmlinux']

### Comparison doc vs data : Does the feature increase the kernel size?

##### I - On the first example:

In [31]:
f_com = features_common.columns
f_com = f_com[0: len(f_com)-1]

size_mod = features_common.groupby(f_com[0]).mean()
size_mod

Unnamed: 0_level_0,vmlinux
ZRAM,Unnamed: 1_level_1
m,43568160.0
n,55431320.0
y,69300370.0


For each features, we compute the the average size for all the modalities ('y', 'm', 'n').
We keep the modality correponding to the biggest value.

In [32]:
size_mod.index[np.argmax(np.array(size_mod))]

'y'

In our case, the 'yes' modality has a bigger size than others, which means that we can find a correlation between a bigger kernel size and the activation of the feature 'ZRAM'. This does not prove the causality (i.e. check 'yes' implies always a bigger size).

At least we want to find correlations correponding to the documentation.

For the rest of the dataset

In [33]:
mod = []
for f in f_com:
    size_mod = features_common.groupby(f).mean()
    mod.append(size_mod.index[np.argmax(np.array(size_mod))])

We join the results to the dataframe with documentation

In [34]:
res = pd.DataFrame(np.transpose([f_com ,  mod]), columns = ["features", "biggest_size_mod"]).set_index("features").join(doc)
res.head()

Unnamed: 0_level_0,biggest_size_mod,keyword,yes_increase_size,size_diff
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZRAM,y,compress,False,
MODULE_COMPRESS,n,compress,False,
ZSWAP,y,compress,False,
ZPOOL,y,compress,False,
ZBUD,y,compress,False,


### Results

In [40]:
print("For", np.sum((res["biggest_size_mod"]=='y')==res["yes_increase_size"]), "features on the", res.shape[0], 
      "we can compare, the documentation corresponds to the dataset.")

For 70 features on the 107 we can compare, the documentation corresponds to the dataset.
