## Goal : compare the documentation to the data collected with kernel compilation

--

Can we find the informations given by the documentation with the dataset?

### Results : 51/73  features corresponding => DEAD END

### TODO:
- check if each modality is well represented => done, it's worse
- read the doc again?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tuxml
from scipy.stats import chi2

### Import the different datasets

#### I - Extracted from the documentation:

In [2]:
doc = pd.read_csv(r'C:\Users\llesoil\Documents\tuxml\features_details.csv', index_col=0)
doc.head()

Unnamed: 0_level_0,keyword,yes_increase_size,size_diff
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ZRAM,compress,False,
KERNEL_GZIP,compress,False,
KERNEL_BZIP2,compress,False,
KERNEL_LZMA,compress,False,
KERNEL_XZ,compress,False,


- features => name
- keyword => which csv give the result (useless here)
- yes_increase_size => boolean, answering the question "Does the kernel size increase (True) or Decrease (False) if I activate this feature?"
- size_diff => if it's mentioned in the documentation, what difference of size should be seen between a kernel with/without this feature activated?

In [3]:
feat_doc = np.array(doc.index)

#### II - Collected data:

In [4]:
#features = pd.read_csv("dataset_before_encoding.csv", low_memory = False)

In [5]:
features = tuxml.load_dataset()
#features = features.replace([2,-1],0)

#### III - Some tests/cleaning

##### a] Are all the features from the documentation in our dataset?

In [6]:
all_features = features.columns
f_in_dataset = []
for f in feat_doc:
    if f in all_features:
        f_in_dataset.append(f)
print(int(len(f_in_dataset)/len(feat_doc)*100), "%  of the features of the documentation are present in the dataset")

77 %  of the features of the documentation are present in the dataset


In [7]:
features_common = features[f_in_dataset]

We keep them in a new dataset

###### b] We remove all the features which aren' t well represented (lack of 'no'/'module' values)

We define our uniformity test (very "large" to keep enough features, keep all the features having vaguely an uniform distribution)

In [8]:
n = features_common.shape[0]
col = features_common.columns

def is_nearly_uniform(feature_name):
    my_feature = features_common[feature_name]
    sum_yes = sum(my_feature==1)
    sum_no = sum(my_feature==0)
    sum_module = sum(my_feature==2)
    d_xhi_square = ((sum_yes-n/3)**2 + (sum_no-n/3)**2 + (sum_module-n/3)**2)/(n/3)
    #d_g_test = 2*(sum_yes*np.log(sum_yes/(n/3)) + sum_no*np.log(sum_no/(n/3)) + sum_module*np.log(sum_module/(n/3)))
    return d_xhi_square < 1.5*chi2.ppf(0.95, n-1)

unif = []
for i in range(len(col)):
    if is_nearly_uniform(col[i]):
        unif.append(i)

In [9]:
len(unif)

73

#### Few bad examples of representation:

In [10]:
num_f = 1
print("feature", num_f)
my_feature = features_common[features_common.columns[num_f]]
sum_yes = sum(my_feature==1)
print("yes :",sum_yes)
sum_no = sum(my_feature==0)
print("no :",sum_no)
sum_module = sum(my_feature==2)
print("module :",sum_module)
print("should we keep it?", is_nearly_uniform(features_common.columns[num_f]))

feature 1
yes : 0
no : 92471
module : 0
should we keep it? False


In [11]:
num_f = 55
print("feature", num_f)
my_feature = features_common[features_common.columns[num_f]]
sum_yes = sum(my_feature==1)
print("yes :",sum_yes)
sum_no = sum(my_feature==0)
print("no :",sum_no)
sum_module = sum(my_feature==2)
print("module :",sum_module)
print("should we keep it?", is_nearly_uniform(features_common.columns[num_f]))

feature 55
yes : 92044
no : 427
module : 0
should we keep it? False


#### Few "good enough" examples

In [12]:
num_f = 0
print("feature", num_f)
my_feature = features_common[features_common.columns[num_f]]
sum_yes = sum(my_feature==1)
print("yes :",sum_yes)
sum_no = sum(my_feature==0)
print("no :",sum_no)
sum_module = sum(my_feature==2)
print("module :",sum_module)
print("should we keep it?", is_nearly_uniform(features_common.columns[num_f]))

feature 0
yes : 7538
no : 80425
module : 4508
should we keep it? True


In [13]:
num_f = 14
print("feature", num_f)
my_feature = features_common[features_common.columns[num_f]]
sum_yes = sum(my_feature==1)
print("yes :",sum_yes)
sum_no = sum(my_feature==0)
print("no :",sum_no)
sum_module = sum(my_feature==2)
print("module :",sum_module)
print("should we keep it?", is_nearly_uniform(features_common.columns[num_f]))

feature 14
yes : 22911
no : 69560
module : 0
should we keep it? True


#### Then we keep the rest

In [14]:
features_common = features_common[features_common.columns[unif]]

# we add the kernel size
features_common['vmlinux'] = features['vmlinux']

In [15]:
len(features_common.columns)

74

### Comparison doc vs data : Does the feature increase the kernel size?

##### I - On the first example:

In [16]:
f_com = features_common.columns
f_com = f_com[0: len(f_com)-1]

size_mod = features_common.groupby(f_com[0])['vmlinux'].mean()
size_mod

ZRAM
0    4.899161e+07
1    6.269718e+07
2    3.960957e+07
Name: vmlinux, dtype: float64

For each features, we compute the the average size for all the modalities ('y', 'm', 'n').
We keep the modality correponding to the biggest value.

In [17]:
size_mod.index[np.argmax(np.array(size_mod))]

1

In our case, the 'yes' modality has a bigger size than others, which means that we can find a correlation between a bigger kernel size and the activation of the feature 'ZRAM'. This does not prove the causality (i.e. check 'yes' implies always a bigger size).

At least we want to find correlations correponding to the documentation.

For the rest of the dataset

In [18]:
mod = []
for f in f_com:
    size_mod = features_common.groupby(f)['vmlinux'].mean()
    mod.append(size_mod.index[np.argmax(np.array(size_mod))])

We join the results to the dataframe with documentation

In [19]:
res = pd.DataFrame(np.transpose([f_com ,  mod]), columns = ["features", "biggest_size_mod"]).set_index("features").join(doc)
res.head()

Unnamed: 0_level_0,biggest_size_mod,keyword,yes_increase_size,size_diff
features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ZRAM,1,compress,False,
MODULE_COMPRESS,0,compress,False,
ZPOOL,1,compress,False,
ZBUD,1,compress,False,
Z3FOLD,1,compress,False,


### Results

In [20]:
print("For", np.sum((res["biggest_size_mod"]==1)==res["yes_increase_size"]), "features on the", res.shape[0], 
      "we can compare, the documentation corresponds to the dataset.")

For 51 features on the 73 we can compare, the documentation corresponds to the dataset.
