## Goal : patch the Linux documentation

With the effect of configuration options on the binary size of the Linux kernel

The goal is to produce a text following a structure like this on a per-option basis:

    ----
    OPTION_NAME
    ----

    OPTION_NAME has a [negligible|weak|strong] effect on the binary size of the resulting kernel.
    In general, activating OPTION_NAME [increases|reduces] the binary size by XXX MB ([weak|medium|strong] representativity score). 
    ...
    


### Import libs & data

See https://zenodo.org/record/4943884#.YqG5cTlByV4

In [1]:
import pandas as pd
import json
import numpy as np

with open("Linux_options.json","r") as f:
    linux_options = json.load(f)
    # Load csv by setting options as int8 to save a lot of memory
df = pd.read_csv("../Linux.csv", dtype={f:numpy.int8 for f in linux_options})

### Data overview

Linux options in columns, Linux configurations in lines, perf is the binary size related to the configuration.
The value is set to 1 if the option is activated in the Linux configuration, 0 otherwise.


In [2]:
df

Unnamed: 0,X86_LOCAL_APIC,OPENVSWITCH,TEXTSEARCH_FSM,NETFILTER_XT_MATCH_TCPMSS,MPLS,NFC_HCI,NETFILTER_XT_MATCH_TIME,NET_MPLS_GSO,NFC_SHDLC,NETFILTER_XT_MATCH_U32,...,ARCH_SUPPORTS_INT128,SLABINFO,MICROCODE_AMD,ISDN_DRV_HISAX,CHARGER_BQ24190,SND_SOC_NAU8825,BH1750,NETWORK_FILESYSTEMS,active_options,perf
0,1,0,0,0,1,0,0,1,0,0,...,1,0,0,0,1,0,0,0,1435,50222120
1,1,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,1382,16660024
2,1,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,1626,43080856
3,1,0,0,0,0,0,0,0,0,0,...,1,1,1,0,1,0,1,0,2140,27261672
4,1,0,0,0,0,1,0,0,1,0,...,1,0,1,0,1,0,1,1,2651,58769440
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92557,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,240,7317008
92558,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,240,7317008
92559,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,240,7317008
92560,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,240,7317008


### Average binary size difference when activating the option

we take the average performance of the configurations with the option enabled, subtract the average performance of the configurations with the option disabled and report the sign and the value.

In [28]:
def avg_diff(option_name):
    # input :
    # option_name a string with the name of an option (a column in our dataset)
    # output :
    # the average difference of performance when activating the option compared to when it is not 
        
    return df.query(option_name+'==1').mean()['perf']-df.query(option_name+'==0').mean()['perf']

In [29]:
avg_diff("DEBUG_INFO")

137832303.34279796

### Confidence score of the previous indicator

It is an index of representativeness of the different values of the option. Why is this? The idea is that the indicator previous indicator depends not only on the number of total configurations, but also on the representativeness of the option. If we have few cases where the option is activated, we have more chances to make mistakes when calculating the average performance associated with it. I propose to use I = 1-2*|0.5-x| with x the proportion of configurations with the option activated, x between 0 and 1. The ideal case being 50% of configurations with the option activated, 50% without the option which gives a confidence of 1. The extreme case is no configuration (or all configurations) with the option activated which gives an index of 0. But as we are addressing the user, we simplify again; depending on the values of I, we create different categories of confidence [0, 0.33] => low confidence, [0.33, 0.67] => medium confidence and [0.67,1] => high confidence

In [36]:
def representativity_index(option_name):
    # input :
    # option_name a string with the name of an option (a column in our dataset)
    # output :
    # the index, as defined above
    # 0<active_prop<1 a real representing the proportion of configurations with the option activated
    active_prop = df.query(option_name+'==1').shape[0]/df.shape[0]
    return 1-2*np.abs(0.5-active_prop)

In [38]:
representativity_index("DEBUG_INFO")

0.1923035370886541

### Average effect on binary size

use the list of feature importances, set thresholds and assign size effect categories to each option (below 0.1% => negligible effect, above 1% => strong effect, in the middle => weak effect). This is arbitrary, but it makes sense as a user.

In [87]:
feat_imp = pd.read_csv("feature_importanceRF.csv", index_col=0, header=None)
feat_imp.columns = ['importance']
feat_imp

Unnamed: 0_level_0,importance
0,Unnamed: 1_level_1
DEBUG_INFO,0.337879
nbyes,0.190736
DEBUG_INFO_REDUCED,0.113583
DEBUG_INFO_SPLIT,0.086327
X86_NEED_RELOCS,0.077693
...,...
USB_GSPCA_VC032X,0.000000
HISAX_ENTERNOW_PCI,0.000000
CRYPTO_HASH2,0.000000
HISAX_DEBUG,0.000000


In [88]:
feat_imp.loc["DEBUG_INFO"][0]

0.3378785492683889

### Generate documentation

In [89]:
def generate_doc(option_name):
    # input :
    # option_name a string with the name of an option (a column in our dataset)
    # output :
    # the lines of the documentation patch
    
    lines = ['----', option_name, '----', '']

    current_line = option_name+' has a '
    
    if option_name in feat_imp.index:
        option_imp = feat_imp.loc[option_name][0]
    else:
        option_imp = 0
    option_conf_score = representativity_index(option_name)
    option_avg_diff = avg_diff(option_name)
    
    if option_imp > 0.01:
        current_line+='strong'
    elif option_imp > 0.001 and option_imp <= 0.01:
        current_line+='weak'
    else:
        current_line+='negligible'

    current_line+=' effect on the binary size of the resulting kernel.'
    lines.append(current_line)

    current_line = 'In general, activating '+option_name+ ' '

    if option_avg_diff < 0:
        current_line+='reduces'
    else:
        current_line+='increases'

    current_line+=' the binary size by '

    if np.abs(option_avg_diff) > 1e9:
        current_line+=str(int(option_avg_diff/1e9))+' GB'
    else:
        if np.abs(option_avg_diff) > 1e6:
            current_line+=str(int(option_avg_diff/1e6))+ ' MB'
        else:
            current_line+=str(int(option_avg_diff/1e3))+ ' kB'

    current_line+=' ('

    if option_conf_score > 2/3:
        current_line+='strong'
    elif option_conf_score > 1/3 and option_conf_score <= 2/3:
        current_line+='medium'
    else:
        current_line+='weak'

    current_line+=' representativity score).'

    lines.append(current_line)
    lines.append('')
    
    return lines

In [90]:
doc = generate_doc("DEBUG_INFO")

for line in doc:
    print(line)

----
DEBUG_INFO
----

DEBUG_INFO has a strong effect on the binary size of the resulting kernel.
In general, activating DEBUG_INFO increases the binary size by 137 MB (weak representativity score).



In [None]:
all_doc = []
for option in linux_options:
    all_doc.extend(generate_doc(option))

### Export to txt file

In [101]:
with open('doc.txt', 'w') as file:
    for line in all_doc:
        file.write(line+"\n")

In [100]:
all_doc

['----',
 'X86_LOCAL_APIC',
 '----',
 '',
 'X86_LOCAL_APIC has a negligible effect on the binary size of the resulting kernel.',
 'In general, activating X86_LOCAL_APIC increases the binary size by 40 MB (weak representativity score).',
 '',
 '----',
 'OPENVSWITCH',
 '----',
 '',
 'OPENVSWITCH has a negligible effect on the binary size of the resulting kernel.',
 'In general, activating OPENVSWITCH increases the binary size by 17 MB (weak representativity score).',
 '',
 '----',
 'TEXTSEARCH_FSM',
 '----',
 '',
 'TEXTSEARCH_FSM has a negligible effect on the binary size of the resulting kernel.',
 'In general, activating TEXTSEARCH_FSM increases the binary size by 13 MB (weak representativity score).',
 '',
 '----',
 'NETFILTER_XT_MATCH_TCPMSS',
 '----',
 '',
 'NETFILTER_XT_MATCH_TCPMSS has a negligible effect on the binary size of the resulting kernel.',
 'In general, activating NETFILTER_XT_MATCH_TCPMSS increases the binary size by 18 MB (weak representativity score).',
 '',
 '----',