This notebook aims to study the vulnerabilities and how their fixes relate to backported updates. 

The data we relied on are subject to a non-disclosure agreement. That means we are not allowed to share these data, so you'll have to trust us ;)

In [1]:
import pandas
import numpy as np
import matplotlib
import seaborn

from IPython.display import display

%matplotlib inline

In [2]:
FIG_SIZE = (8, 3)
FIG_SIZE_WIDE = (8, 2.5)

ECOSYSTEMS = ['NPM', 'Rubygems']
DATE_RANGE = pandas.to_datetime('2015-01-01'), pandas.to_datetime('2020-01-01')
CENSOR_DATE = pandas.to_datetime('2020-01-12')

PALETTE = seaborn.color_palette()
PAL_REL = np.take(seaborn.color_palette('muted'), [3, 8, 2, 0], axis=0)
COLORS = {'NPM': PALETTE[1], 'Rubygems': PALETTE[3]}

matplotlib.rcParams['figure.figsize'] = FIG_SIZE
matplotlib.rcParams['legend.framealpha'] = 1
matplotlib.rcParams['text.latex.preamble'] = r'\usepackage{amsmath}'

SAVEFIG = False

def _savefig(fig, name):
    import os
    fig.savefig(
        os.path.join('..', 'figures', '{}.pdf'.format(name)),
        bbox_inches='tight'
    )
    
savefig = _savefig if SAVEFIG else lambda x, y: None

# Dataset

In [58]:
df_vuln = (
    pandas.read_csv('../data-raw/vulnerabilities.csv.gz', index_col=0, infer_datetime_format=True, parse_dates=['published', 'disclosed'])
    .rename(columns={
        'Id': 'id',
        'vuln_name': 'vulnerability', 
        'base': 'ecosystem', 
        'cvssScore': 'score',
        'fixedIn': 'fixed', 
        'affecting': 'affected',
    })
    .replace({'ecosystem': {'npm': 'NPM', 'RubyGems': 'Rubygems'}})
)

The dataset contains expressions to capture which versions are affected, and in which versions a vulnerability was fixed. 
We'll parse these expressions to convert them to intervals, so we can manipulate them more easily. 

In [12]:
import sys

sys.path.append('../data')

from parsers import parse_or_empty, NPMParser

parser = NPMParser()

affected = dict()
fixed = dict()

for expr in df_vuln.affected.drop_duplicates():
    affected[expr] = parse_or_empty(parser, expr)
    
for expr in df_vuln.fixed.drop_duplicates():
    fixed[expr] = parse_or_empty(parser, expr)

In [59]:
df_vuln = (
    df_vuln
    .replace({
        'affected': {k: str(v) for k,v in affected.items()},
        'fixed': {k: str(v) for k,v in fixed.items()},
    })
    .assign(status=lambda d: np.where(d['fixed'] == '()', 'open', 'closed'))
)

In [61]:
df_vuln.head()

Unnamed: 0,id,package,published,disclosed,severity,vulnerability,ecosystem,score,fixed,affected,status
0,SNYK-JS-MBACKDOOR-565090,m-backdoor,2020-04-12,2020-04-10,critical,Malicious Package,NPM,9.8,(),"[0.0.0,+inf)",open
1,SNYK-JS-PAYPALADAPTIVE-565089,paypal-adaptive,2020-04-12,2020-04-12,medium,Prototype Pollution,NPM,4.2,(),"[0.0.0,+inf)",open
2,SNYK-JS-GRUNTUTILPROPERTY-565088,grunt-util-property,2020-04-12,2020-04-12,medium,Prototype Pollution,NPM,4.0,(),"[0.0.0,+inf)",open
3,SNYK-JS-ELECTRON-565052,electron,2020-04-10,2020-03-06,high,Out-of-bounds Read,NPM,7.3,[8.2.0],"[0.0.0,8.2.0)",closed
4,SNYK-JS-ELECTRON-565051,electron,2020-04-10,2020-04-02,high,Heap Overflow,NPM,8.8,[8.2.1],"[0.0.0,8.2.1)",closed


**TODO**

 - Quantify the dataset, so to give an overview of what we have; 
 - Ignore vulnerabilities that are not yet fixed; 
 - For each vulnerability, tag each release of the vulnerable package as "affected" (yes/no), "fixed" (yes/no). This will allow to identify if a fix is deployed in a previous major as well. Keep in mind that you should do this for all vulnerabilities, not all packages (since a same package can have more than one vulnerability);
 - Quantify the number of backported fixes; 
 - (possibly) Quantify the number of dependents that benefited from such backports; 
 - (possibly) Quantify the number of dependents that could benefit from a backport. 