# Intro 
This notebook should act as a visual reference to remeber what is contained in the [draper dataset](https://osf.io/d45bw/).

In [1]:
import pandas
import numpy
import h5py
import os

In [2]:
train_path      = "../draper_dataset/VDISC_train.hdf5"
test_path       = "../draper_dataset/VDISC_test.hdf5"
validation_path = "../draper_dataset/VDISC_validate.hdf5"

The 3 files all have the same structure, and were generated with a classic 80:10:10 split.

In [3]:
train_f = h5py.File(train_path, 'r')

In [4]:
print(type(train_f))

<class 'h5py._hl.files.File'>


# Data

5 labels are considered: 4 representing the most commonly found CWEs, and the last including all the other, less frequent, CWEs.
From the dataset readme:

    CWE-120 (3.7% of functions)
    CWE-119 (1.9% of functions)
    CWE-469 (0.95% of functions)
    CWE-476 (0.21% of functions)
    CWE-other (2.7% of functions)

The source code of each function is represented by a single UTF-8 string.

In [5]:
for key in train_f.keys(): print(key)

CWE-119
CWE-120
CWE-469
CWE-476
CWE-other
functionSource


In [6]:
CWE_119        = train_f['CWE-119']   # Improper Restriction of Operations within the Bounds of a Memory Buffer
CWE_120        = train_f['CWE-120']   # Buffer Overflow
CWE_469        = train_f['CWE-469']   # Use of Pointer Subtraction to Determine Size
CWE_476        = train_f['CWE-476']   # NULL Pointer Dereference
CWE_other      = train_f['CWE-other'] # Multiple other CWEs
functionSource = train_f['functionSource']

In [7]:
print(type(CWE_119))

<class 'h5py._hl.dataset.Dataset'>


## Labels

The CWE datasets will contain only boolean values indicating whether the function at the same index is vulnerable with a vulnerability of that specific kind.

In [8]:
for i in range(11):
    print("{}\t{}\t{}\t{}\t{}".format(CWE_119[i], CWE_120[i], CWE_469[i], CWE_476[i], CWE_other[i]))

False	False	False	False	False
False	False	False	False	False
False	False	False	False	False
False	False	False	False	False
True	True	False	False	True
False	False	False	False	False
False	False	False	False	False
False	False	False	False	False
False	False	False	False	False
False	False	False	False	False
True	True	False	False	False


Let's now see the number of vulenrable functions per CWE type:

In [9]:
CWE_119_u,   CWE_119_c   = numpy.unique(CWE_119, return_counts=True)
CWE_120_u,   CWE_120_c   = numpy.unique(CWE_120, return_counts=True)
CWE_469_u,   CWE_469_c   = numpy.unique(CWE_469, return_counts=True)
CWE_476_u,   CWE_476_c   = numpy.unique(CWE_476, return_counts=True)
CWE_other_u, CWE_other_c = numpy.unique(CWE_other, return_counts=True)

In [10]:
print("Total functions in the dataset:", len(CWE_119))
print("Total functions vulnerable to CWE_119:", CWE_119_c[1])
print("Total functions vulnerable to CWE_120:", CWE_120_c[1])
print("Total functions vulnerable to CWE_469:", CWE_469_c[1])
print("Total functions vulnerable to CWE_476:", CWE_476_c[1])
print("Total functions vulnerable to CWE_other:", CWE_other_c[1])

Total functions in the dataset: 1019471
Total functions vulnerable to CWE_119: 19286
Total functions vulnerable to CWE_120: 38019
Total functions vulnerable to CWE_469: 2095
Total functions vulnerable to CWE_476: 9694
Total functions vulnerable to CWE_other: 27959


A single function, however, could be vulnerable to multiple CWEs, therefore to obtain the total number of vulnerable functions we must scan through all the CWE datasets.

To make data access faster, we will create a pandas dataframe encapsulating the dataset. Default import will not work due to the presence of variable size strings, not managed by pytables.

In [11]:
train_df = pandas.DataFrame({
    "CWE-119": CWE_119,
    "CWE-120": CWE_120,
    "CWE-469": CWE_469,
    "CWE-476": CWE_476,
    "CWE-other": CWE_other,
    "functionSource": functionSource
})

In [12]:
train_df.head()

Unnamed: 0,CWE-119,CWE-120,CWE-469,CWE-476,CWE-other,functionSource
0,False,False,False,False,False,"clear_area(int startx, int starty, int xsize, ..."
1,False,False,False,False,False,ReconstructDuList(Statement* head)\n{\n Sta...
2,False,False,False,False,False,free_speaker(void)\n{\n if(Lengths)\n ...
3,False,False,False,False,False,mlx4_register_device(struct mlx4_dev *dev)\n{\...
4,True,True,False,False,True,"Parse_Env_Var(void)\n{\n char *p = getenv(""LI..."


In [13]:
tot_vul = 0

In [14]:
%%time
for i in range(len(CWE_119)):
    if True in {CWE_119[i], CWE_120[i], CWE_469[i], CWE_476[i], CWE_other[i]}:
        tot_vul += 1

CPU times: user 5min 6s, sys: 15 ms, total: 5min 6s
Wall time: 5min 6s


In [15]:
print("Total vulnerable functions in the dataset:", tot_vul)

Total vulnerable functions in the dataset: 65904


In [16]:
%%time
vuln_df = (train_df[[ 'CWE-119', 'CWE-120', 'CWE-469', 'CWE-476', 'CWE-other']] == True).any(axis=1)

CPU times: user 50.8 ms, sys: 3.99 ms, total: 54.8 ms
Wall time: 54.2 ms


In [17]:
vuln_df.value_counts()

False    953567
True      65904
dtype: int64

In [20]:
vuln_df.describe(include='all')

count     1019471
unique          2
top         False
freq       953567
dtype: object

The dataset contains $1019471$ functions, $953567$ ($93.54\%$) non-vulnerable and $65904$ ($6.46\%$) vulnerable.

## Source

The functionSource dataset will contain the actual function bodies.

In [21]:
for elem in functionSource[:3]: print("{}\n{}".format("-"*80, elem))

--------------------------------------------------------------------------------
clear_area(int startx, int starty, int xsize, int ysize)
{
  int x;

  TRACE_LOG("Clearing area %d,%d / %d,%d\n", startx, starty, xsize, ysize);

  while (ysize > 0)
  {
    x = xsize;
    while (x > 0)
    {
      mvaddch(starty + ysize - 2, startx + x - 2, ' ');
      x--;
    }
    ysize--;
  }
}
--------------------------------------------------------------------------------
ReconstructDuList(Statement* head)
{
    Statement* spt;

    for (spt = head; spt != NULL; spt = spt->next) {
	delete_def_use_list(spt->use_var_list);
	delete_def_use_list(spt->def_var_list);
	delete_def_use_list(spt->use_array_list);
	delete_def_use_list(spt->def_array_list);
	spt->def_var_list = NULL;
	spt->use_var_list = NULL;
	spt->def_array_list = NULL;
	spt->use_array_list = NULL;
    }
    def_use_statement(head);
}
--------------------------------------------------------------------------------
free_speaker(void)
{
   if(L