# Software Vulnerability Detection using Deep Learning (Experiment Replication)

### Multicolumn experiment (All CWEs and others)

* This is from Russell et. al work (Automated Vulnerability Detection in Source Code Using Deep Representation Learning) https://arxiv.org/abs/1807.04320
* Datasets downloaded from https://osf.io/d45bw/
* Datasets distribution: Training (80%), Validation (10%), Testing (10%)
* The dataset consists of the source code of 1.27 million functions mined from open source software, labeled by static analysis for potential vulnerabilities.
* Each function's raw source code, starting from the function name, is stored as a variable-length UTF-8 string. Five binary 'vulnerability' labels are provided for each function, corresponding to the four most common CWEs in our data plus all others: 
 * CWE-120 (3.7% of functions)
 * CWE-119 (1.9% of functions)
 * CWE-469 (0.95% of functions)
 * CWE-476 (0.21% of functions)
 * CWE-other (2.7% of functions)
* Functions may have more than one detected CWE each.
* Python 3.6 and Tensorflow 2.0.0

In [2]:
#training distribution
!wget https://osf.io/6fexn/download

--2021-04-30 04:36:25--  https://osf.io/6fexn/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34ee71f01ef00170e4a90?action=download&direct&version=1 [following]
--2021-04-30 04:36:27--  https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34ee71f01ef00170e4a90?action=download&direct&version=1
Resolving files.osf.io (files.osf.io)... 35.186.214.196
Connecting to files.osf.io (files.osf.io)|35.186.214.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862111179 (822M) [application/octet-stream]
Saving to: ‘download’


2021-04-30 04:37:11 (18.8 MB/s) - ‘download’ saved [862111179/862111179]



In [3]:
ls

download  [0m[01;34msample_data[0m/


In [4]:
#renaming
!mv download VDISC_train.hdf5

In [5]:
#Testing distribution

!wget https://osf.io/f9t6z/download

--2021-04-30 04:37:11--  https://osf.io/f9t6z/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34e965603840019b1bdd2?action=download&direct&version=1 [following]
--2021-04-30 04:37:13--  https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34e965603840019b1bdd2?action=download&direct&version=1
Resolving files.osf.io (files.osf.io)... 35.186.214.196
Connecting to files.osf.io (files.osf.io)|35.186.214.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 107870135 (103M) [application/octet-stream]
Saving to: ‘download’


2021-04-30 04:37:20 (15.8 MB/s) - ‘download’ saved [107870135/107870135]



In [6]:
mv download VDISC_test.hdf5

In [7]:
#validate distribution

!wget https://osf.io/43mzd/download

--2021-04-30 04:37:21--  https://osf.io/43mzd/download
Resolving osf.io (osf.io)... 35.190.84.173
Connecting to osf.io (osf.io)|35.190.84.173|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34e961f01ef00170e4a01?action=download&direct&version=1 [following]
--2021-04-30 04:37:21--  https://files.osf.io/v1/resources/d45bw/providers/osfstorage/5bf34e961f01ef00170e4a01?action=download&direct&version=1
Resolving files.osf.io (files.osf.io)... 35.186.214.196
Connecting to files.osf.io (files.osf.io)|35.186.214.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 108035556 (103M) [application/octet-stream]
Saving to: ‘download’


2021-04-30 04:37:29 (15.9 MB/s) - ‘download’ saved [108035556/108035556]



In [8]:
mv download VDISC_validate.hdf5

In [9]:
ls

[0m[01;34msample_data[0m/  VDISC_test.hdf5  VDISC_train.hdf5  VDISC_validate.hdf5


## Pre-processing

Converting the HDF5 files for training/validation/testing datasets to python pickle for ease of future usage

In [10]:
import h5py
import pandas as pd

In [13]:
# 3 datasets available

data = h5py.File("VDISC_train.hdf5",'r')


In [21]:
dataval = h5py.File("VDISC_validate.hdf5",'r')
datatest = h5py.File("VDISC_test.hdf5",'r')

In [14]:
# List all groups
data.visit(print)

CWE-119
CWE-120
CWE-469
CWE-476
CWE-other
functionSource


Create a new dataframe from the HDF5 file

In [None]:
mydf = pd.DataFrame(list(data['functionSource']))

In [11]:
mydf['CWE-119']=list(data['CWE-119']); mydf['CWE-120']=list(data['CWE-120']); mydf['CWE-469']=list(data['CWE-469']); mydf['CWE-476']=list(data['CWE-476']); mydf['CWE-other']=list(data['CWE-other']) 

In [12]:
mydf.rename(columns={0:'functionSource'},inplace=True)

In [13]:
mydf.iloc[0:5,0:]

Unnamed: 0,functionSource,CWE-119,CWE-120,CWE-469,CWE-476,CWE-other
0,"clear_area(int startx, int starty, int xsize, ...",False,False,False,False,False
1,ReconstructDuList(Statement* head)\n{\n Sta...,False,False,False,False,False
2,free_speaker(void)\n{\n if(Lengths)\n ...,False,False,False,False,False
3,mlx4_register_device(struct mlx4_dev *dev)\n{\...,False,False,False,False,False
4,"Parse_Env_Var(void)\n{\n char *p = getenv(""LI...",True,True,False,False,True


In [15]:
mydf.to_pickle("VDISC_train.pickle")
#mydf.to_pickle("VDISC_validate.pickle")
#mydf.to_pickle("VDISC_test.pickle")

<b> I store these datasets in the drive. Simple import it to save time</b>

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [19]:
cp /content/VDISC_train.pickle "/content/drive/MyDrive/Colab Notebooks"

## Exploratory Data Analysis

### Importing processed datasets

In [12]:
cd "/content/drive/MyDrive/Colab Notebooks"

/content/drive/MyDrive/Colab Notebooks


In [13]:
train=pd.read_pickle("VDISC_train.pickle")

In [None]:

#validate=pd.read_pickle("VDISC_validate.pickle")
#test=pd.read_pickle("VDISC_test.pickle")

In [None]:
### CONTINUE LATER

## Learning Phase

### Importing libraries

In [14]:
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.metrics
import pickle

print("Tensorlfow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

Tensorlfow version:  2.4.1
Eager mode:  True
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
GPU is NOT AVAILABLE


### Setting static and global variables

In [15]:
# Generate random seed
#myrand=np.random.randint(1, 99999 + 1)
myrand=71926
np.random.seed(myrand)
tf.random.set_seed(myrand)
print("Random seed is:",myrand)

Random seed is: 71926


In [16]:
# Set the global value
WORDS_SIZE=10000
INPUT_SIZE=500
NUM_CLASSES=2
MODEL_NUM=0
EPOCHS=10

### Importing processed datasets

In [None]:
"""train=pd.read_pickle("VDISC_train.pickle")
validate=pd.read_pickle("VDISC_validate.pickle")
test=pd.read_pickle("VDISC_test.pickle")

for dataset in [train,validate,test]:
    for col in range(1,6):
        dataset.iloc[:,col] = dataset.iloc[:,col].map({False: 0, True: 1})

# Create source code sdata for tokenization
x_all = train['functionSource']
#x_all = x_all.append(validate['functionSource'])
#x_all = x_all.append(test['functionSource'])
"""

In [17]:
# Overview of the datasets
train.head()

Unnamed: 0,functionSource,CWE-119,CWE-120,CWE-469,CWE-476,CWE-other
0,"clear_area(int startx, int starty, int xsize, ...",False,False,False,False,False
1,ReconstructDuList(Statement* head)\n{\n Sta...,False,False,False,False,False
2,free_speaker(void)\n{\n if(Lengths)\n ...,False,False,False,False,False
3,mlx4_register_device(struct mlx4_dev *dev)\n{\...,False,False,False,False,False
4,"Parse_Env_Var(void)\n{\n char *p = getenv(""LI...",True,True,False,False,True


In [18]:
x_all = train['functionSource']

In [19]:
x1 = x_all

In [20]:
len(x_all)

1019471

<b> due to huge computation, I take only 1000 samples

In [21]:
x1_all = x_all[1000]

### Tokenizing the source codes

In [22]:
# Tokenizer with word-level
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=False)
tokenizer.fit_on_texts(list(x_all))
del(x_all)
print('Number of tokens: ',len(tokenizer.word_counts))

Number of tokens:  1094129


In [23]:
# Reducing to top N words
tokenizer.num_words = WORDS_SIZE

In [24]:
# Top 10 words
sorted(tokenizer.word_counts.items(), key=lambda x:x[1], reverse=True)[0:10]

[('if', 3126441),
 ('0', 2106459),
 ('return', 1745333),
 ('i', 1375259),
 ('1', 1186857),
 ('int', 1016932),
 ('null', 975347),
 ('the', 791897),
 ('t', 733766),
 ('n', 716010)]

### Create sequence files from the tokens

<b> NOTE: </b> we only considering 1000 samples because it takes days to train entire dataset.

In [25]:
## Tokkenizing train data and create matrix
list_tokenized_train = tokenizer.texts_to_sequences(train['functionSource'][:1000])
x_train = tf.keras.preprocessing.sequence.pad_sequences(list_tokenized_train, 
                                  maxlen=INPUT_SIZE,
                                  padding='post')
x_train = x_train.astype(np.int64)

In [26]:
x_train

array([[ 270,  650,    6, ...,    0,    0,    0],
       [1306,  242, 1306, ...,    0,    0,    0],
       [  45,   72,    1, ...,    0,    0,    0],
       ...,
       [4204,  172,  244, ...,    0,    0,    0],
       [1539, 1138,   19, ...,    0,    0,    0],
       [ 780,   52,   54, ...,    0,    0,    0]])

<b>We are commenting code because we didn't take testing and validating dataset</b>

In [None]:
## Tokkenizing test data and create matrix

### We are commenting code because we didn't take testing and validating dataset
"""
list_tokenized_test = tokenizer.texts_to_sequences(test['functionSource'])
x_test = tf.keras.preprocessing.sequence.pad_sequences(list_tokenized_test, 
                                 maxlen=INPUT_SIZE,
                                 padding='post')
x_test = x_test.astype(np.int64)
"""

In [None]:
## Tokkenizing validate data and create matrix
"""
list_tokenized_validate = tokenizer.texts_to_sequences(validate['functionSource'])
x_validate = tf.keras.preprocessing.sequence.pad_sequences(list_tokenized_validate, 
                                 maxlen=INPUT_SIZE,
                                 padding='post')
x_validate = x_validate.astype(np.int64)
"""

In [None]:
# Example data
#test.iloc[0:5,1:6]

In [27]:
train1 = train
train = train[:1000]

### One-Hot-Enconding (OHE) on the datasets

In [28]:
y_train=[]
#y_test=[]
#y_validate=[]

for col in range(1,6):
    y_train.append(tf.keras.utils.to_categorical(train.iloc[:,col], num_classes=NUM_CLASSES).astype(np.int64))
    #y_test.append(tf.keras.utils.to_categorical(test.iloc[:,col], num_classes=NUM_CLASSES).astype(np.int64))
    #y_validate.append(tf.keras.utils.to_categorical(validate.iloc[:,col], num_classes=NUM_CLASSES).astype(np.int64))

In [29]:
y_train[0]

array([[1, 0],
       [1, 0],
       [1, 0],
       ...,
       [1, 0],
       [1, 0],
       [1, 0]])

In [40]:
# Example data
#y_test[0][1:10]

### Model Definition (CNN with Gaussian Noise and 5 Output Splits)

In [30]:
# Create a random weights matrix

random_weights = np.random.normal(size=(WORDS_SIZE, 13),scale=0.01)

In [31]:
# Must use non-sequential model building to create branches in the output layer
inp_layer = tf.keras.layers.Input(shape=(INPUT_SIZE,))
mid_layers = tf.keras.layers.Embedding(input_dim = WORDS_SIZE,
                                    output_dim = 13,
                                    weights=[random_weights],
                                    input_length = INPUT_SIZE)(inp_layer)
mid_layers = tf.keras.layers.Convolution1D(filters=512, kernel_size=(9), padding='same', activation='relu')(mid_layers)
mid_layers = tf.keras.layers.MaxPool1D(pool_size=5)(mid_layers)
mid_layers = tf.keras.layers.Dropout(0.5)(mid_layers)
mid_layers = tf.keras.layers.Flatten()(mid_layers)
mid_layers = tf.keras.layers.Dense(64, activation='relu')(mid_layers)
mid_layers = tf.keras.layers.Dense(16, activation='relu')(mid_layers)
output1 = tf.keras.layers.Dense(2, activation='softmax')(mid_layers)
output2 = tf.keras.layers.Dense(2, activation='softmax')(mid_layers)
output3 = tf.keras.layers.Dense(2, activation='softmax')(mid_layers)
output4 =tf.keras.layers.Dense(2, activation='softmax')(mid_layers)
output5 = tf.keras.layers.Dense(2, activation='softmax')(mid_layers)
model = tf.keras.Model(inp_layer,[output1,output2,output3,output4,output5])

# Define custom optimizers
adam = tf.keras.optimizers.Adam(lr=0.005, beta_1=0.9, beta_2=0.999, epsilon=1, decay=0.0, amsgrad=False)

## Compile model with metrics
model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])
print("CNN model built: ")
model.summary()

CNN model built: 
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 500)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 500, 13)      130000      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 500, 512)     60416       embedding[0][0]                  
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, 100, 512)     0           conv1d[0][0]                     
____________________________________________________________________________

In [49]:
pwd

'/content/drive/My Drive/Colab Notebooks'

### Tensorboard Callbacks

### Model Training

In [32]:
class_weights = [{0: 1., 1: 5.},{0: 1., 1: 5.},{0: 1., 1: 5.},{0: 1., 1: 5.},{0: 1., 1: 5.}]

history = model.fit(x = x_train,
          y = [y_train[0], y_train[1], y_train[2], y_train[3], y_train[4]],
          #validation_data = (x_validate, [y_validate[0], y_validate[1], y_validate[2], y_validate[3], y_validate[4]]),
          epochs = 40,
          batch_size = 128,
          verbose =2)



Epoch 1/40
8/8 - 8s - loss: 3.4622 - dense_2_loss: 0.6924 - dense_3_loss: 0.6928 - dense_4_loss: 0.6926 - dense_5_loss: 0.6911 - dense_6_loss: 0.6934 - dense_2_accuracy: 0.7510 - dense_3_accuracy: 0.6220 - dense_4_accuracy: 0.6900 - dense_5_accuracy: 0.8970 - dense_6_accuracy: 0.4060
Epoch 2/40
8/8 - 6s - loss: 3.4494 - dense_2_loss: 0.6906 - dense_3_loss: 0.6902 - dense_4_loss: 0.6897 - dense_5_loss: 0.6862 - dense_6_loss: 0.6927 - dense_2_accuracy: 0.9330 - dense_3_accuracy: 0.9510 - dense_4_accuracy: 0.9930 - dense_5_accuracy: 0.9920 - dense_6_accuracy: 0.7180
Epoch 3/40
8/8 - 6s - loss: 3.4286 - dense_2_loss: 0.6884 - dense_3_loss: 0.6861 - dense_4_loss: 0.6845 - dense_5_loss: 0.6771 - dense_6_loss: 0.6925 - dense_2_accuracy: 0.9790 - dense_3_accuracy: 0.9650 - dense_4_accuracy: 0.9960 - dense_5_accuracy: 0.9920 - dense_6_accuracy: 0.7600
Epoch 4/40
8/8 - 6s - loss: 3.4027 - dense_2_loss: 0.6859 - dense_3_loss: 0.6810 - dense_4_loss: 0.6779 - dense_5_loss: 0.6653 - dense_6_loss: 0.

In [33]:
model.save("svd.h5")

In [34]:
pwd

'/content/drive/My Drive/Colab Notebooks'

### Model Evaluation using Testing Set

In [35]:
# Load model
model1 = tf.keras.models.load_model("svd.h5")

In [None]:
"""results = model.evaluate(x_test, y_test, batch_size=128)
for num in range(0,len(model.metrics_names)):
    print(model.metrics_names[num]+': '+str(results[num]))
"""

In [1]:
pwd 

'/content'

In [65]:
cd ..

/content


In [66]:
ls

[0m[01;34mdrive[0m/                  [01;34msample_data[0m/    [01;34mtrain[0m/            VDISC_train.pickle
projector_config.pbtxt  test_sample.py  VDISC_train.hdf5


In [70]:
train['functionSource'][0]

'clear_area(int startx, int starty, int xsize, int ysize)\n{\n  int x;\n\n  TRACE_LOG("Clearing area %d,%d / %d,%d\\n", startx, starty, xsize, ysize);\n\n  while (ysize > 0)\n  {\n    x = xsize;\n    while (x > 0)\n    {\n      mvaddch(starty + ysize - 2, startx + x - 2, \' \');\n      x--;\n    }\n    ysize--;\n  }\n}'

In [36]:
tt = """
int checkPrimeNumber(int n) {
   int i, flag = 1, squareRoot;

   // computing the square root
   squareRoot = sqrt(n);
   for (i = 2; i <= squareRoot; ++i) {
 
      if (n % i == 0) {
         flag = 0;
         break;
      }
   }
   return flag;
}
"""

In [37]:
tt

'\nint checkPrimeNumber(int n) {\n   int i, flag = 1, squareRoot;\n\n   // computing the square root\n   squareRoot = sqrt(n);\n   for (i = 2; i <= squareRoot; ++i) {\n \n      if (n % i == 0) {\n         flag = 0;\n         break;\n      }\n   }\n   return flag;\n}\n'

In [38]:
## Tokkenizing test data and create matrix
list_tokenized_test = tokenizer.texts_to_sequences(tt)
x_test = tf.keras.preprocessing.sequence.pad_sequences(list_tokenized_test, 
                                 maxlen=INPUT_SIZE,
                                 padding='post')
x_test = x_test.astype(np.int64)

### Check The Evaluation Metrics

In [39]:
predicted = model.predict(x_test)

In [40]:
print(predicted)

[array([[0.98364174, 0.0163583 ],
       [0.9836375 , 0.01636249],
       [0.98363966, 0.01636031],
       [0.98364025, 0.01635981],
       [0.98364174, 0.0163583 ],
       [0.9836353 , 0.01636469],
       [0.98363584, 0.01636407],
       [0.98363596, 0.01636398],
       [0.9836353 , 0.01636469],
       [0.9836405 , 0.01635953],
       [0.98363656, 0.01636344],
       [0.9836324 , 0.01636759],
       [0.9836375 , 0.01636249],
       [0.9836394 , 0.01636056],
       [0.98363596, 0.01636398],
       [0.98363966, 0.01636031],
       [0.98363805, 0.0163619 ],
       [0.9836394 , 0.01636056],
       [0.98364097, 0.01635907],
       [0.98363596, 0.01636398],
       [0.9836324 , 0.01636759],
       [0.98364174, 0.0163583 ],
       [0.9836375 , 0.01636249],
       [0.98363966, 0.01636031],
       [0.98364025, 0.01635981],
       [0.98364174, 0.0163583 ],
       [0.98363966, 0.01636031],
       [0.98364174, 0.0163583 ],
       [0.98364174, 0.0163583 ],
       [0.98364174, 0.0163583 ],
       [0

In [41]:
pred_test = [[],[],[],[],[]]

for col in range(0,len(predicted)):
    for row in predicted[col]:
        if row[0] >= row[1]:
            pred_test[col].append(0)
        else:
            pred_test[col].append(1)
            
for col in range(0,len(predicted)):
    print(pd.value_counts(pred_test[col]))

0    252
dtype: int64
0    252
dtype: int64
0    252
dtype: int64
0    252
dtype: int64
0    252
dtype: int64
