# Artifacts Evaluation Instructions: #506   Combining Static and Dynamic Code Information for Software Vulnerability Prediction

## Preliminaries

This interactive Jupyter notebook provides a small-scale demo to showcase the program representation, vulnerability detection, and prediction of vulnerability detection discussed in the paper.

The main results of our CCS 2023 paper involve comparing the performance of our vulnerability detection with prior machine learning-based approaches. The evaluation presented in our paper was conducted on a much larger dataset and for a longer duration. The intention of this notebook is to provide minimal working examples that can be evaluated within a reasonable time frame.

## Instructions for Experimental Workflow:

Before you start, please first make a copy of the notebook by going to the landing page. Then select the checkbox next to the notebook titled *main.ipynb*, then click "**Duplicate**".

Click the name of the newly created Jupyter Notebook, e.g. **AE-Copy1.ipynb**. Next, select "**Kernel**" > "**Restart & Clear Output**". Then, repeatedly press the play button (the tooltip is "run cell, select below") to step through each cell of the notebook.

Alternatively, select each cell in turn and use "**Cell**"> "**Run Cell**" from the menu to run specific cells. Note that some cells depend on previous cells being executed. If any errors occur, ensure all previous cells have been executed.

## Important Notes

**Some cells can take more than half an hour to complete; please wait for the results until step to the next cell.** 

High load can lead to a long wait for results. This may occur if multiple reviewers are simultaneously trying to generate results. 

The experiments are customisable as the code provided in the Jupyter Notebook can be edited on the spot. Simply type your changes into the code blocks and re-run using **Cell > Run Cells** from the menu.

## Links to The Paper

For each step, we note the section number of the submitted version where the relevant technique is described or data is presented.

The main results are presented in Figures 9-12 of the submitted paper.

# Demo 1: The Concoction Model Architecture

This demo corresponds to the architecture given in Figure 3. Note that This is a small-scale demo for vulnerability detection. The full-scale evaluation used in the paper takes over 24 hours to run.

## Step 1. Program representation

The program representation component maps the input source code and dynamic symbolic execution traces of the target function into a numerical embedding vector.

#### *Static representation model*:

In [1]:
#  Extract Program Information (To show static code information like sec 3.3.1.)
import os
root=os.getcwd()
projectPath=os.path.join(os.getcwd(),'getFeature/data/file')
os.chdir(os.path.join(os.getcwd(),'getFeature/staticFeature'))
!python GetStatic.py $projectPath

/home/CONCOCTION/getFeature/data/file
/home/CONCOCTION/getFeature/staticFeature
finished:0.00%   Batch 1
Get function segmentation ! 
Start of enhancement: io.shiftleft.semanticcpg.passes.languagespecific.fuzzyc.TypeDeclStubCreator
End of enhancement: io.shiftleft.semanticcpg.passes.languagespecific.fuzzyc.TypeDeclStubCreator, after 20ms
Start of enhancement: io.shiftleft.semanticcpg.passes.languagespecific.fuzzyc.MethodStubCreator
End of enhancement: io.shiftleft.semanticcpg.passes.languagespecific.fuzzyc.MethodStubCreator, after 44ms
Start of enhancement: io.shiftleft.semanticcpg.passes.methoddecorations.MethodDecoratorPass
End of enhancement: io.shiftleft.semanticcpg.passes.methoddecorations.MethodDecoratorPass, after 7ms
Start of enhancement: io.shiftleft.semanticcpg.passes.linking.capturinglinker.CapturingLinker
End of enhancement: io.shiftleft.semanticcpg.passes.linking.capturinglinker.CapturingLinker, after 3ms
Start of enhancement: io.shiftleft.semanticcpg.passes.linking.linker

In [11]:
#  Pretrain Representation Models (Here we use Graphcodebert like sec 3.4)
#preprocess
import os
print(os.getcwd())
current=os.getcwd()
path=os.path.join(current,'model/data/feature')
outputPath=os.path.join(current,'model/data/output_static.txt')
dir=os.path.join(current,'model/pretrainedModel/staticRepresentation')
!cd $dir && python preprocess.py --data_path $path --output_path $outputPath
#pretrain
storedPath=os.path.join(os.getcwd(),'model/pretrainedModel/staticRepresentation/trainedModel')
!cd $dir && python train.py --model_name_or_path graphcodebert-base --train_data_file $outputPath --per_device_train_batch_size 8 --do_train --output_dir $storedPath --mlm --overwrite_output_dir --line_by_line

/home/CONCOCTION
dirs: 1it [00:00, 24.76it/s]
  return torch._C._cuda_getDeviceCount() > 0
09/12/2023 06:18:12 - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir=/home/CONCOCTION/model/pretrainedModel/staticRepresentation/trainedModel, overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Sep12_06-18-12_5ab788861ade, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False,

In [12]:
#  Show the Trained Model (like the example in https://github.com/microsoft/CodeBERT)
from transformers import AutoTokenizer, AutoModel
import torch
storedPath=os.path.join(os.getcwd(),'model/pretrainedModel/staticRepresentation/trainedModel')
tokenizer = AutoTokenizer.from_pretrained(storedPath)
model = AutoModel.from_pretrained(storedPath)
nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
print(context_embeddings)

Some weights of RobertaModel were not initialized from the model checkpoint at /home/CONCOCTION/model/pretrainedModel/staticRepresentation/trainedModel and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tensor([[[ 0.2573,  0.2688, -0.1238,  ..., -0.1123, -0.1479,  0.3731],
         [ 0.5010, -0.2858,  0.6845,  ...,  0.2044,  0.8592,  0.0892],
         [ 0.1018,  0.6084,  0.0398,  ...,  0.5097, -0.3028,  0.1666],
         ...,
         [ 0.2921,  0.1633,  0.7146,  ...,  0.7883,  0.5627, -0.1219],
         [ 0.2624,  0.6262, -0.0925,  ...,  1.7611, -0.5254,  0.1670],
         [ 0.2574,  0.2688, -0.1238,  ..., -0.1123, -0.1481,  0.3735]]],
       grad_fn=<NativeLayerNormBackward>)



#### *Dynamic representation model*:


In [None]:
#  Extract Program Information (To show dynamic code information like sec 3.3.2.)



In [13]:
#  Pretrain Representation Models (Here we use Simcse like sec 3.4)
#preprocess
import os
print(os.getcwd())
current=os.getcwd()
path=os.path.join(current,'model/data/feature')
outputPath=os.path.join(current,'model/data/output_dynamic.txt')
dir=os.path.join(current,'model/pretrainedModel/dynamicRepresentation')
!cd $dir && python preprocess.py --data_path $path --output_path $outputPath
#training
! cd $dir && python train.py --model_name_or_path bert-base-uncased     --train_file $outputPath   --output_dir ./result    --num_train_epochs 1     --per_device_train_batch_size 32     --learning_rate 3e-5     --max_seq_length 32      --metric_for_best_model stsb_spearman  --load_best_model_at_end     --eval_steps 2     --pooler_type cls     --mlp_only_train     --overwrite_output_dir     --temp 0.05     --do_train

/home/CONCOCTION
dirs: 1it [00:00, 25.94it/s]
cpu
device
09/12/2023 06:19:28 - INFO - __main__ -   Training/evaluation parameters OurTrainingArguments(output_dir='./result', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Sep12_06-19-28_5ab788861ade', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, de

In [15]:
#  Show the Trained Model (like the example in https://github.com/microsoft/CodeBERT)
from transformers import AutoTokenizer, AutoModel
import torch,os
storedPath=os.path.join(os.getcwd(),'model/pretrainedModel/dynamicRepresentation/result')
tokenizer = AutoTokenizer.from_pretrained(storedPath)
model = AutoModel.from_pretrained(storedPath)
nl_tokens=tokenizer.tokenize("return maximum value")
code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
tokens=[tokenizer.cls_token]+nl_tokens+[tokenizer.sep_token]+code_tokens+[tokenizer.eos_token]
tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]
print(context_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at /home/CONCOCTION/model/pretrainedModel/dynamicRepresentation/result and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using eos_token, but it is not set yet.


TypeError: Can't convert None to PyString



## Step 2. Vulnerability Detection

 Concoction’s detection component takes the joint embedding as input to predict the presence of vulnerabilities.  Our current implementation only identifies whether a function may contain a vulnerability or bug and does not specify the type of vulnerability. Here we use SARD benchmarks.

**approximate runtime ~ 30 minutes (please wait before moving to the next cell)**

#### *Dynamic representation model*:


In [None]:
#  Extract Program Information and Generate Joint Embedding (To show dynamic code information like sec 3.3 and sec 3.5.1)



In [26]:
#  Train detection Model (Here we use Simcse like sec 3.5.2)
import os
print(os.getcwd())
current=os.getcwd()
path=os.path.join(current,'model/data/feature')
dir=os.path.join(current,'model/detectionModel')
!cd $dir && python evaluation_bug.py --path_to_data /home/model/data/sard/cwe416 --mode train


/home/CONCOCTION
0
2023-09-12 06:45:51,369 : Load data from exist npy
[2023-09-12 06:45:51] INFO (root/MainThread) Load data from exist npy
total train files is  146
2023-09-12 06:45:51,529 : Computing embedding for train
[2023-09-12 06:45:51] INFO (root/MainThread) Computing embedding for train
100%|█████████████████████████████████████████████| 3/3 [00:07<00:00,  2.52s/it]
2023-09-12 06:45:59,099 : Computed train embeddings
[2023-09-12 06:45:59] INFO (root/MainThread) Computed train embeddings
2023-09-12 06:45:59,099 : Computing embedding for dev
[2023-09-12 06:45:59] INFO (root/MainThread) Computing embedding for dev
100%|█████████████████████████████████████████████| 1/1 [00:02<00:00,  2.49s/it]
2023-09-12 06:46:01,588 : Computed dev embeddings
[2023-09-12 06:46:01] INFO (root/MainThread) Computed dev embeddings
2023-09-12 06:46:01,589 : Computing embedding for test
[2023-09-12 06:46:01] INFO (root/MainThread) Computing embedding for test
100%|██████████████████████████████████████

2023-09-12 06:46:04,415 : Training for 50 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
[2023-09-12 06:46:04] INFO (root/MainThread) Training for 50 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
2023-09-12 06:46:04,434 : Training for 60 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
[2023-09-12 06:46:04] INFO (root/MainThread) Training for 60 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
2023-09-12 06:46:04,452 : Training for 70 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
[2023-09-12 06:46:04] INFO (root/MainThread) Training for 70 epochs: f1 = 0.9795917957735685, precision = 0.9599999785423279,recall = 1.0,accuracy = 0.9795918367346939
2023-09-12 06:46:04,470 : Training for 80 epochs:

[0m

In [None]:
#  Show the Trained Model (Load trained model and test on test case)
import os
print(os.getcwd())
current=os.getcwd()
path=os.path.join(current,'model/data/feature')
dir=os.path.join(current,'model/detectionModel')
!cd $dir && python evaluation_bug.py  --model_to_load /home/CONCOCTION/model/detectionModel/f1_0.9795917957735685_2023-09-12.h5 --path_to_data $path  --mode test

/home/CONCOCTION
{'nhid': 2, 'optim': 'adam', 'batch_size': 64, 'tenacity': 5, 'epoch_size': 200}
0
2023-09-12 06:46:41,392 : Load data from exist npy
[2023-09-12 06:46:41] INFO (root/MainThread) Load data from exist npy
  0%|                                                     | 0/1 [00:00<?, ?it/s]

## Step 3. Deployment

This demo shows how to deploy our trained model on a real world project. Here we apply the xx as our test project.

#### *Path Selection for Symbolic Execution*:
After training the end-to-end model, we develop a path selection component to automatically select a subset of important paths whose dynamic traces are likely to improve prediction accuracy during deployment.

*approximate runtime ~ 30 minutes*

In [None]:
# Execution path representation (shown as Sec. 3.6.1)





In [None]:
# Active learning for path selection (Sec. 3.6.2)
#data preprocess

import os
print(os.getcwd())
current=os.getcwd()
path=os.path.join(current,'model/data/feature')
storedDir=os.path.join(current,'model/data/feature_path')
final_storedDir=os.path.join(current,'model/data/feature_path_text')
dir=os.path.join(current,'model/pathSelection')
if not os.path.exists(storedDir):
    os.mkdir(storedDir)
! cd $dir && python preprocess.py --data_path $path --stored_path $storedDir
#select path
!cd $dir --data_path $storedDir --stored_path $final_storedDir

In [None]:
# Symbolic execution for chosen paths (Sec. 3.6.3)

#### *Fuzzing for Test Case Generation*:

 We use fuzzing techniques to generate test cases for functions predicted to contain potential vulnerabilities, aiming to automate the testing process and minimize the need for manual inspection.

In [None]:
# Utilizing AFL++ To Objective project (Shown as Sec. 3.7)

## Demo 2: Experimental Evaluation

Here, we provide a small-sized evaluation to showcase the working mechanism of Concoction bug detection. A full-scale evaluation, which takes more than a day to run, is provided through the Docker image (with detailed instructions on our project Github).

### Large-scale Testing (Section 5.1)

 This part (add a web link) gives a quantified summary of Concoction for detecting function-level code vulnerabilities across the 20 projects listed in Table 1 in our papaer.

This demo corresponds to Table 5 of the submitted manuscript.


### Evaluation on Open Dataset (Section  5.2)

We now evaluate our vulnerability detection model on the SARD and CVE datasets in Table 2 in paper.

This demo corresponds to Figure 9 and 10 of the submitted manuscript.

*approximate runtime = 10 minutes for one benchmark*

#### *Sard Dataset*:


In [None]:
# Prepare the dataset and preprocess (use parameter to change CWE type and method,
#                                     like FUN A( dataset = 'CWE-123', method = 'Vuldeepecker' ))

# Train and test.

#### *Github Dataset*:


In [None]:
# Prepare the dataset and preprocess (use parameter to method, like FUN A( dataset = 'Github', method = 'Vuldeepecker' ))

# Train and test.


#### Full-scale evaluation data

We now plot the diagrams using full-scale evaluation data (it would take too long to run the experiment lively). The results correspond to Figure 9 and 10 (Section 5.2) of the submitted manuscript.

In [None]:
# Output Figure 9: Evaluation on standard vulnerability databases. Min-max bars show performance across vulnerability types.
# Output Figure 10:  Evaluation on the CVE dataset Concoction gives the best performance across evaluation metrics.

### Case Study 3:  Evaluation on Opensource Projects (Section 5.3)

We now compare to the baseline methods by applying them to the three open-source projects in Table 3 with a total of 35 CVEs reported by independent users.

This demo corresponds to Figure 11 of the submitted manuscript.

*approximate runtime = xx minutes for one benchmark*

### Client RL Deployment Demo

This demo shows how to apply the saved client RL to optimize a test program for Code Size Reduction. 

*approximate runtime ~ 15 minutes*

#### Performance evaluation on benchmarks
Benchmarks: Sqlite, Libtiff, Libpng.
*approximate runtime ~ 20 minutes*

In [None]:
# Load model and test benchmarks (Input benchmark and method like FUN A( project = 'Sqlite', method = 'Concoction').

#### Full-scale evaluation data
We now generate the table using full-scale evaluation data (it would take too long to run the experiment lively). The results correspond to Figure 11 (Section 5.3) of the submitted manuscript.

In [None]:
# Output Figure 11: The number of vulnerabilities identified Concoction and other methods for open-source projects in Table 3.

### Further Analysis (Alternative)

####  DL model implementation choices.

we evaluate several variants of Concoction on the CVE dataset.

This demo corresponds to Figure 12 of the submitted manuscript.

*approximate runtime = xx minutes for one benchmark*

In [None]:
# Load model and test CVE benchmarks (Input method like FUN A( method = 'Static').

#### Full-scale evaluation data
We now generate the figure using full-scale evaluation data (it would take too long to run the experiment lively). The results correspond to Figure 12 (Section 5.4) of the submitted manuscript.


In [None]:
# Output Figure 12:  Comparing implementation variants of Concoction on the CVE dataset.