Reproduction package for "Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection"
This artifact includes the code to reproduce our experiments on DeepDFA, LineVul, and LineVul+DeepDFA, accepted at ICSE 2024. This constitutes a large part of the research prototype, with other experiments consisting of running other models and running on other datasets.
Links:
- Paper: "Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection" https://dl.acm.org/doi/10.1145/3597503.3623345
- ArXiv preprint: https://arxiv.org/abs/2212.08108
- Data package: https://doi.org/10.6084/m9.figshare.21225413
- GitHub repo: https://github.com/ISU-PAAL/DeepDFA
- Initial data package creation: September 20, 2023
- Cleanup for artifact evaluation: January 04, 2024
- Add full usage instructions and scripts
- Fix bugs
- Integrate feedback from artifact evaluation: January 10, 2024
- Add documentation and convenience scripts
- Add Docker container
- Link to dataset: https://doi.org/10.6084/m9.figshare.21225413
- Link to Big-Vul which we used:
- Link to LineVul's version of Big-Vul dataset which we used to run LineVul for baseline and DeepDFA+LineVul experiments:
This section describes the organization and contents of various datasets and preprocessed files. Datasets and preprocessed files must be downloaded from Figshare and unpacked into specific locations in the repository (using scripts/download_all.sh and the related download instructions). The total storage requirements are about 8GB download bandwidth and 45 GB unzipped storage space.
- Repository organization:
CodeT5
: the code for the CodeT5 and CodeT5+DeepDFA models.DDFA
: the code for the DeepDFA model.LineVul
: the code for the LineVul and LineVul+DeepDFA models.scripts
: miscellaneous scripts we used to report the results.README.md
: this file
- Datasets in
DDFA/storage/external
andLineVul/data/MSR
:DDFA/storage/external/MSR_data_cleaned.csv
: The original Big-Vul dataset. To unzip this file, unzipMSR_data_cleaned.zip
inDDFA/storage/external/
.DDFA/storage/external/linevul_splits.csv
: A manifest of which dataset partition (training, validation, or test) contains each example in the Big-Vul dataset according to LineVul's random split, which we reused for our experiments.LineVul/data/MSR/{train,val,test}.csv
: The Big-Vul dataset, preprocessed for consumption by LineVul. To unzip this file, unzipMSR_LineVul.zip
inLineVul/data/MSR/
.
- Cache files in
DDFA/storage/cache
:DDFA/storage/cache/bigvul_valid_False.csv
: The CFGs produced by Joern can have some irregularities which make them invalid for use in DeepDFA. This file caches a manifest of which CFG files are valid.id
: the id of the graph being referenced.valid
: True if the graph is valid, False otherwise.
- Preprocessed files in
DDFA/storage/processed
:DDFA/storage/processed/bigvul/before
: The CFGs produced by Joern. The CFGs are organized as a collection of files where<id>
is the numeric ID of the corresponding dataset example:DDFA/storage/processed/bigvul/before/<id>.c
: Source code of the example.DDFA/storage/processed/bigvul/before/<id>.c.nodes.json
: Serialized list of AST nodes in the example's CFG format.- Each object in the array contains information about one node.
id
: a number uniquely identifying the node within a graph._label
: the type of the node, assigned by Joern.columnNumber
: the column number this node occurs on.lineNumber
: the line number this node occurs on.order
: the left-to-right order of the node, relative to any sibling nodes.code
: the source code of the node.name
: the identifier, if any, referenced in this node.typeFullName
: the type declared in the node.
DDFA/storage/processed/bigvul/before/<id>.c.edges.json
: Serialized list of edges in the example's CFG format.- Each object in the array contains information about one directed edge.
- Index
0
: the id of the in-node of this edge. - Index
1
: the id of the out-node of this edge. - Index
2
: the type of the edge, e.g. AST, CFG, ... - Index
3
: dataflow information attached to this edge.
DDFA/storage/processed/bigvul/before/<id>.c.cpg.bin
: Saved CFG generated by Joern, in Joern's binary format.- To unpack the above CFGs, unzip
before.zip
inDDFA/storage/processed/bigvul
.
DDFA/storage/processed/bigvul/eval/statement_labels.pkl
: A list of which lines are dependent on the lines added in Big-Vul, used to compute the line-level vulnerability labels.DDFA/storage/processed/bigvul/nodes.csv
: A CSV format of the CFG nodes inDDFA/storage/processed/bigvul/before
.DDFA/storage/processed/bigvul/edges.csv
: A CSV format of the CFG edges inDDFA/storage/processed/bigvul/before
.DDFA/storage/processed/bigvul/graphs.bin
: A single file containing the DGL-format graphs, produced byDDFA/sastvd/scripts/dbize_graphs.py
and loaded byDDFA/sastvd/linevd/graphmogrifier.py
.DDFA/storage/processed/bigvul/nodes_feat__ABS_DATAFLOW_<settings>.csv
: Abstract dataflow information for one feature. Some of these files will be combined to form the optimal feature set.- Each row contains information about the abstract dataflow features attached to a node.
node_id
: the id of the node which this row is referencing.graph_id
: the id of the graph which this row is referencing._ABS_DATAFLOW_<feature>_all_limitall_<all-limit>_limitsubkeys_<subkey-limit>
: the index of the abstract dataflow feature.feature
: which of the four features is being referenced; one ofapi
,datatype
,literal
, oroperator
.all-limit
: the maximum number of indices in the abstract dataflow embedding; all less-frequent items are given theunknown
index as a placeholder.subkey-limit
: the maximum number of indices in any individual abstract dataflow feature (i.e. subkey); all less-frequent items are given theunknown
index as a placeholder.
DDFA/storage/processed/bigvul/abstract_dataflow_hash_api_datatype_literal_operator.csv
: The selected abstract dataflow features with the optimal settings, which will be used for DeepDFA.- Each row contains information about the abstract dataflow embedding attached to a node.
node_id
: the id of the node which this row is referencing.graph_id
: the id of the graph which this row is referencing.hash
: the combination of abstract dataflow features occurring in this node.
- To unpack the above preprocessed and cache files, unzip
preprocessed_data.zip
in the repository root.
- Hardware: We ran the experiments on an AMD Ryzen 5 1600 3.2 GHz processor with 48GB of RAM and an Nvidia 3090 GPU with 24GB of GPU memory.
- Software:
- Linux operating system (we tested on Ubuntu version 22.04).
- Anaconda3 (we tested on version 23.11.0).
- CUDA and CUDA toolkit (we tested on version 11.8).
We also provide a Docker container which fulfills the software requirements. To use the container with CUDA, please ensure you have installed the NVIDIA Container Toolkit. We provided a script to run without CUDA, and without installing NVIDIA Container Toolkit, if desired.
Use these scripts to run the main performance experiments from the paper (Table 3b).
These instructions start up a Docker container with the requisite requirements.
# Download the code
git clone https://github.com/ISU-PAAL/DeepDFA
cd DeepDFA
# Start the Docker container with an interactive shell; requires NVIDIA Container Toolkit to be installed.
bash scripts/docker_run.sh # may take up to 10 minutes to download the Docker container
# Optional: run without CUDA; this removes the requirement for NVIDIA Container Toolkit.
bash scripts/docker_run_cpu.sh # may take up to 10 minutes to download the Docker container
# Now inside the Docker container shell
# Download the data
bash scripts/download_all.sh # takes about 30 minutes
# Run the main experiments
bash scripts/performance_evaluation.sh # Takes about 20 hours to run all experiments; see below for individual experiments
# Optional: run without CUDA; this removes the requirement for NVIDIA Container Toolkit.
bash scripts/performance_evaluation_cpu.sh
We provided wrapper scripts to execute the various steps.
# Download the code
git clone https://github.com/ISU-PAAL/DeepDFA
cd DeepDFA
# Setup the environment
. scripts/setup_environment.sh # may take up to 20 minutes to download the packages
# Download the data
bash scripts/download_all.sh
# Run the main experiments
bash scripts/performance_evaluation.sh # Takes about 20 hours to run all experiments; see below for individual experiments
git clone https://github.com/ISU-PAAL/DeepDFA
cd DeepDFA
# In repository root directory
. scripts/setup_environment.sh
# Optional: install joern and add it to the executable path. Takes about 10 minutes
bash $(dirname $0)/install_joern.sh
export PATH="$PWD/joern/joern-cli:$PATH"
Our code assumes that the datasets and preprocessed data files are placed in the following locations:
- Unzip
MSR_data_cleaned.zip
inDDFA/storage/external/
(1.44 GB, 10.79 GB unzipped). - Unzip
MSR_LineVul.zip
inLineVul/data/MSR/
(1.72 GB, 10.96 GB unzipped). - Unzip
preprocessed_data.zip
in the repository root (1.67 GB, 8.87 GB unzipped). - Unzip
before.zip
inDDFA/storage/processed/bigvul
(3.38 GB, 14.70 GB unzipped).
The following script downloads and unpacks all the data to the proper locations.
# In repository root directory
bash scripts/download_all.sh
This script trains DeepDFA based on the dataset of source code already processed into CFGs. It reports the performance at the end, comparable to Table 3b in our paper.
cd DDFA
# Train DeepDFA
bash scripts/train.sh --seed_everything 1 # takes about 30 minutes
These scripts report performance of the LineVul and DeepDFA+LineVul models, comparable to Table 3b in our paper.
cd LineVul/linevul
# Train LineVul
bash scripts/msr_train_linevul.sh 1 MSR # takes about 10 hours
# Train DeepDFA+LineVul
bash scripts/msr_train_combined.sh 1 MSR # takes about 10 hours
The above scripts use the preprocessed data included in our data archive, for ease of replicability. The instructions below show how to run the code end-to-end.
The current prototype scripts take some time to process data into the format for our dataset, so we provide instructions how to do it with sample mode or full data mode.
cd DDFA
bash scripts/preprocess.sh --sample # may take several hours. Requires Joern to be installed
# Train DeepDFA
bash scripts/train.sh --seed_everything 1 --data.sample_mode True
cd ../LineVul/linevul
# Train DeepDFA+LineVul
bash scripts/msr_train_combined.sh 1 MSR --sample --epochs 1
To run the full preprocessing from the raw MSR dataset to training DeepDFA, unpack only storage/external/MSR_data_cleaned.csv
(skipping the preprocessed data such as the CFGs in storage/processed
) and run these steps.
cd DDFA
bash scripts/preprocess.sh # may take several hours. Requires Joern to be installed
# Train DeepDFA
bash scripts/train.sh --seed_everything 1
This section contains extended instructions for running additional experiments reported in our paper.
For the experiments below, see our extended data package at https://doi.org/10.6084/m9.figshare.21225413.v1, file data.zip
.
We forked this portion of our implementation from LineVD. We used their code to generate CFGs with Joern v1.1.1072 and load the dataset.
Scripts for running the different experiments:
bash scripts/train.sh --seed_everything 1 # train on MSR, see Table 3b
bash scripts/run_profiling.sh <checkpoint_from_training> # run profiling on trained checkpoint, see Table 5
bash scripts/run_cross_project.sh # train on mixed-project, evaluate on mixed- and cross-project, see Table 7
The coverage of the abstract dataflow embedding (running with --analyze_dataset
) is logged in logs/1.Effectiveness/DDFA/analyze_dataset.log
.
We forked this portion of our implementation from LineVul. We used their code to evaluate LineVul and combine with DDFA.
# run DDFA preprocessing before running any *_combined.sh
cd LineVul/linevul
# MSR
bash scripts/msr_train_linevul.sh 1 MSR # original LineVul model (without DeepDFA), see Table 3b
bash scripts/msr_train_combined.sh 1 MSR # LineVul + DeepDFA, see Table 3b
# cross project, see Table 7
bash scripts/cross_project_train_linevul.sh 1 cross_project/fold_0_holdout
bash scripts/cross_project_eval_linevul.sh 1 saved_models/cross_project-fold_0_dataset/checkpoint-best-f1/1_linevul.bin cross_project/fold_0_holdout
bash scripts/cross_project_train_combined.sh 1 cross_project/fold_0_holdout
bash scripts/cross_project_eval_combined.sh 1 saved_models/cross_project-fold_0_dataset/checkpoint-best-f1/1_combined.bin cross_project/fold_0_holdout
missing_ids.txt
is a list of the 7% of the dataset examples which could not be parsed by DDFA.
Use scripts/report_profiling.py
. For example:
python scripts/report_profiling.py --profiledata logs/2.Efficiency/profiling/DDFA/gpu_bs1/profiledata.jsonl
gflops: 763.31798 average: 0.04079514617070173
gmacs: 763.31798 average: 0.04079514617070173
For LineVul and LineVul+DDFA, GMACs are printed in profile.log
(consider the first occurrence of XXX GMACs
).
The scripts for training ablation models and evaluating them on DbgBench are in ablation_study/scripts
.
The scripts, notebooks, and model outputs for extracting performance metrics are in ablation_study/logs_bigvul
and ablation_study/logs_dbgbench
.
The scripts for training CodeT5 and DeepDFA+CodeT5 are in CodeT5/code/CodeT5/sh
.
The logs of running the scripts are in CodeT5/logs/*/train.log
.
The scripts for training UniXcoder and DeepDFA+UniXcoder are in UniXcoder/scripts
.
This includes an updated script, linevul_main.py
, for running the LineVul model with a UniXcoder backbone and can be placed into the LineVul/linevul
source directory to run with the code in the main data package.
The logs of running the scripts are in the various directories logs_*
.
Notebooks are included to summarize model performance in logs_size
and logs_crossproject
and a script is included for DbgBench in logs_dbgbench
.
The scripts and model outputs for running statistical tests on LLM vs. DeepDFA+LLM are in statistical_test
.
Please see linevul.ipynb
to run LineVul vs. DeepDFA+LineVul, and
Please see unixcoder.ipynb
to run UniXcoder vs. DeepDFA+UniXcoder.
If you used our code in your research, please consider citing our paper:
Benjamin Steenhoek, Hongyang Gao, and Wei Le. 2024. Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24), April 14–20, 2024, Lisbon, Portugal. ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3597503.3623345