These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Install the necessary dependencies before running the project. All these softwares and libraries are needed in our project during implementing the graph processing and model training module.
For graph-based method, please reference
Graph-based Method/requirements.txtfor more details.
Clone this project, deploy on your own local machine, and get ready for your first running.
$
$ pip install -r requirements.txt
$ cd EdgeProcessing/slicec_7edges_funcblock
$ javac -cp lib/org/eclipse/cdt.core/5.6.0.201402142303/*:lib/org/eclipse/equinox.common/3.6.200.v20130402-1505/*:lib/com.ibm.icu-4.4.2.jar -d bin src/main/java/slice/*.java src/main/java/sevenEdges/*.java src/main/java/sevenEdges/treeview/*.java src/main/java/sevenEdges/treeview/ast/*.java src/main/java/sevenEdges/nodeTraversal/*.java
This part is an instruction of our graph-based model, which leverages the advances in graph neural networks(GNNs) to develop a learning method to capture the potential code patterns under composite graph structures.
The basic idea of this part is to combine different types of code graph structures, to find out if the composite graphs outperform single graph structures and how much. We selected AST, CFG, PDG and their various combinations as our experimental targets, and we extract these graph structures from C/C++ source code utilizing the code analysis tool Joern.
├── README.md <- The top-level README for developers using this project.
├── requirements.txt <- The necessary python environments.
|
├── GraphModelClient <- The model training module
│ ├── cli
│ │ ├── train.py <- The entrance of training models.
│ │ ├── test.py <- Testing the specified model using data.
│ │ ├── __init__.py
│ ├── cli_utils
│ │ ├── default_hypers
│ │ ├── dataset_utils.py
│ │ ├── model_utils.py
│ │ ├── param_helpers.py
│ │ ├── task_utils.py
│ │ ├── training_utils.py
│ │ ├── __init__.py
│ ├── data
│ │ ├── data
│ │ │ ├── data_preprocess.py
│ │ │ ├── our_map_all.txt
│ │ │ ├── __init__.py
│ │ ├── graph_dataset.py
│ │ ├── jsonl_graph_dataset.py
│ │ ├── jsonl_graph_property_dataset.py
│ │ ├── __init__.py
│ ├── layers
│ │ ├── message_passing
│ │ │ ├── ggnn.py
│ │ │ ├── gnn_edge_mlp.py
│ │ │ ├── gnn_film.py
│ │ │ ├── message_passing.py
│ │ │ ├── __init__.py
│ │ ├── gnn.py
│ │ ├── graph_global_exchange.py
│ │ ├── nodes_to_graph_representation.py
│ │ ├── __init__.py
│ ├── models
│ │ ├── graph_binary_classification_task.py
│ │ ├── graph_regression_task.py
│ │ ├── graph_task_model.py
│ │ ├── node_multiclass_task.py
│ │ ├── __init__.py
│ ├── utils
│ │ ├── activation.py
│ │ ├── constants.py
│ │ ├── gather_dense_gradient.py
│ │ ├── param_helpers.py
│ └── └── __init__.py
│ └── __init__.py
|
├── EdgeProcessing <- The graph processing module.
│ ├── README.md <- The README for developers extracting various graphs.
│ ├── joern-cli <- Use Joern library to extract different graph edges.
│ │ ├── bin
│ │ ├── lib
│ │ ├── graph <- Scripts to generate various graphs.
│ │ │ ├── ast.sc <- Scripts to generate AST graph.
│ │ │ ├── ast+cfg.sc <- Scripts to generate AST+CFG combined graph.
│ │ │ ├── all.sc <- Scripts to generate AST+CFG+PDG composite graph.
│ │ ├── joern
│ │ ├── joern-parse
│ ├── slicec_7edges_funcblock <- Utils for extracting graphs.
│ │ ├── bin
│ │ ├── lib
│ │ ├── src
To construct various graph structures form C/C++, we use Eclipse CDT library and Joern to extract and integrate different graphs.
For more details, please reference EdgeProcessing/README.
We use CDT library to extract the target functions in sample files.
$ cd EdgeProcessing/slicec_7edges_funcblock
- Run slice.ClassifyFileOfProject to extract all the C file from the SARD dataset / Run slice.NvdClassifyFile for the NVD dataset.
- Run slice.Main to slice SARD data in function level / slice.NvdMain for the NVD dataset.
We use Joern to extract edges of specific graphs and classify them by types. Then we traverse the source codes' AST nodes parsed by CDT library, and integrate edges with AST nodes to generate target graphs.
$ cd EdgeProcessing/joern-cli
- Use Joern to get all the specific edge relationships(i.e. control flows and data flows)
$ cd EdgeProcessing/slicec_7edges_funcblock
- Run sevenEdges.Main to extract source codes' AST nodes from SARD / sevenEdges.NvdMain for the NVD dataset.
- Run sevenEdges.concateJoern to integrate nodes with edges.
This part of the code provides a user-friendly interface. For a quick start, you can simply run the command below:
$ cd GraphModelClient/cli
$ CUDA_VISIBLE_DEVICES=0 python train.py GGNN GraphBinaryClassification ../data/data/ast_graph
Train/Val/Test ratios - 0.8/0.1/0.1
Example results of training on the SARD dataset with composite AST+CFG+PDG graph.
Saved Model checkpoint at 78 epochs.
Dataset parameters used: {
"max_nodes_per_batch": 128, "num_fwd_edge_types": 7, "add_self_loop_edges": true, "tie_fwd_bkwd_edges": true, "threshold_for_classification": 0.5
}
Model parameters used: {
"gnn_aggregation_function": "sum", "gnn_message_activation_function": "ReLU", "gnn_hidden_dim": 128, "gnn_use_target_state_as_input": false, "gnn_normalize_by_num_incoming": true, "gnn_num_edge_MLP_hidden_layers": 1, "gnn_message_calculation_class": "GGNN", "gnn_initial_node_representation_activation": "tanh", "gnn_dense_intermediate_layer_activation": "tanh", "gnn_num_layers": 4, "gnn_dense_every_num_layers": 10000, "gnn_residual_every_num_layers": 2, "gnn_use_inter_layer_layernorm": true, "gnn_layer_input_dropout_rate": 0.2, "gnn_global_exchange_mode": "gru", "gnn_global_exchange_every_num_layers": 10000, "gnn_global_exchange_weighting_fun": "softmax", "gnn_global_exchange_num_heads": 4, "gnn_global_exchange_dropout_rate": 0.2, "optimizer": "Adam", "learning_rate": 0.001, "learning_rate_decay": 0.98, "momentum": 0.85, "gradient_clip_value": 1.0, "use_intermediate_gnn_results": false, "graph_aggregation_num_heads": 128, "graph_aggregation_hidden_layers": [4], "graph_aggregation_dropout_rate": 0.1, "gnn_num_aggr_MLP_hidden_layers": null
}
== Running on test dataset
Loading data from ../data/data/tem_all_graph/new/ast.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2022-03-21_12-38-21_best.pkl.
CP_test Accuracy = 0.750|precision = 0.583 | recall = 0.875 | f1 = 0.700 |TPR = 0.875 | FPR = 0.312 | TNR = 0.688 | FNR = 0.125 |
Guidance and ideas for some parts from:
Graph-based method project based on: