CodeSem is a dataset of programs extracted from real-world flagship software codebases (e.g., Linux Kernel, GCC, MySQL, etc.) and manually validated for equivalence prediction and alias prediction. In this repository, we include not only the CodeSem dataset, but also the re-implementation of the CuBERT, CodeBERT, GGNN, and Graph Sandwiches models, as well as scripts for converting program source code into corresponding model inputs.
Here's a quick start to reproduce our experiment.
# clone this repository
git clone https://github.com/CodeSemDataset/CodeSem.git
cd CodeSem
# build the docker image
docker build --network=host -t "tf:2.6" .
# run a new docker container
# Dir is the workdir of your local device
docker run --gpus all -it --rm -v $Dir/CodeSem:/tmp tf:2.6 bash
To generate model inputs for sequence models and graph models, refer to README in scripts/tokens and scripts/graph, respectively.
To train and test models, refer to README in each code model under models/.
CodeSem is a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL, etc.) and has been manually validated for two prediction tasks: (1) alias prediction in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction in which models predict whether two programs are semantically equivalent. We re-designed four influential code models for alias and equivalence prediction: CuBERT, CodeBERT, GGNN, and Graph Sandwiches, and perform a head-to-head comparison based on CodeSem.
Fourteen open-source projects are selected to construct CodeSem: Linux Kernel, GCC, MySQL, Git, tmux, Redis, curl, LevelDB, H2O, libgit2, The Silver Searcher, Protocol Buffers, aria2, and fish. These are all mature projects, most of which are large-scale (with hundreds of thousands or even millions of lines of code) and have a decades-long history. In general, the source code of those projects implement a board range of functionalities such as data transmission, memory management, cross-compilation, etc., which in turn makes CodeSem diverse. We give the details about the projects we selected for CodeSem in the table below.
Projects | Description | Version | Size (KLoC) |
---|---|---|---|
Linux Kernel | The main component of a Linux operating system. | 5.3.6 | 18,724 |
MySQL | A relational database management system. | 8.0.25 | 3,595 |
GCC | A compiler for the GNU operating system. | 10.3.0 | 9,173 |
Git | A distributed version control system. | f443b2 | 684 |
tmux | A terminal multiplexer. | 5071b82 | 68 |
Redis | An in-memory key–value database. | 6.2.6 | 206 |
curl | A data transferring tool. | 7.79.0 | 254 |
LevelDB | A key-value storage library. | 1.23 | 22 |
H2O | An optimized HTTP/1, HTTP/2, HTTP/3 server. | 3e4b697 | 284 |
libgit2 | A cross-platform, linkable library implementation of Git. | 2fc0fcb | 236 |
The Silver Searcher | A code-searching tool | a61f178 | 6 |
Protocol Buffers | A cross-platform data format used to serialize structured data. | 3.20.0 | 650 |
aria2 | A utility for downloading files. | 1.36.0 | 136 |
fish | A user-friendly command line shell. | 3.4.1 | 407 |
We assemble dataset for the following two tasks: alias prediction and equivalence prediction. The structure of the dataset is shown below:
.
├── alias_prediction
│ ├── fine-tune
│ │ ├── curl.csv
│ │ ├── gcc.csv
│ │ └── ...
│ └── specialized_pre-train
│ ├── gcc.csv
│ └── ...
└── equivalence_prediction
├── fine-tune
│ ├── aria2_compare_equal_pairs.csv
│ ├── aria2_compare_inequal_pairs.csv
│ ├── curl_compare_equal_pairs.csv
│ ├── curl_compare_inequal_pairs.csv
│ └── ...
├── specialized_pre-train
│ ├── aria2_compare_equal_pairs.csv
│ ├── aria2_compare_inequal_pairs.csv
│ ├── curl_compare_equal_pairs.csv
│ ├── curl_compare_inequal_pairs.csv
│ └── ...
└── AllProjects
├── aria2_compare
│ ├── 1
│ │ ├── wslay_event_recv_508-135788_607-140735.foo.c
│ │ ├── wslay_event_recv_508-135788_696-145433.foo.c
│ │ ├── wslay_event_recv_508-135788_703-145874.foo.c
│ │ ├── wslay_event_recv_519-136415_607-140735.foo.c
│ │ └── ...
│ ├── 1_1090
│ │ ├──wslay_event_recv_596-139874_703-145874.foo.c
│ │ └── ...
│ ├── ...
│ └── include
│ ├──dycfoo.h
│ ├──uri_split.i.hd.c.h
│ └── ...
├── curl_compare
└── ...
There are two subdatasets under datasets: alias_prediction and equivalence_prediction, for two prediction tasks. The dataset for each task is further divided into specialized_pre-train
and fine-tune
, for different training stages we designed for these three models.
For alias prediction, each csv
file is formatted like ['name1', 'path1', 'def_line1', 'name2', 'path2', 'def_line2', 'fine_grained_label']
, in which the first three columns represent the variable name of the first variable, the relative path to the file where it is located, and the row where the definition point is located. Columns four through six represent the similar information for the second variable. The seventh column represents the label, which means that the two variables must alias (label is 1), may alias (label is 2) or must-not alias (label is 0).
For equivalent predictions, each csv
file is named after 'projectName_compare_dataType_pairs.csv'
, and the file is formatted like ['file1', 'file2']
. file1
and file2
correspond to the relative paths of the two files in the projectName_compare
subdirectory in Allprojects. When the dataType
is equal
, the two files are semantically equivalent, and label is 1. When the dataType
is inequal
, the two files are inequivalent, and label is 0. Each .foo.c
file under equivalence_prediction
can be compiled to .ast
by using clang
. (The header files required for each project are in the project's include
folder, so just keep the current directory structure and run the command, e.g. clang -emit-ast -c ./datasets/equivalence_prediction/AllProjects/gcc_compare/6/byte_re_match_2_internal_6583-212619_6851-219281.foo.c
.)
The models we used in the evaluation, namely CuBERT, CodeBERT, GGNN, and Graph Sandwiches.
Please refer to this document for the use of sequence models. For the use of graph models, please refer to README in the specific task subdirectory. For example, for the equivalence prediction task of GGNN, please refer to this README.
The scripts we used to convert source code to model input.
Code in scripts/tokens is for sequence models and code in scripts/graph is for graph models.