Skip to content

CodeSemDataset/CodeSem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeSem

CodeSem is a dataset of programs extracted from real-world flagship software codebases (e.g., Linux Kernel, GCC, MySQL, etc.) and manually validated for equivalence prediction and alias prediction. In this repository, we include not only the CodeSem dataset, but also the re-implementation of the CuBERT, CodeBERT, GGNN, and Graph Sandwiches models, as well as scripts for converting program source code into corresponding model inputs.

Table of Contents

Quickstart

Here's a quick start to reproduce our experiment.

# clone this repository
git clone https://github.com/CodeSemDataset/CodeSem.git
cd CodeSem

# build the docker image
docker build --network=host -t "tf:2.6" .

# run a new docker container
# Dir is the workdir of your local device
docker run --gpus all -it --rm -v $Dir/CodeSem:/tmp tf:2.6 bash

To generate model inputs for sequence models and graph models, refer to README in scripts/tokens and scripts/graph, respectively.

To train and test models, refer to README in each code model under models/.

Introduction

CodeSem is a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL, etc.) and has been manually validated for two prediction tasks: (1) alias prediction in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction in which models predict whether two programs are semantically equivalent. We re-designed four influential code models for alias and equivalence prediction: CuBERT, CodeBERT, GGNN, and Graph Sandwiches, and perform a head-to-head comparison based on CodeSem.

Datasets

Fourteen open-source projects are selected to construct CodeSem: Linux Kernel, GCC, MySQL, Git, tmux, Redis, curl, LevelDB, H2O, libgit2, The Silver Searcher, Protocol Buffers, aria2, and fish. These are all mature projects, most of which are large-scale (with hundreds of thousands or even millions of lines of code) and have a decades-long history. In general, the source code of those projects implement a board range of functionalities such as data transmission, memory management, cross-compilation, etc., which in turn makes CodeSem diverse. We give the details about the projects we selected for CodeSem in the table below.

Projects Description Version Size (KLoC)
Linux Kernel The main component of a Linux operating system. 5.3.6 18,724
MySQL A relational database management system. 8.0.25 3,595
GCC A compiler for the GNU operating system. 10.3.0 9,173
Git A distributed version control system. f443b2 684
tmux A terminal multiplexer. 5071b82 68
Redis An in-memory key–value database. 6.2.6 206
curl A data transferring tool. 7.79.0 254
LevelDB A key-value storage library. 1.23 22
H2O An optimized HTTP/1, HTTP/2, HTTP/3 server. 3e4b697 284
libgit2 A cross-platform, linkable library implementation of Git. 2fc0fcb 236
The Silver Searcher A code-searching tool a61f178 6
Protocol Buffers A cross-platform data format used to serialize structured data. 3.20.0 650
aria2 A utility for downloading files. 1.36.0 136
fish A user-friendly command line shell. 3.4.1 407

Data Details

We assemble dataset for the following two tasks: alias prediction and equivalence prediction. The structure of the dataset is shown below:

.
├── alias_prediction
│   ├── fine-tune
│   │   ├── curl.csv
│   │   ├── gcc.csv
│   │   └── ...
│   └── specialized_pre-train
│       ├── gcc.csv
│       └── ...
└── equivalence_prediction
    ├── fine-tune
    │   ├── aria2_compare_equal_pairs.csv
    │   ├── aria2_compare_inequal_pairs.csv
    │   ├── curl_compare_equal_pairs.csv
    │   ├── curl_compare_inequal_pairs.csv
    │   └── ...
    ├── specialized_pre-train
    │   ├── aria2_compare_equal_pairs.csv
    │   ├── aria2_compare_inequal_pairs.csv
    │   ├── curl_compare_equal_pairs.csv
    │   ├── curl_compare_inequal_pairs.csv
    │   └── ...
    └── AllProjects
        ├── aria2_compare
        │   ├── 1
        │   │   ├── wslay_event_recv_508-135788_607-140735.foo.c
        │   │   ├── wslay_event_recv_508-135788_696-145433.foo.c
        │   │   ├── wslay_event_recv_508-135788_703-145874.foo.c
        │   │   ├── wslay_event_recv_519-136415_607-140735.foo.c
        │   │   └── ...
        │   ├── 1_1090
        │   │   ├──wslay_event_recv_596-139874_703-145874.foo.c
        │   │   └── ...
        │   ├── ...
        │   └── include
        │       ├──dycfoo.h
        │       ├──uri_split.i.hd.c.h
        │       └── ...
        ├── curl_compare
        └── ...

There are two subdatasets under datasets: alias_prediction and equivalence_prediction, for two prediction tasks. The dataset for each task is further divided into specialized_pre-train and fine-tune, for different training stages we designed for these three models.

For alias prediction, each csv file is formatted like ['name1', 'path1', 'def_line1', 'name2', 'path2', 'def_line2', 'fine_grained_label'], in which the first three columns represent the variable name of the first variable, the relative path to the file where it is located, and the row where the definition point is located. Columns four through six represent the similar information for the second variable. The seventh column represents the label, which means that the two variables must alias (label is 1), may alias (label is 2) or must-not alias (label is 0).

For equivalent predictions, each csv file is named after 'projectName_compare_dataType_pairs.csv', and the file is formatted like ['file1', 'file2']. file1 and file2 correspond to the relative paths of the two files in the projectName_compare subdirectory in Allprojects. When the dataType is equal, the two files are semantically equivalent, and label is 1. When the dataType is inequal, the two files are inequivalent, and label is 0. Each .foo.c file under equivalence_prediction can be compiled to .ast by using clang. (The header files required for each project are in the project's include folder, so just keep the current directory structure and run the command, e.g. clang -emit-ast -c ./datasets/equivalence_prediction/AllProjects/gcc_compare/6/byte_re_match_2_internal_6583-212619_6851-219281.foo.c.)

Models

The models we used in the evaluation, namely CuBERT, CodeBERT, GGNN, and Graph Sandwiches.

Please refer to this document for the use of sequence models. For the use of graph models, please refer to README in the specific task subdirectory. For example, for the equivalence prediction task of GGNN, please refer to this README.

Scripts

The scripts we used to convert source code to model input.

Code in scripts/tokens is for sequence models and code in scripts/graph is for graph models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages