CodeSem

CodeSem is a dataset of programs extracted from real-world flagship software codebases (e.g., Linux Kernel, GCC, MySQL, etc.) and manually validated for equivalence prediction and alias prediction. In this repository, we include not only the CodeSem dataset, but also the re-implementation of the CuBERT, CodeBERT, GGNN, and Graph Sandwiches models, as well as scripts for converting program source code into corresponding model inputs.

Quickstart

Here's a quick start to reproduce our experiment.

# clone this repository
git clone https://github.com/CodeSemDataset/CodeSem.git
cd CodeSem

# build the docker image
docker build --network=host -t "tf:2.6" .

# run a new docker container
# Dir is the workdir of your local device
docker run --gpus all -it --rm -v $Dir/CodeSem:/tmp tf:2.6 bash

To generate model inputs for sequence models and graph models, refer to README in scripts/tokens and scripts/graph, respectively.

To train and test models, refer to README in each code model under models/.

Introduction

CodeSem is a dataset built upon the source code of real-world flagship software (e.g., Linux Kernel, GCC, MySQL, etc.) and has been manually validated for two prediction tasks: (1) alias prediction in which models predict whether two pointers must alias, may alias or must not alias; and (2) equivalence prediction in which models predict whether two programs are semantically equivalent. We re-designed four influential code models for alias and equivalence prediction: CuBERT, CodeBERT, GGNN, and Graph Sandwiches, and perform a head-to-head comparison based on CodeSem.

Datasets

Fourteen open-source projects are selected to construct CodeSem: Linux Kernel, GCC, MySQL, Git, tmux, Redis, curl, LevelDB, H2O, libgit2, The Silver Searcher, Protocol Buffers, aria2, and fish. These are all mature projects, most of which are large-scale (with hundreds of thousands or even millions of lines of code) and have a decades-long history. In general, the source code of those projects implement a board range of functionalities such as data transmission, memory management, cross-compilation, etc., which in turn makes CodeSem diverse. We give the details about the projects we selected for CodeSem in the table below.

Projects	Description	Version	Size (KLoC)
Linux Kernel	The main component of a Linux operating system.	5.3.6	18,724
MySQL	A relational database management system.	8.0.25	3,595
GCC	A compiler for the GNU operating system.	10.3.0	9,173
Git	A distributed version control system.	f443b2	684
tmux	A terminal multiplexer.	5071b82	68
Redis	An in-memory key–value database.	6.2.6	206
curl	A data transferring tool.	7.79.0	254
LevelDB	A key-value storage library.	1.23	22
H2O	An optimized HTTP/1, HTTP/2, HTTP/3 server.	3e4b697	284
libgit2	A cross-platform, linkable library implementation of Git.	2fc0fcb	236
The Silver Searcher	A code-searching tool	a61f178	6
Protocol Buffers	A cross-platform data format used to serialize structured data.	3.20.0	650
aria2	A utility for downloading files.	1.36.0	136
fish	A user-friendly command line shell.	3.4.1	407

Data Details

We assemble dataset for the following two tasks: alias prediction and equivalence prediction. The structure of the dataset is shown below:

.
├── alias_prediction
│   ├── fine-tune
│   │   ├── curl.csv
│   │   ├── gcc.csv
│   │   └── ...
│   └── specialized_pre-train
│       ├── gcc.csv
│       └── ...
└── equivalence_prediction
    ├── fine-tune
    │   ├── aria2_compare_equal_pairs.csv
    │   ├── aria2_compare_inequal_pairs.csv
    │   ├── curl_compare_equal_pairs.csv
    │   ├── curl_compare_inequal_pairs.csv
    │   └── ...
    ├── specialized_pre-train
    │   ├── aria2_compare_equal_pairs.csv
    │   ├── aria2_compare_inequal_pairs.csv
    │   ├── curl_compare_equal_pairs.csv
    │   ├── curl_compare_inequal_pairs.csv
    │   └── ...
    └── AllProjects
        ├── aria2_compare
        │   ├── 1
        │   │   ├── wslay_event_recv_508-135788_607-140735.foo.c
        │   │   ├── wslay_event_recv_508-135788_696-145433.foo.c
        │   │   ├── wslay_event_recv_508-135788_703-145874.foo.c
        │   │   ├── wslay_event_recv_519-136415_607-140735.foo.c
        │   │   └── ...
        │   ├── 1_1090
        │   │   ├──wslay_event_recv_596-139874_703-145874.foo.c
        │   │   └── ...
        │   ├── ...
        │   └── include
        │       ├──dycfoo.h
        │       ├──uri_split.i.hd.c.h
        │       └── ...
        ├── curl_compare
        └── ...

There are two subdatasets under datasets: alias_prediction and equivalence_prediction, for two prediction tasks. The dataset for each task is further divided into specialized_pre-train and fine-tune, for different training stages we designed for these three models.

For alias prediction, each csv file is formatted like ['name1', 'path1', 'def_line1', 'name2', 'path2', 'def_line2', 'fine_grained_label'], in which the first three columns represent the variable name of the first variable, the relative path to the file where it is located, and the row where the definition point is located. Columns four through six represent the similar information for the second variable. The seventh column represents the label, which means that the two variables must alias (label is 1), may alias (label is 2) or must-not alias (label is 0).

For equivalent predictions, each csv file is named after 'projectName_compare_dataType_pairs.csv', and the file is formatted like ['file1', 'file2']. file1 and file2 correspond to the relative paths of the two files in the projectName_compare subdirectory in Allprojects. When the dataType is equal, the two files are semantically equivalent, and label is 1. When the dataType is inequal, the two files are inequivalent, and label is 0. Each .foo.c file under equivalence_prediction can be compiled to .ast by using clang. (The header files required for each project are in the project's include folder, so just keep the current directory structure and run the command, e.g. clang -emit-ast -c ./datasets/equivalence_prediction/AllProjects/gcc_compare/6/byte_re_match_2_internal_6583-212619_6851-219281.foo.c.)

Models

The models we used in the evaluation, namely CuBERT, CodeBERT, GGNN, and Graph Sandwiches.

Please refer to this document for the use of sequence models. For the use of graph models, please refer to README in the specific task subdirectory. For example, for the equivalence prediction task of GGNN, please refer to this README.

Scripts

The scripts we used to convert source code to model input.

Code in scripts/tokens is for sequence models and code in scripts/graph is for graph models.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
CodePredictor		CodePredictor
datasets		datasets
scripts		scripts
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CodeSem

Table of Contents

Quickstart

Introduction

Datasets

Data Details

Models

Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Languages

CodeSemDataset/CodeSem

Folders and files

Latest commit

History

Repository files navigation

CodeSem

Table of Contents

Quickstart

Introduction

Datasets

Data Details

Models

Scripts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages