CodeGen4Libs

This repo is for the ASE2023 paper titled "CodeGen4Libs: A Two-stage Approach for Library-oriented Code Generation".

Updates

2023-09-10: Initial Benchmark Release
2023-10-04: add Huggingface support

TODO

Model Implementations

Huggingface support

Hugging Face Datasets

Usage

from datasets import load_dataset
dataset = load_dataset("FudanSELab/CodeGen4Libs")

Dataset Structure

DatasetDict({
    train: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 391811
    })
    validation: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 5967
    })
    test: Dataset({
        features: ['id', 'method', 'clean_method', 'doc', 'comment', 'method_name', 'extra', 'imports_info', 'libraries_info', 'input_str', 'input_ids', 'tokenized_input_str', 'input_token_length', 'labels', 'tokenized_labels_str', 'labels_token_length', 'retrieved_imports_info', 'retrieved_code', 'imports', 'cluster_imports_info', 'libraries', 'attention_mask'],   
        num_rows: 6002
    })
})

Benchmark Format

Benchmark has been meticulously structured and saved in the DatasetDict format, accessible at Dataset and Models of CodeGen4Libs. The specific data fields for each tuple are delineated as follows:

id: the unique identifier for each tuple.
method: the original method-level code for each tuple.
clean_method: the ground-truth method-level code for each task.
doc: the document of method-level code for each tuple.
comment: the natural language description for each tuple.
method_name: the name of the method.
extra: extra information on the code repository to which the method level code belongs.
- license: the license of code repository.
- path: the path of code repository.
- repo_name: the name of code repository.
- size: the size of code repository.
imports_info: the import statements for each tuple.
libraries_info: the libraries info for each tuple.
input_str: the design of model input.
input_ids: the ids of tokenized input.
tokenized_input_str: the tokenized input.
input_token_length: the length of the tokenized input.
labels: the ids of tokenized output.
tokenized_labels_str: the tokenized output.
labels_token_length: the length of the the tokenized output.
retrieved_imports_info: the retrieved import statements for each tuple.
retrieved_code: the retrieved method-level code for each tuple.
imports: the imported packages of each import statement.
cluster_imports_info: cluster import information of code.
libraries: libraries used by the code.
attention_mask: attention mask for the input.

Models Download

NL+Libs+Imports(Ret)->Imports

NL+Libs->Imports

NL+Libs->Code

NL+Libs+Imports(Gen)->Code

NL+Libs+Code(Ret)->Code

NL+Libs+Imports(Gen)+Code(Ret)->Code

Usage

Environment Setup

from transformers import RobertaTokenizer, T5ForConditionalGeneration

Load Model

tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
# add <code>, </code> as special tokens
tokenizer.add_special_tokens(
    {"additional_special_tokens": tokenizer.special_tokens_map["additional_special_tokens"] + ["<code>", "</code>"]}
)
# load model
model_name = "codegen4lib_base"
model_dir = PathUtil.finetune_model(f"{version}/best_{model_name}")
model = T5ForConditionalGeneration.from_pretrained(model_dir)

Genetrate Example

input_str = "Gets the detailed information for a given agent pool"
input_ids = tokenizer(input_str, return_tensors="pt").input_ids
input_ids = torch.as_tensor(input_ids).to("cuda")

outputs = model.generate(input_ids, max_length=512)
print("output_str: ", tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@inproceedings{ase2023codegen4libs,
  author       = {Mingwei Liu and Tianyong Yang and Yiling Lou and Xueying Du and Ying Wang and and Xin Peng},
  title        = {{CodeGen4Libs}: A Two-stage Approach for Library-oriented Code Generation},
  booktitle    = {38th {IEEE/ACM} International Conference on Automated Software Engineering,
                  {ASE} 2023, Kirchberg, Luxembourg, September 11-15, 2023},
  pages        = {0--0},
  publisher    = {{IEEE}},
  year         = {2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
generation		generation
README.md		README.md
_config.yml		_config.yml
run_test.py		run_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

generation

generation

README.md

README.md

_config.yml

_config.yml

run_test.py

run_test.py

Repository files navigation

CodeGen4Libs

Updates

TODO

Huggingface support

Usage

Dataset Structure

Benchmark Format

Models Download

Usage

Citation

About

Releases

Packages

Contributors 3

Languages

FudanSELab/codegen4libs

Folders and files

Latest commit

History

Repository files navigation

CodeGen4Libs

Updates

TODO

Huggingface support

Usage

Dataset Structure

Benchmark Format

Models Download

Usage

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages