CodeS: Natural Language to Code Repository via Multi-Layer Sketch

What is this about?

The impressive performance of large language models (LLMs) on code-related tasks has shown the potential of fully automated software development. In light of this, we introduce a new software engineering task, namely Natural Language to code Repository (NL2Repo). This task aims to generate an entire code repository from its natural language requirements. To address this task, we propose a simple yet effective framework CodeS, which decomposes NL2Repo into multiple sub-tasks by a multi-layer sketch. Specifically, CodeS includes three modules: RepoSketcher, FileSketcher, and SketchFiller. RepoSketcher first generates a repository's directory structure for given requirements; FileSketcher then generates a file sketch for each file in the generated structure; SketchFiller finally fills in the details for each function in the generated file sketch. To rigorously assess CodeS on the NL2Repo task, we carry out evaluations through both automated benchmarking and manual feedback analysis. For benchmark-based evaluation, we craft a repository-oriented benchmark, SketchEval, and design an evaluation metric, SketchBLEU. For feedback-based evaluation, we develop a VSCode plugin for CodeS and engage 30 participants in conducting empirical studies. Extensive experiments prove the effectiveness and practicality of CodeS on the NL2Repo task.

Project Directory

.
├── assets
├── clean_repo.py # ./repos/ -> ./cleaned_repos/
├── cleaned_repos
├── craft_train_data.py # ./output -> ./training_data
├── extract_sketch.py # ./cleaned_repos/ -> ./output
├── outputs
├── projects # two projects
├── prompt_construction_utils.py
├── repos
├── requirements.txt
├── run_step1_clean.sh # runing ./clean_repo.py
├── run_step2_extract_sketch.sh # runing ./extract_sketch.py
├── run_step3_make_data.sh # runing ./craft_train_data.py
├── scripts
├── train # *train codes model* scripts
├── training_data
└── validation # *evaluation* scripts

Creating Instruction Data for 100 Repositories

Download the selected repositories to the ./repos directory and unzip them;
Preprocess the repositories;

bash run_step1_clean.sh

Extract instruction training data for RepoSketcher, FileSketcher, and SketchFiller.

bash run_step2_extract_sketch.sh
bash run_step3_make_data.sh

Training

Place the created instruction data into ./train/data and configure dataset_info.json according to the structure described at https://github.com/hiyouga/LLaMA-Factory/tree/main/data.
Start the training process:

vim ./train/run_train_multi_gpu.sh
bash ./train/run_train_multi_gpu.sh

Evaluation

Install SketchBLEU, similar to CodeBLEU.
Perform inference on SketchEval:

python ./codes/validation/evaluation-scripts/from_scratch_inference.py

Convert the inference results for the entire repository:

python ./codes/validation/evaluation-scripts/transfer_output_to_repo.py

Evaluate the generated repository as with CodeBLEU:

python ./codes/validation/evaluation-scripts/batch_eval/get_metric.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeS: Natural Language to Code Repository via Multi-Layer Sketch

What is this about?

Project Directory

Creating Instruction Data for 100 Repositories

Training

Evaluation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
cleaned_repos		cleaned_repos
projects		projects
repos		repos
scripts		scripts
train		train
training_data		training_data
validation		validation
.gitignore		.gitignore
README.md		README.md
clean_repo.py		clean_repo.py
craft_train_data.py		craft_train_data.py
extract_sketch.py		extract_sketch.py
prompt_construction_utils.py		prompt_construction_utils.py
requirements.txt		requirements.txt
run_step1_clean.sh		run_step1_clean.sh
run_step2_extract_sketch.sh		run_step2_extract_sketch.sh
run_step3_make_data.sh		run_step3_make_data.sh

NL2Code/CodeS

Folders and files

Latest commit

History

Repository files navigation

CodeS: Natural Language to Code Repository via Multi-Layer Sketch

What is this about?

Project Directory

Creating Instruction Data for 100 Repositories

Training

Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages