Skip to content

SenseLLM/StructureCoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Alignment with Fill-In-the-Middle for Enhancing Code Generation

📄 Paper🏠 Repo🤖 Models

Introduction

Structure splits code snippets into smaller, granular blocks, creatingmore diverse DPO pairs from the same testcases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Please refer to our paper for more details!


Models

Model Checkpoint Size
StructureCoder-1.5B 🤗 HF Link 1.5B
StructureCoder-3B 🤗 HF Link 3B
StructureCoder-7B 🤗 HF Link 7B

Train and Evaluation

Download Data

cd data
python download.py

Train process

Black Format

cd data
python process.py -t format_code -i input_file -o output_file

Check Label

cd data
# input file: file after black format
python check.py -i input_file -o output_file

Extract Block

cd data
# input file: file after check label
python process.py -t fim -i input_file -o output_file
python process.py -t full -i input_file -o output_file

Generation

python construct.py -t fim -p model_path -i input_file -o output_file
python construct.py -t full -p model_path -i input_file -o output_file

#### Check Generation Result
python check.py -i input_file -o output_file

#### Process Check Result; Output training data
python process -i input_dir -o output_file --epoch 3 -t fim fulls

Train

You can directly use the open-sourced training data to train the model.

torchrun --nproc_per_node 8 train_dpo.py \
  --seed 3407 \
  --report_to tensorboard \
  --dataloader_num_workers 8 \
  --remove_unused_columns False \
  --save_steps 100 \
  --max_len 2048 \
  --warmup_ratio 0.05 \
  --logging_steps 10 \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine_with_min_lr \
  --lr_scheduler_kwargs '{"min_lr_rate": 0.1}' \
  --optim rmsprop \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --bf16 \
  --do_train \
  --save_only_model \
  --save_safetensors \
  --gradient_checkpointing \
  --deepspeed config/stage_1.json \
  --learning_rate 1e-6 \
  --model_cfg model_path \
  --train_file train_file \
  --output_dir output_dir

Test

python test -p output_dir/checkpoint-final

Acknowledgments

We thank the following amazing projects that truly inspired us:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages