Structure splits code snippets into smaller, granular blocks, creatingmore diverse DPO pairs from the same testcases. Additionally, we introduce the Abstract Syntax Tree (AST) splitting and curriculum training method to enhance the DPO training. Please refer to our paper for more details!
Model | Checkpoint | Size |
---|---|---|
StructureCoder-1.5B | 🤗 HF Link | 1.5B |
StructureCoder-3B | 🤗 HF Link | 3B |
StructureCoder-7B | 🤗 HF Link | 7B |
cd data
python download.py
cd data
python process.py -t format_code -i input_file -o output_file
cd data
# input file: file after black format
python check.py -i input_file -o output_file
cd data
# input file: file after check label
python process.py -t fim -i input_file -o output_file
python process.py -t full -i input_file -o output_file
python construct.py -t fim -p model_path -i input_file -o output_file
python construct.py -t full -p model_path -i input_file -o output_file
#### Check Generation Result
python check.py -i input_file -o output_file
#### Process Check Result; Output training data
python process -i input_dir -o output_file --epoch 3 -t fim fulls
You can directly use the open-sourced training data to train the model.
torchrun --nproc_per_node 8 train_dpo.py \
--seed 3407 \
--report_to tensorboard \
--dataloader_num_workers 8 \
--remove_unused_columns False \
--save_steps 100 \
--max_len 2048 \
--warmup_ratio 0.05 \
--logging_steps 10 \
--num_train_epochs 1 \
--lr_scheduler_type cosine_with_min_lr \
--lr_scheduler_kwargs '{"min_lr_rate": 0.1}' \
--optim rmsprop \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--bf16 \
--do_train \
--save_only_model \
--save_safetensors \
--gradient_checkpointing \
--deepspeed config/stage_1.json \
--learning_rate 1e-6 \
--model_cfg model_path \
--train_file train_file \
--output_dir output_dir
python test -p output_dir/checkpoint-final
We thank the following amazing projects that truly inspired us: