Our code is implemented based on OpenRLHF. Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt
For a training cycle, following the code below, then adjust the tempurature in 1., and start a new collect-train cycle.
# 1. collect 8K math data
bash sh/collect_data.sh
# 2. make VR pairs dataset for DPO
bash sh/make_vr_pairs.sh
# 3. train the dpo model
bash sh/train_dpo.sh
# adjust the tempurature in 1., then start a new collect-train cycle.We used Qwen Math's codebase for evaluation (i.e., pass@1 accuracy).
bash sh/evaluate_all_bench.sh