Skip to content

Thecommonirin/CASPO

Repository files navigation

🔧 Quick Start

Installation

Our code is implemented based on OpenRLHF. Please follow OpenRLHF's guidance to configure required environments. Then run pip install -r requirements.txt

Reproduce the Project

For a training cycle, following the code below, then adjust the tempurature in 1., and start a new collect-train cycle.

# 1. collect 8K math data
bash sh/collect_data.sh
# 2. make VR pairs dataset for DPO
bash sh/make_vr_pairs.sh
# 3. train the dpo model
bash sh/train_dpo.sh
# adjust the tempurature in 1., then start a new collect-train cycle.

Evaluation of Math Reasoning

We used Qwen Math's codebase for evaluation (i.e., pass@1 accuracy).

bash sh/evaluate_all_bench.sh

About

Step-level Confidence-aware Optimization for Large Reasoning Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors