This repository supports PPO-like / VGPO reinforcement learning training for text / multimodal models.
The structure below matches the current repository layout:
├── .vscode/
├── data/ # Data directory (train/val/infer)
├── docker/ # Docker environments
├── docs/ # Documentation
├── outputs/ # Training / inference outputs (logs, ckpts, jsonl)
├── recipe/ # Training recipe configs (if any)
├── reward/
│ └── vgpo_reward.py # ✅ Custom reward function for VERL training
├── scripts/ # Helper scripts (optional)
├── src/
│ ├── cogflow_process_data.py # ✅ Data preprocessing script
│ ├── infer.py # ✅ Inference entry (with VSR gating)
│ ├── infer.sh # ✅ Inference launch script
│ └── train_vgpo.sh # ✅ VGPO training launch script
├── tests/
├── verl/ # ✅ VERL source code (your local rollout modifications are here)
├── verl.egg-info/
├── pyproject.toml
├── requirements.txt
├── requirements_sglang.txt
├── requirements-npu.txt
├── setup.py
└── README.md
Recommended:
- Python >= 3.10
- CUDA >= 11.8 (depends on your torch/vLLM versions)
- Multi-GPU training: 8 GPUs (adjustable)
Install base dependencies:
pip install -r requirements.txtPreprocessing script:
src/cogflow_process_data.pyExample usage:
python src/cogflow_process_data.py \
--input data/raw \
--output data/processedTraining uses parquet by default:
-
data/train.parquet
-
data/val.parquet
Inference often uses jsonl (or any custom format):
- data/infer.jsonl
Training entry script:
bash src/train_vgpo.shCustom reward file:
reward/vgpo_reward.pyIt is passed to training via:
custom_reward_function.path=reward/vgpo_reward.py
custom_reward_function.name=compute_score
Inside compute_score() you can implement:
-
rule-based reward
-
model-based reward (e.g., scoring with IntlzR reward model)
-
tool-based reward (e.g., sandbox test cases)
Inference entry points:
src/infer.py
src/infer.sh
Run:
bash src/infer.shOutputs are saved to:
outputs/infer/<experiment_name>/*.jsonlThis repo is responsible for:
-
loading Swift-exported checkpoints as MODEL_DIR
-
optionally calling the Swift-trained reward model during reward computation or inference scoring
Artifacts usage:
-
SFT checkpoint: used as actor_rollout_ref.model.path=$MODEL_DIR
-
IntlzR reward model: can be invoked inside reward/vgpo_reward.py or src/infer.py
Training outputs include:
logs: outputs/ or exp_log/
checkpoints: outputs/ckpt/ (depends on your trainer configs)
This project contains ByteDance VERL code and follows the Apache 2.0 License.@article{chen2026cogflow,
title = {CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving},
author = {Chen, Shuhang and Xu, Yunqiu and Xie, Junjie and Lu, Aojun and Feng, Tao and Huang, Zeying and Zhang, Ning and Sun, Yi and Yang, Yi and Yuan, Hangjie},
journal = {arXiv preprint arXiv:2601.01874},
year = {2026}
}