Flow-OPD integrates On-Policy Distillation into the Flow Matching pipeline, replacing sparse scalar rewards with dense, trajectory-level, multi-teacher vector field supervision. Evaluated on SD-3.5-Medium, Flow-OPD achieves +18pt average improvement over vanilla GRPO and surpasses individual teacher models on OCR and DeQA.
Clone this repository and install packages.
git clone https://github.com/CostaliyA/Flow-OPD.git
cd Flow_OPD
conda create -n flow_grpo python=3.10.16
pip install -e .To avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.
Models
We optimized the OCR and Deqa experts to achieve better training results.
- SD3.5:
stabilityai/stable-diffusion-3.5-medium - GenEval Teacher:
CostaliyA/SD3.5M-FlowGRPO-OCR-3240orjieliu/SD3.5M-FlowGRPO-GenEval - OCR Teacher:
jieliu/SD3.5M-FlowGRPO-Text - PickScore Teacher:
jieliu/SD3.5M-FlowGRPO-PickScore - Deqa Teacher(optional):
CostaliyA/SD3.5M-FlowGRPO-Deqa-Mix
Reward Models
- PickScore:
laion/CLIP-ViT-H-14-laion2B-s32B-b79Kyuvalkirstain/PickScore_v1
- CLIPScore:
openai/clip-vit-large-patch14 - Aesthetic Score:
openai/clip-vit-large-patch14
The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.
Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.
Please install paddle-ocr:
pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-LevenshteinThen, pre-download the model using the Python command line:
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)PickScore requires no additional installation. Note that the original pickscore dataset corresponds to dataset/pickscore in this repository, containing some NSFW prompts. We strongly recommend using pickapic_v1_no_images_training_sfw, the SFW version of the Pick-a-Pic dataset, which corresponds to dataset/pickscore_sfw in this repository.
Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.
Since sglang may conflict with other environments, we recommend creating a new conda environment.
conda create -n sglang python=3.10.16
conda activate sglang
pip install "sglang[all]"We use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:
python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85Please install imagereward:
pip install image-reward
pip install git+https://github.com/openai/CLIP.gitPlease create a new Conda virtual environment with vllm:
pip install vllm
bash scripts/single_node/run_qwen_model.shand then change Line 130 (base_url) in rewards.py
Note: All training and evaluation prompts are located in the
dataset/folder. Training prompts follow the format used in flow-grpo, and evaluation prompts follow T2I-CompBench.
Before training, you can merge multiple expert LoRAs into a single cold-start LoRA to accelerate convergence:
bash scripts/single_node/merge.shAfter merging, set the merged LoRA path in the training config:
config.train.lora_path = "path/to/merged/lora"First, the GenEval rewarder and deqa services need to be deployed on other nodes.
# Master node
bash scripts/multi_node/sd3_mix.sh 0
# Other nodes
bash scripts/multi_node/sd3_mix.sh 1
bash scripts/multi_node/sd3_mix.sh 2
bash scripts/multi_node/sd3_mix.sh 3# Single-teacher OPD (local single-node, single GPU or multi-GPU)
bash scripts/single_node/sd3_opd_example.sh
# Multi-teacher OPD (local single-node, multi-GPU)
bash scripts/single_node/sd3_opd_mix_local.sh
# reward_mode: kl_only or gkd(sft)
# train.mar_lora: The lora path for full-process supervision defaults to the base model.- Single-teacher: Uses a single
kl_ref_lora_pathreference for OPD KL reward. - Multi-teacher: Uses
alternatetraining mode with per-datasetkl_ref_lora_pathβ each dataset (e.g., OCR, GenEval) uses its own teacher LoRA. Currently configured with 8 GPUs inmix_opd_8gpu. Reducenum_processesin the shell script and adjust batch sizes inconfig/grpo.py:mix_opd_8gpufor fewer GPUs.
This section describes how to evaluate your trained LoRA model on T2I-CompBench, based on the evaluation pipeline from STAGE.
First, run run_eval.sh to generate images for all T2I-CompBench categories:
bash scripts/single_node/run_eval.shModify run_eval.sh to set your LoRA path and output directory:
torchrun --nproc_per_node=8 scripts/eval_t2icompbench.py \
--lora "path/to/your/lora" \
--benchmark t2i_compbench \
--output_dir ./eval_results/compbench_imagesImages will be saved under {output_dir}/{category}/samples/.
Clone the T2I-CompBench repository and install its dependencies:
git clone https://github.com/Karine-Huang/T2I-CompBench.git
cd T2I-CompBench
# Follow the installation instructions in their repositorySet T2I_COMP_CODE_ROOT in cal_t2i_compbench_value.sh to point to the cloned T2I-CompBench folder:
T2I_COMP_CODE_ROOT="/path/to/T2I-CompBench"Then run the scoring script:
bash cal_t2i_compbench_value.shResults for each category will be saved as txt files under the corresponding annotation directories.
| Model | GenEval | OCR Acc. | DeQA | PickScore | Average |
|---|---|---|---|---|---|
| SD-3.5-M (base) | 0.63 | 0.59 | 4.07 | 21.64 | 0.72 |
| GRPO-Mix (best baseline) | 0.73 | 0.83 | 4.33 | 21.84 | 0.82 |
| Flow-OPD (Merge Init) | 0.92 | 0.94 | 4.35 | 23.08 | 0.90 |
- β¨ +18pt average improvement over base model
- π +8pt improvement over GRPO-Mix (best baseline)
- π 0.92 GenEval score (base: 0.63)
- π 0.94 OCR accuracy (base: 0.59)
Flow-OPD decouples expertise acquisition from model unification through a two-stage process:
- π§ Cold Start Initialization β SFT or Model Merging to initialize the student model
- π¨βπ« Multi-Teacher On-Policy Distillation β Dense vector field supervision from multiple teachers
The key innovations include:
- β‘ On-Policy Sampling (SDE): Stochastic exploration via SDE for diverse trajectory sampling
- π Multi-Teacher Dense Labeling: Each teacher (GenEval, OCR, DeQA, PickScore) acts as a Generative Reward Model returning a full vector field
- π¨ MAR (Manifold Anchor Regularization): KL regularization from a frozen aesthetic teacher prevents aesthetic degradation
The code is being gradually open-sourced, optimized, and refactored. Please feel free to contact me if you have any questions.
- Extend the training code
- Release the training code
- Release model weights (HuggingFace)
- Release paper (arXiv)
@article{fang2026flow,
title={Flow-OPD: On-Policy Distillation for Flow Matching Models},
author={Fang, Zhen and Huang, Wenxuan and Zeng, Yu and Zhao, Yiming and Chen, Shuang and Feng, Kaituo and Lin, Yunlong and Chen, Lin and Chen, Zehui and Cao, Shaosheng and others},
journal={arXiv preprint arXiv:2605.08063},
year={2026}
}This repo is based on flow-grpo. We also build upon STAGE for T2I-CompBench evaluation. We thank the authors for their valuable contributions to the AIGC community.




