Skip to content

CostaliyA/Flow-OPD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

57 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Flow-OPD: On-Policy Distillation for Flow Matching Models

Project Page Paper Code Model

Flow-OPD integrates On-Policy Distillation into the Flow Matching pipeline, replacing sparse scalar rewards with dense, trajectory-level, multi-teacher vector field supervision. Evaluated on SD-3.5-Medium, Flow-OPD achieves +18pt average improvement over vanilla GRPO and surpasses individual teacher models on OCR and DeQA.


πŸš€ Quick Started

1. Environment Set Up

Clone this repository and install packages.

git clone https://github.com/CostaliyA/Flow-OPD.git
cd Flow_OPD
conda create -n flow_grpo python=3.10.16
pip install -e .

2. Model Download

To avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.

Models

We optimized the OCR and Deqa experts to achieve better training results.

  • SD3.5: stabilityai/stable-diffusion-3.5-medium
  • GenEval Teacher: CostaliyA/SD3.5M-FlowGRPO-OCR-3240 or jieliu/SD3.5M-FlowGRPO-GenEval
  • OCR Teacher: jieliu/SD3.5M-FlowGRPO-Text
  • PickScore Teacher: jieliu/SD3.5M-FlowGRPO-PickScore
  • Deqa Teacher(optional): CostaliyA/SD3.5M-FlowGRPO-Deqa-Mix

Reward Models

  • PickScore:
    • laion/CLIP-ViT-H-14-laion2B-s32B-b79K
    • yuvalkirstain/PickScore_v1
  • CLIPScore: openai/clip-vit-large-patch14
  • Aesthetic Score: openai/clip-vit-large-patch14

3. Reward Preparation

The steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.

GenEval

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

OCR

Please install paddle-ocr:

pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein

Then, pre-download the model using the Python command line:

from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)

Pickscore

PickScore requires no additional installation. Note that the original pickscore dataset corresponds to dataset/pickscore in this repository, containing some NSFW prompts. We strongly recommend using pickapic_v1_no_images_training_sfw, the SFW version of the Pick-a-Pic dataset, which corresponds to dataset/pickscore_sfw in this repository.

DeQA

Please create a new Conda virtual environment and install the corresponding dependencies according to the instructions in reward-server.

UnifiedReward

Since sglang may conflict with other environments, we recommend creating a new conda environment.

conda create -n sglang python=3.10.16
conda activate sglang
pip install "sglang[all]"

We use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:

python -m sglang.launch_server --model-path CodeGoat24/UnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85

ImageReward

Please install imagereward:

pip install image-reward
pip install git+https://github.com/openai/CLIP.git

QwenVL score

Please create a new Conda virtual environment with vllm:

pip install vllm
bash scripts/single_node/run_qwen_model.sh

and then change Line 130 (base_url) in rewards.py

4. Dataset Preparation

Note: All training and evaluation prompts are located in the dataset/ folder. Training prompts follow the format used in flow-grpo, and evaluation prompts follow T2I-CompBench.

5. Start Training

5.0 Cold Start (optional)

Before training, you can merge multiple expert LoRAs into a single cold-start LoRA to accelerate convergence:

bash scripts/single_node/merge.sh

After merging, set the merged LoRA path in the training config:

config.train.lora_path = "path/to/merged/lora"

5.1 GRPO-mix

First, the GenEval rewarder and deqa services need to be deployed on other nodes.

# Master node
bash scripts/multi_node/sd3_mix.sh 0
# Other nodes
bash scripts/multi_node/sd3_mix.sh 1
bash scripts/multi_node/sd3_mix.sh 2
bash scripts/multi_node/sd3_mix.sh 3

5.2 Flow-OPD

# Single-teacher OPD (local single-node, single GPU or multi-GPU)
bash scripts/single_node/sd3_opd_example.sh

# Multi-teacher OPD (local single-node, multi-GPU)
bash scripts/single_node/sd3_opd_mix_local.sh
# reward_mode: kl_only or gkd(sft)
# train.mar_lora: The lora path for full-process supervision defaults to the base model.
  • Single-teacher: Uses a single kl_ref_lora_path reference for OPD KL reward.
  • Multi-teacher: Uses alternate training mode with per-dataset kl_ref_lora_path β€” each dataset (e.g., OCR, GenEval) uses its own teacher LoRA. Currently configured with 8 GPUs in mix_opd_8gpu. Reduce num_processes in the shell script and adjust batch sizes in config/grpo.py:mix_opd_8gpu for fewer GPUs.

πŸ“Š Evaluation

This section describes how to evaluate your trained LoRA model on T2I-CompBench, based on the evaluation pipeline from STAGE.

1. Generate Images

First, run run_eval.sh to generate images for all T2I-CompBench categories:

bash scripts/single_node/run_eval.sh

Modify run_eval.sh to set your LoRA path and output directory:

torchrun --nproc_per_node=8 scripts/eval_t2icompbench.py \
    --lora "path/to/your/lora" \
    --benchmark t2i_compbench \
    --output_dir ./eval_results/compbench_images

Images will be saved under {output_dir}/{category}/samples/.

2. Install T2I-CompBench

Clone the T2I-CompBench repository and install its dependencies:

git clone https://github.com/Karine-Huang/T2I-CompBench.git
cd T2I-CompBench
# Follow the installation instructions in their repository

3. Score Images

Set T2I_COMP_CODE_ROOT in cal_t2i_compbench_value.sh to point to the cloned T2I-CompBench folder:

T2I_COMP_CODE_ROOT="/path/to/T2I-CompBench"

Then run the scoring script:

bash cal_t2i_compbench_value.sh

Results for each category will be saved as txt files under the corresponding annotation directories.


🎯 Key Results

Model GenEval OCR Acc. DeQA PickScore Average
SD-3.5-M (base) 0.63 0.59 4.07 21.64 0.72
GRPO-Mix (best baseline) 0.73 0.83 4.33 21.84 0.82
Flow-OPD (Merge Init) 0.92 0.94 4.35 23.08 0.90
  • ✨ +18pt average improvement over base model
  • πŸš€ +8pt improvement over GRPO-Mix (best baseline)
  • πŸ“Š 0.92 GenEval score (base: 0.63)
  • πŸ“ 0.94 OCR accuracy (base: 0.59)

πŸ”¬ Method Overview

Flow-OPD decouples expertise acquisition from model unification through a two-stage process:

  1. 🧊 Cold Start Initialization β€” SFT or Model Merging to initialize the student model
  2. πŸ‘¨β€πŸ« Multi-Teacher On-Policy Distillation β€” Dense vector field supervision from multiple teachers

The key innovations include:

  • ⚑ On-Policy Sampling (SDE): Stochastic exploration via SDE for diverse trajectory sampling
  • πŸ”€ Multi-Teacher Dense Labeling: Each teacher (GenEval, OCR, DeQA, PickScore) acts as a Generative Reward Model returning a full vector field
  • 🎨 MAR (Manifold Anchor Regularization): KL regularization from a frozen aesthetic teacher prevents aesthetic degradation

πŸ“‹ Todo List

The code is being gradually open-sourced, optimized, and refactored. Please feel free to contact me if you have any questions.

πŸ”„ In Progress

  • Extend the training code

βœ… Completed

  • Release the training code
  • Release model weights (HuggingFace)
  • Release paper (arXiv)

🎨 Qualitative Results

Overview

Teaser

Comparison

Comparison

More Results (1/3)

More Results 1

More Results (2/3)

More Results 2

More Results (3/3)

More Results 3


πŸ“š Citation

@article{fang2026flow,
  title={Flow-OPD: On-Policy Distillation for Flow Matching Models},
  author={Fang, Zhen and Huang, Wenxuan and Zeng, Yu and Zhao, Yiming and Chen, Shuang and Feng, Kaituo and Lin, Yunlong and Chen, Lin and Chen, Zehui and Cao, Shaosheng and others},
  journal={arXiv preprint arXiv:2605.08063},
  year={2026}
}

πŸ™ Acknowledgements

This repo is based on flow-grpo. We also build upon STAGE for T2I-CompBench evaluation. We thank the authors for their valuable contributions to the AIGC community.

About

Official Repo of "Flow-OPD: On-Policy Distillation for Flow Matching Models"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors