Skip to content

SII-sc22mc/PA-BDM

Repository files navigation

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

arXiv GitHub Hugging Face

PA-BDM is a document-recognition model built on Qwen2.5-VL and block diffusion. It targets text, formula, table, and diagram recognition, and is designed to keep the flexible-length generation and KV-cache benefits of Block Diffusion Models while improving their structural consistency and inference speed.

The paper identifies two bottlenecks in standard BDM decoding: tokens are only committed after a full block is finished, and bidirectional intra-block denoising conflicts with left-to-right inter-block generation. PA-BDM addresses them with:

  • Causal intra-block denoising: tokens inside a candidate block attend from prefix to suffix, matching autoregressive structural order.
  • Confidence-gated Structural Loss (CSL): training supervision focuses on the longest reliable masked prefix and avoids noisy gradients from unstable continuations.
  • Progressive Prefix Commitment (PPC): inference treats block size as a maximum candidate range, commits the longest reliable prefix into KV cache, and resets the unresolved suffix as a new candidate range.

Repository Layout

PA-BDM/
|-- infer.ipynb                                      # Notebook inference demo
|-- main.py                                         # CLI inference demo
|-- run_train.sh                                    # One-command training launcher
|-- requirements.txt                                # Pip dependencies except CUDA torch stack
|-- environment.yml                                 # Conda environment with CUDA torch stack
|-- init_env.sh                                     # Editable installs for train/ and eval/
|-- docs/
|   |-- INSTALLATION.md
|   |-- INFERENCE.md
|   `-- TRAINING_EVALUATION.md
|-- train/
|   |-- llava/                                      # PA-BDM / DiffusionVL training code
|   `-- scripts/
|       `-- diffusionvl_qwenvl_finetune_causal_64_include_tr.sh
`-- eval/
    |-- scripts/                                    # lmms-eval launchers
    `-- lmms-eval/

Quick Start

1. Create Environment

Conda is recommended for GPU training:

conda env create -f environment.yml
conda activate pa-bdm
bash init_env.sh

Pip-only setup is also possible. Install the CUDA-matched PyTorch stack first, then install the remaining packages:

conda create -n pa-bdm python=3.10 -y
conda activate pa-bdm
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
bash init_env.sh

2. Download Models

For inference, download the released PA-BDM checkpoint:

huggingface-cli download MingxuChai/PA-BDM --local-dir /path/to/models/PA-BDM

For training from Qwen2.5-VL, convert the base model to the DiffusionVL-QwenVL format:

python scripts/diffusionvl_prepare/convert_qwen2.5vl_to_diffusionvl.py \
  --source_path Qwen/Qwen2.5-VL-3B-Instruct \
  --dest_path /path/to/models/DiffusionVL-Qwen2.5VL-3B-causal

3. Run Inference

Notebook:

jupyter lab infer.ipynb

CLI:

python main.py \
  --model-path /path/to/models/PA-BDM \
  --image example_formula.jpg \
  --task formula \
  --gen-length 1024 \
  --steps 32 \
  --confidence-threshold 0.95

Task prompts:

  • text: Text Recognition.
  • formula: Formula Recognition.
  • table: Table Recognition.
  • diagram: Diagram Recognition.

4. Prepare Training Data

The loader accepts a single JSON/JSONL file or a YAML file that mixes multiple datasets. Each sample follows the LLaVA-style multimodal conversation format:

{
  "id": "sample-000001",
  "image": "images/sample.png",
  "conversations": [
    {"from": "human", "value": "<image>\nFormula Recognition."},
    {"from": "gpt", "value": "\\frac{x}{y}"}
  ]
}

YAML example:

datasets:
  - json_path: /path/to/formula_train.json
    image_root: /path/to/formula_images
    sampling_strategy: all
  - json_path: /path/to/table_train.jsonl
    image_root: /path/to/table_images
    sampling_strategy: random:50000

Supported sampling strategies are all, first:N, end:N, random:N, and percentage forms such as random:20%.

5. Run Training

The root launcher wraps train/scripts/diffusionvl_qwenvl_finetune_causal_64_include_tr.sh and exposes the paths as environment variables:

PRETRAINED_CHECKPOINT=/path/to/models/DiffusionVL-Qwen2.5VL-3B-causal \
DATA_PATH=/path/to/train_data.yaml \
IMAGE_FOLDER=/path/to/images \
OUTPUT_DIR=/path/to/outputs \
GPU_NUM=4 \
RUN_NAME=pa-bdm-csl-64 \
BD3LM_BLOCK_SIZE=64 \
bash run_train.sh

For multi-node training, set the standard torchrun variables:

NUM_NODES=4 \
GPU_NUM=8 \
MASTER_ADDR=10.0.0.1 \
MASTER_PORT=29199 \
RANK=0 \
bash run_train.sh

The paper reports experiments with PA-BDM-1.2B/3B, CSL and PPC confidence threshold 0.95, and maximum candidate block size commonly set to 32 unless otherwise specified. This repository's causal_64_include_tr training script defaults to BD3LM_BLOCK_SIZE=64; set BD3LM_BLOCK_SIZE=32 if you want the paper's default block-size setting.

Documentation

Document Description
Installation Environment, model download, and editable package setup
Training & Evaluation Data format, one-command training, and evaluation scripts
Inference Notebook and CLI inference

Acknowledgements

This repo is mainly built on Qwen2.5-VL, LLaDA-V, BD3LMs, and DiffusionVL. We thank the authors for their open-source contributions.

Citation

If you find this work useful, please cite:

@misc{chai2026prefixadaptiveblockdiffusionefficient,
      title={Prefix-Adaptive Block Diffusion for Efficient Document Recognition},
      author={Mingxu Chai and Ziyu Shen and Chenyu Liu and Kaidi Zhang and Jiazheng Zhang and Dingwei Zhu and Zhiheng Xi and Ruoyu Chen and Jun Long and Jihua Kang and Tao Gui and Qi Zhang},
      year={2026},
      eprint={2605.16861},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.16861},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors