π Project Page Β |Β π Paper Β |Β π PDF Β |Β π€ Data Β |Β π Results Β |Β βοΈ Citation
Official PyTorch implementation of our cross-modal on-policy self-distillation paper.
Code release for Visual-OPSD, a cross-modal on-policy self-distillation framework that transfers the visual generation knowledge of a unified multimodal model (UMM) into its text-only understanding pathway β without paying the inference-time cost of generating intermediate "visual thought" (VT) images.
*Figure 1 of the paper. (a) Visual-OPSD (green) matches or exceeds its generative teacher (purple) on 6 of 9 benchmarks. (b) Per-benchmark gains over the teacher; VT-helpful spatial tasks dominate. (c) Pareto-dominant at 74.03 % / 10.0 s β a 14Γ speedup over the teacher.*Status. Pre-print release alongside the arXiv submission. The author list, project page, and pre-trained Visual-OPSD student checkpoints will be released here as those become public.
Unified multimodal models such as BAGEL and ThinkMorph can interleave generated visual thoughts (VTs) with text reasoning β actual images produced by multi-step diffusion that are spliced into the chain-of- thought before the final answer. This improves spatial reasoning substantially (+75pp on VSP over BAGEL), but each VT costs ~50 diffusion steps, inflating per-sample inference latency by roughly 14Γ.
A pilot intervention on the frozen ThinkMorph teacher (paper Section 2) shows that removing or corrupting the intermediate VTs barely changes accuracy across all nine benchmarks, while a complementary attention analysis on V* shows that once a VT is rendered the subsequent reasoning attends almost exclusively to that generated image and ignores the original input:
Visual-OPSD therefore asks a different question: does the generation pathway encode reasoning knowledge that the understanding pathway cannot recover on its own, and if so, can it be distilled cross-modally inside the same model?
The paper's answer is yes:
- Teacher and student share identical weights, but they differ in conditioning context. The teacher is fed a strictly visual-only privileged trace β the intermediate VT images, encoded by the same ViT the model already uses for understanding. The student sees only the problem image and the question.
- Distillation is on-policy. The student samples completions from its own current policy, and the teacher rescores the same tokens under the privileged VT context. The loss is a token-level Generalized Jensen-Shannon divergence on the shared completion span.
- No new parameters, no architectural changes β only a training objective. At inference the student is text-only and never invokes the diffusion pathway.
| Method | Avg (9 benchmarks) β | Latency (s / sample) β |
|---|---|---|
| ThinkMorph (VT teacher) | 70.63 | 142.8 |
| Text-only SFT | 63.75 | 28.5 |
| Visual-OPSD-Noise (control) | 64.15 | 14.9 |
| Visual-OPSD (Ours) | 74.03 | 10.0 |
| Ξ vs ThinkMorph | +3.40 pp | 14.3Γ faster |
| Ξ vs Text-only SFT | +10.28 pp | 2.9Γ faster |
The Visual-OPSD-Noise control (real VT replaced with Gaussian noise of matching shape) yields only +0.40pp over Text-only SFT, while Visual- OPSD with real VT yields +10.28pp. A post-distillation KL analysis shows that Visual-OPSD closes 58.4 % of the teacherβstudent distributional gap, against 3.5 % for the noise control. Together these two diagnostics confirm that the transferred signal originates from the generation pathway's VT semantic content β not from generic JSD regularization or from the surrounding privileged structure.
See INSTALL.md for a full walk-through. Short version:
git clone <repo-url> Visual-OPSD && cd Visual-OPSD
python -m pip install -U uv
uv venv --python 3.10
source .venv/bin/activate
uv pip install -r requirements.txt
# flash-attn must be installed as a prebuilt wheel matching your
# CUDA / PyTorch version, e.g. CUDA 12.6 + PyTorch 2.5:
pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.0.8/flash_attn-2.5.9+cu126torch2.5-cp310-cp310-linux_x86_64.whlpython download_model.py # β models/ThinkMorph-7B/The base UMM (BAGEL-7B-MoT fine-tuned on interleaved CoT traces) is the shared starting point for the Visual-OPSD teacher and student.
Visual-OPSD trains on the four reasoning datasets released with
ThinkMorph (Visual_Search / Spatial_Navigation / Jigsaw_Assembly
/ Chart_Refocus; 24,990 samples total). Each parquet record carries
problem_image, question, reasoning_thought_*, reasoning_image_*
(the VT images that become the teacher's privileged context), answer,
and full_text_only_thought.
Download from the ThinkMorph Hugging Face
hub and place them under datasets/
(the default expected by data/dataset_info.py):
datasets/
βββ Visual_Search/data/*.parquet
βββ Spatial_Navigation/data/*.parquet
βββ Jigsaw_Assembly/data/*.parquet
βββ Chart_Refocus/data/*.parquet
Override the root by setting VISUAL_OPSD_DATA_ROOT=/path/to/datasets, or edit
the path helpers in data/dataset_info.py.
Each dataset folder must contain a data/ subdirectory with parquet
shards.
| Step | Command | Notes |
|---|---|---|
| KL diagnostic (Section 2.2) | bash scripts/visual_opsd/run_kl_diagnostic.sh |
Verifies that the VT context measurably shifts the frozen base model's completion distribution (paper: Kgen β 4.64 nats/token across 1k samples). |
| Text-only SFT baseline | bash scripts/visual_opsd/run_sft_baseline.sh |
Reproduces the Text-only SFT row of Table 2. |
| Visual-OPSD (main) | bash scripts/visual_opsd/run_visual_opsd.sh |
On-policy Visual-OPSD with EMA teacher, Ξ²=0.5, top-K=256, token-clip=0.05. |
| Visual-OPSD-Noise (ablation) | bash scripts/visual_opsd/run_visual_opsd_noise.sh |
Replaces real VT images with Gaussian noise in the teacher context. |
All scripts launch with torchrun and assume an Arnold-style worker
environment (ARNOLD_WORKER_* env vars). Adapt the launcher header for
your scheduler. The paper's reported checkpoint is at 1,000 steps;
the default TOTAL_STEPS in the launchers is 2,000 β pass 1000 as
the sixth positional argument to reproduce the paper exactly:
bash scripts/visual_opsd/run_visual_opsd.sh models/ThinkMorph-7B ema 0.5 1.0 1.0 1000See TRAIN.md for the full hyperparameter table and a reproduction protocol.
Visual-OPSD students are pure text-only models β they never invoke the diffusion pathway at inference, so you do not need the VAE.
python examples/inference_demo.py \
--model_path results/visual-opsd-ema-beta0.5/checkpoints/0001000 \
--image path/to/problem.jpg \
--question "..."For full interleaved inference (text + generated VT) with the base
ThinkMorph teacher, see inferencer.py β the
InterleaveInferencer.interleave_inference entry point matches the
original BAGEL / ThinkMorph API.
We use the open-source
VLMEvalKit-ThinkMorph
evaluation harness, which already supports all nine benchmarks reported
in the paper (VSP, VisPuzzle, ChartQA, VStar, BLINK-J, MMVP,
SAT, BLINK, CV-Bench). See EVAL.md for the protocol
used in the paper (greedy decoding, max 1,024 tokens, single H800,
batch size 1).
- The teacher's privileged channel is strictly visual-only: only the VT images appear, not the text thoughts or the ground-truth answer. The teacher does not need to be "stronger"; it merely possesses more visual information.
student_completionis sampled from the current student weights every step (on-policy), usingFSDP.summon_full_paramsso each rank can run BAGEL's batched inference primitives.- The teacher is the same model as the student, conditioned differently.
Three teacher modes are supported via
--teacher_mode:self(no-grad twin),ema(default, decay 0.995), orfixed(frozen initial checkpoint). - JSD supports Ξ² interpolation, top-K restriction (paper uses K=256),
per-token clipping (paper uses Ο=0.05), and an optional Tinker
reverse-KL variant β all in
scripts/visual_opsd/opsd_loss.py.
A self-contained walk-through with launcher options is in
scripts/visual_opsd/README.md, and the step-by-step
pseudocode is in Algorithm 1 (Appendix C) of the paper.
- 8 Γ NVIDIA H800 (80 GB) for full-rank on-policy Visual-OPSD training; FSDP HYBRID_SHARD with activation checkpointing and gradient accumulation keep peak memory under 70 GB / GPU.
- The on-policy sampler runs on each rank inside
FSDP.summon_full_params, so the effective student copy must fit in a single GPU's memory at sampling time (it does, with the optimizer state CPU-offloaded; see Appendix C of the paper). - A single 80 GB GPU is enough for inference and the KL diagnostic.
This repository builds on the open-source releases of BAGEL and ThinkMorph, which provide the unified-model architecture and the interleaved reasoning trace datasets that Visual-OPSD trains on. The on-policy self-distillation recipe is in the same family as the OPSD family (text-only OPSD, long-context OPSDL, and visual-crop Vision-OPD); Visual-OPSD differs from these by being the first OPSD instance to bridge a generationβ understanding gap within a single unified multimodal model. The generalized JSD formulation is the one popularised by Agarwal et al., GKD. All upstream licences (Apache 2.0) are preserved.
We also thank Jialong Wu for valuable discussions.
Apache License 2.0 β see LICENSE.
@article{li2026visual,
title={Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning},
author={Li, Pengyu and Gao, Zhitao and Zhang, Lingling and Huang, Muye and Li, Yuanming and Xu, Fangzhi and Liu, Jun},
journal={arXiv preprint arXiv:2606.18974},
year={2026}
}




