Beichen Zhang* · Yuhong Liu* · Jinsong Li · Yuhang Zang† · Jiaqi Wang† · Dahua Lin†
*Equal Contribution †Corresponding authors.
📖Paper | 🏠Homepage | 🤗ETCHR-FLUX.2-klein-9B Model | 🤗ETCHR SFT-400K Dataset | 🤗ETCHR GRPO-10K Dataset | 🤗DL3DV-2K Benchmark
- 🚀 [2026/05/24] We have released the training and evaluation code of ETCHR.
- 🚀 [2026/05/21] We have released the ETCHR-FLUX.2-klein-9B Model, ETCHR-SFT-400K Dataset and ETCHR GRPO-10K Dataset.
We are thrilled to introduce ETCHR (Editing To Clarify and Harness Reasoning), a novel question-conditioned, reasoning-aware image editor designed to serve as a decoupled visual reasoning assistant for Multimodal Large Language Models (MLLMs).
By decoupling the specialized image editor from the downstream understanding model, ETCHR bridges the critical bottleneck where a purely textual chain of thought fails in fine-grained focus or complex spatial transformations.
- 🔥 Decoupled & Plug-and-Play: ETCHR functions as a separate module, allowing it to assist diverse downstream MLLMs (such as Qwen3-VL-8B, Gemini-3.1-Flash-Lite, or Kimi K2.5) without requiring any task-specific fine-tuning on the understanding models themselves.
- 🔥 Naturally Reflective Pipeline: Introduces an Edit-Verify-Reason inference mechanism where the understanding model filters out noisy or flawed edits, reverting safely to the original image when verification fails.
We evaluate ETCHR across five distinct task families spanning fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding. Across all evaluated backbones, ETCHR consistently yields major improvements in Pass@1 accuracy:
Prepare your environment:
git clone https://github.com/InternLM/ETCHR.git
conda create -n ETCHR python==3.11
conda activate ETCHR
cd RL/Pref-GRPO
bash env_setup.sh fastvideo
pip install "vllm>=0.11.0"
pip install qwen-vl-utils==0.0.14We Provide an example code running ETCHR on DL3DV-2K Benchmark in Evaluation/inference_dl3dv.py, you can start the evaluation with the following two steps:
Step 1: start a VLLM server for an understanding model (eg. Qwen3-VL-8B, Kimi K2.5, ...).
cd Evaluation
bash launch_vllm.shStep 2: Run ETCHR atop any understanding model
python inference_dl3dv.pyWe adopt a two-stage Training Pipeline. See SFT.md and RL.md for further details.
ETCHR can assist with a broad spectrum of understanding tasks, including fine-grained perception, chart reasoning, maze navigation, jigsaw puzzles, and 3D spatial understanding.
If you find this project useful, please kindly cite:
Our work is based on FLUX.2-klein-base-9B, so please follow FLUX Non-Commercial License.
The work is built upon DiffSynth-Studio and Pref-GRPO, two excellent codebases for Diffusion models training!






