Johannes Schusterbauer* · Ming Gui* · Yusong Li · Pingchuan Ma · Felix Krause · Björn Ommer
CompVis Group @ LMU Munich, Munich Center for Machine Learning (MCML)
CVPR 2026
Patch Forcing turns denoising into a spatially adaptive process. During training, different image patches receive heterogeneous timesteps. While conceptually straightforward, this only works well with a dedicated timestep sampler that controls how much clean information is exposed per sample, closing the train–test gap where inference starts from pure noise. This framework enables dynamic sampling strategies, where easy regions can be denoised faster and provide cleaner context for harder ones.
🔥 Contributions
- Patch-wise timesteps
$\rightarrow$ enables heterogeneous denoising - LTG timestep sampler
$\rightarrow$ fixes train-test mismatch - Patch difficulty-guided sampling
$\rightarrow$ allocates compute adaptively
Natural images are highly spatially heterogeneous: some regions (e.g. backgrounds) are easy to denoise, while others (e.g. fine structures, text) require more refinement and context. However, standard diffusion and flow-based models treat all regions equally, applying the same timestep and compute everywhere.
Key idea: move from global to patch-wise denoising, where different regions follow different noise trajectories.
Naively assigning random timesteps per patch does not work. When timesteps are sampled independently and uniformly, most training samples contain a mix of noisy and already partially clean regions. As a result, the model learns to rely on this implicit context, even though such states never occur at inference, where generation starts from pure noise. This creates a clear train–test mismatch.
Prior work (SRM) addresses this by controlling the average amount of information per sample. While this partially mitigates the issue, it does not fully resolve it: even if the average is well-behaved, individual patches can still be nearly clean. In practice, this means that almost every training example still contains highly informative regions.
Our key idea is to instead control the maximum information available in each sample. Concretely, we first sample a maximum timestep and then restrict all patch-wise timesteps to lie below it. This prevents any region from becoming too clean during training and ensures that the model consistently operates in regimes that match inference.
With this simple change, heterogeneous patch-wise denoising works! Even without any adaptive sampling at inference, this training strategy already improves generation quality over standard diffusion models with uniform timesteps.
To fully leverage patch-wise denoising at inference, we need to decide which regions should be denoised faster and which require more refinement. For this, we augment the model with a lightweight uncertainty (difficulty) head that predicts, for each patch, how reliable the current denoising velocity prediction is.
With heterogeneous denoising and the uncertainty head, we base our adaptive samplers on three key findings:
-
context helps denoising
$\rightarrow$ advancing confident (easy) regions provides cleaner context that improves predictions in harder regions -
uncertainty reflects patch difficulty
$\rightarrow$ higher uncertainty correlates with higher validation loss -
more context reduces uncertainty
$\rightarrow$ cleaner neighboring regions make difficult patches easier to denoise
These findings naturally lead to adaptive sampling strategies that allocate compute where it is most useful. Instead of denoising all patches uniformly, we use the predicted uncertainty to guide the process: easy regions are advanced more aggressively, while difficult ones receive additional refinement.
- The dual-loop sampler alternates between quickly advancing confident patches and refining uncertain ones with smaller steps.
- The look-ahead sampler goes one step further by explicitly advancing confident patches into the future and using their cleaner states as context for denoising harder regions.
Together, these strategies turn patch-wise heterogeneity into adaptive inference, improving generation quality under the same compute budget by focusing effort where it matters most.
Please refer to our paper for a more detailed description of our framework. 😉
This codebase is based on Python 3.12 and the packages listed in requirements.txt.
First, clone the repository:
git clone git@github.com:CompVis/patch-forcing.git
cd patch-forcingThen create the environment and install the dependencies:
conda create -n pft python=3.12
conda activate pft
pip install -r requirements.txtIf the default install fails on your machine, follow the safer install order noted in ?requirements.txt: install torch and torchvision first, then flash-attn, then the remaining requirements.
We release two Patch Forcing checkpoints: PFT-B and PFT-XL. The checkpoints contain the EMA weights, as well as the model config.
To generate class-conditional samples use:
python scripts/sample.py \
--ckpt /path/to/model.ckpt \
--sample-fn-config configs/sampler/dual-loop.yaml \
--num-sampling-steps 100 \
--cfg-scale 4.0
# ... you can add sampler specific args via dot-notationFor FID samples use scripts/sample_ddp.py:
torchrun --standalone --nproc_per_node=8 scripts/sample_ddp.py \
--ckpt /path/to/model.ckpt \
--sample-fn-config configs/sampler/euler-pf.yaml \
--per-proc-batch-size 64 \
--num-fid-samples 50000 \
--num-sampling-steps 100 \
--cfg-scale 1.0If your checkpoint comes from training and does not already contain the compact config + state_dict format expected by the samplers, convert it first:
python scripts/convert_ckpt.py /path/to/training.ckptYou can train new models via train.py. The repository is based on hydra, and the base config lives in configs/config.yaml. Experiments in configs/experiment overwrite this base config. Use CLI overrides to swap configs or change individual fields, for example python train.py experiment=imnet-pft-b name=imnet/my-run data=dummy256 train_params.max_steps=10000.
To directly use the ImageNet-256 webdataset file, configure the ImageNet-256 shard locations in configs/data/imagenet256.yaml.
For debugging, use you can use configs/data/dummy256.yaml.
Train the main class-conditional experiments with:
python train.py experiment=imnet-pft-b
python train.py experiment=imnet-pft-xlIf you want to use your own dataloader, make sure it returns a dictionary with image (bchw tensor normalized to label.
For text-to-image training, first fill in the gaps in configs/data/t2i-256.yaml and then you can train them with
python train.py experiment=t2i-pft1.2b-qwenThe batch should contain a dict with image, text (set corresponding text key in trainer), and img_meta if you want to include crop size conditioning via RoPE (see patch_flow/data_utils.py for more info). The default loader uses random caption sampling (as we used multiple caption lengths during training).
You can use scripts/t2i_sample.py to sample images based on a text prompt.
If you use our work in your research, please use the following BibTeX entry. 🙂
@InProceedings{schusterbauer2025patchforcing,
title={Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation},
author={Johannes Schusterbauer and Ming Gui and Yusong Li and Pingchuan Ma and Felix Krause and Björn Ommer},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2026}
}



