Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Jessica Bader*, Mateusz Pach*, María A Bravo, Serge Belongie, Zeynep Akata
* denotes equal contribution

Abstract

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like “above” or “to the right of” poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free.

PosEval Setup

The prompts for PosEval can be found in the PosEval/ folder.

PosEval extends GenEval [1], and follows GenEval’s evaluation setup with minimal modifications to support our additional tasks.

To run the evaluation with these changes:

You can either use the modified version provided at PosEval/evaluate_images.py, where our additions are clearly marked with ######,
Or you can manually add the required lines to GenEval’s evaluate_images.py.

We recommend using the same evaluation settings as GenEval, specifically, 4 images per prompt.

Code

Code for Stitch we be available shortly.

Works Cited

[1] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS, 2023.

Citation

@article{bader2025stitch,
  title={Stitch: Training-Free Position Control in Multimodal Diffusion Transformers}, 
  author={Jessica Bader and Mateusz Pach and María Bravo and Serge Belongie and Zeynep Akata},
  journal={arxiv},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
PosEval		PosEval
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Abstract

PosEval Setup

Code

Works Cited

Citation

About

Uh oh!

Releases

Packages

Languages

ExplainableML/Stitch

Folders and files

Latest commit

History

Repository files navigation

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Abstract

PosEval Setup

Code

Works Cited

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages