Skip to content

ExplainableML/Stitch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Abstract

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like “above” or “to the right of” poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free.


Method

PosEval Setup

The prompts for PosEval can be found in the PosEval/ folder.

PosEval extends GenEval [1], and follows GenEval’s evaluation setup with minimal modifications to support our additional tasks.

To run the evaluation with these changes:

  • You can either use the modified version provided at PosEval/evaluate_images.py, where our additions are clearly marked with ######,
  • Or you can manually add the required lines to GenEval’s evaluate_images.py.

We recommend using the same evaluation settings as GenEval, specifically, 4 images per prompt.

Code

Code for Stitch we be available shortly.

Works Cited

[1] Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS, 2023.

Citation

@article{bader2025stitch,
  title={Stitch: Training-Free Position Control in Multimodal Diffusion Transformers}, 
  author={Jessica Bader and Mateusz Pach and María Bravo and Serge Belongie and Zeynep Akata},
  journal={arxiv},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages