Yunxiang Zhang, Nan Wu, Connor Lin, Gordon Wetzstein, Qi Sun
Published in ACM Transactions on Applied Perception 2024
Presented at ACM Symposium on Applied Perception 2024 (Best Paper Award and Best Presentation Award)
[Paper] [Project Page] [Video]
Diffusion models offer unprecedented image generation power given just a text prompt. While emerging approaches for controlling diffusion models have enabled users to specify the desired spatial layouts of the generated content, they cannot predict or control where viewers will pay more attention due to the complexity of human vision. Recognizing the significance of attention-controllable image generation in practical applications, we present a saliency-guided framework to incorporate the data priors of human visual attention mechanisms into the generation process. Given a user-specified viewer attention distribution, our control module conditions a diffusion model to generate images that attract viewers’ attention toward the desired regions. To assess the efficacy of our approach, we performed an eye-tracked user study and a large-scale model-based saliency analysis. The results evidence that both the cross-user eye gaze distributions and the saliency models’ predictions align with the desired attention distributions. Lastly, we outline several applications, including interactive design of saliency guidance, attention suppression in unwanted regions, and adaptive generation for varied display/viewing conditions.
First create a dedicated conda environment:
conda env create -f environment.yml
conda activate gazefusion
- Place the images that will be used for building the training dataset under the
data/raw-images/folder; - Download the pre-trained EML-Net modules (res_imagenet.pth, res_places.pth, and res_decoder.pth) for saliency map generation and place them under the
models/emlnet/folder; - Build the training dataset with prompt-smap-image pairs:
python process.py; - Check the command-line arguments in
process.pyfor more data preparation options.
- Download the untrained GazeFusion model from OneDrive and place it under the
models/folder; - Train the GazeFusion model with prompt-smap-image pairs:
python train.py; - Check the command-line arguments in
train.pyfor more training options.
- Download the trained GazeFusion model from OneDrive (or use your own trained one) and place it under the
models/folder; - Place your custom saliency map files under the
assets/smaps/folder (or use a provided one); - Generate images with saliency guidance:
python generate.py --smap your_smap --prompt your_prompt; - Check the command-line arguments in
generate.pyfor more generation options.
We would like to thank Saining Xie, Anyi Rao, and Zoya Bylinskii for fruitful early discussion, and the authors of Stable Diffusion, ControlNet, BLIP-2, EML-Net, and Text2Video-Zero for their great work, based on which GazeFusion was developed.
If you find this work helpful to your research, please consider citing BibTeX:
@article{zhang2024gazefusion,
title={GazeFusion: Saliency-guided Image Generation},
author={Zhang, Yunxiang and Wu, Nan and Lin, Connor Z and Wetzstein, Gordon and Sun, Qi},
journal={ACM Transactions on Applied Perception},
year={2024},
publisher={ACM New York, NY}
}