Latent Diffusion and Stable Diffusion Implementation
This code was tested with Python 3.8, Pytorch 1.11 using pre-trained models through huggingface / diffusers. Specifically, we implemented our method over Latent Diffusion and Stable Diffusion. Additional required packages are listed in the requirements file. The code was tested on a Tesla V100 16GB but should work on other cards with at least 12GB VRAM.
In order to get started, we recommend taking a look at our notebooks: prompt-to-prompt_ldm and prompt-to-prompt_stable. The notebooks contain end-to-end examples of usage of prompt-to-prompt on top of Latent Diffusion and Stable Diffusion respectively. Take a look at these notebooks to learn how to use the different types of prompt edits and understand the API.
In our notebooks, we perform our main logic by implementing the abstract class AttentionControl
object, of the following form:
class AttentionControl(abc.ABC):
@abc.abstractmethod
def forward (self, attn, is_cross: bool, place_in_unet: str):
raise NotImplementedError
The forward
method is called in each attention layer of the diffusion model during the image generation, and we use it to modify the weights of the attention. Our method (See Section 3 of our paper) edits images with the procedure above, and each different prompt edit type modifies the weights of the attention in a different manner.
The general flow of our code is as follows, with variations based on the attention control type:
prompts = ["A painting of a squirrel eating a burger", ...]
controller = AttentionControl(prompts, ...)
run_and_display(prompts, controller, ...)
In this case, the user swaps tokens of the original prompt with others, e.g., the editing the prompt "A painting of a squirrel eating a burger"
to "A painting of a squirrel eating a lasagna"
or "A painting of a lion eating a burger"
. For this we define the class AttentionReplace
.
In this case, the user adds new tokens to the prompt, e.g., editing the prompt "A painting of a squirrel eating a burger"
to "A watercolor painting of a squirrel eating a burger"
. For this we define the class AttentionEditRefine
.
In this case, the user changes the weight of certain tokens in the prompt, e.g., for the prompt "A photo of a poppy field at night"
, strengthen or weaken the extent to which the word night
affects the resulting image. For this we define the class AttentionReweight
.
cross_replace_steps
: specifies the fraction of steps to edit the cross attention maps. Can also be set to a dictionary[str:float]
which specifies fractions for different words in the prompt.self_replace_steps
: specifies the fraction of steps to replace the self attention maps.local_blend
(optional):LocalBlend
object which is used to make local edits.LocalBlend
is initialized with the words from each prompt that correspond with the region in the image we want to edit.equalizer
: used for attention Re-weighting only. A vector of coefficients to multiply each cross-attention weight
@article{hertz2022prompt,
title = {Prompt-to-Prompt Image Editing with Cross Attention Control},
author = {Hertz, Amir and Mokady, Ron and Tenenbaum, Jay and Aberman, Kfir and Pritch, Yael and Cohen-Or, Daniel},
journal = {arXiv preprint arXiv:2208.01626},
year = {2022},
}
Null-text inversion enables intuitive text-based editing of real images with the Stable Diffusion model. We use an initial DDIM inversion as an anchor for our optimization which only tunes the null-text embedding used in classifier-free guidance.
Prompt-to-Prompt editing of real images by first using Null-text inversion is provided in this Notebooke.
@article{mokady2022null,
title={Null-text Inversion for Editing Real Images using Guided Diffusion Models},
author={Mokady, Ron and Hertz, Amir and Aberman, Kfir and Pritch, Yael and Cohen-Or, Daniel},
journal={arXiv preprint arXiv:2211.09794},
year={2022}
}
This is not an officially supported Google product.