# Stable Diffusion

Code from [huggingface](https://huggingface.co/docs/diffusers/using-diffusers/conditional_image_generation)

[Tutorial on diffusion models](https://huggingface.co/docs/diffusers/tutorials/tutorial_overview)

Works with T4 GPU

In [None]:
!pip install --upgrade diffusers[torch]

## Load Pipeline and generate image

In [None]:
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

Details are in this  [paper](https://huggingface.co/docs/diffusers/using-diffusers/write_own_pipeline).

For prompts see [here](https://www.mage.space/explore).

In [None]:
prompt = "a photo of a ship with a lion and daffodils"
prompt = "stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
prompt = "a snowcapped mountain in the alps topped by a full moon, backlight, centered composition, masterpiece, photorealistic, 8k"
prompt = "confused hippie man with tattoos on his face"
prompt = "portrait+ style happy, hand on ear, Elderly person stock photo."
prompt = "<lora:meganRain:0.5> (close-up, pro photograph of a 21 year old woman), (highly detailed face:1.4) (smile:0.7) (background inside dark, moody, private study:1.3) , POV, nikon d850, film stock photograph ,4 kodak portra 400 ,camera f1.6 lens ,rich colors ,hyper realistic ,lifelike texture, dramatic lighting , cinestill 800,"
prompt = "RAW photo, portrait of handsome blond man from Norway, punk hair, smiling eyes, highly detailed textures, skin pores, nose piercing, perfect lighting, photorealism, photo realistic, hard focus, smooth, depth of field, 8K UHD, photo taken by a Sony Alpha 1 , 85mm lens, f/1. 4 aperture, 1/500 shutter speed, ISO 100 film, neutral colors, muted colors"

image = pipe(prompt).images[0]
image

### Deconstruct a Basic Pipeline: Generate an Image from Noise



In the example above, the pipeline contains a [UNet2DModel](https://huggingface.co/docs/diffusers/v0.26.3/en/api/models/unet2d#diffusers.UNet2DModel) model and a [DDPMScheduler](https://huggingface.co/docs/diffusers/v0.26.3/en/api/schedulers/ddpm#diffusers.DDPMScheduler). The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the noise residual and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps.

To recreate the pipeline with the model and scheduler separately, let's write our own denoising process.
* Load the model and scheduler:

In [None]:
from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")b

* Set the number of timesteps to run the denoising process for:

In [None]:
scheduler.set_timesteps(50)

* Setting the scheduler timesteps creates a tensor with evenly spaced elements in it, 50 in this example. Each element corresponds to a timestep at which the model denoises an image. When you create the denoising loop later, you’ll iterate over this tensor to denoise an image:

In [None]:
scheduler.timesteps

* Create some random noise with the same shape as the desired output:

In [None]:
import torch

sample_size = model.config.sample_size
noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")

* Now write a loop to iterate over the timesteps. At each timestep, the model does a [UNet2DModel.forward()](https://huggingface.co/docs/diffusers/v0.26.3/en/api/models/unet2d#diffusers.UNet2DModel.forward) pass and returns the noisy residual. The scheduler's [step()](https://huggingface.co/docs/diffusers/v0.26.3/en/api/schedulers/ddpm#diffusers.DDPMScheduler.step) method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it’ll repeat until it reaches the end of the timesteps array.

In [None]:
input = noise

for t in scheduler.timesteps:
    with torch.no_grad():
        noisy_residual = model(input, t).sample
    previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
    input = previous_noisy_sample

* The last step is to convert the denoised output into an image:

In [None]:
from PIL import Image
import numpy as np

image = (input / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image

### Deconstruct Stable Diffusion pipeline: Generate an Image from Text

In [None]:
# @title
!wget https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/stable_diffusion.png
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
plt.figure(figsize = (10,10))
img = mpimg.imread('stable_diffusion.png')
plt.imshow(img, interpolation='nearest')
plt.axis('off')
plt.show()

Here is only an overview. Details are in the vision course.

The stable diffusion model takes both a latent seed and a text prompt as an input.
1. The latent seed is then used to generate random latent image representations of size
$64\times64$
1. The text prompt is transformed to text embeddings of size $77\times 768$ via **CLIP's text encoder**.
1. Next the **U-Net** iteratively denoises the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. <br>
The **denoising process** is repeated ca. 50 times to step-by-step retrieve better latent image representations.

1. Once complete, the latent image representation is decoded by the decoder part of the **variational auto encoder**.

Stable Diffusion is a text-to-image **latent diffusion** model. It is called a latent diffusion model because it works with a **lower-dimensional representation of the image** instead of the actual pixel space, which makes it more memory efficient. The encoder compresses the image into a smaller representation, and a decoder to convert the compressed representation back into an image. For text-to-image models, you'll need a tokenizer and an encoder to generate text embeddings. From the previous example, you already know you need a UNet model and a scheduler.

More info on the stabel diffusion model is [here](https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work). There are three main components in latent diffusion.
1. An **autoencoder** (VAE). <br>
   The VAE model has two parts, an encoder and a decoder.
  * The **encoder** is used to convert the image into a low dimensional latent representation, which will serve as the input to the U-Net model.
  * The **decoder**, conversely, transforms the latent representation back into an image.
  * During latent diffusion training, the encoder is used to get the latent representations (latents) of the images for the forward diffusion process, which ***applies more and more noise*** at each step.
  * During inference, the denoised latents generated by the reverse diffusion process are ***converted back into images*** using the VAE decoder. As we will see during inference we only need the VAE decoder.
1. A [**U-Net**](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb#scrollTo=wW8o1Wp0zRkq). <br>
  The U-Net has an encoder part and a decoder part both comprised of ResNet blocks.
  * The **encoder** compresses an image representation into a lower resolution image representation.
  * The **decoder** decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy. More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.

1. A **text-encoder**, e.g. [CLIP's Text Encoder](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel). <br>
  * The text-encoder is responsible for transforming the input prompt, e.g. "An astronaut riding a horse" into an embedding vector that can be understood by the U-Net.
  * It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.
  * Stable Diffusion does **not train** the text-encoder during training and simply uses an CLIP's already trained text encoder, [CLIPTextModel](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel).

[Here](https://huggingface.co/docs/diffusers/using-diffusers/write_own_pipeline#deconstruct-the-stable-diffusion-pipeline) is the code to explore the process.

Arguments are printed by `pipe?`.

```
Args:
    vae ([`AutoencoderKL`]):
        Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
    text_encoder ([`~transformers.CLIPTextModel`]):
        Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)).
    tokenizer ([`~transformers.CLIPTokenizer`]):
        A `CLIPTokenizer` to tokenize text.
    unet ([`UNet2DConditionModel`]):
        A `UNet2DConditionModel` to denoise the encoded image latents.
    scheduler ([`SchedulerMixin`]):
        A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
        [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
    safety_checker ([`StableDiffusionSafetyChecker`]):
        Classification module that estimates whether generated images could be considered offensive or harmful.
        Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
        about a model's potential harms.
    feature_extractor ([`~transformers.CLIPImageProcessor`]):
        A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
```
There is an extensive [tutorial](https://huggingface.co/docs/diffusers/index) on Stable Diffusion.

## Image-to-Image

Image-to-image is similar to text-to-image, but in addition to a prompt, you can also pass an initial image as a starting point for the diffusion process. This is described in a  [notebook](https://huggingface.co/docs/diffusers/using-diffusers/img2img).

## Inpainting

Inpainting replaces or edits specific areas of an image. This makes it a useful tool for image restoration like removing defects and artifacts, or even replacing an image area with something entirely new. Inpainting relies on a mask to determine which regions of an image to fill in; the area to inpaint is represented by white pixels and the area to keep is represented by black pixels. The white pixels are filled in by the prompt.

This is performed in this [notebook](https://huggingface.co/docs/diffusers/using-diffusers/inpaint).