# **Stable Diffusion** 🎨
*...using `🧨diffusers`*

Stable Diffusion is a text-to-image AI model that generates images from written descriptions. Unlike traditional diffusion models that work directly with pixels (which is computationally expensive), Stable Diffusion uses a clever approach called "latent diffusion" that works in a compressed representation space, making it much faster and more efficient.

The model has three key components: a VAE (autoencoder) that compresses images into a smaller latent space and reconstructs them back to full images, a U-Net that learns to remove noise step-by-step in this compressed space, and a text encoder (CLIP) that converts your text prompt into numerical embeddings the model can understand.


During inference, Stable Diffusion starts with random noise in the latent space, then uses the U-Net to gradually "denoise" this randomness over ~50 steps, guided by your text prompt. Each step refines the image until you get a final result that matches your description. The VAE decoder then converts this latent representation back into a viewable image. This process allows you to generate high-quality 512×512 images quickly, even on consumer GPUs.

Let's get started!

**Stable Diffusion during inference**



This diagram shows Stable Diffusion's inference pipeline. It starts with two inputs: random Gaussian noise (latent seed) and a user prompt. The text prompt gets converted into numerical embeddings by the frozen CLIP text encoder.

The core process happens in the Text-conditioned latent UNet, which takes the noisy 64×64 latents and the text embeddings as inputs. The UNet predicts what noise to remove, guided by the text description. A scheduler algorithm uses this prediction to "reconstruct" (denoise) the latents step by step.

This denoising process repeats N times (typically ~50 steps), with each iteration producing cleaner latents that better match the text prompt. Finally, the Variational Autoencoder Decoder converts the denoised 64×64 latents back into a full 512×512 output image.

The key insight is that by working in the compressed latent space (64×64) instead of pixel space (512×512), the process is 64 times more memory efficient while still producing high-quality results.
<p align="left">
<img src="https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/stable_diffusion.png" alt="sd-pipeline" width="500"/>
</p>


## Denoising Process

Given random Gaussian noise, the Unet model predicts the added noise and progressively denoises the image until it reaches a good quality image.

![Alt text](https://i.redd.it/a84scuqybtfc1.gif)

## 2. How to use `StableDiffusionPipeline`

In this section, we show how you can run text to image inference in just a few lines of code!

### Setup

First, please make sure you are using a GPU runtime to run this notebook, so inference is much faster. If the following command fails, use the `Runtime` menu above and select `Change runtime type`.

In [None]:
import torch

In [None]:
if torch.cuda.is_available():
    # Shows the nVidia GPUs, if this system has any
    !nvidia-smi

Next, you should install `diffusers` as well `scipy`, `ftfy` and `transformers`. `accelerate` is used to achieve much faster loading.

In [None]:
!pip install diffusers==0.11.1
!pip install transformers scipy ftfy accelerate

In [None]:
# This is added to get around some issues of Torch not loading models correctly (test on Mac OS X and Kubuntu Linux)
!pip install --upgrade huggingface-hub==0.26.2 transformers==4.46.1 tokenizers==0.20.1 diffusers==0.31.0

### Stable Diffusion Pipeline


First, we load the pre-trained weights of all components of the model. In this notebook we use Stable Diffusion version 1.4 ([CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4)).

In [None]:
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)

Next, let's move the pipeline to GPU to have faster inference.

In [None]:
if torch.cuda.is_available():
    device=torch.device("cuda")
elif torch.backends.mps.is_available():
    device=torch.device("mps")

pipe = pipe.to(device)

And we are ready to generate images:

In [None]:
prompt = "Craftsman, Moroccan Ghibli studio style"
image = pipe(prompt).images[0]  # image here is in [PIL format](https://pillow.readthedocs.io/en/stable/)

# Now to display an image you can either save it such as:
image.save(f"image1.png")

# or if you're in a google colab you can directly display it with
image

Running the above cell multiple times will give you a different image every time. If you want deterministic output you can pass a random seed to the pipeline. Every time you use the same seed you'll have the same image result.

In [None]:
generator = torch.Generator(device).manual_seed(1024)

image = pipe(prompt, generator=generator).images[0]

image

You can change the number of inference steps using the `num_inference_steps` argument. In general, results are better the more steps you use. Stable Diffusion, being one of the latest models, works great with a relatively small number of steps, so we recommend to use the default of `50`. If you want faster results you can use a smaller number.

The following cell uses the same seed as before, but with fewer steps. Note how some details, such as the horse's head or the helmet, are less defin realistic and less defined than in the previous image:

In [None]:
generator = torch.Generator(device).manual_seed(1024)

image = pipe(prompt, num_inference_steps=15, generator=generator).images[0]

image

To generate multiple images for the same prompt, we simply use a list with the same prompt repeated several times. We'll send the list to the pipeline instead of the string we used before.



Let's first write a helper function to display a grid of images. Just run the following cell to create the `image_grid` function, or disclose the code if you are interested in how it's done.

In [None]:
from PIL import Image

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Now, we can generate a grid image once having run the pipeline with a list of 3 prompts.

In [None]:
num_images = 3
prompt = ["Craftsman, Moroccan Ghibli studio style"] * num_images

images = pipe(prompt).images

grid = image_grid(images, rows=1, cols=3)
grid

And here's how to generate a grid of `n × m` images.

In [None]:
num_cols = 3
num_rows = 4

prompt = ["Craftsman, Moroccan Ghibli studio style"] * num_cols

all_images = []
for i in range(num_rows):
  images = pipe(prompt).images
  all_images.extend(images)

grid = image_grid(all_images, rows=num_rows, cols=num_cols)
grid

As you may observe, the results are not perfect, which indicates that some fine-tuning (retraining) could help improve them.

In the next challenge, you will see how to fine-tune stable diffusion using LoRA for fast and efficient fine-tuning.

<h1>Exercice</h1>

1. In the notebook, you used the parameter num_inference_steps=15 to generate faster results. Based on what you learned about the diffusion process, explain why reducing the number of steps makes generation faster but potentially affects image quality.

2. The notebook mentions that Stable Diffusion works in 'latent space' rather than directly with pixels. In simple terms, explain why this approach makes the model faster and what component is responsible for converting between latent space and the final image you see.