# How to use a Stable Diffusion Model

Install the necessary libraries: diffusers==0.11.1, scipy, ftfy, accelerate and transformers so we can use them to run our code

In [None]:
#install libraries
!pip install diffusers==0.2.4
!pip install transformers scipy ftfy
!pip install "ipywidgets>=7,<8"

Next, all the imports needed from the libraries need to be called. We also need to log into Hugging Face as well (Hugging Face is a developer tool where developers can share AI and ML models and datasets):

In [2]:
import os
from PIL import Image, ImageDraw
import cv2
import numpy as np
from IPython.display import HTML
from base64 import b64encode

import torch
from torch import autocast
from torch.nn import functional as F
from diffusers import StableDiffusionPipeline, AutoencoderKL
from diffusers import UNet2DConditionModel, PNDMScheduler, LMSDiscreteScheduler
from diffusers.schedulers.scheduling_ddim import DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
from tqdm.auto import tqdm
from huggingface_hub import notebook_login
from google.colab import output

device = 'cuda'

output.enable_custom_widget_manager()
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


`StableDiffusionPipeline` is an end-to-end inference pipeline that you can use to generate images from text. An inference pipeline is a trained model which you can use to put through new data inputs, in this case these inputs are the text, and the pipeline will turn these inputs into desired outputs (images).

In [None]:
#this code will be used to load up the ML/stable diffusion model
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16)  

#sending model to GPU
pipe = pipe.to("cuda")


When we call 'pipe' in the code and give it the parameters 'prompt', essentially what is happening is our text is being inputted into the 'StableDiffusionPipeline' model. At this stage, the model is going through 50 steps/layers of diffusion. Since the model we are using is already trained, the model is basically starting from just noise (pixels) and with each step of diffusion it is removing more and more noise until we are left with our output pixels. These pixels are put through a "decoder" which then prints out the final image.

To understand how a stable diffusion model fully works start to finish, please watch this video from 2:31 - 4:29 https://www.youtube.com/watch?v=ltLNYA3lWAQ 

In [None]:
#prompt for our stable diffusion model (what the image will be based off of)


prompt = "A futuristic city with flying cars"

#pipe is the model that we just loaded and it is going to run the prompt through the model and generate an image (image[0]) and store it in the list
image = pipe(prompt).images[0]

#display/print the image below
image


Running the above cell multiple times will give you a different image every time. If you want to choose your output you can pass a random seed to the pipeline. Every time you use the same seed you'll have the same image result. 

In [None]:
#passing a random seed manually to the model so that everytime the model is run, the same image is displayed
generator = torch.Generator("cuda").manual_seed(1024)

#same model as before is being run which generate an image using the prompt given, the only difference is that the 'generator' parameter is now passing the seed we inputted into the model, resulting in the same image being generated by the model everytime 
image = pipe(prompt, generator=generator).images[0]

#display/print the image below
image

The next parameter that can be manipulated is the number of diffusion steps indicated by the `num_inference_steps` argument. In general, images are better and more detailedthe more steps you use, however less steps means much faster diffusion process. The model has the same seed number as the one above but with less diffusion steps so you see the change for yourself.

In [None]:
#same seed so we can come the difference in images with less diffusion steps from the image above
generator = torch.Generator("cuda").manual_seed(1024)

#same model as before but with a new parameter resulting in 15 steps of diffusion rather than the default 50
image = pipe(prompt, num_inference_steps=15, generator=generator).images[0]

#display/print the image below
image

It is also possible to change the parameters so the images produced are not squares. By default, stable diffusion produces images that are 512 x 512 but this can easily be changed. To do so, simply change the height and width so you can create images in portrait or landscape for example. While changing these values, remember that height and width are multiples of 8 so your new values must follow this or it will not work.

In [None]:
#same model but with changed width parameters so that the image displayed by the model is in landscape
image = pipe(prompt, height=512, width=768).images[0]

#display/print the image below
image

Finally, it is also possible to generate multiple images for the same prompt, all that must be done is a list must be used with the same prompt repeated several times. This list is then sent to the pipeline instead of the string we used before. To do so, we will need to create our own function 'image_grid' to display the grid of images


In [None]:
#import from the python imaging library
from PIL import Image

#define the image_grid function and store passed arguments
def image_grid(imgs, rows, cols):

#testing whether number of images in our list match up with the number of images we plan to display based on the number of rows and coluums we have (debugging)
    assert len(imgs) == rows*cols

#storing the length and width of one of the images in the list into variables 'w' and 'h'
    w, h = imgs[0].size

#create a one new big image with the dimensions large enough to fit all images in the list
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size

#for loop used to paste the images into the big image accordingly   
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))

#return function to main code
    return grid

Now, we can generate a grid image once having run the pipeline with a list of 3 prompts.

In [None]:
#number of images we want to generate
num_images = 3

#this is the prompt our images will be based on and it is multiplied by the number of images we want to generate
prompt = ["A city in the future with flying cars"] * num_images

#pipe is going to run the prompt through the model we loaded previously and generate an image (image[0]) and store it in the images list
images = pipe(prompt).images

#pass arguments into image_grid function
grid = image_grid(images, rows=1, cols=3)

#Call image_grid function 
grid

Finally, here is how to generate a grid of images

In [None]:
#declare how many columns and rows of images you would like
num_cols = 3
num_rows = 4

#prompt that the images will be based off multiplied by the number of columns of images we want to generate
prompt = ["A city in the future with flying cars"] * num_cols

#create an empty list where the images will be stored
all_images = []

#Use for loop to iterate through number of rows of images you would like, each time it iterates through it will generate a 3 new images as the prompt was multiplied by the number of columns and then those images will be added to the list
for i in range(num_rows):
  images = pipe(prompt).images
  all_images.extend(images)

#pass arguments into image_grid function
grid = image_grid(a ll_images, rows=num_rows, cols=num_cols)

#Call image_grid function 
grid

# Custom Pipeline from Scratch

To create a custom pipeline, each individual part of the model will need to be downloaded, this includes the autoencoder, tokenizer, Unet and the scheduler. 

In [None]:
# 1. Load the autoencoder model which will be used to decode the latents into image space. 
vae = AutoencoderKL.from_pretrained(
    'CompVis/stable-diffusion-v1-4', subfolder='vae', use_auth_token=True)
vae = vae.to(device)

# 2. Load the tokenizer and text encoder to tokenize and encode the text. 
tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')
text_encoder = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14')
text_encoder = text_encoder.to(device)

# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained(
    'CompVis/stable-diffusion-v1-4', subfolder='unet', use_auth_token=True)
unet = unet.to(device)

# 4. Create a scheduler for inference
scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, beta_end=0.012,
    beta_schedule='scaled_linear', num_train_timesteps=1000)

Now that all the necessary components of the model have been downloaded, the text prompt needs to be turned into embeddings which the computer can recognize. This will be done by putting the text prompt through the tokenizer which will turn the text into tokens, and then these tokens can be put through the autoencoder in order to get our embeddings. 

In [4]:
def get_text_embeds(prompt):
  # Tokenize text and get embeddings
  text_input = tokenizer(
      prompt, padding='max_length', max_length=tokenizer.model_max_length,
      truncation=True, return_tensors='pt')
  with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(device))[0]

  # Do the same for unconditional embeddings
  uncond_input = tokenizer(
      [''] * len(prompt), padding='max_length',
      max_length=tokenizer.model_max_length, return_tensors='pt')
  with torch.no_grad():
    uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0]

  # Concatenate for final embeddings
  text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
  return text_embeddings

The next step is to generate random latents as the latent space needs to be filled with noise so it can later be denoised based on our prompt. In order to denoise the image, some called a scheduler is used and it is essentially designed to scale the denoising process in such a way that it can be completed in the indicated number of inference steps while also maintaining the highest possible quality of the image in those amount of steps. 

Once the latent space is all noise, the inference steps begin to take place. How this process works is the Unet model will predict the latents for the next inference step based on the noise and the scheduler. Using this latent data the next step of noise will be predicted. Using this noise, the Unet model will then predict the latents for the next step and the process will repeat itself until the final image is left

In [5]:
def produce_latents(text_embeddings, height=512, width=512,
                    num_inference_steps=50, guidance_scale=7.5, latents=None):
  if latents is None:
    #Generate random latents in order to fill the latent space with noise
    latents = torch.randn((text_embeddings.shape[0] // 2, unet.in_channels, \
                           height // 8, width // 8))
  latents = latents.to(device)

  scheduler.set_timesteps(num_inference_steps)
  latents = latents * scheduler.sigmas[0]

  with autocast('cuda'):
    for i, t in tqdm(enumerate(scheduler.timesteps)):
      # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
      latent_model_input = torch.cat([latents] * 2)
      sigma = scheduler.sigmas[i]
      latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)

      # predict the noise residual
      with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)['sample']

      # perform guidance
      noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
      noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

      # compute the previous noisy sample x_t -> x_t-1
      latents = scheduler.step(noise_pred, i, latents)['prev_sample']
  
  return latents

Once the model has predicted the latents for the final image, they are put through the decoder and scaled in order to match the desired demensions of the image. 

In [6]:
def decode_img_latents(latents):
  latents = 1 / 0.18215 * latents

  with torch.no_grad():
    imgs = vae.decode(latents)

  imgs = (imgs / 2 + 0.5).clamp(0, 1)
  imgs = imgs.detach().cpu().permute(0, 2, 3, 1).numpy()
  imgs = (imgs * 255).round().astype('uint8')
  pil_images = [Image.fromarray(image) for image in imgs]
  return pil_images

# imgs = decode_img_latents(test_latents)
# imgs[0]

Now that all the important functions for the pipeline have been defined, a main pipeline can be made that can call all the important functions in order to run the custom stable diffusion model.

In [7]:
def prompt_to_img(prompts, height=512, width=512, num_inference_steps=50,
                  guidance_scale=7.5, latents=None):
  if isinstance(prompts, str):
    prompts = [prompts]

  # Prompts -> text embeds
  text_embeds = get_text_embeds(prompts)

  # Text embeds -> img latents
  latents = produce_latents(
      text_embeds, height=height, width=width, latents=latents,
      num_inference_steps=num_inference_steps, guidance_scale=guidance_scale)
  
  # Img latents -> imgs
  imgs = decode_img_latents(latents)

  return imgs

All that is left to do is give a prompt to the model and it will produce the desired image.

In [None]:
prompt_to_img('A futuristic city with flying cars', 512, 512, 20)[0]

Finally, the custom pipeline is fully completed!