## Stable Diffusion with Diffusers

Stable Diffusion is a text-to-image latene diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M Unet and 123M text encoder, the model is realatively lightweight and can run on many consumer GPUs.

LAION-5B is the largest, freely accessible multi-model dataset that currently exists.

Here we use Stable Diffusion witht the 🤗 diffusers library, explain how the model works and finally dive a bit deeper into how diffusers allows once to customize the image generation pipeline. 

### How to use StableDiffusionPipeline

First, please make sure you are using a GPU runtime to run this notebook, so inference is much faster. If the following commend fails, use the `Runtime` menu above and select `Change runtime type`.

In [None]:
import os, platform

torch_device = 'cpu'

if 'kaggle' in os.environ.get('KAGGLE_URL_BASE', 'localhost'):
    torch_device = 'cuda'
else:
    torch_device = 'mps' if platform.system() == 'Darwin' else 'cpu'

In [None]:
torch_device

In [None]:
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

Next, you should install diffusers as well scipy, ftfy and transformers.accelerate is used to achieve much faster loading.

In [None]:
!pip install --upgrade git+https://github.com/huggingface/diffusers.git

In [None]:
!pip install transformers scipy ftfy accelerate

### Stable Diffusion Pipeline

`StableDiffusionPipeline` is an end-to-end inference pipeline that you can yse to generate images from text with just a few lines of code.

In [None]:
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler

repo_id = 'stabilityai/stable-diffusion-2-1'
pipe = StableDiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

In [None]:
pipe = pipe.to(torch_device)

In [None]:
prompt ='house, shot 35 mm, realism, octane render, 8k, trending on artstation, 35 mm camera, unreal engine, hyper detailed, photo - realistic maximum detail, volumetric light, realistic matte painting, hyper photorealistic, trending on artstation, ultra - detailed, realistic'
negative_prompt='BadDream, (UnrealisticDream:1.3)'

image = pipe(prompt=prompt, negative_prompt=negative_prompt).images[0] # image here is in [PIL format](https://pillow.readthedocs.io/en/satble/)

# Now to display an image you can either save it such as:
image

### [Seed](https://aisuko.gitbook.io/wiki/ai-techniques/stable-diffusion/the-important-parameters-for-stunning-ai-image#seed)

Running the above cell multiple times will give you a different image every time. If you want deterministic output you can pass a random seed to the pipeline. Every time you use the same you will have the same image result.

In [None]:
import torch

generator = torch.Generator(torch_device).manual_seed(5775709)

image = pipe(prompt=prompt,negative_prompt=negative_prompt, generator=generator).images[0]

image

### num_inference_steps

We can change the number of inference steps using the `num_inference_steps` argument. In general, results are better the more steps you use. Normally yhe default value is 50. If you want faster results you can use a smaller number.

In [None]:
import torch

generator = torch.Generator(torch_device).manual_seed(5775709)

image = pipe(prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=30, generator=generator).images[0]

image

### Generating multiple images

It is a way to increase the adherence to the conditional signal which in this case is text as well as overall sample quality. In simple terms classifer free guidance dorces the generation to better match with the prompt. Numbers like 7 or 8.5 give good results, if you use a very large number the images might look good, but will be less diverse.

In [None]:
from PIL import Image

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

In [None]:
num_images = 3
prompt1 = ["a photograph of an astronaut riding a horse"] * num_images

images = pipe(prompt1).images

grid = image_grid(images, rows=1, cols=3)
grid

### Generate non_square images

Let's create a rectangular images in portrait or landscape ratios, there are some recommendations to chooise good image sizes:
* Make sure `heigh` and `width` are both multiples of 8
* Going below 512 might result in lower quality images
* Going over 512 in both directions will repeat image areas (global coherence is lost)
* The best way to create non-squre images is to use `512` in one dimension, and a value larger than that in the other one

In [None]:
image = pipe(prompt=prompt, num_inference_steps=30, generator=generator, height=512, width=768).images[0]
image

#### Stable Diffusion during inference

Putting it together, let's now takes a closer look at how the model works in inference by illustrating the logical flow

<p align="center">
    <img src="https://hostux.social/system/media_attachments/files/110/683/631/285/614/442/original/6a9f7fecd5e3949b.png" width="1000" />
</p>


The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size 64x64 where as the text prompt is tranformed to text embeddings of size 77x768 via CLIP's text encoder.

DPM Solve Multistep scheduler is able to achieve great quality in less steps, like 25.