# 02. Basic Image Generation + Functions

## 01. Basic SDXL Pipeline
#### Content

1.  [Load SDXL Pipeline](#basicsdxl)
2.  [SDXL Architecture Components](#sdxlarchitecture)
3.  [SDXL Base Model](#sdxlbase)
4.  [SDXL Base + Refiner](#sdxlbaserefiner)

---

**Reduce memory usage** (avoid out-of-memory-errors) see: 
https://huggingface.co/docs/diffusers/optimization/memory

---
## Description + Links

If you want to know more about the Stable Diffusion Pipline check out this notebook [<u>SDXL - Explained</u>](../1.0_general/02_definitions.ipynb).


**Documentation**

https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl

**Paper**

[Podell, D., et al. (2023): SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis](https://arxiv.org/abs/2307.01952)


[Rombach, R., et al. (2021): High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)


## Setup

In [None]:
%env HF_HOME=/cluster/user/ehoemmen/.cache
%env HF_DATASETS_CACHE=/cluster/user/ehoemmen/.cache
%env TRANSFORMERS_CACHE=/cluster/user/ehoemmen/.cache

In [None]:
pip install -U diffusers invisible_watermark transformers accelerate safetensors

<a id="basicsdxl"></a>

## 01. Load SDXL Pipeline

The Stable Diffusion XL Pipline consists of two stages - the **base model** and the **refiner model**. The base model could also be run as a standalone model and the expert during the high-noise diffusion stage. The refiner is expert during the low-noise diffusion stage and adds high-quality details. The generated image from the base model could be passed to the refiner to add more details. Using both the base and refiner model together to generate an image, is known as an **ensemble of expert denoisers**.

In [None]:
# load both the base and refiner model

from diffusers import DiffusionPipeline
import torch

base = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", 
    torch_dtype=torch.float16, 
    variant="fp16", 
    cache_dir="/cluster/user/ehoemmen/.cache"
)
#normally it's "base.to("cuda")" - but to avoid "out of memory-errors" we use the enable_sequential_cpu_offload() or enable_model_cpu_offload()
base.enable_model_cpu_offload()

refiner = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-refiner-1.0",
    text_encoder_2=base.text_encoder_2,
    vae=base.vae,
    torch_dtype=torch.float16,
    variant="fp16",
)

#to avoid "out of memory-errors" we use the enable_model_cpu_offload() or enable_sequential_cpu_offload() instead of using CUDA
refiner.enable_model_cpu_offload()

<a id="sdxlarchitecture"></a>

## 02. SDXL Architecture Components

To learn more about the SDXL architecture check out the [SDXL Architecture - Explained](../1.0_General/02_Definitions.ipynb) for defintions or the [Hugging Face - Diffusion Models Class (DM from Scratch) ](../HF_Diffusion%20Models%20Class) to learn Diffusion Models from scratch.

You can access any part of the pipline by typing `base.unet`, `base.scheduler` or go even deeper with `base.unet.parameters`...

In [None]:
#Pipline Components
print(list(base.components.keys()))

#### UNet - Number of Parameters

Access the UNet and show the number of parameters

In [None]:
base.unet.num_parameters(only_trainable=True)

#### The Tokenizer and Text Encoder

text encoder is to **turn an input string (the prompt) into a numerical representation that can be fed to the UNet as conditioning**. The text is first turned into a series of tokens using the pipeline's tokenizer

In [None]:
# Tokenizing and encoding an example prompt manually

# Tokenize
input_ids = base.tokenizer(["A painting of a flooble"])['input_ids']
print("Input ID -> decoded token")
for input_id in input_ids[0]:
  print(f"{input_id} -> {base.tokenizer.decode(input_id)}")

# Feed through CLIP text encoder
input_ids = torch.tensor(input_ids).to()
with torch.no_grad():
  text_embeddings = base.text_encoder(input_ids)['last_hidden_state']
print("Text embeddings shape:", text_embeddings.shape)

In [None]:
base.text_encoder

<a id="sdxlbase"></a>
## 03. SDXL Base Model 

In [None]:
# base model image generation 
prompt = "delicious risotto in a pan, food photography, realistic, top view, professional lightning"

# all parameters are used in default here
images = base(prompt=prompt,
              variant="fp16",
              torch_dtype=torch.float16,
             ).images[0]

images

### SDXL Parameters
Key arguments to tweak in the pipeline:

* **width** and **height** specify the size of the generated image. They must be **divisible by 8 for the VAE** to work
* the **number of steps** influences the generation quality. The default (50) works well but in some cases you can get away with as few as 20 steps which is handy for experimentation
* the **negative prompt** is used during the classifier-free guidance process, and can be a useful way to add additional control.
* the `guidance_scale` argument determines how strong the **classifier-free guidance (CFG)** is. Higher scales push the generated images to **better match the prompt**, but if the scale is too hig hthe results can become over-saturated and unpleasant

Here you can find all the different [**SDXL Parameters**](../1.0_general/03_parameters.ipynb).


In [None]:
# base model image generation 
prompt = "delicious risotto in a pan, food photography, realistic, top view, professional lightning"
generator = torch.Generator().manual_seed(33)

images = base(
        prompt=prompt,             # What to generate
        negative_prompt="oversaturated, blurry, low quality", # What NOT to generate
        height=1024, width=1024,   # Specify the image size
        guidance_scale=8,          # How strongly to follow the prompt
        num_inference_steps=35,    # How many steps to take
        generator=generator        # Fixed random seed
        ).images[0]

images

<a id="sdxlbaserefiner"></a>
## 04. SDXL Base + Refiner

To use this approach, you need to define the number of timesteps for each model to run through their respective stages. For the base model, this is controlled by the `denoising_end` parameter and for the refiner model, it is controlled by the `denoising_start` parameter. These parameters should be a float between 0 and 1. 

Let’s set `denoising_end=0.8` so the base model performs the first 80% of denoising the high-noise timesteps and set `denoising_start=0.8` so the refiner model performs the last 20% of denoising the low-noise timesteps. The base model output should be in **latent space** instead of a PIL image.


In [None]:
# here the generated image from the base model is passed to the refiner pipline after 80% of the denoising steps

prompt = "delicious risotto in a pan, food photography, realistic, top view, professional lightning"

image = base(
    prompt=prompt,
    num_inference_steps=40,
    denoising_end=0.8,
    output_type="latent",
).images
image = refiner(
    prompt=prompt,
    num_inference_steps=40,
    denoising_start=0.8,
    image=image,
).images[0]
image