# 02. Basic Image Generation + Functions

## 02. SDXL Features + Image Control

#### Content:
0.  [SDXL - General](#sdxlgeneral)
1.  [Basic Image Generation](#basicgeneration)
2.  [Guidance Scale](#guidance)
3.  [Prompt Weighting](#weighting)
4.  [Image Grid](#grid)
5.  [Deterministic Generation](#deterministic)
6.  [Image Diffusion Process Grid](#processgrid)
7.  [Key-Findings](#keyfind)

## Description + Links

**Documentation**

https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl

In [None]:
%env HF_HOME=/cluster/user/ehoemmen/.cache
%env HF_DATASETS_CACHE=/cluster/user/ehoemmen/.cache
%env TRANSFORMERS_CACHE=/cluster/user/ehoemmen/.cache

In [None]:
pip install -U diffusers invisible_watermark transformers accelerate safetensors datasets scipy torchsde compel mediapipe

<a id="sdxlgeneral"></a>
## SDXL - General Generation


In [None]:
import torch
from diffusers import DiffusionPipeline

# compel for prompt weighting
from compel import Compel, ReturnedEmbeddingsType

# for image grid
from PIL import Image

In [None]:
#always set chache_dir to your folder to avoid "out of memory"-errors
pipeline = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  cache_dir="/cluster/user/ehoemmen/.cache",
  variant="fp16",
  use_safetensors=True,
  torch_dtype=torch.float16,
)
#.to("cuda")

#enable_sequential_cpu_offload() to avoid "out of memory"-errors - then don't move the pipe to CUDA
pipeline.enable_sequential_cpu_offload()

#Prompt Weighting -set up compel
compel = Compel(
  tokenizer=[pipeline.tokenizer, pipeline.tokenizer_2] ,
  text_encoder=[pipeline.text_encoder, pipeline.text_encoder_2],
  returned_embeddings_type=ReturnedEmbeddingsType.PENULTIMATE_HIDDEN_STATES_NON_NORMALIZED,
  requires_pooled=[False, True]
)

#Image Grid
def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

<a id="basicgeneration"></a>

### 1. Basic Image Generation

SDXL uses **two text encoders** (CLIP ViT-L and OpenCLIP ViT-bigG). You can access them with 'prompt' and 'prompt_2' and also 'negative_prompt' and 'negative_prompt_2'.
* 'prompt' is used for the primary description of the image
* 'prompt_2' is used for additional information like the style, atmosphere or background details

In [None]:
n_steps=30

prompt = "mockup of an organic milk package with a cow on it, organic aesthetic"
# prompt2 = "organic aesthetic, green and brown colors"

# neg_prompt = "red ball"
# neg_prompt2 = None

# generate image // set manuel seed 
generator = torch.Generator().manual_seed(33)
image = pipeline(prompt=prompt, 
                 prompt_2=prompt2,
                 # negative_prompt=neg_prompt,
                 # negative_prompt_2=neg_prompt2,
                 generator=generator, 
                 height=1024,
                 width=1024,
                 num_inference_steps=n_steps,
                ).images[0]

image

<a id="guidance"></a>
## 2. Guidance Scale
Is a scale for **classifier-free guidance (cfg)**.

Higher guidance scale  encourages to generate images that are closely linked to the text prompt, usually at the expense of lower image quality. The Default value is **7.5**.
Here we're creating an image grid with four different cfg-values, to compare the results.

In [None]:
n_images = 4
prompt = "a red cat playing with a ball"

cfg_values = [1, 7, 12, 17]

images = []
for cfg_value in cfg_values:
    generated_images = pipeline(prompt=prompt, 
                                #generator=generator, 
                                num_inference_steps=20,
                                #height=512,width=512,
                                num_images_per_prompt=n_images,
                                guidance_scale=cfg_value
                                ).images[0]
    images.append(generated_images)

# create grid
grid = image_grid(images, rows=1, cols=4)
grid

<a id="weighting"></a>
## 3. Prompt weighting

More control by focusing on specific parts of the prompt. It works by increasing or decreasing the scale of the text embedding vector that corresponds to its concept in the prompt because you may not necessarily want the model to focus on all concepts equally. Use **Compel** for prompt weighting.

To **increase or decrease the weight** of a word use + or - (+ corresponds to the value 1.1, ++ corresponds to 1.2, and so on. Similarly, - corresponds to 0.9 and -- corresponds to 0.8).

Or use the word and the weight. E.g. upweight "red ball" by the factor 1.5 and downweight "cat" by 0.5

In [None]:
#example 1
prompt = "a cat playing with a (red ball)1.5"
conditioning, pooled = compel(prompt)

# generate image
generator = torch.Generator().manual_seed(33)
image = pipeline(prompt_embeds=conditioning, 
                 pooled_prompt_embeds=pooled, 
                 generator=generator, 
                 num_inference_steps=30,
                ).images[0]
image

In [None]:
# example 2
prompt = "a cat playing with a red ball--------"
conditioning, pooled = compel(prompt)

# generate image
generator = torch.Generator().manual_seed(33)
image = pipeline(prompt_embeds=conditioning, 
                 pooled_prompt_embeds=pooled, 
                 generator=generator, 
                 num_inference_steps=30,
                ).images[0]

image

<a id="grid"></a>
## 4. Image Grid

Creating an image grid makes is much easier to handle the generation and evaluate results, compared to a single image output.

In [None]:
# mehrere Bilder mit gleichem Prompt als Grid erstellen (hier 3)

num_images = 3
prompt = ["a photograph of an astronaut riding a horse"] * num_images
conditioning, pooled = compel(prompt)

images = pipeline(prompt_embeds=conditioning, 
                  pooled_prompt_embeds=pooled, 
                  num_inference_steps=30,
                 ).images

grid = image_grid(images, rows=1, cols=3)

grid

In [None]:
# Mehrere Bilder mit gleichem Prompt als Grid + Prompt Weighting

num_images = 4
prompt = ["a photograph of an (astronaut)0.5 riding a (horse)1.5"] * num_images
conditioning, pooled = compel(prompt)

images = pipeline(prompt_embeds=conditioning, 
                  pooled_prompt_embeds=pooled, 
                  num_inference_steps=30,
                 ).images

grid = image_grid(images, rows=1, cols=4)
grid

In [None]:
#Mehrere Bilder mit gleichem Prompt als Grid
#jedes +/- = weighting von 0,1

num_images = 4
prompt = ["a photograph of an astronaut----- riding a horse+++++"] * num_images
conditioning, pooled = compel(prompt)

images = pipeline(prompt_embeds=conditioning, 
                  pooled_prompt_embeds=pooled, 
                  num_inference_steps=30,
                 ).images

grid = image_grid(images, rows=1, cols=4)

grid

<a id="deterministic"></a>
## 5. Deterministic generation

You can test and improve the image quality by iterating of different specifications like a more detailed prompt or play with the seed. The **batch generation** makes it super easy compare results and see what impact the changes have.

In [None]:
# Generate 4 images with different seed

n_images = 4
prompt = ["a red cat playing with a ball"] 
conditioning, pooled = compel(prompt)

generator = [torch.Generator(device="cuda").manual_seed(i) for i in range(n_images)]

images = pipeline(prompt_embeds=conditioning, 
                 pooled_prompt_embeds=pooled, 
                 generator=generator, 
                 num_inference_steps=30,
                 num_images_per_prompt=n_images,
                ).images

grid = image_grid(images, rows=1, cols=4)
grid

In [None]:
# improve one image with 4 different additional prompts

n_images = 4
prompt = ["a red cat playing with a ball" + t for t in [", highly realistic", ", artsy", ", trending", ", colorful"]]

conditioning, pooled = compel(prompt)

# generate image
generator = [torch.Generator().manual_seed(10) for i in range(n_images)]

images = pipeline(prompt_embeds=conditioning, 
                 pooled_prompt_embeds=pooled, 
                 generator=generator, 
                 num_inference_steps=30,
                ).images

grid = image_grid(images, rows=1, cols=4)
grid

<a id="processgrid"></a>
## 6. Image Diffusion Process Grid
The illustrated diffusion process. You can see the generation starts with random noise and removes it over time until reaching a high quality image.

In [None]:
# eine Bildreihe mit einem Prompt

n_images = 5
prompt = ["a red cat playing with a ball"] 
conditioning, pooled = compel(prompt)

# Create a list of denoising_end values
end = 1.0 / n_images
denoising_ends = [end * (step + 1) for step in range(n_images)]

# Seed setzen
generator = torch.Generator(device="cuda").manual_seed(2147483647) 

# Initialize images list
images = []

# Generate images for each denoising_end value
for i in range(n_images):
    generator.manual_seed(2147483647)
    image = pipeline(prompt_embeds=conditioning, 
                     pooled_prompt_embeds=pooled, 
                     generator=generator, 
                     num_inference_steps=30,
                     denoising_end=denoising_ends[i]
                    ).images[0]
    images.append(image)

grid = image_grid(images, rows=1, cols=n_images)

grid

In [None]:
# mehrere Bildreihen mit mehreren Prompts

n_images = 5
prompts = [
    "a red cat playing with a ball",
    "a blue dog chasing its tail",
    # ...
]

# Seed setzen
generator = torch.Generator(device="cuda").manual_seed(2147483647)

# Create a list of denoising_end values
end = 1.0 / n_images
denoising_ends = [end * (step + 1) for step in range(n_images)]

# Liste, die die Bildgenerierungen speichert
all_images = [] 

for prompt in prompts:
    conditioning, pooled = compel([prompt]) 
    images = []

    # Generate images for each denoising_end value
    # hier muss nochmal der manuelle seed gesetzt werden 
    for i in range(n_images):
        generator.manual_seed(2147483647)
        image = pipeline(prompt_embeds=conditioning, 
                         pooled_prompt_embeds=pooled, 
                         generator=generator, 
                         num_inference_steps=30,
                         denoising_end=denoising_ends[i]
                        ).images[0]
        images.append(image)
    
    all_images.extend(images)

grid = image_grid(all_images, rows=len(prompts), cols=n_images)

grid


<a id="keyfind"></a>
## 7. Key Findings

#### Prompt_2

No matter what I tested with the **second text encoder** (`prompt_2`), the result was always **worse** when this was used. Actually, secondary information about the background or the style of the image should be specified here. That's why I then **only used the simple 'prompt'** and did not specify any further information via 'prompt_2' and left it empty.
* All information regarding style and coloring etc. should therefore always be appended to the normal **`prompt` separated by commas**, e.g:
  
  `"mockup of an organic milk package with a cow and the text 'Organic', organic aesthetic, natural green and brown colors"`

#### Guidance Scale

It has proven useful to always create images based on the default value **7.5**. If the results do not meet your expectations, you have a variable with the cfg parameter that is worth readjusting. This has a big effect on the generated image.

#### Prompt Weighting


There are two ways of writing prompt weighting: (word)1.5 or (word)+++++. The decimal notation is more suitable, as it also allows word combinations to be weighted and the prompt is not unnecessarily long due to the many + or - characters. for example

`"cat playing with a (red ball)1.6"`

Sometimes it is necessary to play around with the weighting (which word(s) (combination) + how high the weighting) in order to achieve the desired result.
