# 03. Fast Image Generation

## 03. Würstchen
#### Content

1. [Würstchen - General Image Generation](#wuerstchen)
2. [Honey Cornflakes Test](#cornflakes)
3. [Text Generation](#text)
4. [Key-Findings](#keyfind)

## Description + Links

* works in a highly compressed latent space of images, which reduces computational costs for both training and inference
* Würstchen consists of 3 stages: Stage C, B and A.
    * Stage C will first generate latents in a very compressed latent space (`prior pipeline`)
    *  afterwards the generated latents will be passed to Stage B, which decompresses the latents into a bigger latent space
    *  these latents can then be decoded by Stage A into pixel-space (Stage B & A are both encapsulated in the `decoder_pipeline`
* employs a two-stage compression (Stage A is a VQGAN and Stage B is a Diffusion Autoencoder)

---
**Documentation**

https://huggingface.co/docs/diffusers/v0.27.2/en/api/pipelines/wuerstchen#w%C3%BCrstchen-overview

**Paper**

[Pernias, P., et al (2023): Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://arxiv.org/abs/2306.00637)

## Setup

In [None]:
%env HF_HOME=/cluster/user/ehoemmen/.cache
%env HF_DATASETS_CACHE=/cluster/user/ehoemmen/.cache
%env TRANSFORMERS_CACHE=/cluster/user/ehoemmen/.cache

In [None]:
import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
from PIL import Image
from IPython.display import display

pipe = AutoPipelineForText2Image.from_pretrained(
    "warp-ai/wuerstchen",
    cache_dir="/cluster/user/ehoemmen/.cache",
    torch_dtype=torch.float16).to("cuda")

#pipe.enable_model_cpu_offload()
#pipe.enable_sequential_cpu_offload()

<a id="wuerstchen"></a>
## 01. Würstchen - General Image Generation

In [None]:
prompt = "Red Cat playing with a ball"

#pipe.enable_model_cpu_offload()

images = pipe(
    prompt, 
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=3,
).images

# Stitch the images together side-by-side
total_width = sum(img.width for img in images)
max_height = max(img.height for img in images)
stitched_image = Image.new('RGB', (total_width, max_height))

x_offset = 0
for img in images:
    stitched_image.paste(img, (x_offset, 0))
    x_offset += img.width

# Display the stitched image
display(stitched_image)

#### Promptverständnis - Zuordnung von Zwei Farben 

In [None]:
prompt = "Red Cat playing with a green ball"

#pipe.enable_model_cpu_offload()

images = pipe(
    prompt, 
    width=1024,
    height=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=3,
).images

# Stitch the images together side-by-side
total_width = sum(img.width for img in images)
max_height = max(img.height for img in images)
stitched_image = Image.new('RGB', (total_width, max_height))

x_offset = 0
for img in images:
    stitched_image.paste(img, (x_offset, 0))
    x_offset += img.width

# Display stitched image
display(stitched_image)

<a id="cornflakes"></a>
## 02. Honey Cornflakes - Test

In [None]:
prompt = "Honey Flavoured Cornflakes, Yellow Packaging design, cute bees, food photography, mockup"
#negative_prompt = "realistic photo"

n_images = 3

#pipe.enable_model_cpu_offload()

images = pipe(
    prompt, 
    width=1024,
    height=1536,
    #negative_prompt = negative_prompt,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=10.0, 
    num_images_per_prompt=n_images,
).images

# Stitch the images together
total_width = sum(img.width for img in images)
max_height = max(img.height for img in images)
stitched_image = Image.new('RGB', (total_width, max_height))

x_offset = 0
for img in images:
    stitched_image.paste(img, (x_offset, 0))
    x_offset += img.width

# Display stitched images
display(stitched_image)

<a id="text"></a>
## 03. Text Generation

In [None]:
prompt = "street sign that reads 'Welcome to New York' "
negative_prompt = "realistic photo"

n_images = 3

#pipe.enable_model_cpu_offload()

images = pipe(
    prompt, 
    width=1024,
    height=1536,
    negative_prompt = negative_prompt,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=10.0, 
    num_images_per_prompt=n_images,
).images

# Stitch the images together
total_width = sum(img.width for img in images)
max_height = max(img.height for img in images)
stitched_image = Image.new('RGB', (total_width, max_height))

x_offset = 0
for img in images:
    stitched_image.paste(img, (x_offset, 0))
    x_offset += img.width

# Display stitched images
display(stitched_image)

<a id="keyfind"></a>

## 4. Key Findings

Results of würstchen are very good and aesthetic. A good model for generating ideas, e.g. for a workshop in the practice. Cornflakes packaging is of a much higher quality, especially compared to SDXL-Turbo.

Overall, however, I have tested very little with würstchen. In addition, there is already the follow-up version with [<u>**Stable Cascade**</u> (Würstchen V3)](../3.0_fast_image_generation/04_stable_cascade.ipynb)
