# GPU-vRAM Usage Estimation for Diffusion Models
## Objective
Derive an analytical equation to estimate peak vRAM usage during inference for the `stable-diffusion-v1-5/stable-diffusion-v1-5` for arbitrary input image sizes.

## Background
vRAM consumption during diffusion model inference differs significantly from model size on disk. Peak memory depends on:
 - Model weights (fixed)
 - Intermediate activations (vary with image dimensions and prompt length)
 - Framework overhead (CUDA kernels, workspace buffers)
 - Attention mechanism memory scaling (O(N²) with sequence length)

Where:
 - `H`, `W` = input image height and width
 - `prompt_length` = tokenized prompt length
 - Identify any additional factors affecting vRAM

## Requirements
 - Analyze the architecture: Understand UNet, VAE, CLIP text encoder, and how tensors flow through the pipeline
 - Account for precision: Assume `FP16` (2 bytes/parameter)
 - Model fully on GPU: Ignore pipeline.enable_model_cpu_offload() in your equation
 - Peak, not average: Find the stage with maximum memory allocation
 - Document assumptions: Clearly state what you include/exclude (e.g., gradient storage, optimizer states)

## Deliverables
 - Equation with explanation of each term
 - Derivation notes showing how you arrived at each component
 - Validation (optional but encouraged): Compare equation predictions against actual nvidia-smi measurements using the provided test code

In [1]:
# cuda torch verification
import torch
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(0))

True
0
NVIDIA GeForce RTX 4060 Laptop GPU


In [2]:
# pip install torch torchvision diffusers['torch'] transformers accelerate

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline = pipeline.to("cuda" if torch.cuda.is_available() else "cpu")

# Uncomment this if you have limited GPU vRAM (although, this assignment can be done without any GPU use!)
pipeline.enable_model_cpu_offload()

# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# prepare image
img_src = [{
    "url": "./data/balloon--low-res.jpeg",
    "prompt": "aerial view, colorful hot air balloon, lush green forest canopy, springtime, warm climate, vibrant foliage, soft sunlight, gentle shadow, white birds flying alongside, harmony, freedom, bright natural colors, serene atmosphere, highly detailed, realistic, photorealistic, cinematic lighting"
}, {
    'url': "./data/bench--high-res.jpg",
    'prompt': "photorealistic, high resolution, realistic lighting, natural shadows, detailed textures, lush green grass, wooden bench with grain detail, expansive valley, agricultural fields, blue-toned mountains, fluffy cumulus clouds, wispy cirrus clouds, bright blue sky, clear sunny day, soft sunlight, tranquil atmosphere, cinematic realism"
}, {
    'url': "./data/groceries--low-res.jpg",
    'prompt': "cartoon style, bold outlines, simplified shapes, vibrant colors, playful atmosphere, exaggerated proportions, stylized SUV trunk, whimsical paper grocery bags, fresh produce with bright highlights, baguette with cartoon detail, cheerful parking area, greenery with simplified textures, sunny day, lighthearted mood, 2D illustration, animated landscape aesthetic"
}, {
    'url': "./data/truck--high-res.jpg",
    'prompt': "Michelangelo style, Renaissance painting, classical composition, rich earthy tones, detailed brushwork, divine atmosphere, expressive lighting, monumental presence, artistic grandeur, fresco-inspired texture, high contrast shadows, timeless aesthetic"
    #quirk noticed, both images very dissimiliar, neither does a truck fit the description, objectively performs worse than every other prompt/image pair.
}]

results = list()

# This for loop is meant to demonstrate that the models' vRAM usage depends
# on Image-size and prompt length (among other factors). You may observe the
# vRAM usage while the model is running by executing the following command
# in a separate terminal and monitoring the changes in vRAM usage:
#    ```shell
#    watch -n 1.0 nvidia-smi
#    ```
#
# You may modify this for loop according to your needs.
for _src in img_src:
    init_image = load_image(_src.get('url'))
    prompt = _src.get('prompt')

    # pass prompt and image to pipeline
    image = pipeline(prompt, image=init_image, guidance_scale=5.0).images[0]
    results.append(make_image_grid([init_image, image], rows=1, cols=2))

results[0].show()

  from .autonotebook import tqdm as notebook_tqdm
Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading pipeline components...: 100%|██████████| 7/7 [00:00<00:00,  8.84it/s]
100%|██████████| 40/40 [00:04<00:00,  8.51it/s]
100%|██████████| 40/40 [04:16<00:00,  6.40s/it]
100%|██████████| 40/40 [00:10<00:00,  3.81it/s]
100%|██████████| 40/40 [01:28<00:00,  2.22s/it]


In [6]:
import os

os.makedirs("output", exist_ok=True)

for i, res in enumerate(results):
    res.save(f"./output/result_{i}.png")

In [7]:
text_encoder=pipeline.text_encoder
unet=pipeline.unet
vae=pipeline.vae

In [8]:
# total parameters of each model
def count_params(model):
    return sum(p.numel() for p in model.parameters())

clip_params=count_params(text_encoder)
unet_params=count_params(unet)
vae_params=count_params(vae)

clip_weights_mb=(clip_params*2)/(1024*1024)
unet_weights_mb=(unet_params*2)/(1024*1024)
vae_weights_mb=(vae_params*2)/(1024*1024)

total_weights_mb=clip_weights_mb+unet_weights_mb+vae_weights_mb

In [9]:
# referencing vae configuration values below
vae_config=vae.config
latent_channels=vae_config.latent_channels #verified as 4

In [10]:
scaling_factor=vae.config.scaling_factor
vae_scale_factor=1/scaling_factor #used as 8 in papers, but here choosing from model actual config

vae_block_channels=vae.config.block_out_channels

In [11]:
vae_block_channels

[128, 256, 512, 512]

In [12]:
# referencing config of clip below
clip_config=text_encoder.config
clip_hidden_dim=clip_config.hidden_size
clip_num_layers=clip_config.num_hidden_layers

In [13]:
unet_config=unet.config
unet_in_channels=unet_config.in_channels #4 for latent space
unet_block_out_channels=unet_config.block_out_channels
unet_down_block_types=unet_config.down_block_types
unet_up_block_types=unet_config.up_block_types
unet_attention_head_dim=unet_config.attention_head_dim

In [14]:
# def f(h: int, w: int, prompt_length: int, guidance_scale: float=7.5, **kwargs):
#     """
#     :param h: height of input image in pixels
#     :param w: width of input image in pixels
#     :param prompt_length: length of input prompt in number tokens generated after tokenizing the input-prompt.
#     :param kwargs: any additional factors needed for this computation (this is for your use)

#     Args:
#         guidance_scale (float, optional): The guidance scale as defined in Classifier-Free Diffusion Guidance. Defaults to 7.5.
#     """
#     latent_h=h//vae_scale_factor
#     latent_w=w//vae_scale_factor

#     cfg_multiplier=2 if guidance_scale>1.0 else 1
#     text_embeddings_mb=(cfg_multiplier*prompt_length*clip_hidden_dim*2)/(1024*1024)

#     # clip attention o(prompt length squared) for 12 layers
#     clip_attention_mb=(12*prompt_length*prompt_length*2)/(1024*1024)
#     clip_activations_mb=text_embeddings_mb+clip_attention_mb

#     # coarse approximation of unet feature maps at 4 levels
#     scales=[
#         (latent_h, latent_w, unet_block_out_channels[0]),
#         (latent_h//2, latent_w//2, unet_block_out_channels[1]),
#         (latent_h//4, latent_w//4, unet_block_out_channels[2]),
#         (latent_h//8, latent_w//8, unet_block_out_channels[3])
#     ]

#     unet_features_mb=0
#     for sh, sw, sc in scales:
#         f_m=cfg_multiplier*sh*sw*sc
#         unet_features_mb+=(f_m*2)/(1024*1024)

#     cross_attention_mb=0
#     for sh, sw, sc in scales:
#         spatial_tokens=sh*sw

#         num_head=sc//unet_attention_head_dim
#         attention_map=cfg_multiplier*spatial_tokens*prompt_length*num_head
#         cross_attention_mb+=(attention_map*2)/(1024*1024)

#     self_attention_mb=0
#     for sh, sw, sc in scales[:2]: # only last two levels have self-attention layers
#         spatial_tokens=sh*sw

#         num_head=sc//unet_attention_head_dim
#         attention_map=cfg_multiplier*spatial_tokens*spatial_tokens*num_head
#         self_attention_mb+=(attention_map*2)/(1024*1024)

#     unet_activations_mb=unet_features_mb+cross_attention_mb+self_attention_mb

#     vae_scales=[
#     (latent_h, latent_w, vae_block_channels[0]),
#     (latent_h*2, latent_w*2, vae_block_channels[1]),
#     (latent_h*4, latent_w*4, vae_block_channels[2]),
#     (latent_h*8, latent_w*8, vae_block_channels[3]),
#     (h, w, 3)
#     ]

#     vae_activations_mb=0
#     for sh, sw, sc in vae_scales:
#         f_m=2*sh*sw*sc
#         vae_activations_mb+=(f_m*2)/(1024*1024)
    
#     #not including in pytorch and cuda overhead as unreliable and no direct calculation from pytorch

#     peak_vram_mb=(
#         total_weights_mb+
#         clip_activations_mb+
#         unet_activations_mb
#         # (vae_activations_mb) vae decode step happens after denoising completes and not during it, so cannot contribute to peak memory so only activates after Unet has been cleared
#     )

#     return peak_vram_mb


# Massively incorrect results even if accounting for optimizations and xformers.

## Your Task
Derive a formula:

---

In [None]:
# def f(h: int, w: int, prompt_length: int, guidance_scale: float=5.0, **kwargs):
#     """
#     :param h: height of input image in pixels
#     :param w: width of input image in pixels
#     :param prompt_length: length of input prompt in number tokens generated after tokenizing the input-prompt.
#     :param kwargs: any additional factors needed for this computation (this is for your use)

#     Args:
#         guidance_scale (float, optional): The guidance scale as defined in Classifier-Free Diffusion Guidance. Defaults to 5.0.
#     """
    
#     #Model works well for small images, but fails for larger images, thus pointing to growth w.r.t size and unrealistic constant vram demand w.r.t size.

#     latent_h=h//vae_scale_factor
#     latent_w=w//vae_scale_factor

#     cfg_multiplier=2 if guidance_scale>1.0 else 1
#     text_embeddings_mb=(cfg_multiplier*prompt_length*clip_hidden_dim*2)/(1024*1024)

#     #removing CLIP attention, happens once at the start, doesnt exist simultaneously with unet which is likely the bottleneck
#     #adding UNET features only at the peak resolution level   ----  REDACTED change --- after test run and review of results

#     scales=[
#         (latent_h, latent_w, unet_block_out_channels[0]),
#         (latent_h//2, latent_w//2, unet_block_out_channels[1]),
#         (latent_h//4, latent_w//4, unet_block_out_channels[2]),
#         (latent_h//8, latent_w//8, unet_block_out_channels[3])
#     ]
#     unet_features_mb=0
#     for sh, sw, sc in scales: # only first two levels should likely be contributing to peak
#         f_m=cfg_multiplier*sh*sw*sc
#         unet_features_mb+=(f_m*2)/(1024*1024)

#     peak_h=scales[0][0]
#     peak_w=scales[0][1]
#     peak_c=scales[0][2]

#     #removing cross attention and self attention at all scales, accounting for xformers change o(nxm) instead of o(n^2) for cross attention, and o(n) self attention instead of o(n2)
#     # adding in number of heads for cross attention calculations
#     spatial_tokens=peak_h*peak_w
#     num_heads=peak_c//unet_attention_head_dim
#     cross_attention_mb=(cfg_multiplier*spatial_tokens*prompt_length*2*num_heads)*(2)/(1024*1024)
#     self_attention_mb=(cfg_multiplier*spatial_tokens*2)/(1024*1024)

#     #accounting for residual connections in the UNET
#     residual_activations_mb=unet_features_mb*0.3

#     unet_activations_mb=unet_features_mb+cross_attention_mb+self_attention_mb+residual_activations_mb

#     #remove VAE entirely, but retain latent encoding by VAE
#     vae_latent_mb=(2*latent_h*latent_w*latent_channels)/(1024*1024)

#     #last-ditch effort, accounting for overhead that scales non linearly with activation size, perhaps explains failure to grow for larger images, from cuda context, caching memory allocator, etc, arbitarily choosing value of 0.3
#     overhead_mb=(unet_activations_mb+text_embeddings_mb)*0.3

#     peak_vram_mb=(
#         total_weights_mb+
#         text_embeddings_mb+
#         unet_activations_mb+
#         vae_latent_mb+
#         overhead_mb
#     )

#     return peak_vram_mb

In [33]:
def f(h: int, w: int, prompt_length: int, guidance_scale: float=5.0, **kwargs):
    """
    :param h: height of input image in pixels
    :param w: width of input image in pixels
    :param prompt_length: length of input prompt in number tokens generated after tokenizing the input-prompt.
    :param kwargs: any additional factors needed for this computation (this is for your use)

    Args:
        guidance_scale (float, optional): The guidance scale as defined in Classifier-Free Diffusion Guidance. Defaults to 5.0.
    """
    
    #Model works well for small images, but fails for larger images, thus pointing to growth w.r.t size and unrealistic constant vram demand w.r.t size.

    latent_h=h//vae_scale_factor
    latent_w=w//vae_scale_factor

    cfg_multiplier=2 if guidance_scale>1.0 else 1
    text_embeddings_mb=(cfg_multiplier*prompt_length*clip_hidden_dim*2)/(1024*1024)

    #removing CLIP attention, happens once at the start, doesnt exist simultaneously with unet which is likely the bottleneck
    #adding UNET features only at the peak resolution level   ----  REDACTED change --- after test run and review of results

    scales=[
        (latent_h, latent_w, unet_block_out_channels[0]),
        (latent_h//2, latent_w//2, unet_block_out_channels[1]),
        (latent_h//4, latent_w//4, unet_block_out_channels[2]),
        (latent_h//8, latent_w//8, unet_block_out_channels[3])
    ]
    unet_features_mb=0
    for sh, sw, sc in scales: # only first two levels should likely be contributing to peak
        f_m=cfg_multiplier*sh*sw*sc
        unet_features_mb+=(f_m*2*3)/(1024*1024) # ACCOUNTING FOR RESIDUAL CONNECTIONS HERE INPUT, OUTPUT, SKIP

    peak_h=scales[0][0]
    peak_w=scales[0][1]
    peak_c=scales[0][2]

    # ---- REDACTED ---- removing cross attention and self attention at all scales, accounting for xformers change o(nxm) instead of o(n^2) for cross attention, and o(n) self attention instead of o(n2)
    # adding in number of heads for cross attention calculations
    spatial_tokens=peak_h*peak_w
    num_heads=peak_c//unet_attention_head_dim


    # ACCOUNT FOR QKV CACHE and MATRICES
    total_attention_mb = 0
    for sh, sw, sc in scales[:2]:  # First 2 scales typically have attention blocks
        spatial_tokens = sh * sw
        
        # Cross-attention: needs Q (from spatial), K, V (from text), and output
        # Q: [cfg_mult, spatial_tokens, channels]
        # K, V: [cfg_mult, prompt_length, channels] 
        # Output: [cfg_mult, spatial_tokens, channels]
        cross_attention_mb = (
            cfg_multiplier * spatial_tokens * sc +           # Q projection
            cfg_multiplier * prompt_length * sc * 2 +        # K, V projections
            cfg_multiplier * spatial_tokens * sc             # Attention output
        ) * 2 / (1024 * 1024)
        
        # Self-attention: Q, K, V all from spatial tokens
        # Even with xformers/flash attention, we need Q, K, V matrices
        self_attention_mb = (
            cfg_multiplier * 3 * spatial_tokens * sc +       # Q, K, V projections
            cfg_multiplier * spatial_tokens * sc             # Attention output
        ) * 2 / (1024 * 1024)
        
        total_attention_mb += cross_attention_mb + self_attention_mb


    unet_activations_mb=unet_features_mb+total_attention_mb

    #remove VAE entirely, but retain latent encoding by VAE
    vae_latent_mb=(2*latent_h*latent_w*latent_channels)/(1024*1024)

    #last-ditch effort, accounting for overhead that scales non linearly with activation size, perhaps explains failure to grow for larger images, from cuda context, caching memory allocator, etc, arbitarily choosing value of 0.3
    overhead_mb=(unet_activations_mb+text_embeddings_mb)*0.3

    peak_vram_mb=(
        total_weights_mb+
        text_embeddings_mb+
        unet_activations_mb+
        vae_latent_mb+
        overhead_mb
    )

    return peak_vram_mb

---

## Verification

In [34]:
results=[]

for idx in range(len(img_src)):
    # clear cache before measurement
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

    test_case=img_src[idx]
    init_image = load_image(test_case.get('url'))
    prompt=test_case.get('prompt')

    h, w=init_image.size[1], init_image.size[0]
    tokenized_prompt=pipeline.tokenizer(prompt, return_tensors="pt")
    prompt_length=tokenized_prompt.input_ids.shape[1]

    image=pipeline(prompt, image=init_image, guidance_scale=7.5).images[0]

    actual_peak_mb=torch.cuda.max_memory_allocated()/ (1024*1024)
    predicted_peak_mb=f(h, w, prompt_length)
    delta=actual_peak_mb - predicted_peak_mb

    results.append({
    "idx": idx,
    "h": h,
    "w": w,
    "prompt_length": prompt_length,
    "predicted_peak_mb": predicted_peak_mb,
    "actual_peak_mb": actual_peak_mb,
    "delta_mb": delta
    })

100%|██████████| 40/40 [00:04<00:00,  9.18it/s]
100%|██████████| 40/40 [04:21<00:00,  6.53s/it]
100%|██████████| 40/40 [00:10<00:00,  3.79it/s]
100%|██████████| 40/40 [01:28<00:00,  2.22s/it]


In [35]:
results

[{'idx': 0,
  'h': 380,
  'w': 396,
  'prompt_length': 55,
  'predicted_peak_mb': 2147.5687465667725,
  'actual_peak_mb': 2490.1533203125,
  'delta_mb': 342.58457374572754},
 {'idx': 1,
  'h': 2048,
  'w': 2048,
  'prompt_length': 64,
  'predicted_peak_mb': 5215.882328414917,
  'actual_peak_mb': 6906.06591796875,
  'delta_mb': 1690.1835895538334},
 {'idx': 2,
  'h': 534,
  'w': 800,
  'prompt_length': 66,
  'predicted_peak_mb': 2354.8342205047607,
  'actual_peak_mb': 2619.16748046875,
  'delta_mb': 264.33325996398935},
 {'idx': 3,
  'h': 1200,
  'w': 1800,
  'prompt_length': 43,
  'predicted_peak_mb': 3664.287310409546,
  'actual_peak_mb': 5771.58837890625,
  'delta_mb': 2107.301068496704}]

### **model consistently fails to grow enough for larger predictions, near margin of error for smaller images** 
- temporarily resolved by unbacked CUDA+PyTorch overhead assumption (resolved to some extent, issue still exists)

### **Issues**
- Rectangular images might need some form of padding for better performance
- Unsure of perhaps existence of inverse relationship between prompt token length vs vram consumption
- above 2 points, focused on idx 3, as it breaks the previously thought issue of predictions growing faultier as image gets larger

---

## Tips
- Although no GPU is needed to accomplish this task (analyze code/architecture)
- Use PyTorch documentation and model architecture inspection

# Evaluation Criteria
- Correctness: Formula accounts for major memory consumers
- Completeness: All image-dependent and prompt-dependent factors identified
- Rigor: Derivation shows understanding of PyTorch memory model and diffusion architecture
- Clarity: Equation is readable and well-documented

---
### Research insights into building equation
Diffusion model consists of
    - CLIP encoder
    - UNet
    - VAE Decoder

These have the highest number of parameters thus affect the vram occupancy the most

1. CLIP encoder
- Convert text into features
- So basically context embeddings, like used in RNNs and LSTMs
- standard transformer architecture with mhsa (multihead self attention)
- 123M parameters

2. UNet
- Starts via noise
- denoises image via attention with text embeddings
- (not the standard conv UNet as used for convolution)
- 859M parameters
- likely consumes the most amount of vram

3. VAE decoder
- intractible density function functioning on latent space as produced by Unet
- guides it back into pixel space
- 83.65M parameters

---
Rough work

In [18]:
print(pipeline)

StableDiffusionImg2ImgPipeline {
  "_class_name": "StableDiffusionImg2ImgPipeline",
  "_diffusers_version": "0.35.2",
  "_name_or_path": "stable-diffusion-v1-5/stable-diffusion-v1-5",
  "feature_extractor": [
    "transformers",
    "CLIPImageProcessor"
  ],
  "image_encoder": [
    null,
    null
  ],
  "requires_safety_checker": true,
  "safety_checker": [
    "stable_diffusion",
    "StableDiffusionSafetyChecker"
  ],
  "scheduler": [
    "diffusers",
    "PNDMScheduler"
  ],
  "text_encoder": [
    "transformers",
    "CLIPTextModel"
  ],
  "tokenizer": [
    "transformers",
    "CLIPTokenizer"
  ],
  "unet": [
    "diffusers",
    "UNet2DConditionModel"
  ],
  "vae": [
    "diffusers",
    "AutoencoderKL"
  ]
}



In [19]:
print(vae.config)

FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types', ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']), ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2), ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 512), ('scaling_factor', 0.18215), ('shift_factor', None), ('latents_mean', None), ('latents_std', None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True), ('mid_block_add_attention', True), ('_use_default_values', ['latents_mean', 'mid_block_add_attention', 'scaling_factor', 'shift_factor', 'use_quant_conv', 'latents_std', 'force_upcast', 'use_post_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.6.0'), ('_name_or_path', '/home/arnab/.cache/huggingface/hub/models--stable-diffusion-v1-5--stable-diffusion-v1-5/snaps

In [20]:
print(text_encoder.config)

CLIPTextConfig {
  "architectures": [
    "CLIPTextModel"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "dropout": 0.0,
  "dtype": "float16",
  "eos_token_id": 2,
  "hidden_act": "quick_gelu",
  "hidden_size": 768,
  "initializer_factor": 1.0,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 77,
  "model_type": "clip_text_model",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "projection_dim": 768,
  "transformers_version": "4.57.1",
  "vocab_size": 49408
}



In [21]:
print(unet.config)

FrozenDict([('sample_size', 64), ('in_channels', 4), ('out_channels', 4), ('center_input_sample', False), ('flip_sin_to_cos', True), ('freq_shift', 0), ('down_block_types', ['CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D']), ('mid_block_type', 'UNetMidBlock2DCrossAttn'), ('up_block_types', ['UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D']), ('only_cross_attention', False), ('block_out_channels', [320, 640, 1280, 1280]), ('layers_per_block', 2), ('downsample_padding', 1), ('mid_block_scale_factor', 1), ('dropout', 0.0), ('act_fn', 'silu'), ('norm_num_groups', 32), ('norm_eps', 1e-05), ('cross_attention_dim', 768), ('transformer_layers_per_block', 1), ('reverse_transformer_layers_per_block', None), ('encoder_hid_dim', None), ('encoder_hid_dim_type', None), ('attention_head_dim', 8), ('num_attention_heads', None), ('dual_cross_attention', False), ('use_linear_projection', False), ('class_embed_type', None), ('addition_emb

In [22]:
unet_config.layers_per_block

2