# Lab 4.1.2: Image Generation with Diffusion Models

**Module:** 4.1 - Multimodal AI  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how diffusion models generate images from noise
- [ ] Use SDXL to generate high-quality images from text prompts
- [ ] Apply ControlNet for guided image generation
- [ ] Use image-to-image translation to transform existing images
- [ ] Optimize generation for DGX Spark's capabilities

---

## üìö Prerequisites

- Completed: Lab 4.1.1 (Vision-Language Models)
- Knowledge of: Basic neural networks, PyTorch
- Running in: NGC PyTorch container

---

## üåç Real-World Context

Image generation has transformed creative industries:

- **Design**: Rapid prototyping of product concepts
- **Marketing**: Generating campaign imagery at scale
- **Gaming**: Creating game assets and concept art
- **Architecture**: Visualizing building designs
- **Fashion**: Designing and previewing clothing

---

## üßí ELI5: How Do Diffusion Models Work?

> **Imagine you have a beautiful painting, and you slowly add TV static noise to it** - like when an old TV loses signal. After adding lots of noise, the painting just looks like pure static.
>
> Diffusion models learn to **reverse this process**! They start with pure noise (static) and gradually remove it, step by step, until a beautiful image appears.
>
> The magic part: By telling the model what image you want ("a cat wearing sunglasses"), it removes the noise in a way that reveals that specific image!
>
> **In AI terms:** The model is trained on millions of images that have been progressively noised. It learns to predict and remove noise at each step, conditioned on a text description that guides what image should emerge.

---

## Part 1: Environment Setup

Let's set up our environment for image generation.

In [None]:
# Check GPU and environment
import torch
import gc

print("=" * 50)
print("DGX Spark Environment Check")
print("=" * 50)

if torch.cuda.is_available():
    device = torch.cuda.get_device_properties(0)
    print(f"GPU: {device.name}")
    print(f"Total Memory: {device.total_memory / 1024**3:.1f} GB")
    print(f"Compute Capability: {device.major}.{device.minor}")
    
    # Check for Blackwell (compute capability 10.x)
    if device.major >= 10:
        print("‚úÖ Blackwell GPU detected - optimal for bfloat16!")
else:
    print("‚ùå No GPU detected!")

In [None]:
# Install dependencies (run once)
# !pip install diffusers>=0.27.0 transformers>=4.45.0 accelerate>=0.27.0 safetensors

In [None]:
# Import libraries
import time
from pathlib import Path
from typing import Optional, List

import torch
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np

# Helper function to display images
def show_image(image: Image.Image, title: str = ""):
    """Display a single image."""
    plt.figure(figsize=(10, 10))
    plt.imshow(image)
    plt.axis('off')
    if title:
        plt.title(title, fontsize=12)
    plt.tight_layout()
    plt.show()

def show_images(images: List[Image.Image], titles: List[str] = None, cols: int = 3):
    """Display multiple images in a grid."""
    n = len(images)
    rows = (n + cols - 1) // cols
    
    fig, axes = plt.subplots(rows, cols, figsize=(5*cols, 5*rows))
    axes = axes.flatten() if n > 1 else [axes]
    
    for i, (ax, img) in enumerate(zip(axes, images)):
        ax.imshow(img)
        ax.axis('off')
        if titles and i < len(titles):
            ax.set_title(titles[i], fontsize=10)
    
    # Hide empty subplots
    for ax in axes[n:]:
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()

print("‚úÖ Libraries imported!")

---

## Part 2: Basic Image Generation with SDXL

Stable Diffusion XL (SDXL) is a powerful open-source model that generates high-quality 1024x1024 images.

### üßí ELI5: What is SDXL?

> **SDXL is like a super talented artist** who has studied millions of images and can create almost anything you describe.
>
> When you give it a "prompt" (text description), it:
> 1. **Understands** what you want using a text encoder (CLIP)
> 2. **Starts** with random noise
> 3. **Refines** the noise over many steps, guided by your description
> 4. **Produces** a detailed 1024x1024 image

In [None]:
from diffusers import StableDiffusionXLPipeline

# Load SDXL - uses about 8GB VRAM
print("Loading SDXL...")
print("(First run downloads ~6GB model)")
start_time = time.time()

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.bfloat16,  # Optimal for Blackwell
    use_safetensors=True,
)

# Move to GPU
pipe = pipe.to("cuda")

# Enable memory-efficient attention
pipe.enable_vae_slicing()

print(f"\n‚úÖ Loaded in {time.time() - start_time:.1f}s")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
# Your first image generation!
prompt = "A majestic mountain landscape at golden hour, with a crystal-clear lake reflecting snow-capped peaks, photorealistic, 8k"

# Negative prompt helps avoid unwanted elements
negative_prompt = "blurry, low quality, distorted, ugly, watermark, text"

print(f"üé® Generating: '{prompt[:50]}...'")
start_time = time.time()

# Generate with fixed seed for reproducibility
generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=30,  # More steps = better quality
    guidance_scale=7.5,      # How closely to follow the prompt
    generator=generator,
).images[0]

elapsed = time.time() - start_time
print(f"\n‚è±Ô∏è Generated in {elapsed:.1f}s")

show_image(image, prompt[:80] + "...")

### üîç Understanding the Parameters

| Parameter | What it does | Typical values |
|-----------|--------------|----------------|
| `num_inference_steps` | More steps = finer details, slower | 20-50 |
| `guidance_scale` | Higher = follows prompt more strictly | 5-15 |
| `generator` | Random seed for reproducibility | Any integer |
| `negative_prompt` | What to avoid in the image | "blurry, ugly..." |

---

In [None]:
# Let's explore different prompts!

prompts = [
    "A cyberpunk city at night, neon lights, rain-slicked streets, cinematic",
    "A cozy cottage in an enchanted forest, fairy lights, magical atmosphere",
    "An astronaut riding a horse on Mars, digital art, highly detailed",
]

images = []
for i, prompt in enumerate(prompts):
    print(f"Generating {i+1}/{len(prompts)}: {prompt[:40]}...")
    
    generator = torch.Generator(device="cuda").manual_seed(42 + i)
    
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=25,
        guidance_scale=7.5,
        generator=generator,
    ).images[0]
    
    images.append(image)

# Display all images
show_images(images, [p[:40] + "..." for p in prompts])

### ‚úã Try It Yourself

Generate an image with your own prompt! Try:
- Different artistic styles ("oil painting", "watercolor", "anime")
- Different subjects (animals, people, landscapes, abstract)
- Different moods (dark, cheerful, mysterious, serene)

In [None]:
# Your turn! Modify the prompt below
your_prompt = "YOUR CREATIVE PROMPT HERE"

# Uncomment to generate:
# generator = torch.Generator(device="cuda").manual_seed(123)
# your_image = pipe(
#     prompt=your_prompt,
#     negative_prompt=negative_prompt,
#     num_inference_steps=30,
#     guidance_scale=7.5,
#     generator=generator,
# ).images[0]
# show_image(your_image, your_prompt)

---

## Part 3: Understanding Guidance Scale and Steps

Let's visualize how different parameters affect the output.

In [None]:
# Compare different guidance scales
prompt = "A red apple on a wooden table, studio lighting"

guidance_scales = [3.0, 7.5, 12.0, 20.0]
images = []

print("Comparing guidance scales...")
for gs in guidance_scales:
    generator = torch.Generator(device="cuda").manual_seed(42)
    
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=25,
        guidance_scale=gs,
        generator=generator,
    ).images[0]
    
    images.append(image)
    print(f"  Guidance {gs}: Done")

show_images(images, [f"Guidance = {gs}" for gs in guidance_scales], cols=4)

### üîç What Just Happened?

- **Low guidance (3.0)**: More creative/random, may drift from prompt
- **Medium guidance (7.5)**: Good balance - recommended default
- **High guidance (12-20)**: Very literal interpretation, may look artificial

---

In [None]:
# Compare different step counts
step_counts = [10, 20, 30, 50]
images = []
times = []

print("Comparing inference steps...")
for steps in step_counts:
    generator = torch.Generator(device="cuda").manual_seed(42)
    
    start = time.time()
    image = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        num_inference_steps=steps,
        guidance_scale=7.5,
        generator=generator,
    ).images[0]
    
    elapsed = time.time() - start
    images.append(image)
    times.append(elapsed)
    print(f"  Steps {steps}: {elapsed:.1f}s")

show_images(images, [f"Steps = {s} ({t:.1f}s)" for s, t in zip(step_counts, times)], cols=4)

### üîç Steps vs Quality Trade-off

- **10 steps**: Fast but rough - good for previews
- **20-30 steps**: Sweet spot for most uses
- **50+ steps**: Diminishing returns, only for final output

---

## Part 4: Image-to-Image Translation

Transform existing images while keeping their structure!

### üßí ELI5: Image-to-Image

> **Imagine tracing over a photo with colored pencils in a new style.** The basic shapes stay the same, but the style completely changes.
>
> That's img2img! You give it a starting image and a prompt, and it transforms the image to match the new description while keeping the original composition.

In [None]:
from diffusers import StableDiffusionXLImg2ImgPipeline

# Load img2img pipeline (shares weights with base)
print("Loading img2img pipeline...")

img2img_pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.bfloat16,
    use_safetensors=True,
)
img2img_pipe = img2img_pipe.to("cuda")
img2img_pipe.enable_vae_slicing()

print("‚úÖ Ready!")

In [None]:
# First, let's create a base image
base_prompt = "A simple sketch of a house with a tree"

generator = torch.Generator(device="cuda").manual_seed(42)
base_image = pipe(
    prompt=base_prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,
    guidance_scale=7.5,
    generator=generator,
).images[0]

show_image(base_image, "Base Image")

In [None]:
# Now transform it with different styles!
style_prompts = [
    "A house and tree in the style of Van Gogh's Starry Night, oil painting",
    "A house and tree as a Japanese anime scene, Studio Ghibli style",
    "A house and tree in cyberpunk style, neon lights, futuristic",
]

transformed_images = [base_image]  # Include original
titles = ["Original"]

for prompt in style_prompts:
    generator = torch.Generator(device="cuda").manual_seed(42)
    
    transformed = img2img_pipe(
        prompt=prompt,
        image=base_image,
        strength=0.75,  # How much to change (0-1)
        num_inference_steps=30,
        guidance_scale=7.5,
        generator=generator,
    ).images[0]
    
    transformed_images.append(transformed)
    titles.append(prompt.split(",")[0][:30])

show_images(transformed_images, titles, cols=2)

### üîç Understanding Strength Parameter

The `strength` parameter controls how much the output can differ from the input:
- **0.0**: No change (identical to input)
- **0.5**: Moderate changes, keeps composition
- **0.75**: Significant transformation
- **1.0**: Complete reimagining (ignores input structure)

---

## Part 5: ControlNet - Guided Generation

ControlNet lets you guide image generation using edges, poses, depth maps, and more!

### üßí ELI5: What is ControlNet?

> **Imagine you're giving an artist specific instructions:** "Paint a dog, but it should be in exactly THIS pose" (shows a stick figure).
>
> ControlNet lets you give the AI similar "guides":
> - **Edge detection**: The outlines of objects
> - **Pose estimation**: Where people/animals should be positioned
> - **Depth maps**: What should be in front vs background

In [None]:
# Clean up previous pipelines to free memory
del img2img_pipe
torch.cuda.empty_cache()
gc.collect()

print(f"Memory after cleanup: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
import cv2

# Load ControlNet for edge detection (Canny)
print("Loading ControlNet...")

controlnet = ControlNetModel.from_pretrained(
    "diffusers/controlnet-canny-sdxl-1.0",
    torch_dtype=torch.bfloat16,
)

controlnet_pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    controlnet=controlnet,
    torch_dtype=torch.bfloat16,
)

controlnet_pipe = controlnet_pipe.to("cuda")
controlnet_pipe.enable_vae_slicing()

print(f"‚úÖ Loaded! Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

In [None]:
def get_canny_edges(image: Image.Image, low_threshold: int = 100, high_threshold: int = 200) -> Image.Image:
    """
    Extract edge map from an image using Canny edge detection.
    """
    # Convert to numpy array
    img_array = np.array(image)
    
    # Convert to grayscale
    gray = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
    
    # Apply Canny edge detection
    edges = cv2.Canny(gray, low_threshold, high_threshold)
    
    # Convert back to RGB PIL Image
    edges_rgb = np.stack([edges] * 3, axis=-1)
    return Image.fromarray(edges_rgb)

# Use our previously generated image
control_image = get_canny_edges(base_image)

# Display edge map
show_images([base_image, control_image], ["Original", "Edge Map (Canny)"], cols=2)

In [None]:
# Generate new images guided by the edge map
controlnet_prompts = [
    "A Victorian mansion with a cherry blossom tree, sunset lighting",
    "A futuristic building with a holographic tree, sci-fi",
    "A gingerbread house with a candy cane tree, fantasy",
]

controlled_images = [control_image]
titles = ["Edge Map"]

for prompt in controlnet_prompts:
    print(f"Generating: {prompt[:40]}...")
    generator = torch.Generator(device="cuda").manual_seed(42)
    
    result = controlnet_pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        image=control_image,
        controlnet_conditioning_scale=0.5,  # How strongly to follow edges
        num_inference_steps=30,
        guidance_scale=7.5,
        generator=generator,
    ).images[0]
    
    controlled_images.append(result)
    titles.append(prompt.split(",")[0][:25])

show_images(controlled_images, titles, cols=2)

### üîç What Just Happened?

Notice how all generated images follow the same basic structure (house + tree) but with completely different styles! The edge map acted as a "skeleton" for the AI to paint over.

**Use cases for ControlNet:**
- Convert sketches to realistic images
- Maintain pose/composition when changing styles
- Generate variations that keep the same layout

---

## Part 6: DGX Spark Optimization Tips

Let's explore how to get the best performance from your DGX Spark!

In [None]:
# Memory and speed optimization techniques
print("üìä DGX Spark Image Generation Optimization Guide")
print("=" * 60)

# Current memory usage
allocated = torch.cuda.memory_allocated() / 1024**3
total = torch.cuda.get_device_properties(0).total_memory / 1024**3

print(f"\nCurrent Memory: {allocated:.1f}GB / {total:.1f}GB ({allocated/total*100:.1f}%)")

print("""
‚úÖ OPTIMIZATION TECHNIQUES:

1. Use bfloat16 (native Blackwell support):
   pipe.to(torch.bfloat16)  # Already enabled!

2. VAE slicing for lower memory:
   pipe.enable_vae_slicing()  # Already enabled!

3. VAE tiling for very large images (2048x2048+):
   pipe.enable_vae_tiling()

4. Model CPU offload (if needed):
   pipe.enable_model_cpu_offload()  # Slower but saves VRAM

5. Compile with torch.compile() for 10-30% speedup:
   pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead")

6. Use smaller batch sizes:
   num_images_per_prompt=1  # Default

üìà WHAT FITS ON DGX SPARK (128GB):

| Model               | VRAM   | Fits? |
|---------------------|--------|-------|
| SDXL Base           | ~8GB   | ‚úÖ Easily |
| SDXL + Refiner      | ~16GB  | ‚úÖ Yes |
| Flux.1-dev          | ~24GB  | ‚úÖ Yes |
| SD 3.5 Large        | ~20GB  | ‚úÖ Yes |
| Multiple models     | ~50GB  | ‚úÖ Yes |
""")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting Negative Prompt
```python
# ‚ùå Wrong: No negative prompt
image = pipe(prompt="A cat").images[0]  # May include artifacts

# ‚úÖ Right: Always include negative prompt
image = pipe(
    prompt="A cat",
    negative_prompt="blurry, low quality, distorted, ugly"
).images[0]
```
**Why:** Negative prompts significantly improve quality by guiding the model away from common issues.

---

### Mistake 2: Using Wrong Image Size
```python
# ‚ùå Wrong: Non-standard size
image = pipe(prompt="...", width=1000, height=1000).images[0]  # Bad quality!

# ‚úÖ Right: Use sizes the model was trained on
# SDXL optimal sizes: 1024x1024, 1152x896, 896x1152, 1216x832, etc.
image = pipe(prompt="...", width=1024, height=1024).images[0]
```
**Why:** Models are trained on specific aspect ratios. Non-standard sizes cause artifacts.

---

### Mistake 3: Too Many Steps Without Benefit
```python
# ‚ùå Wrong: 100 steps takes forever with minimal quality gain
image = pipe(prompt="...", num_inference_steps=100).images[0]

# ‚úÖ Right: Sweet spot is 25-35 steps
image = pipe(prompt="...", num_inference_steps=30).images[0]
```
**Why:** Quality gains diminish rapidly after ~30 steps. Use fewer steps for previews, more for final output.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ How diffusion models generate images from noise
- ‚úÖ Using SDXL for high-quality text-to-image generation
- ‚úÖ Understanding guidance scale and inference steps
- ‚úÖ Image-to-image translation for style transfer
- ‚úÖ ControlNet for guided generation with edge maps
- ‚úÖ Optimization techniques for DGX Spark

---

## üöÄ Challenge (Optional)

Build a **Style Transfer Pipeline** that:
1. Takes a reference image
2. Extracts its edge map using Canny
3. Generates 5 different style variations
4. Creates a comparison grid
5. Saves all images with metadata (prompt, seed, parameters)

In [None]:
# Challenge: Your code here!

def style_transfer_pipeline(
    reference_image: Image.Image,
    styles: List[str],
    output_dir: str = "outputs"
) -> List[Image.Image]:
    """
    Apply multiple styles to a reference image using ControlNet.
    
    Args:
        reference_image: Input image to transform
        styles: List of style prompts
        output_dir: Directory to save results
        
    Returns:
        List of generated images
    """
    # Your implementation here!
    pass

---

## üìñ Further Reading

- [Stable Diffusion Paper](https://arxiv.org/abs/2112.10752)
- [SDXL Technical Report](https://arxiv.org/abs/2307.01952)
- [ControlNet Paper](https://arxiv.org/abs/2302.05543)
- [Diffusers Documentation](https://huggingface.co/docs/diffusers)
- [Prompt Engineering Guide](https://stable-diffusion-art.com/prompt-guide/)

---

## üßπ Cleanup

In [None]:
# Clean up GPU memory
if 'pipe' in dir():
    del pipe
if 'controlnet_pipe' in dir():
    del controlnet_pipe
if 'controlnet' in dir():
    del controlnet

torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")

---

## Next Steps

In the next lab, we'll build a **Multimodal RAG System** that can search across both images and text using CLIP embeddings!

‚û°Ô∏è Continue to [Lab 03: Multimodal RAG](./03-multimodal-rag.ipynb)