![image.png](https://i.imgur.com/a3uAqnb.png)


# Text-to-Image Generation using Stable Diffusion - Homework Assignment

![Stable Diffusion Architecture](https://miro.medium.com/v2/resize:fit:1400/1*NpQ282NJdOfxUsYlwLJplA.png)

In this homework, you will finetune a **Stable Diffusion** model to generate Naruto-style images from text descriptions. This involves working with the complete diffusion pipeline including VAE, UNet, text encoder, and scheduler.

## 📌 Project Overview
- **Task**: Text-to-Naruto image generation
- **Architecture**: Stable Diffusion with UNet diffusion model
- **Dataset**: Naruto-style dataset with text descriptions
- **Goal**: Generate realistic Naruto-style images from text prompts

## 📚 Learning Objectives
By completing this assignment, you will:
- Understand diffusion models and the stable diffusion pipeline
- Learn to finetune pre-trained diffusion models
- Work with VAE, UNet, text encoders, and schedulers
- Practice text-to-image generation techniques
- Handle memory constraints with large models

## 1️⃣ Dataset Setup (PROVIDED)

The Naruto-style dataset has been loaded for you. The dataset contains:
- 1,221 training images with corresponding text descriptions
- Each sample has an 'image' and 'text' field
- Images are in various sizes and need to be resized to 512x512


In [2]:
from datasets import load_dataset

# Dataset already loaded
ds = load_dataset("Alex-0402/naruto-style-dataset-with-text")
print("Dataset info:", ds)
print("Number of training samples:", len(ds['train']))

# Display a sample
sample = ds['train'][0]
print("\nSample text:", sample['text'])
print("Image size:", sample['image'].size)

Dataset info: DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 1221
    })
})
Number of training samples: 1221

Sample text: a man with dark hair and brown eyes, naruto style
Image size: (1080, 1080)


## 2️⃣ Import Libraries and Configuration

**Task**: Import all necessary libraries and set up configuration parameters.

**Requirements**:
- Import diffusers, transformers, and related libraries
- Import PyTorch, PIL, numpy, and other utilities
- Set random seeds for reproducibility
- Configure hyperparameters for stable diffusion training

In [3]:
# TODO: Import all necessary libraries:
#       - torch, torch.nn, torch.optim
#       - diffusers (UNet2DConditionModel, AutoencoderKL, PNDMScheduler, etc.)
#       - transformers (CLIPTextModel, CLIPTokenizer)
#       - PIL, numpy, matplotlib
#       - torchvision.transforms
#       - tqdm for progress bars

# TODO: Set random seeds for reproducibility (use seed=42)

# TODO: Check device availability and print

# TODO: Define configuration parameters:
MODEL_ID = "OFA-Sys/small-stable-diffusion-v0"  # Smaller stable diffusion model
IMG_SIZE = 512  # Image resolution
BATCH_SIZE = 1  # Small batch size for memory constraints
LEARNING_RATE = 1e-5  # Learning rate for finetuning
NUM_EPOCHS = 3  # Number of training epochs
INFERENCE_STEPS = 100  # Number of denoising steps during inference
GUIDANCE_SCALE = 7.5  # Classifier-free guidance scale

## 3️⃣ Load Pre-trained Stable Diffusion Components

**Task**: Load all components of the stable diffusion pipeline.

**Requirements**:
- Load VAE (Variational Autoencoder) for image encoding/decoding
- Load UNet for the diffusion process
- Load text encoder and tokenizer for text conditioning
- Load noise scheduler for the diffusion process


In [4]:
# TODO: Load stable diffusion components:
#       - vae = AutoencoderKL.from_pretrained(MODEL_ID, subfolder="vae")
#       - unet = UNet2DConditionModel.from_pretrained(MODEL_ID, subfolder="unet")
#       - text_encoder = CLIPTextModel.from_pretrained(MODEL_ID, subfolder="text_encoder")
#       - tokenizer = CLIPTokenizer.from_pretrained(MODEL_ID, subfolder="tokenizer")
#       - scheduler = PNDMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler")

# TODO: Move models to device
# TODO: Set VAE and text encoder to eval mode (only UNet will be trained)
# TODO: Print model information and parameter counts

## 4️⃣ Data Preprocessing and Custom Dataset

**Task**: Create custom dataset class and preprocessing pipeline.

**Requirements**:
- Resize images to 512x512 resolution
- Normalize images to [-1, 1] range for VAE
- Tokenize text descriptions
- Handle data augmentation appropriately

In [5]:
# TODO: Create NarutoDataset class inheriting from torch.utils.data.Dataset
# TODO: In __init__(self, dataset, tokenizer, size=512):
#       - Store dataset, tokenizer, and image size
#       - Define image transforms:
#         * Resize to (size, size)
#         * Random horizontal flip for augmentation
#         * ToTensor()
#         * Normalize with mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]

# TODO: Implement __len__ to return dataset length
# TODO: Implement __getitem__ to:
#       - Get image and text from dataset
#       - Apply transforms to image
#       - Tokenize text with padding and truncation
#       - Return dict with 'pixel_values' and 'input_ids'

# TODO: Create train_dataset and train_dataloader
# TODO: Print dataset info and test with one sample

## 5️⃣ Training Setup and Loss Function

**Task**: Set up the training components including optimizer and loss function.

**Requirements**:
- Create optimizer for UNet parameters only
- Implement the diffusion loss (noise prediction loss)
- Set up proper gradient scaling and mixed precision if needed
- Configure learning rate scheduling

In [6]:
# TODO: Create optimizer for UNet parameters only:
#       - optimizer = torch.optim.AdamW(unet.parameters(), lr=LEARNING_RATE)

# TODO: Create noise scheduler for training (different from inference)
#       - noise_scheduler = DDPMScheduler.from_pretrained(MODEL_ID, subfolder="scheduler")

# TODO: Define helper functions:
#       - encode_text(text_input): tokenize and encode text to embeddings
#       - encode_image(image): encode image to latent space using VAE
#       - decode_latent(latent): decode latent back to image using VAE

# TODO: Set up gradient scaler for mixed precision training (optional)
# TODO: Initialize training tracking variables

## 6️⃣ Training Loop Implementation

**Task**: Implement the main training loop for diffusion model finetuning.

**Requirements**:
- Encode images to latent space using VAE
- Add noise to latents according to diffusion schedule
- Predict noise using UNet conditioned on text
- Compute loss between predicted and actual noise
- Update UNet parameters via backpropagation

In [7]:
# TODO: Implement training loop:
#       
#       For each epoch:
#         For each batch in train_dataloader:
#           - Get images and text from batch
#           - Encode images to latent space using VAE
#           - Encode text to embeddings using text encoder
#           - Sample random timesteps for diffusion
#           - Add noise to latents according to schedule
#           - Predict noise using UNet with text conditioning
#           - Compute MSE loss between predicted and actual noise
#           - Backpropagate and update UNet parameters
#           - Track and display training progress
#
# TODO: Use torch.no_grad() for VAE and text encoder operations
# TODO: Implement proper error handling and memory management
# TODO: Save model checkpoints periodically
# TODO: Display loss curves and training statistics

## 7️⃣ Inference Pipeline Setup

**Task**: Create inference pipeline for text-to-image generation.

**Requirements**:
- Set up complete diffusion pipeline with trained UNet
- Configure scheduler for inference (100 steps)
- Implement text-to-image generation function
- Handle classifier-free guidance

In [8]:
# TODO: Create inference pipeline:
#       - Set all models to eval mode
#       - Create StableDiffusionPipeline with trained components
#       - Configure scheduler for inference

# TODO: Implement generate_image function that:
#       - Takes text prompt as input
#       - Encodes text to embeddings
#       - Starts with random noise
#       - Performs denoising for specified number of steps
#       - Decodes final latent to image
#       - Returns PIL image

# TODO: Set up proper inference configuration:
#       - num_inference_steps = INFERENCE_STEPS
#       - guidance_scale = GUIDANCE_SCALE
#       - Enable safety checker if desired

## 8️⃣ Generate Images with Dataset Prompts

**Task**: Generate images using 5 prompts from the training dataset.

**Requirements**:
- Select 5 different text prompts from the dataset
- Generate images for each prompt
- Display results in a grid format
- Show prompt text alongside generated images

In [9]:
# TODO: Select 5 prompts from training dataset:
#       - Use different indices to get variety
#       - Extract text descriptions

# TODO: Generate images for each dataset prompt:
#       - Use generate_image function
#       - Set random seed for reproducibility
#       - Save generated images

# TODO: Create visualization:
#       - Display each prompt text
#       - Show corresponding generated image
#       - Use matplotlib subplot for clean layout
#       - Add titles and proper formatting

# TODO: Display results in a 2x3 grid or similar arrangement

## 9️⃣ Generate Images with Custom Prompts

**Task**: Generate images using 5 custom prompts that you create.

**Requirements**:
- Write 5 creative prompts in Naruto style
- Test different types of descriptions (characters, scenes, actions)
- Generate and display results
- Compare quality with dataset prompt results

In [10]:
# TODO: Define 5 custom prompts, for example:
#       - "A ninja with orange hair performing a jutsu"
#       - "A village hidden in the leaves at sunset"
#       - "A powerful chakra aura surrounding a young ninja"
#       - "A battle scene with multiple ninjas using different techniques"
#       - "A peaceful training ground with cherry blossoms"

# TODO: Generate images for each custom prompt:
#       - Use same generation parameters as before
#       - Ensure consistent quality

# TODO: Create visualization for custom prompts:
#       - Similar layout to dataset prompts
#       - Show prompt text and generated image
#       - Use consistent formatting

# TODO: Display all 5 custom prompt results

## 🔟 Model Evaluation and Comparison

**Task**: Evaluate and compare your results

**Requirements**:
- Compare generated images with original dataset images
- Evaluate image quality, style consistency, and prompt adherence
- Plot training progress and loss convergence


In [None]:
# TODO: Create comparison visualization:
#       - Show original dataset images alongside generated ones
#       - Compare style consistency
#       - Evaluate prompt adherence

# TODO: Plot training loss curve:
#       - Show loss progression over epochs
#       - Analyze convergence behavior

## 📝 Evaluation Criteria

Your homework will be evaluated based on:

1. **Implementation Correctness (40%)**
   - Proper stable diffusion pipeline setup
   - Correct training loop implementation
   - Working inference pipeline
   - Appropriate use of VAE, UNet, text encoder, and scheduler

2. **Training and Results (30%)**
   - Model trains without errors
   - Reasonable loss convergence
   - Generated images show Naruto style characteristics
   - Successful generation from both dataset and custom prompts

3. **Code Quality (30%)**
   - Clean, readable code with proper comments
   - Efficient memory usage and error handling
   - Proper tensor operations and device management
   - Good visualization and presentation