
# **Fine-Tuning Stable Diffusion with Low-Rank Adaptation (LoRA)**

## Project Overview
This project explores parameter-efficient fine-tuning of Stable Diffusion models using Low-Rank Adaptation (LoRA) on the Flickr8k dataset. We compare two different approaches:

1. **Model 1 (Minimal)**: Training on 100 images with attention-only LoRA adaptation
2. **Model 2 (Comprehensive)**: Training on 3,000 images with expanded LoRA targeting both attention and feedforward networks

Our goal is to demonstrate how LoRA enables efficient adaptation of large diffusion models even with limited computational resources, and to analyze the impact of dataset size and architecture choices on generation quality.

## Theoretical Background
### Stable Diffusion
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images from text prompts. The model works by:
1. Encoding the input text using CLIP text encoder
2. Applying a diffusion process in the latent space
3. Gradually denoising random noise into a coherent image guided by the text embedding

### Low-Rank Adaptation (LoRA)
LoRA is a parameter-efficient fine-tuning technique that:
- Freezes pre-trained model weights
- Injects trainable rank decomposition matrices into layers
- Approximates weight updates using low-rank matrices: ΔW = AB, where A ∈ ℝᵐˣʳ, B ∈ ℝʳˣⁿ
- Reduces trainable parameters by ~97% compared to full fine-tuning

LoRA enables adaptation of large models on consumer hardware while maintaining performance.


## **Fine-Tune Stable Diffusion with Flickr8k**

### **Import Libraries**

In [1]:
import os
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
from transformers import CLIPTokenizer
from diffusers import StableDiffusionPipeline, DDPMScheduler
from peft import get_peft_model, LoraConfig
from datasets import Dataset as HFDataset
from tqdm import tqdm



### **Configuration parameters for model fine-tuning**

**1. image_dir: Directory containing Flickr8k images**

**2. captions_file: Path to captions dataset**

**3. pretrained_model: Base Stable Diffusion model to fine-tune**

**4. output_dir: Where to save the fine-tuned model**

**5. image_size: Resolution for training images (512x512)**

**6. batch_size: Batch size for training (larger for Model 2)**

**7. num_epochs: Number of training epochs**

**8. lr: Learning rate (lower for Model 2 for stability)**

In [None]:
image_dir = "C:/Users/molavade.s/Latent_Diffusion_model/flickr8k/Images"
captions_file = "C:/Users/molavade.s/Latent_Diffusion_model/flickr8k/captions.txt"
pretrained_model = "CompVis/stable-diffusion-v1-4"
output_dir = "./sd-fine-tuned-lora-updated"
image_size = 512
batch_size = 4
num_epochs = 20
lr = 5e-6
device = "cuda" if torch.cuda.is_available() else "cpu"

### **Load Captions**

In [5]:
def load_captions(captions_path):
    pairs = []
    with open(captions_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("image,caption"):
                continue
            try:
                img, caption = line.split(',', 1)
                img = img.split('#')[0]
                full_img_path = os.path.join(image_dir, img)
                if os.path.exists(full_img_path):
                    pairs.append({'image': full_img_path, 'caption': caption.strip()})
                else:
                    print(f"Image file not found: {full_img_path}")
            except Exception as e:
                print(f"Error processing line: {line} – {e}")
    return pairs


# Load and convert to HuggingFace Dataset
pairs = load_captions(captions_file)[:3000]  # Limit to first 1000
from datasets import Dataset as HFDataset
hf_dataset = HFDataset.from_list(pairs)

### **Tokenizer and Transform** 

In [6]:
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
transform = transforms.Compose([
    transforms.Resize((image_size, image_size)),
    transforms.ToTensor(),
    transforms.Normalize([0.5], [0.5])
])

### **Custom Dataset**

In [7]:
class FlickrDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        example = self.data[idx]
        image = Image.open(example['image']).convert('RGB')
        pixel_values = transform(image)
        text_inputs = tokenizer(example['caption'], padding='max_length', truncation=True, max_length=77, return_tensors='pt')
        return {
            'pixel_values': pixel_values,
            'input_ids': text_inputs.input_ids.squeeze(0),
            'attention_mask': text_inputs.attention_mask.squeeze(0)
        }

dataset = FlickrDataset(hf_dataset)

### **Load Pipeline and Freeze**

In [None]:
pipe = StableDiffusionPipeline.from_pretrained(pretrained_model, torch_dtype=torch.float16 if device=="cuda" else torch.float32)
pipe.to(device)
pipe.vae.requires_grad_(False)
pipe.text_encoder.requires_grad_(False)

In [9]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["to_q", "to_k", "to_v", "to_out.0", "ff.net.0.proj", "ff.net.2"],
    bias="none"
)

pipe.unet = get_peft_model(pipe.unet, lora_config)


### **Training**

In [None]:
optimizer = torch.optim.Adam(pipe.unet.parameters(), lr=lr)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)

for epoch in range(num_epochs):
    pipe.unet.train()
    for batch in tqdm(dataloader, desc=f'Epoch {epoch+1}/{num_epochs}'):
        images = batch['pixel_values'].to(device, dtype=torch.float16)
        latents = pipe.vae.encode(images).latent_dist.sample() * 0.18215
        input_ids = batch['input_ids'].to(device)
        noise = torch.randn_like(latents)
        timesteps = torch.randint(0, pipe.scheduler.config.num_train_timesteps, (latents.shape[0],), device=device).long()
        noisy_latents = pipe.scheduler.add_noise(latents, noise, timesteps)
        with torch.no_grad():
            encoder_hidden_states = pipe.text_encoder(input_ids)[0].to(dtype=torch.float16)
        model_pred = pipe.unet(noisy_latents, timesteps, encoder_hidden_states).sample
        loss = torch.nn.functional.mse_loss(model_pred, noise)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Epoch {epoch+1}: Loss = {loss.item()}')

### **Save final LoRA fine-tuned UNet**

In [11]:
pipe.unet.save_pretrained(output_dir)
print(f"Fine-tuned U-Net saved to: {output_dir}")

Fine-tuned U-Net saved to: ./sd-fine-tuned-lora-updated


In [None]:
from diffusers import StableDiffusionPipeline
import torch
from peft import PeftModel, LoraConfig

# Load original SD pipeline
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe.to("cuda")

# Load your LoRA fine-tuned U-Net
from peft import PeftModel
pipe.unet = PeftModel.from_pretrained(pipe.unet, "./sd-fine-tuned-lora-updated")
pipe.unet.eval()

# Enable faster generation
pipe.enable_attention_slicing()

## Conclusions

1. LoRA enables efficient fine-tuning of large diffusion models
2. Even modest datasets (100-3,000 samples) can produce meaningful adaptations
3. Strategic choices in LoRA configuration significantly impact results
4. The approach scales well from small experiments to larger training sets