# 1. Introduction – Let’s Customize Diffusion!

Welcome to the Diffusion + LoRA Fine-Tuning Workshop! 
In this tutorial, we’ll show you how to fine-tune a powerful diffusion model using just a few images of yourself (or your cat, or your coffee mug—no judgment!).


## What you’ll learn:

 How diffusion models work (briefly!)

 How to prepare your own dataset

 How to caption it like a pro

 How to fine-tune using LoRA

 And finally… see your AI clone generate magic

 Before We Begin: Upload 10–15 photos of yourself. Face visibility helps!

## Credits:

This code was adapted from: [Fine Tuning SDXL on a Free T4 Google Colab GPU](https://medium.com/@ravi.kaskuser/fine-tuning-sdxl-on-a-free-t4-google-colab-gpu-41ca2cd3cec8)

All credits goes to [Ravi Adi Prakoso](https://medium.com/@ravi.kaskuser)

# Setup

Install the required dependencies

In [None]:
!nvidia-smi

In [None]:
!pip install git+https://github.com/huggingface/diffusers.git@v0.32.1

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content

# Captioning (optional)

For diffusion models to “know” what they’re generating, they need text-image pairs. That’s where captioning comes in.

We’ll assign a special token to represent your concept (e.g., "a photo of TOK man").
Think of this as your model's personalized vocabulary word!

Tip: Use consistent phrasing across captions. The simpler, the better.

In [None]:
import requests
from transformers import AutoProcessor, BlipForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load the processor and the captioning model
blip_processor = AutoProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
blip_model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base",torch_dtype=torch.float16).to(device)

# captioning utility
def caption_images(input_image):
    inputs = blip_processor(images=input_image, return_tensors="pt").to(device, torch.float16)
    pixel_values = inputs.pixel_values

    generated_ids = blip_model.generate(pixel_values=pixel_values, max_length=50)
    generated_caption = blip_processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return generated_caption


In [None]:

import glob
from PIL import Image

# create a list of (Pil.Image, path) pairs
local_dir = "/content/drive/MyDrive/ICOICT_demo/images"
imgs_and_paths = [(path,Image.open(path)) for path in glob.glob(f"{local_dir}*.jpg")]

imgs_and_paths

In [None]:
import json

caption_prefix = "a photo of TOK man, " #@param
with open(f'{local_dir}metadata.jsonl', 'w') as outfile:
  for img in imgs_and_paths:
      caption = caption_prefix + caption_images(img[1]).split("\n")[0]
      entry = {"file_name":img[0].split("/")[-1], "prompt": caption}
      json.dump(entry, outfile)
      outfile.write('\n')

In [None]:
import gc

# delete the BLIP pipelines and free up some memory
del blip_processor, blip_model
gc.collect()
torch.cuda.empty_cache()

# Training – Time to Teach the Model

Now the fun part: let’s fine-tune the model using LoRA — a lightweight way to inject new knowledge into a huge model without retraining the whole thing.


What’s Happening Under the Hood:

  LoRA adds a few trainable adapters to the model

  It’s fast, cheap, and doesn’t mess with the core weights!

  Expect ~30–90 minutes training time (depending on settings & hardware)

In [None]:
!pip install bitsandbytes transformers accelerate peft -q

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!accelerate config default

### Download Dreambooth LoRA training script

In [None]:

!wget https://raw.githubusercontent.com/huggingface/diffusers/v0.32.1/examples/dreambooth/train_dreambooth_lora_sdxl.py


Adjust your parameters


*   instance_data_dir: location of your images
*   output_dir: the model will be saved here
*   resolution: try different resolutions (256, 512, 1024), the higher resolution the longer training time
* instance prompt: keyword associated with your photo



In [None]:
#!/usr/bin/env bash
!accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --instance_data_dir="/content/drive/MyDrive/ICOICT_demo/images" \
  --output_dir="model_LoRA" \
  --resolution=256 \
  --instance_prompt="a photo of TOK man" \
  --caption_column="prompt"\
  --mixed_precision="fp16" \
  --train_batch_size=1 \
  --gradient_accumulation_steps=3 \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --max_train_steps=600 \
  --checkpointing_steps=100 \
  --seed="0"


# INFERENCE

Your model is now trained. Let’s put it to the test!

Try generating images with creative prompts using your special token:

### Load the pretrained model

In [None]:
import torch
from diffusers import DiffusionPipeline, AutoencoderKL

vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    vae=vae,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
)
_ = pipe.to("cuda")

In [None]:
import os

output_dir = "/content/drive/MyDrive/ICOICT_demo/output_images"
os.makedirs(output_dir, exist_ok=True)

def save_image_incremental(image, output_dir, prefix="image", ext=".jpg"):
    os.makedirs(output_dir, exist_ok=True)

    # List files with the given prefix and extension
    existing_files = [f for f in os.listdir(output_dir) if f.startswith(prefix) and f.endswith(ext)]

    # Extract numbers from filenames
    existing_nums = []
    for f in existing_files:
        try:
            num = int(f.replace(prefix + "-", "").replace(ext, ""))
            existing_nums.append(num)
        except ValueError:
            continue

    # Get the next number
    next_num = max(existing_nums, default=0) + 1
    filename = f"{prefix}-{next_num}{ext}"

    # Save the image
    image.save(os.path.join(output_dir, filename))
    print(f"Saved: {filename}")
    return filename


## Load your own fine-tuned  model

If you haven't download your fine-tuned model and upload it to the specified directory

In [None]:
repo_id = "/content/drive/MyDrive/ICOICT_demo/models/pytorch_lora_weights-600.safetensors"
pipe.load_lora_weights(repo_id)
_ = pipe.to("cuda")

## Generate your images

Experiment with different prompts


In [None]:
import random

trigger = "a photo of TOK man, "

# Example prompts for generating images
example_prompts = [
    "futuristic cyberpunk style, glowing neon lights in the background, blue and pink lighting, wearing a cyber visor, moody expression, rain falling, ultra-detailed face, stylized realism",
    "futuristic cyberpunk style, glowing neon lights in the background, blue and pink lighting, wearing a cyber visor, moody expression, rain falling, ultra-detailed face, stylized realism",
    "hiking through a misty forest, wearing a hooded jacket, slight rain on his face, atmospheric background, realistic style, cinematic mood, sharp face details",
    "standing confidently at the head of a meeting table, arms crossed, business casual attire, glass-walled office background, leadership presence, natural lighting, high-resolution detail",
    "smiling confidently against a white studio background, wearing a light blue shirt and blazer, well-lit for media and marketing use, modern corporate photography style, photorealistic",
    "laughing and engaging with coworkers in a creative office space, casual but polished look, natural interaction, modern workplace environment, clear facial detail, warm and friendly tone",
    "standing under colorful neon signs in a busy night market, lively background blur, side lighting casting soft shadows, reflective surfaces, vibrant city energy, highly detailed portrait",
    "standing under a temple in Japan, lively background blur, daylight, reflective surfaces, vibrant city energy, highly detailed portrait",
    "standing in a desert at sunset, windswept scarf, rugged face with sun-kissed skin, dramatic sky in the background, realistic shadows, adventure mood, cinematic detail",
    "working at a desk in a bright open-plan office, laptop in front of him, natural daylight from large windows, smart-casual outfit, candid professional moment, realistic lighting and detail",
    "looking out a bright  window, soft indoor lighting, glossy glass in focus, smile, bright background, ultra-realistic lighting",
    "professional studio headshot, plain dark background, soft diffused lighting, confident expression, detailed facial features, symmetrical composition, ultra-realistic skin texture",
    "looking out a rainy window, soft indoor lighting, raindrops on glass in focus, melancholic expression, warm cozy background, ultra-realistic lighting, shallow depth of field",
    "wearing a navy business suit and white shirt, standing in front of a modern office backdrop, confident and approachable expression, clean lighting, realistic skin texture, professional portrait style",
    "standing on a rooftop at sunset, cinematic lighting, wearing a black leather jacket, bokeh city lights in the background, shallow depth of field, dramatic sky, realistic style, high detail, wide angle",
]

# Generate a random prompt using the trigger and an example prompt
prompt = f"{trigger} {random.choice(example_prompts)}"
prompt

# Generate an image using the prompt and save
image = pipe(prompt=prompt, num_inference_steps=25).images[0]
save_image_incremental(image, output_dir)
image
