## Overview

`Unconditional image generation` is a popular application that generates images that look like those in the dataset used for training. Typically, the best result are obtained from finetuning a pretrained model on s specific dataset. You can find many of these checkpoints on the Huggingface model hub. Let's take a look at how to finetune a pretrained model on a custom dataset. Here we use UNet2DModel from scrach on a subnet of the Smithsonian butterflies dataset to generate images of butterflies.

## Preparing the environment

Before we start, make sure we have Datasets installed to load and preprocess image datasets, and accelerate.
The following command will install TensorBoard to visualize training metrics.

In [None]:
# Make sure accelerate supports multiples
!pip install accelerate==0.20.3
!pip install diffusers['training']

### Login to your account for share model(optional)

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## The training process

### Training configuration

For convenience, create a TrainingConfig class containing the training **[hyperparameters](https://aisuko.gitbook.io/wiki/ai-techniques/large-language-model/ggml#hyperparameters)**.

In [None]:
from dataclasses import dataclass


@dataclass
class TrainingConfig:
    image_size = 128 # the generated image resolution
    train_batch_size=16
    eval_batch_size=16 # how many images to sample for evaluation
    num_epochs=50
    gradient_accumulation_steps=1
    learning_rate=1e-4
    lr_warmup_steps=500
    save_image_epochs=10
    save_model_epochs=30
    mixed_precision="fp16" # `no` for float32, `fp16` for automatic mixed precision
    output_dir="ddpm-butterflies-128" # the model name locally and on the HF Hub
    push_to_hub=False # whether to upload the model to the HF Hub
    hub_private_repo=False
    overwrite_output_dir=True # overwrite the old model when re-running the notebook
    seed=0

config = TrainingConfig()

### Load the dataset

In [None]:
from datasets import load_dataset

config.dataset_name = "huggan/smithsonian_butterflies_subset"
dataset = load_dataset(config.dataset_name, split="train")

Datasets uses the image feature to automatically decode the image data and load it as a PIL.Image which we can visualize:

In [None]:
import matplotlib.pyplot as plt

fig, axs=plt.subplots(1,4,figsize=(16,4))
for i, image in enumerate(dataset[:4]["image"]):
    axs[i].imshow(image)
    axs[i].set_axis_off()
fig.show()

The images are all diferent sizes though, so we'll need to preprocess them first:
* `Resize` changes the image size to the one defined in config.image_size
* `RandomHorizontalFlip` arguments the dataset by randomly mirroring the images.
* [`Normalize`](https://aisuko.gitbook.io/wiki/ai-techniques/framework/ml_training_components#normalization) is important to rescale the pixel values into a [-1,1] range, which is what the model expects.

In [None]:
from torchvision import transforms

preprocess = transforms.Compose(
    [
        transforms.Resize((config.image_size, config.image_size)),
        transforms.RandomHorizontalFlip(),
        transforms.RandomRotation(degrees=10),  # Add random rotation,
        transforms.ToTensor(),
        transforms.Normalize([0.5], [0.5]),
    ]
)

Use Datasets' set transform methos to apply the `preprocess` function on the fly during training:

In [None]:
def transform(examples):
    images = [preprocess(image.convert("RGB")) for image in examples["image"]]
    return {"images": images}

dataset.set_transform(transform)

(Optional)Visualize the images again to confirm that they've been resized. Now you're ready to wrap the dataset in a DataLoader for training:

In [None]:
import torch

train_dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=config.train_batch_size,
    shuffle=True
)

## Create a UNet2DModel

Preatained models in Diffusers are easily created from their model class with the parameters you want. For example, to create a UNet2DModel:

In [None]:
from diffusers import UNet2DModel

model=UNet2DModel(
    sample_size=config.image_size, # the target image resolution
    in_channels=3, # the number of input channels, 3 for RGB images
    out_channels=3, # the number of output channels
    layers_per_block=2, # how many ResNet layers to use per UNet block
    block_out_channels=(128,128,256,256,512,512), # the numbe of output channels for eaxh UNet block
    down_block_types=(
        "DownBlock2D", # a regular ResNet downsampling block
        "DownBlock2D",
        "DownBlock2D",
        "DownBlock2D",
        "AttnDownBlock2D", # a ResNet downsampling block with spatial self-attention
        "DownBlock2D",
    ),
    up_block_types=(
        "UpBlock2D", # a regular ResNet upsampling block
        "AttnUpBlock2D", # a ResNet upsampling block with spatial self-attention
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
        "UpBlock2D",
    ),
)

Checking the sample image shape matches the model output shape:

In [None]:
sample_image = dataset[0]["images"].unsqueeze(0)
print("Input shape:", sample_image.shape)

In [None]:
print("Output shape:", model(sample_image, timestep=0).sample.shape)

## Add some noise to the image

### Create a scheduler

The scheduler behaves differently dependeing on whether you're using the model for training of inference.

* `During inference`, the scheduler generates image from the noise. 
* `During training`, the scheduler takes a model output or a sample from a specific point in the diffusion process and applies noise to the image according to a `noise schedule` and an `update rule`.

In [None]:
import torch

from PIL import Image
from diffusers import DDPMScheduler

noise_scheduler = DDPMScheduler(num_train_timesteps=1000)
noise=torch.randn(sample_image.shape)
timesteps=torch.LongTensor([50])
noisy_image=noise_scheduler.add_noise(sample_image, noise, timesteps)

Image.fromarray(((noisy_image.permute(0,2,3,1)+1.0)*127.5).type(torch.uint8).numpy()[0])

The training objective of the model is to predict the noise added to the image. The loss at this step can be calculated by:

In [None]:
import torch.nn.functional as F

noise_pred = model(noisy_image, timesteps).sample
mse_loss = F.mse_loss(noise_pred, noise)
l1_loss = F.l1_loss(noise_pred, noise)
combined_loss = 0.7 * mse_loss + 0.3 * l1_loss  # Combine MSE and L1 losses
 

## Train the model

### First, we will need an optimizer and a learning rate scheduler:

In [None]:
from diffusers.optimization import get_cosine_schedule_with_warmup

optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
lr_scheduler = get_cosine_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=config.lr_warmup_steps,
    num_training_steps=len(train_dataloader) * config.num_epochs,
)

### Second, we will need a way to evaluate the model.

For evaluation, we can use the DDPMPipeline to generate a batch of sample images and save it as a grid:

In [None]:
from diffusers import DDPMPipeline
import math
import os

def make_grid(images, rows, cols):
    w, h =images[0].size
    grid = Image.new('RGB', size=(cols * w, rows * h))
    for i, image in enumerate(images):
        grid.paste(image, box=(i % cols * w, i // cols * h))
    return grid

def evaluate(config, epoch, pipeline):
    # Sample some image from random noise(this is the backward diffusion process).
    # The default pipeline output type is `List[PIL.Image]`
    images =pipeline(
        batch_size=config.eval_batch_size,
        generator = torch.manual_seed(config.seed),
    ).images

    # Make a grid out of the images
    image_grid = make_grid(images, rows=4, cols=4)

    # Save the images
    test_dir=os.path.join(config.output_dir, "samples")
    os.makedirs(test_dir, exist_ok=True)
    image_grid.save(f"{test_dir}/{epoch:04d}.png")

### Wrapping all in a training loop

We can use Huggingface `Accelerate` for TensorBoard logging, gradient accumulation,and mixed precision training.

In [None]:
from __future__ import annotations
from accelerate import Accelerator
from huggingface_hub import HfFolder, Repository, whoami
from tqdm.auto import tqdm
from pathlib import Path
import os


def get_full_repo_name(model_id:str, organization:str=None, token:str=None):
    if token is None:
        token=HfFolder.get_token()
    if organization is None:
        username = whoami(token=token)["name"]
        return f"{username}/{model_id}"
    else:
        return f"{organization}/{model_id}"
    
def train_loop(config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler):
    # Initialize accelerato and tensorboard logging
    accelerator  = Accelerator(
        mixed_precision=config.mixed_precision,
        gradient_accumulation_steps=config.gradient_accumulation_steps,
        log_with="tensorboard",
        project_dir=os.path.join(config.output_dir, "logs"),
    )
    if accelerator.is_main_process:
        if config.push_to_hub:
            repo_name=get_full_repo_name(Path(config.output_dir).name)
            repo=Repository(config.output_dir, clone_from=repo_name)
        elif config.output_dir is not None:
            os.makedirs(config.output_dir, exist_ok=True)
        accelerator.init_trackers("train_example")
    
    # Prepare all the components
    # There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the prepare method
    model, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
        model,
        optimizer,
        train_dataloader,
        lr_scheduler,
    )

    global_step = 0 #the global step across all epochs

    # Now we train the model
    for epoch in range(config.num_epochs):
        progress_bar =tqdm(total=len(train_dataloader), disable=not accelerator.is_local_main_process)
        progress_bar.set_description(f"Epoch {epoch}")

        for step, batch in enumerate(train_dataloader):
            clean_images=batch["images"]
            # Sample noise to add the images
            noise=torch.randn(clean_images.shape).to(clean_images.device)
            bs=clean_images.shape[0]

            # Sample a random temstep for each image
            timesteps=torch.randint(0, noise_scheduler.config.num_train_timesteps, size=(bs,), device=clean_images.device).long()

            # Add noise to the clean images according to the noise magnitude at each timestep (this is the forward diffusion process)
            noisy_images=noise_scheduler.add_noise(clean_images, noise, timesteps)

            with accelerator.accumulate(model):
                # Predict the noise residual
                noise_pred = model(noisy_images, timesteps, return_dict=False)[0]
                loss=F.mse_loss(noise_pred, noise)
                accelerator.backward(loss)

                accelerator.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
            
            progress_bar.update(1)
            logs={"loss": loss.detach().item(),"lr": lr_scheduler.get_last_lr()[0],"step": global_step}
            progress_bar.set_postfix(**logs)
            accelerator.log(logs,step=global_step)
            global_step += 1

        
        # After each epoch you optionally sample some demo images with evaluate() and save the model
        if accelerator.is_main_process:
            pipeline = DDPMPipeline(unet=accelerator.unwrap_model(model), scheduler=noise_scheduler)

            if (epoch+1) % config.save_image_epochs == 0 or epoch == config.num_epochs - 1:
                evaluate(config, epoch, pipeline)
            
            if (epoch +1) % config.save_model_epochs == 0 or epoch == config.num_epochs - 1:
                if config.push_to_hub:
                    repo.push_to_hub(commit_message=f"Epoch {epoch}", blocking=True)
                else:
                    pipeline.save_pretrained(config.output_dir)

### Launch the training with Accelerate's notebook_launcher function

Pass the function to the training loop, all the training arguments and the number of processes(this value to the number of GPUs avaliable to you) to use for training:

In [None]:
from accelerate import notebook_launcher

args = (config, model, noise_scheduler, optimizer, train_dataloader, lr_scheduler)

# TPU num_processes=8, multiples CPUs num_processes=2
notebook_launcher(train_loop, args, num_processes=1)

### Checking the final images generated by the model

In [None]:
import glob

sample_images=sorted(glob.glob(f"{config.output_dir}/samples/*.png"))
Image.open(sample_images[-1])

## Evaluate Model on Test Data

Let's evaluate the trained model on the test dataset and visualize some of the generated images. This will give us an idea of how well the model has learned to generate images.

We'll use the same process as before, where we add noise to clean images and then subtract the predicted noise from them to generate the final images.

Let's proceed with the evaluation:


In [None]:
# Evaluate the model on test data
test_dataset = load_dataset(config.dataset_name, split="test")  # Assuming a "test" split exists
test_dataloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=config.eval_batch_size,
    shuffle=False
)

# Evaluate the model on the test dataset and generate samples
test_samples = []
with torch.no_grad():
    for batch in test_dataloader:
        clean_images = batch["images"]
        noise = torch.randn(clean_images.shape).to(clean_images.device)
        timesteps = torch.randint(0, noise_scheduler.config.num_train_timesteps, size=(clean_images.shape[0],), device=clean_images.device).long()

        noisy_images = noise_scheduler.add_noise(clean_images, noise, timesteps)
        predicted_noise = model(noisy_images, timesteps, return_dict=False)[0]

        generated_images = torch.clamp(noisy_images - predicted_noise, -1.0, 1.0)
        test_samples.append(generated_images)

# Display a few generated images from the test dataset
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(16, 4))
for i, generated_image in enumerate(test_samples[0][:4]):
    axs[i].imshow(generated_image.permute(1, 2, 0).cpu().numpy())
    axs[i].set_axis_off()
fig.show()
