# Scalable Diffusion Models with Transformer (DiT)

This notebook samples from pre-trained DiT models. DiTs are class-conditional latent diffusion models trained on ImageNet that use transformers in place of U-Nets as the DDPM backbone. DiT outperforms all prior diffusion models on the ImageNet benchmarks.

[Project Page](https://www.wpeebles.com/DiT) | [HuggingFace Space](https://huggingface.co/spaces/wpeebles/DiT) | [Paper](http://arxiv.org/abs/2212.09748) | [GitHub](github.com/facebookresearch/DiT)

# 1. Setup

We recommend using GPUs (Runtime > Change runtime type > Hardware accelerator > GPU). Run this cell to clone the DiT GitHub repo and setup PyTorch. You only have to run this once.

In [87]:
%cd /content/Diffusion




/content/Diffusion


In [85]:
!git clone https://github.com/Riccardo582/Diffusion.git

Cloning into 'Diffusion'...
remote: Enumerating objects: 122, done.[K
remote: Counting objects: 100% (122/122), done.[K
remote: Compressing objects: 100% (77/77), done.[K
remote: Total 122 (delta 52), reused 113 (delta 43), pack-reused 0 (from 0)[K
Receiving objects: 100% (122/122), 4.80 MiB | 10.37 MiB/s, done.
Resolving deltas: 100% (52/52), done.


In [70]:
!ls
import os, sys
os.chdir("Diffusion")
sys.path.append(os.getcwd())


#pip install diffusers timm --upgrade
# DiT imports:
import torch
from torchvision.utils import save_image
from diffusion import create_diffusion
from diffusers.models import AutoencoderKL
from download import find_model
from models import DiT ,DiT_XL_2
from PIL import Image
from IPython.display import display
torch.set_grad_enabled(False)
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cpu":
    print("GPU not found. Using CPU instead.")

Diffusion  sample_data


# Download DiT-XL/2 Models

You can choose between a 512x512 model and a 256x256 model. You can swap-out the LDM VAE, too.

In [None]:
image_size = 256 #@param [256, 512]
vae_model = "stabilityai/sd-vae-ft-ema" #@param ["stabilityai/sd-vae-ft-mse", "stabilityai/sd-vae-ft-ema"]
# Load model:
model = DiT_XL_2(input_size=latent_size).to(device)
state_dict = find_model(f"DiT-XL-2-{image_size}x{image_size}.pt")
model.load_state_dict(state_dict)
model.eval() # important!


# 2. Sample from Pre-trained DiT Models

You can customize several sampling options. For the full list of ImageNet classes, [check out this](https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a).

In [None]:
# Set user inputs:
seed = 0 #@param {type:"number"}
torch.manual_seed(seed)
num_sampling_steps = 250 #@param {type:"slider", min:0, max:1000, step:1}
cfg_scale = 4 #@param {type:"slider", min:1, max:10, step:0.1}
class_labels = 207, 360, 387, 974, 88, 979, 417, 279 #@param {type:"raw"}
samples_per_row = 4 #@param {type:"number"}

# Create diffusion object:
diffusion = create_diffusion(str(num_sampling_steps))

# Create sampling noise:
n = len(class_labels)
z = torch.randn(n, 4, latent_size, latent_size, device=device)
y = torch.tensor(class_labels, device=device)

# Setup classifier-free guidance:
z = torch.cat([z, z], 0)
y_null = torch.tensor([1000] * n, device=device)
y = torch.cat([y, y_null], 0)
model_kwargs = dict(y=y, cfg_scale=cfg_scale)

# Sample images:
samples = diffusion.p_sample_loop(
    model.forward_with_cfg, z.shape, z, clip_denoised=False, 
    model_kwargs=model_kwargs, progress=True, device=device
)
samples, _ = samples.chunk(2, dim=0)  # Remove null class samples
samples = vae.decode(samples / 0.18215).sample

# Save and display images:
save_image(samples, "sample.png", nrow=int(samples_per_row), 
           normalize=True, value_range=(-1, 1))
samples = Image.open("sample.png")
display(samples)

In [71]:
import torch
print("CUDA available:", torch.cuda.is_available())
print("GPU count:", torch.cuda.device_count())


CUDA available: True
GPU count: 1


In [88]:
!torchrun --nnodes=1 --nproc_per_node=1 train.py \
  --model DiT-XL/2 \
  --data-path data/datasets/data/burgers_train_16.pt \
  --image-size 16 \
  --cx 1 \
  --cy 1 \
  --epochs 5 \
  --global-batch-size 8 \
  --num-workers 2


  self.setter(val)
Starting rank=0, seed=0, world_size=1.
[[34m2026-01-29 16:11:48[0m] Experiment directory created at results/000-DiT-XL-2
[[34m2026-01-29 16:12:01[0m] DiT Parameters: 673,641,220
[[34m2026-01-29 16:12:01[0m] Dataset contains 800 PDE samples (data/datasets/data/burgers_train_16.pt)
[[34m2026-01-29 16:12:01[0m] Training for 5 epochs...
[[34m2026-01-29 16:12:01[0m] Beginning epoch 0...
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/Diffusion/train.py", line 362, in <module>
[rank0]:     main(args)
[rank0]:   File "/content/Diffusion/train.py", line 261, in main
[rank0]:     for x_cond, y, phys in loader:
[rank0]:                            ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 732, in __next__
[rank0]:     data = self._next_data()
[rank0]:            ^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1506, in _next

In [67]:
import os
print("Current working directory:")
print(os.getcwd())


Current working directory:


FileNotFoundError: [Errno 2] No such file or directory

In [78]:
!ls data/datasets/data


burgers_test_16.pt   darcy_test_16.pt  mini_car.pt
burgers_train_16.pt  darcy_test_32.pt
