Yongsheng Yu1,2 Wei Xiong1† Weili Nie1 Yichen Sheng1 Shiqiu Liu1 Jiebo Luo2
1NVIDIA 2University of Rochester
†Project Lead and Main Advising
PixelDiT is a single-stage, end-to-end pixel-space diffusion transformer that eliminates the VAE autoencoder entirely. It uses a dual-level architecture — patch-level DiT for global semantics + pixel-level DiT for texture details — to generate images directly in pixel space.
- 1.61 FID on ImageNet 256×256
- 0.74 GenEval / 83.5 DPG-Bench on text-to-image at 1024×1024
- No VAE, no latent space
- [2025/11] Paper, training & inference code, and pre-trained models are released.
All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow ADM evaluation protocol.
| Epoch | gFID↓ | CFG Scale | Steps | Sampler | Time Shift | CFG Interval |
|---|---|---|---|---|---|---|
| 80 | 2.36 | 3.25 | 100 | FlowDPMSolver | 1.0 | [0.1, 1.0] |
| 160 | 1.97 | 3.25 | 100 | FlowDPMSolver | 1.0 | [0.1, 1.0] |
| 320 | 1.61 | 2.75 | 100 | FlowDPMSolver | 1.0 | [0.1, 0.9] |
| Resolution | gFID↓ | CFG Scale | Steps | Sampler | Time Shift | CFG Interval |
|---|---|---|---|---|---|---|
| 512×512 | 1.81 | 3.5 | 100 | FlowDPMSolver | 2.0 | [0.1, 1.0] |
| Resolution | GenEval↑ | DPG-Bench↑ |
|---|---|---|
| 512×512 | 0.78 | 83.7 |
| 1024×1024 | 0.74 | 83.5 |
Docker image (recommended): nvcr.io/nvidia/pytorch:24.09-py3
pip install -r requirements.txtNote: Our models are resumed every 4 hours, using the timestamp as the random seed each time. As a result, the final training outcome may have a slight gap compared to a continuous run without intermediate resumes.
Training and evaluation instructions for class-conditioned generation on ImageNet 256×256 and 512×512.
Multi-stage training (512px → 1024px) and inference for text-to-image generation.
├── pixdit_core/ # Shared PixelDiT model definitions (c2i & t2i)
├── tools/ # Shared utilities (checkpoint download, GFLOPs computation)
├── c2i/ # Class-to-image
└── t2i/ # Text-to-image
Measure single-forward-pass GFLOPs for any PixelDiT model (run from project root):
# C2I (ImageNet 256x256, default resolution)
python tools/compute_flops.py --config c2i/configs/pix256_xl.yaml# T2I at 1024x1024
python tools/compute_flops.py --config t2i/configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml --height 1024 --width 1024If you find this work useful, please cite:
@inproceedings{yu2025pixeldit,
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026},
}