A minimal implementation of Denoising Diffusion Probabilistic Models (DDPM) built entirely from scratch in PyTorch. The model learns to generate images through an iterative denoising process. Currently supports MNIST (28x28 grayscale) and CIFAR-10 (32x32 RGB).
DDPM works in two phases:
Forward process (training): Gradually add Gaussian noise to real images over T=1000 timesteps until they become pure noise.
Reverse process (inference): Starting from pure noise, the trained UNet predicts and removes the noise step-by-step, recovering a clean image.
Pure Noise (t=999) --> ... --> Partially Denoised (t=500) --> ... --> Clean Image (t=0)
.
├── configs/
│ ├── config_mnist.py # Hyperparameters for MNIST
│ └── config_cifar10.py # Hyperparameters for CIFAR-10
├── models/
│ └── unet.py # UNet architecture with time embeddings & attention
├── pipelines/
│ └── ddpm_scheduler.py # Linear noise scheduler (forward + reverse diffusion)
├── utils/
│ ├── mnist_dataset.py # CSV dataset loader for MNIST
│ ├── mnist_training.py # MNIST training loop
│ ├── cifar10_dataset.py # Folder-based image dataset loader for CIFAR-10
│ └── cifar10_training.py # CIFAR-10 training loop
├── train.py # Entry point for training
├── sample_mnist.py # Generate a single MNIST image
├── sample_cifar10.py # Generate a single CIFAR-10 image
├── app.py # Gradio web app for interactive generation
├── environment.yml # Conda environment file
└── README.md
The denoising network follows the standard UNet encoder-decoder structure:
- Encoder (DownBlocks): 3 blocks with channels
[32, 64, 128, 256], each containing ResNet layers, self-attention, and optional spatial downsampling - Bottleneck (MidBlocks): 2 blocks with channels
[256, 256, 128]with ResNet + self-attention - Decoder (UpBlocks): 3 blocks mirroring the encoder with skip connections, ResNet layers, self-attention, and upsampling
Time conditioning: Sinusoidal positional embeddings encode the diffusion timestep, projected through an MLP and injected into every ResNet block.
Uses a linear beta schedule from beta_start=0.0001 to beta_end=0.02 across 1000 timesteps.
add_noise(x0, noise, t)-- Forward process:x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noisesample_prev_timestep(xt, noise_pred, t)-- Reverse step using the DDPM mean formula with posterior variance
- Python 3.9+
- PyTorch 2.0+ (with CUDA or MPS support recommended)
conda env create -f environment.yml
conda activate ddpm- Download the MNIST CSV files from Kaggle: MNIST in CSV
- Create a
data/directory in the project root and place the downloaded CSV files there:
mkdir -p data
# Move the downloaded files into the data directory
mv /path/to/mnist_train.csv data/train.csv
mv /path/to/mnist_test.csv data/test.csv # optional, not used for trainingYour project should look like:
UnconditionalDDPM/
├── data/
│ └── train.csv <-- required
├── configs/
├── models/
└── ...
The expected CSV format is:
label,1x1,1x2,...,28x28
7,0,0,...,0
2,0,0,...,255
First column is the digit label (unused for unconditional training), remaining 784 columns are pixel values (0-255) for a 28x28 image. The config at configs/config_mnist.py points to data/train.csv by default -- update csv_path there if your file is named differently.
- Download the CIFAR-10 images dataset from Kaggle: CIFAR-10 Images (or any source that provides CIFAR-10 as individual PNG files).
- Place all training images (
.pngfiles) into adata/train/directory in the project root:
mkdir -p data/train
# Move/copy all CIFAR-10 PNG images into data/train/
cp /path/to/cifar10-images/*.png data/train/Your project should look like:
UnconditionalDDPM/
├── data/
│ └── train/
│ ├── 0.png
│ ├── 1.png
│ ├── 2.png
│ └── ... <-- 32x32 RGB PNG images
├── configs/
├── models/
└── ...
The dataset loader reads every image file in the folder, converts it to RGB, and normalizes pixel values to [-1, 1]. The config at configs/config_cifar10.py points to data/train by default -- update folder_path there if your images are elsewhere.
To train on a specific dataset, update the import in train.py to use the desired config and training module:
MNIST:
# train.py should import from configs.config_mnist and utils.mnist_training
python train.pyTrains the UNet for 40 epochs on 1000 randomly sampled images from the CSV. Checkpoints are saved to mnist/ddpm_ckpt.pth.
CIFAR-10:
# train.py should import from configs.config_cifar10 and utils.cifar10_training
python train.pyTrains the UNet for 20 epochs on 40,000 images from the image folder. Checkpoints are saved to cifar10/cifar10_ckpt.pth.
Training progress is printed per epoch:
Epoch 1/40: 100%|██████████| 15/15 [00:05<00:00]
Finished epoch:1 | Loss : 0.8226
Finished epoch:2 | Loss : 0.5288
...
Done Training ...
MNIST:
python sample_mnist.pyGenerates one digit image by running the full 1000-step reverse diffusion from pure noise. Output saved to default/sample/generated_sample.png.
CIFAR-10:
python sample_cifar10.pyGenerates one 32x32 RGB image via 1000-step reverse diffusion. Output saved to cifar10/sample/cifar10_sample.png.
python app.pyOpens a Gradio interface at http://127.0.0.1:7860 where you can:
- Set a random seed (or -1 for random)
- Generate a single digit and view the result
- See the denoising progression as a horizontal strip (noise to clean image)
Each dataset has its own config file under configs/.
| Parameter | Value | Description |
|---|---|---|
num_timesteps |
1000 | Number of diffusion steps |
beta_start / beta_end |
0.0001 / 0.02 | Linear noise schedule range |
im_channels |
1 | Grayscale |
im_size |
28 | Image resolution (28x28) |
down_channels |
[32, 64, 128, 256] | Feature channels per encoder level |
time_emb_dim |
128 | Timestep embedding dimension |
num_heads |
2 | Attention heads per block |
batch_size |
64 | Training batch size |
num_epochs |
40 | Training epochs |
lr |
0.0001 | Adam learning rate |
subset_size |
1000 | Number of images to sample from CSV |
| Parameter | Value | Description |
|---|---|---|
num_timesteps |
1000 | Number of diffusion steps |
beta_start / beta_end |
0.0001 / 0.02 | Linear noise schedule range |
im_channels |
3 | RGB |
im_size |
32 | Image resolution (32x32) |
down_channels |
[32, 64, 128, 256] | Feature channels per encoder level |
time_emb_dim |
128 | Timestep embedding dimension |
num_heads |
2 | Attention heads per block |
num_down_layers |
2 | Layers per downsample block (deeper than MNIST) |
batch_size |
64 | Training batch size |
num_epochs |
20 | Training epochs |
lr |
0.0001 | Adam learning rate |
subset_size |
40000 | Number of images to use from folder |
- Denoising Diffusion Probabilistic Models (Ho et al., 2020)
- The Annotated Diffusion Model (Hugging Face)