Vision-only x4 latent diffusion super-resolution experiments. Report source: paper/main.tex.
This is a public source-available, non-commercial research project. The goal is to train an SR model directly, without using a pretrained text-to-image diffusion model. The intended final model handles photo and anime/illustration domains in one codebase with domain conditioning.
This repository is not OSI-approved open source because commercial use is not permitted.
Target task:
LR 128x128 -> HR 512x512
LR 192x192 -> HR 768x768 later
Planned model:
HR image
-> factor-4 VAE / autoencoder
-> HR latent
LR image
-> condition encoder
-> multi-scale LR features
noisy HR latent + LR features + timestep + domain embedding
-> conditional diffusion U-Net
-> denoised HR latent
-> VAE decoder
-> x4 SR output
Constraints:
- PyTorch first.
- ROCm/GPU primary.
- No custom CUDA/ROCm ops.
- TPU/XLA compatibility is a later consideration, so code should stay close to standard PyTorch where practical.
- No pretrained T2I model dependency.
We finished the first Stage 1: VAE / Autoencoder, Stage 2: deterministic LR -> HR latent pretraining, and Stage 3: conditional latent diffusion passes. The photo100k scale-up has completed through Stage 4 condition-start, and the current best sampled photo100k checkpoint is the Stage 4 condition-start checkpoint initialized from Stage 3:
Stage 2 photo100k: latent_pretrain_photo100k_b64, finished step 30000
Stage 3 photo100k: diffusion_photo100k_b32, finished step 60000
Stage 4 photo100k: diffusion_photo100k_b32_stage4_condition, finished step 5000
sampled val100: Stage3 25.3745 PSNR, Stage4 25.4072 PSNR
For denoise/sharpening work, photo_v2 degradation is implemented and the
Stage 2/3 photo100k v2 fine-tunes have completed:
Stage 2 photo100k v2: latent_pretrain_photo100k_v2_b64, finished step 20000
Stage 3 photo100k v2: diffusion_photo100k_b32_v2, finished step 20000
Stage 3 v2 sampled val100: SR 22.6699 PSNR, bicubic 22.4103 PSNR, delta +0.2595
Stage 4 photo100k v2: diffusion_photo100k_b32_stage4_condition_v2, finished step 5000
Stage 4 v2 sampled val100: SR 22.8426 PSNR, bicubic 22.4103 PSNR, delta +0.4323
Stage 4 v2 vs Stage 3 v2: +0.1727 PSNR, wins 81 / losses 19
Stage 4 v2 improves the Stage 3 v2 sampled result and usually stabilizes the
denoise/sharpening output, but some color/contrast overshoot and small
cyan/green sampling artifacts remain. A stronger photo_v3_noise_mix Stage 2
small run was stopped at step 12700 after eval plateaued around
eval/latent_loss 0.282. The 500M-class Stage 2 XL condition encoder then
completed 80000 steps and beat the small v3 condition encoder:
XL Stage 2 condition encoder: configs/latent_pretrain_photo100k_v3_noise_xl.yaml
XL Stage 2 best latent: step 66000, eval/latent_loss 0.27230
XL Stage 2 best PSNR proxy: step 72000, decoded_psnr 21.52
XL Stage 4 condition-start: configs/diffusion_photo100k_xl_stage4_condition_v3.yaml
XL full inference params: 509.658M
Stage 4 XL has not been started yet. The next step is to compare the Stage 2 XL candidate condition encoders, then run the 469.6M U-Net with partial init from Stage 4 v2.
For VM migration and continuation context, read:
Implemented:
- Project scaffold and config loading.
- Manifest-based dataset loader with
photo/animedomain IDs. - On-the-fly x4 degradation pipeline.
- Factor-4
AutoencoderKL. - Autoencoder training loop with bf16 autocast.
- W&B online/offline logging.
- Fixed validation sample logging for Stage 1:
samples/LRsamples/GTsamples/HR
- Validation eval during training:
eval/losseval/reconeval/kleval/mseeval/psnreval/num_images
- Standalone checkpoint eval script.
- Scratch recovery scripts for ephemeral VM storage.
- Stage 2 LR-to-latent predictor and training loop.
- Stage 3 conditional U-Net, noise scheduler, and diffusion training loop.
- Stage 3 DDIM/img2img inference and sampled validation eval.
- Stage 4-lite low-timestep and condition-start fine-tuning.
photo_v2degradation for stronger blur/noise/compression/ringing/color shift/banding experiments.photo_v3_noise_mixdegradation and XL configs for 500M-class denoise/sharpening experiments.- Partial checkpoint initialization for widened/deepened Stage 2 and diffusion
models via
--partial-init.
Stage 1 training config:
configs/autoencoder_photo10k.yaml
Stage 1 run name:
autoencoder_photo10k_b16_eval_online
Selected Stage 1 VAE checkpoint:
/home/jwheojjang/scratch/sr-diffusion/runs/autoencoder_photo10k_b16_eval_online/checkpoints/best_eval_recon.pt
Stage 1 VAE shape:
HR 512x512 -> latent 128x128
latent channels: 16
batch size: 16
max steps: 100000
train set: 10000 photo images
val set: 100 photo images
eval: every 1000 steps
fixed sample logging: every 500 steps
The first Stage 1 pass was stopped at step 50000. The selected checkpoint is
best_eval_recon.pt, which matched the 50k checkpoint in the current run:
eval/recon: 0.01198
eval/kl: 9.38684
eval/psnr: 40.19
Current Stage 2 config:
configs/latent_pretrain_photo10k.yaml
Stage 2 photo100k scale-up config:
configs/latent_pretrain_photo100k.yaml
This run uses batch size 64 on MI300X and max_steps: 30000, which is about
18.6 passes over the 103,450-image training split.
Current Stage 2 run name:
latent_pretrain_photo10k_b16
Selected Stage 2 checkpoint:
/home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo10k_b16/checkpoints/best_eval_latent.pt
Stage 2 final result:
finished step: 50000
best eval latent loss: step 48000, eval/latent_loss 0.21775
best decoded PSNR proxy: step 47000, eval/decoded_psnr 23.89
Stage 2 photo100k scale-up result:
run name: latent_pretrain_photo100k_b64
finished step: 30000
selected checkpoint: /home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo100k_b64/checkpoints/best_eval_latent.pt
best eval latent loss: step 28000, eval/latent_loss 0.21230
best decoded PSNR proxy: step 22000, eval/decoded_psnr 23.93
final eval: step 30000, eval/latent_loss 0.21267, eval/decoded_psnr 23.88
Current Stage 3 config:
configs/diffusion_photo10k_b32.yaml
Next scale-up Stage 3 config:
configs/diffusion_photo100k_b32.yaml
Current Stage 3 model:
conditional U-Net params: 76.6M
frozen Stage 2 condition encoder params: 2.4M
latent shape: 16 x 128 x 128
batch size: 32
max steps: 25000
Selected Stage 3 checkpoint:
/home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.pt
Stage 3 training result:
finished step: 25000
best eval noise/x0: step 24000, eval/noise_mse 0.00766, eval/x0_mse 0.09063
best decoded PSNR diagnostic: step 25000, eval/decoded_psnr 24.10
Sampled Stage 3 eval, using --init condition, --start-timestep 50,
and 32 DDIM steps on 32 fixed validation images:
mean bicubic PSNR: 24.66
mean SR PSNR: 25.55
mean delta: +0.89 dB
Sampled Stage 3 eval on all 100 validation images:
mean bicubic PSNR: 24.478
mean SR PSNR: 25.222
mean delta: +0.744 dB
Stage 4-lite low-timestep fine-tune result:
config: configs/diffusion_photo10k_b32_stage4_lowt.yaml
initialized from: Stage 3 best checkpoint
train timesteps: 0..100
finished step: 5000
best eval/x0_mse: step 5000, eval/x0_mse 0.01186
best decoded PSNR diagnostic: step 4500, eval/decoded_psnr 32.74
sampled val32 SR PSNR: 25.5493
sampled val32 delta vs Stage 3: -0.0037 dB
decision: do not promote; keep Stage 3 as current best sampled checkpoint
Stage 4 condition-start fine-tune result:
config: configs/diffusion_photo10k_b32_stage4_condition.yaml
selected checkpoint: /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32_stage4_condition/checkpoints/best_eval_condition_decoded.pt
initialized from: Stage 3 best checkpoint
train timesteps: 25..100
stopped early: step 2500, best checkpoint at step 1000
best one-step condition diagnostic: step 1000, eval/decoded_psnr 23.78
best sampled setting: --init condition --start-timestep 25 --steps 32
sampled val32 SR PSNR: 25.660
sampled val100 SR PSNR: 25.293
sampled val100 delta vs Stage 3: +0.071 dB
decision: promote as current best sampled checkpoint
This trains the low-timestep path from the Stage 2 condition latent instead of from a noised ground-truth latent, matching the current inference initialization more closely.
At batch_size=32, one epoch is:
10000 images / 32 = 312.5 steps
So the Stage 3 25000 step config is about 80 epochs.
The current photo training manifest is:
/home/jwheojjang/scratch/sr-diffusion/data/manifest_photo10k.csv
It contains:
photo/train: 10000
photo/val: 100
The 10k photo set is built from:
- DIV2K HR.
- Flickr2K HR.
- A deterministic subset of COCO train2017.
The active scale-up target is:
/home/jwheojjang/scratch/sr-diffusion/data/manifest_photo100k.csv
It is built from DF2K plus 100,000 deterministic COCO train2017 images selected
with short side >=320, for about 103,550 training images and 100 validation
images. COCO only has 45,897 train2017 images with short side >=480, so the
stricter high-resolution-only variant is closer to photo50k.
LR images are not stored. They are generated on the fly from HR crops by the
degradation pipeline. The current mild degradation already includes light LR
noise:
- Gaussian noise with probability
0.25, sigma[0.0, 4.0]. - Poisson noise with probability
0.05. - JPEG/WebP compression.
- blur, color jitter, sharpening, and mild banding.
The photo_v2 degradation preset is available for denoise/sharpening work. It
adds stronger blur, LR blur, signal-dependent sensor noise, heavier
Gaussian/Poisson noise, stronger JPEG/WebP artifacts, edge ringing,
oversharpen halos, color shift, and stronger banding. Because it changes the LR
distribution seen by the condition encoder, the recommended path is to
fine-tune Stage 2 on photo_v2 before running Stage 3/4 experiments that use
the same preset.
The photo_v3_noise_mix preset is a stronger denoise-focused curriculum. It
mixes photo_v2, photo_v3_noise, and mild degradations so the condition
encoder sees heavy Gaussian/sensor/chroma noise without losing cleaner inputs
entirely. photo_v3_noise adds explicit chroma/color noise and stronger
compression/noise ranges while keeping oversharpen/ringing probabilities
moderate to avoid reinforcing the cyan/green dot artifacts observed in v2.
For VAE training, LR is only used for visual logging. The VAE loss is:
HR -> encode -> latent -> decode -> reconstructed HR
LR degradation becomes a core training signal in Stage 2/3.
See docs/DATASETS.md for dataset notes and licensing caveats.
This VM exposes an ext4 scratch partition labeled DOSCRATCH. Mount it before
large datasets or long training runs:
bash scripts/mount_doscratch.shThe default mount point is:
/home/jwheojjang/scratch
The scratch volume is treated as ephemeral. After a VM restart, recover the scratch layout and development datasets with:
bash scripts/recover_scratch.shThat recreates:
- scratch directories
- toy dataset
- DIV2K
- Flickr2K
- COCO train2017 subset
manifest_photo10k.csv
To recover the larger photo100k setup after scratch loss:
bash scripts/recover_scratch.sh --coco-count 100000To recover only the smaller DIV2K seed dataset:
bash scripts/recover_scratch.sh --skip-flickr2k --skip-cocoHugging Face is used as persistent checkpoint storage because scratch can be lost after VM restarts. The current target is a public model repository:
jwheo/sr-diffusion
Upload only selected checkpoints/configs/metrics, not raw datasets. See docs/HUGGINGFACE.md for the exact upload commands.
The public Hugging Face prototype can be downloaded and run from a fresh clone.
The default inference config still points at the smaller 10k Stage 4
condition-start checkpoint for faster setup. The Colab notebook now also lets
you select the larger photo100k Stage 4 checkpoint or the experimental
photo100k photo_v2 Stage 3 checkpoint for denoise/sharpening review.
For a click-to-run demo, open the Colab notebook:
notebooks/sr_diffusion_colab_demo.ipynb
Install PyTorch for the target machine first. For this ROCm VM:
python3 -m venv .venv
source .venv/bin/activate
pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm7.2
pip install -e .Download the selected public checkpoints from Hugging Face:
python scripts/download_hf_checkpoints.pyDownload the larger photo100k/v2 artifact set:
python scripts/download_hf_checkpoints.py --preset photo100kRun x4 SR from an LR image. The default HF config expects a 128x128 LR crop and writes a 512x512 output:
python infer_diffusion.py \
--input-lr /path/to/lr_128.png \
--output-dir outputs/demoRun tiled x4 SR from a larger LR image:
python infer_diffusion.py \
--input-lr /path/to/larger_lr.png \
--output-dir outputs/tiled_demo \
--tile \
--tile-overlap 32Tiled inference splits the LR image into overlapping 128x128 tiles, samples each tile, and feather-blends the 512x512 tile outputs back into one x4 image. It is slower than single-tile inference because diffusion sampling runs per tile. Start with small LR images, for example 256x256 or 384x384, when using Colab.
For a controlled smoke test from an HR image, let the script center-crop HR and create the degraded LR input first:
python infer_diffusion.py \
--input-hr /path/to/hr_image.png \
--output-dir outputs/demo_from_hr \
--seed 123The output is sr_00.png. The default config is
configs/hf/diffusion_stage4_condition.yaml, which points at:
checkpoints/stage1_autoencoder_best_eval_recon.pt
checkpoints/stage2_latent_pretrain_best_eval_latent.pt
checkpoints/stage4_condition_b32_best_eval_condition_decoded.pt
For the current photo100k Stage 4 checkpoint:
python infer_diffusion.py \
--config configs/hf/diffusion_photo100k_stage4_condition.yaml \
--input-lr /path/to/lr_128.png \
--output-dir outputs/photo100k_stage4For the current experimental photo100k photo_v2 Stage 4 checkpoint:
python infer_diffusion.py \
--config configs/hf/diffusion_photo100k_stage4_condition_v2.yaml \
--input-lr /path/to/lr_128.png \
--output-dir outputs/photo100k_v2_stage4For the earlier photo100k photo_v2 Stage 3 checkpoint:
python infer_diffusion.py \
--config configs/hf/diffusion_photo100k_v2.yaml \
--input-lr /path/to/lr_128.png \
--output-dir outputs/photo100k_v2_stage3To compare the earlier Stage 3 baseline instead:
python infer_diffusion.py \
--config configs/hf/diffusion_stage3_baseline.yaml \
--input-lr /path/to/lr_128.png \
--output-dir outputs/stage3_demoThese are research checkpoints under a non-commercial license. They are useful for inspecting the prototype behavior, not yet a polished production SR model.
Code is released under the PolyForm Noncommercial License 1.0.0. Model checkpoints and generated artifacts are released under CC BY-NC 4.0.
Commercial use is not permitted without separate written permission. This includes paid hosted inference, resale, or integration into commercial products.
Run the current Stage 1 VAE training config:
/home/jwheojjang/venvs/rocm/bin/python train_autoencoder.py \
--config configs/autoencoder_photo10k.yamlRecommended long-running launch through tmux:
tmux new-session -d -s sr_ae10k \
'cd /home/jwheojjang/sr-diffusion && env PYTHONUNBUFFERED=1 /home/jwheojjang/venvs/rocm/bin/python train_autoencoder.py --config configs/autoencoder_photo10k.yaml > /home/jwheojjang/scratch/sr-diffusion/runs/autoencoder_photo10k_b16_eval_online/train_tmux.log 2>&1'Watch the training log:
tail -f /home/jwheojjang/scratch/sr-diffusion/runs/autoencoder_photo10k_b16_eval_online/train_tmux.logRun the current Stage 2 deterministic latent pretraining config:
/home/jwheojjang/venvs/rocm/bin/python train_latent_pretrain.py \
--config configs/latent_pretrain_photo10k.yamlRun the photo100k Stage 2 scale-up from the selected 10k checkpoint:
/home/jwheojjang/venvs/rocm/bin/python train_latent_pretrain.py \
--config configs/latent_pretrain_photo100k.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo10k_b16/checkpoints/best_eval_latent.ptRun the XL photo100k Stage 2 condition encoder for the 500M-class path:
/home/jwheojjang/venvs/cuda/bin/python train_latent_pretrain.py \
--config configs/latent_pretrain_photo100k_v3_noise_xl.yamlThis run has completed. Important checkpoints are available on Hugging Face:
checkpoints/stage2_photo100k_v3_noise_xl_b64_best_eval_latent.pt
checkpoints/stage2_photo100k_v3_noise_xl_b64_step_0072000.pt
checkpoints/stage2_photo100k_v3_noise_xl_b64_latest.pt
metrics/stage2_photo100k_v3_noise_xl_b64_summary.json
Recommended Stage 2 tmux launch:
tmux new-session -d -s sr_stage2 \
'cd /home/jwheojjang/sr-diffusion && env PYTHONUNBUFFERED=1 /home/jwheojjang/venvs/rocm/bin/python train_latent_pretrain.py --config configs/latent_pretrain_photo10k.yaml > /home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo10k_b16/train_tmux.log 2>&1'Watch the Stage 2 log:
tail -f /home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo10k_b16/train_tmux.logRun the current Stage 3 conditional diffusion config:
/home/jwheojjang/venvs/rocm/bin/python train_diffusion.py \
--config configs/diffusion_photo10k_b32.yamlAfter Stage 2 photo100k finishes, run the photo100k Stage 3 config:
/home/jwheojjang/venvs/rocm/bin/python train_diffusion.py \
--config configs/diffusion_photo100k_b32.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.ptAfter comparing the XL Stage 2 condition encoder candidates, run the 500M-class condition-start U-Net. It can reuse shape-compatible tensors from the smaller Stage 4 v2 checkpoint. This has not been started yet:
/home/jwheojjang/venvs/cuda/bin/python train_diffusion.py \
--config configs/diffusion_photo100k_xl_stage4_condition_v3.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo100k_b32_stage4_condition_v2/checkpoints/best_eval_condition_decoded.pt \
--partial-initRecommended Stage 3 tmux launch:
tmux new-session -d -s sr_stage3 \
'cd /home/jwheojjang/sr-diffusion && env PYTHONUNBUFFERED=1 /home/jwheojjang/venvs/rocm/bin/python train_diffusion.py --config configs/diffusion_photo10k_b32.yaml > /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/train_tmux.log 2>&1'Watch the Stage 3 log:
tail -f /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/train_tmux.logWatch GPU usage:
watch -n 1 rocm-smi --showuse --showmemuse --showtemp --showpowerAttach to the tmux session:
tmux attach -t sr_ae10kDetach without stopping training:
Ctrl-b d
Training eval is enabled in configs/autoencoder_photo10k.yaml:
eval:
enabled: true
split: val
limit: 100
batch_size: 16
every: 1000
run_at_start: trueThis means:
- eval at step
1 - eval at step
1000 - eval at step
2000 - and so on
The best checkpoint by eval/recon is written to:
checkpoints/best_eval_recon.pt
Manual checkpoint eval:
/home/jwheojjang/venvs/rocm/bin/python eval_autoencoder.py \
--config configs/autoencoder_photo10k.yaml \
--checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/autoencoder_photo10k_b16_eval_online/checkpoints/latest.pt \
--split val \
--limit 100The current config logs to W&B online:
logging:
wandb:
project: sr-diffusion
name: autoencoder_photo10k_b16_eval_online
mode: onlineImage logging uses fixed validation images so improvements are comparable over time:
logging:
samples:
split: val
count: 4
indices: [0, 1, 2, 3]Logged image keys:
samples/LR: degraded LR, upsampled for viewing.samples/GT: original HR target.samples/HR: VAE reconstruction.
The name samples/HR currently means reconstructed HR output. If this becomes
confusing, rename it to samples/Recon before the next large run.
Stage 0: scaffold and data pipeline
- Done.
- Repo scaffold, configs, manifests, degradation pipeline, smoke tests.
Stage 1: VAE / Autoencoder
- Done for the first pass.
- Train factor-4 VAE on 512 HR crops.
- Select checkpoint using fixed visual samples plus
eval/recon,eval/psnr, and residual qualitative checks. - Possible improvements before moving on:
- LPIPS/perceptual eval.
- perceptual training loss.
- KL weight sweep.
- larger or domain-balanced data.
- rename
samples/HRtosamples/Reconfor clarity.
Stage 2: deterministic LR -> HR latent pretrain
- Done for the first 10k pass; photo100k scale-up is the next active pass.
- Freeze the selected Stage 1 VAE.
- Train an LR-to-latent predictor that maps degraded LR inputs to HR VAE encoder means.
- This is where LR degradation quality starts to matter directly.
- Log fixed validation
samples/LR,samples/GT, andsamples/Predto W&B.
Run the current Stage 2 pretraining config:
/home/jwheojjang/venvs/rocm/bin/python train_latent_pretrain.py \
--config configs/latent_pretrain_photo10k.yamlRun the photo100k Stage 2 scale-up:
/home/jwheojjang/venvs/rocm/bin/python train_latent_pretrain.py \
--config configs/latent_pretrain_photo100k.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/latent_pretrain_photo10k_b16/checkpoints/best_eval_latent.ptStage 3: conditional latent diffusion
- First pass complete. It is the baseline for the current Stage 4 condition checkpoint.
- Train diffusion U-Net over HR latents.
- Conditioning:
- frozen Stage 2 LR-to-latent condition encoder
- timestep embedding
- photo/anime domain embedding
- Initial model size is 76.6M trainable U-Net parameters.
- Target model size is roughly 250M-500M parameters after the pipeline is stable.
Stage 4: perceptual / GAN fine-tune
- Current stage.
- First conservative Stage 4-lite low-timestep fine-tune is complete. It improved one-step diagnostics but not the fixed 32-step sampled eval, so it is not promoted over Stage 3.
- Condition-start fine-tuning, initialized from the Stage 3 best checkpoint,
is the current best sampled SR checkpoint. It trains low timesteps
25..100, but starts the training noisy latent from the Stage 2 condition latent so the train path better matchesinfer_diffusion.py --init condition. - It uses a small effective-noise loss plus a stronger x0 latent reconstruction
loss to preserve fidelity. The best sampled setting so far is
--start-timestep 25. - Use carefully, because later perceptual/GAN tuning can improve apparent sharpness while hurting fidelity.
Run the Stage 4-lite low-timestep fine-tune:
/home/jwheojjang/venvs/rocm/bin/python train_diffusion.py \
--config configs/diffusion_photo10k_b32_stage4_lowt.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.ptRun the Stage 4 condition-start fine-tune:
/home/jwheojjang/venvs/rocm/bin/python train_diffusion.py \
--config configs/diffusion_photo10k_b32_stage4_condition.yaml \
--init-checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.ptStage 5: few-step distillation
- Distill the diffusion model for faster inference.
Stage 6: preference eval
- Fixed private eval set.
- Generate outputs from multiple checkpoints/settings.
- A/B comparisons.
- Accumulate Elo separately for photo and anime.
configs/ experiment configs
docs/ dataset and project notes
scripts/ dataset, scratch, and utility scripts
src/sr_diffusion/ package code
datasets/ manifest dataset
degradations/ x4 LR degradation pipeline
eval/ eval helpers
losses/ reconstruction/KL losses
models/ AutoencoderKL, LR predictor, diffusion U-Net
train_autoencoder.py Stage 1 training entrypoint
train_latent_pretrain.py Stage 2 deterministic latent pretraining entrypoint
train_diffusion.py Stage 3 conditional diffusion training entrypoint
infer_diffusion.py Stage 3 DDIM/img2img SR sampling entrypoint
eval_diffusion_samples.py sampled diffusion validation eval
compare_eval_samples.py sampled eval comparison contact sheets
eval_autoencoder.py standalone VAE eval entrypoint
infer_reconstruct.py reconstruction smoke/inference
tests/ unit tests
Create a toy dataset:
/home/jwheojjang/venvs/rocm/bin/python scripts/make_toy_dataset.py \
--output runs/toy_data \
--count 16Train a tiny autoencoder for a few steps:
/home/jwheojjang/venvs/rocm/bin/python train_autoencoder.py \
--config configs/autoencoder_tiny.yaml \
--limit-steps 10Run unit tests:
/home/jwheojjang/venvs/rocm/bin/python -m pytestRun a tiny Stage 3 smoke test:
/home/jwheojjang/venvs/rocm/bin/python train_diffusion.py \
--config configs/diffusion_scratch_tiny.yaml \
--limit-steps 1Run current best Stage 4 condition-start sampling from an HR image by creating a controlled LR input first:
/home/jwheojjang/venvs/rocm/bin/python infer_diffusion.py \
--config configs/diffusion_photo10k_b32_stage4_condition.yaml \
--checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32_stage4_condition/checkpoints/best_eval_condition_decoded.pt \
--input-hr /path/to/hr_image.png \
--output-dir /home/jwheojjang/scratch/sr-diffusion/runs/infer_diffusion_stage4_condition \
--steps 32 \
--seed 123The Stage 4 condition config sets sampling.start_timestep: 25, so the command
above uses the best sampled setting found so far unless --start-timestep is
passed explicitly.
Run Stage 3 baseline sampling from an HR image by creating a controlled LR input first:
/home/jwheojjang/venvs/rocm/bin/python infer_diffusion.py \
--config configs/diffusion_photo10k_b32.yaml \
--checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.pt \
--input-hr /path/to/hr_image.png \
--output-dir /home/jwheojjang/scratch/sr-diffusion/runs/infer_diffusion_sample \
--steps 32 \
--seed 123Run Stage 3 baseline sampling from an existing LR image:
/home/jwheojjang/venvs/rocm/bin/python infer_diffusion.py \
--config configs/diffusion_photo10k_b32.yaml \
--checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32/checkpoints/best_eval_noise.pt \
--input-lr /path/to/lr_128.png \
--output-dir /home/jwheojjang/scratch/sr-diffusion/runs/infer_diffusion_sample \
--steps 32 \
--seed 123The default sampler starts from the Stage 2 condition latent with light noise
added (--init condition). If a config has sampling.start_timestep, that
value is used when --start-timestep is omitted. Otherwise condition sampling
falls back to 50. Pure noise sampling is available with --init noise, but
the current checkpoints are more stable in condition-initialized mode.
Run a small sampled validation sweep and compare against bicubic:
/home/jwheojjang/venvs/rocm/bin/python eval_diffusion_samples.py \
--config configs/diffusion_photo10k_b32_stage4_condition.yaml \
--checkpoint /home/jwheojjang/scratch/sr-diffusion/runs/diffusion_photo10k_b32_stage4_condition/checkpoints/best_eval_condition_decoded.pt \
--output-dir /home/jwheojjang/scratch/sr-diffusion/runs/eval_diffusion_stage4_condition_val8_32step \
--split val \
--limit 8 \
--steps 32 \
--seed 1337The sampled eval grid is written as grid_lr_bicubic_sr_gt.png, with columns in
this order: LR nearest, bicubic, SR, GT.
Compare two sampled eval directories and create top win/loss contact sheets:
/home/jwheojjang/venvs/rocm/bin/python compare_eval_samples.py \
--baseline-dir /home/jwheojjang/scratch/sr-diffusion/runs/eval_diffusion_stage3_val100_t50_32step \
--candidate-dir /home/jwheojjang/scratch/sr-diffusion/runs/eval_diffusion_stage4_condition_val100_t25_32step \
--output-dir /home/jwheojjang/scratch/sr-diffusion/runs/compare_stage3_vs_stage4_condition_val100 \
--baseline-label stage3 \
--candidate-label stage4condReconstruct one image:
/home/jwheojjang/venvs/rocm/bin/python infer_reconstruct.py \
--config configs/autoencoder_tiny.yaml \
--checkpoint runs/autoencoder_tiny/checkpoints/latest.pt \
--input runs/toy_data/images/0000.png \
--output-dir runs/reconstruct_smoke