A flow matching-based diffusion transformer for anime image generation.
This project is for research purposes only.
- GitHub: https://github.com/FREEANIMA/diffusion_model_sampling
- Hugging Face: https://huggingface.co/honghong3/diffusion-transformer
This project is licensed under CC BY-NC 4.0.
For research and non-commercial use only.
- GPU: NVIDIA A100 40GB (Google Colab)
- Dataset: ~4.8M anime images
- Processed: ~1.8M images (epoch 0, ongoing)
- Throughput: ~1.3 it/s
- Samples below are intermediate checkpoints — quality will improve as training continues.
| 12k images | 600k images | 1.2M images | 1.8M images |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
# sampler conditional
prompt = "1girl, red hair, school uniform, happy, red eyes, open mouth, detailed face"
steps = 100
cfg_scale = 2.0
seed = 1234
- Backbone: Diffusion Transformer (DiT) with adaLN modulation
- Parameters: ~550M
- Framework: Flow Matching (velocity prediction)
| Component | Model |
|---|---|
| VAE | stabilityai/sd-vae-ft-mse |
| Text Encoder | openai/clip-vit-large-patch14 |
| Tokenizer | openai/clip-vit-large-patch14 |
- Resolution: 512 × 512 (single bucket)
- Noise Schedule: Log-SNR uniform sampling with resolution-dependent shift
- CFG: Classifier-free guidance
- Prompts are tag-based (comma-separated danbooru-style tags)
pip install torch transformers diffusers accelerate torchvision tqdmpython main.pyC:.
│ main.py
│ output.png
│ README.md
│ requirements.txt
│
├─app
│ │ clip.py
│ │ config.json
│ │ config.py
│ │ model.py
│ │ sampling.py
│ │ sd_vae.py
│ └─ __init__.py
│
├─assets
│ 100k.png
│ 150k.png
│ 1k.png
│ 50k.png
│
└─weights
image.pth




