title | layout | parent | nav_order |
---|---|---|---|
Diffusion and its variants |
default |
DL Basics |
3 |
Unlike text, image and video data are often high-dimension and contain the inner spatial relationship. Thus, applying LLMs to image/video data is not straightforward.
In this blog, we will introduce the diffusion model and its variants, which have been widely used in image/video generation.
Although 2D synthesis has gained significant progress, the view consistency still remains a challenge. To bridge this gap, some works have proposed to generate 3D objects from text descriptions directly.
Usually, 3D synthesis is a challenging task due to
- Limited high quality 3D training data
- Hard to align different modal in 3D space
- [Blog] What are Diffusion Models?, Lilian Weng
- [Blog] Diffusion Models for Video Generation, Lilian Weng
- [ICML'21] Zero-Shot Text-to-Image Generation, OpenAI
- [CVPR'22] High-Resolution Image Synthesis with Latent Diffusion Models, Heidelberg University
- [arXiv 2024.03] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Stability AI
- [arXiv 2023.11] Stable Video Diffusion, Stability AI
- [arXiv 2024.03] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation, Stability AI
- [ICCV'23] Scalable Diffusion Models with Transformers, UC Berkeley
- [arXiv 2023.11] RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D, Alibaba
- [arXiv 2024.03] DreamReward: Text-to-3D Generation with Human Preference, Tsinghua University
- [CVPR'24 Highlight] HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting, The Chinese University of Hong Kong
- [NeurIPS'23] DreamHuman: Animatable 3D Avatars from Text, Google Research