[Project Page] | [Paper (arXiv)]
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.
- π Streaming Generation: Support for continuous motion generation with text condition changes
- π Latent Diffusion Forcing: Efficient generation using compressed latent space with diffusion
- β‘ Real-time Capable: Optimized for streaming inference with ~50 FPS model output
# Create conda environment
conda create -n motion_gen python=3.10
conda activate motion_gen
# Install PyTorch
pip install torch torchvision torchaudio
# Install dependencies
pip install -r requirements.txt
# Install Flash Attention
conda install -c nvidia cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
pip install flash-attn --no-build-isolationIf you only need to generate motions and don't plan to train or evaluate models, you can use our standalone model on Hugging Face:
This version requires no dataset downloads and works out-of-the-box for inference:
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True
)
# Generate motion from text
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}") # (~240, 263)
# Generate as joint coordinates for visualization
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
# Multi-text transitions
motion = model(
text=[["walk forward", "turn around", "run back"]],
length=[120],
text_end=[[40, 80, 120]]
)For detailed API documentation, see the model card.
Note: For training, evaluation, or using the scripts in this repository, continue with the Data Preparation section below.
To reproduce our results from scratch, follow the original data preparation pipelines:
HumanML3D:
- Follow the instructions in the HumanML3D repository
- Extract 263D motion features using their processing pipeline
- Place the processed data in
raw_data/HumanML3D/
BABEL:
- Download from the BABEL website
- Process the motion sequences to extract 263D features
- For streaming generation, segment and process according to the frame-level annotations
- Place the processed data in
raw_data/BABEL_streamed/
Dependencies:
- Download T5 encoder weights from Hugging Face
- Download T2M evaluation models from the text-to-motion repository
- Download GloVe embeddings
We provide all necessary data (datasets, dependencies, and pretrained models) on Hugging Face: π€ ShandaAI/FloodDiffusionDownloads
For inference only (downloads deps/ and outputs/):
pip install huggingface_hub
python download_assets.pyFor training/evaluation (also downloads datasets in raw_data/):
pip install huggingface_hub
python download_assets.py --with-datasetThis will automatically download and extract files into the correct directories.
After downloading or preparing the data, your project should have the following structure:
Dependencies Directory:
deps/
βββ t2m/ # Text-to-Motion evaluation models
β βββ humanml3d/ # HumanML3D evaluator
β βββ kit/ # KIT-ML evaluator
β βββ meta/ # Statistics (mean.npy, std.npy)
βββ glove/ # GloVe word embeddings
β βββ our_vab_data.npy
β βββ our_vab_idx.pkl
β βββ our_vab_words.pkl
βββ t5_umt5-xxl-enc-bf16/ # T5 text encoder
Dataset Directory:
raw_data/
βββ HumanML3D/
β βββ new_joint_vecs/ # 263D motion features (required)
β βββ texts/ # Text annotations
β βββ train.txt # Training split
β βββ val.txt # Validation split
β βββ test.txt # Test split
β βββ all.txt # All samples
β βββ Mean.npy # Dataset mean
β βββ Std.npy # Dataset std
β βββ TOKENS_*/ # Pretokenized features (auto-generated)
β βββ animations/ # Rendered videos (optional)
β
βββ BABEL_streamed/
βββ motions/ # 263D motion features (required)
βββ texts/ # Text annotations
βββ frames/ # Frame-level annotations
βββ train_processed.txt # Training split
βββ val_processed.txt # Validation split
βββ test_processed.txt # Test split
βββ TOKENS_*/ # Pretokenized features (auto-generated)
βββ animations/ # Rendered videos (optional)
Pretrained Models Directory:
outputs/ # Pretrained model checkpoints
βββ vae_1d_z4_step=300000.ckpt # VAE model (1D, z_dim=4)
βββ 20251106_063218_ldf/
β βββ step_step=50000.ckpt # LDF model checkpoint (HumanML3D)
βββ 20251107_021814_ldf_stream/
βββ step_step=240000.ckpt # LDF streaming model checkpoint (BABEL)
Note: If you downloaded the models using the script above, the paths are already correctly configured. Otherwise, update
test_ckptandtest_vae_ckptin your config files to point to your checkpoint locations.
Create configs/paths.yaml from the example:
cp configs/paths_default.yaml configs/paths.yaml
# Edit paths.yaml to point to your data directoriesvae_wan_1d.yaml- VAE training configurationldf.yaml- LDF training on HumanML3Dldf_babel.yaml- LDF training on BABELstream.yaml- Streaming generation configldf_generate.yaml- Generation-only config
# Train VAE
python train_vae.py --config configs/vae_wan_1d.yaml --override train=True
# Test VAE
python train_vae.py --config configs/vae_wan_1d.yamlPrecompute VAE tokens for diffusion training:
python pretokenize_vae.py --config configs/vae_wan_1d.yaml# Train on HumanML3D
python train_ldf.py --config configs/ldf.yaml --override train=True
# Train on BABEL (streaming)
python train_ldf.py --config configs/ldf_babel.yaml --override train=True
# Test/Evaluate
python train_ldf.py --config configs/ldf.yamlpython generate_ldf.py --config configs/stream.yamlRender motion files to videos:
python visualize_motion.pyThis script:
- Reads 263D motion features from disk
- Renders to MP4 videos with skeleton visualization
- Supports batch processing of directories
For real-time interactive demo with streaming generation, see web_demo/README.md.
- Input: T Γ 263 motion features
- Latent: (T/4) Γ 4 tokens
- Architecture: Causal encoder and decoder based on WAN2.2
- Backbone: DiT based on WAN2.2
- Text Encoder: T5
- Diffusion Schedule: Triangular noise schedule
- Streaming: Autoregressive latent generation
pl_train/
βββ configs/ # Configuration files
β βββ vae_wan_1d.yaml # VAE training config
β βββ ldf.yaml # LDF training (HumanML3D)
β βββ ldf_babel.yaml # LDF training (BABEL)
β βββ stream.yaml # Streaming generation
β βββ paths.yaml # Data paths (create from .example)
β
βββ datasets/ # Dataset loaders
β βββ humanml3d.py # HumanML3D dataset
β βββ babel.py # BABEL dataset
β
βββ models/ # Model implementations
β βββ vae_wan_1d.py # VAE encoder-decoder
β βββ diffusion_forcing_wan.py # LDF diffusion model
β
βββ metrics/ # Evaluation metrics
β βββ t2m.py # Text-to-Motion metrics
β βββ mr.py # Motion reconstruction metrics
β
βββ utils/ # Utilities
β βββ initialize.py # Config & model loading
β βββ motion_process.py # Motion data processing
β βββ visualize.py # Rendering utilities
β
βββ train_vae.py # VAE training script
βββ train_ldf.py # LDF training script
βββ pretokenize_vae.py # Dataset pretokenization
βββ generate_ldf.py # Motion generation
βββ visualize_motion.py # Batch visualization
βββ requirements.txt # Python dependencies
βββ web_demo/ # Real-time web demo (separate)
External Dependencies:
<project_root>/
βββ deps/ # Model dependencies
βββ raw_data/ # Motion datasets
- 2025/12/8: Added EMA smoothing option for joint positions during rendering
If you use this code in your research, please cite:
@article{cai2025flooddiffusion,
title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
journal={arXiv preprint arXiv:2512.03520},
year={2025}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright (c) 2025 Shanda AI Research Tokyo
Note: This project includes code from third-party sources with separate licenses. See THIRD_PARTY_LICENSES.md for details.
- HumanML3D - Dataset
- text-to-motion - Evaluation metrics
- BABEL - Dataset for streaming motion generation
- AMASS - Source motion capture data
- PyTorch Lightning - Training framework
- VideoPose3D - Quaternion operations code
- Hugging Face Transformers - T5 model implementation
- Alibaba Wan Team - WAN model architecture and components
The preprocessed datasets we provide contain extracted motion features (263-dim) and text annotations derived from HumanML3D and BABEL, which are built upon AMASS and HumanAct12. We do not distribute raw AMASS data (SMPL parameters/meshes). This follows standard practice in the motion generation research community. If you require raw motion data or plan to use it for commercial purposes, please register and agree to the licenses on the AMASS website.