Skip to content

ShandaAI/FloodDiffusion

Repository files navigation

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

[Project Page] | [Paper (arXiv)]

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

Features

  • πŸ”„ Streaming Generation: Support for continuous motion generation with text condition changes
  • πŸš€ Latent Diffusion Forcing: Efficient generation using compressed latent space with diffusion
  • ⚑ Real-time Capable: Optimized for streaming inference with ~50 FPS model output

Installation

Environment Setup

# Create conda environment
conda create -n motion_gen python=3.10
conda activate motion_gen

# Install PyTorch
pip install torch torchvision torchaudio

# Install dependencies
pip install -r requirements.txt

# Install Flash Attention
conda install -c nvidia cuda-toolkit
export CUDA_HOME=$CONDA_PREFIX
pip install flash-attn --no-build-isolation

Quick Inference (No Data Required)

If you only need to generate motions and don't plan to train or evaluate models, you can use our standalone model on Hugging Face:

πŸ€— ShandaAI/FloodDiffusion

This version requires no dataset downloads and works out-of-the-box for inference:

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate as joint coordinates for visualization
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)

# Multi-text transitions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]
)

For detailed API documentation, see the model card.

Note: For training, evaluation, or using the scripts in this repository, continue with the Data Preparation section below.

Data Preparation

Prepare Data from Original Sources

To reproduce our results from scratch, follow the original data preparation pipelines:

HumanML3D:

  • Follow the instructions in the HumanML3D repository
  • Extract 263D motion features using their processing pipeline
  • Place the processed data in raw_data/HumanML3D/

BABEL:

  • Download from the BABEL website
  • Process the motion sequences to extract 263D features
  • For streaming generation, segment and process according to the frame-level annotations
  • Place the processed data in raw_data/BABEL_streamed/

Dependencies:

  • Download T5 encoder weights from Hugging Face
  • Download T2M evaluation models from the text-to-motion repository
  • Download GloVe embeddings

Quick Start: Download Preprocessed Data (Recommended)

We provide all necessary data (datasets, dependencies, and pretrained models) on Hugging Face: πŸ€— ShandaAI/FloodDiffusionDownloads

For inference only (downloads deps/ and outputs/):

pip install huggingface_hub
python download_assets.py

For training/evaluation (also downloads datasets in raw_data/):

pip install huggingface_hub
python download_assets.py --with-dataset

This will automatically download and extract files into the correct directories.

Directory Structure

After downloading or preparing the data, your project should have the following structure:

Dependencies Directory:

deps/
β”œβ”€β”€ t2m/                     # Text-to-Motion evaluation models
β”‚   β”œβ”€β”€ humanml3d/           # HumanML3D evaluator
β”‚   β”œβ”€β”€ kit/                 # KIT-ML evaluator
β”‚   └── meta/                # Statistics (mean.npy, std.npy)
β”œβ”€β”€ glove/                   # GloVe word embeddings
β”‚   β”œβ”€β”€ our_vab_data.npy
β”‚   β”œβ”€β”€ our_vab_idx.pkl
β”‚   └── our_vab_words.pkl
└── t5_umt5-xxl-enc-bf16/    # T5 text encoder

Dataset Directory:

raw_data/
β”œβ”€β”€ HumanML3D/
β”‚   β”œβ”€β”€ new_joint_vecs/      # 263D motion features (required)
β”‚   β”œβ”€β”€ texts/               # Text annotations
β”‚   β”œβ”€β”€ train.txt            # Training split
β”‚   β”œβ”€β”€ val.txt              # Validation split
β”‚   β”œβ”€β”€ test.txt             # Test split
β”‚   β”œβ”€β”€ all.txt              # All samples
β”‚   β”œβ”€β”€ Mean.npy             # Dataset mean
β”‚   β”œβ”€β”€ Std.npy              # Dataset std
β”‚   β”œβ”€β”€ TOKENS_*/            # Pretokenized features (auto-generated)
β”‚   └── animations/          # Rendered videos (optional)
β”‚
└── BABEL_streamed/
    β”œβ”€β”€ motions/             # 263D motion features (required)
    β”œβ”€β”€ texts/               # Text annotations
    β”œβ”€β”€ frames/              # Frame-level annotations
    β”œβ”€β”€ train_processed.txt  # Training split
    β”œβ”€β”€ val_processed.txt    # Validation split
    β”œβ”€β”€ test_processed.txt   # Test split
    β”œβ”€β”€ TOKENS_*/            # Pretokenized features (auto-generated)
    └── animations/          # Rendered videos (optional)

Pretrained Models Directory:

outputs/                     # Pretrained model checkpoints
β”œβ”€β”€ vae_1d_z4_step=300000.ckpt          # VAE model (1D, z_dim=4)
β”œβ”€β”€ 20251106_063218_ldf/
β”‚   └── step_step=50000.ckpt            # LDF model checkpoint (HumanML3D)
└── 20251107_021814_ldf_stream/
    └── step_step=240000.ckpt           # LDF streaming model checkpoint (BABEL)

Note: If you downloaded the models using the script above, the paths are already correctly configured. Otherwise, update test_ckpt and test_vae_ckpt in your config files to point to your checkpoint locations.

Configuration

Create configs/paths.yaml from the example:

cp configs/paths_default.yaml configs/paths.yaml
# Edit paths.yaml to point to your data directories

Available Configs

  • vae_wan_1d.yaml - VAE training configuration
  • ldf.yaml - LDF training on HumanML3D
  • ldf_babel.yaml - LDF training on BABEL
  • stream.yaml - Streaming generation config
  • ldf_generate.yaml - Generation-only config

Training

1. Train VAE (Motion Encoder)

# Train VAE
python train_vae.py --config configs/vae_wan_1d.yaml --override train=True

# Test VAE
python train_vae.py --config configs/vae_wan_1d.yaml

2. Pretokenize Dataset

Precompute VAE tokens for diffusion training:

python pretokenize_vae.py --config configs/vae_wan_1d.yaml

3. Train Latent Diffusion Forcing (Flood Diffusion)

# Train on HumanML3D
python train_ldf.py --config configs/ldf.yaml --override train=True

# Train on BABEL (streaming)
python train_ldf.py --config configs/ldf_babel.yaml --override train=True

# Test/Evaluate
python train_ldf.py --config configs/ldf.yaml

Generation

Interactive Generation

python generate_ldf.py --config configs/stream.yaml

Visualization

Render motion files to videos:

python visualize_motion.py

This script:

  • Reads 263D motion features from disk
  • Renders to MP4 videos with skeleton visualization
  • Supports batch processing of directories

Web Real-time Demo

For real-time interactive demo with streaming generation, see web_demo/README.md.

Model Architecture

VAE (Variational Autoencoder)

  • Input: T Γ— 263 motion features
  • Latent: (T/4) Γ— 4 tokens
  • Architecture: Causal encoder and decoder based on WAN2.2

LDF (Latent Diffusion Forcing)

  • Backbone: DiT based on WAN2.2
  • Text Encoder: T5
  • Diffusion Schedule: Triangular noise schedule
  • Streaming: Autoregressive latent generation

Project Structure

pl_train/
β”œβ”€β”€ configs/                        # Configuration files
β”‚   β”œβ”€β”€ vae_wan_1d.yaml             # VAE training config
β”‚   β”œβ”€β”€ ldf.yaml                    # LDF training (HumanML3D)
β”‚   β”œβ”€β”€ ldf_babel.yaml              # LDF training (BABEL)
β”‚   β”œβ”€β”€ stream.yaml                 # Streaming generation
β”‚   └── paths.yaml                  # Data paths (create from .example)
β”‚
β”œβ”€β”€ datasets/                       # Dataset loaders
β”‚   β”œβ”€β”€ humanml3d.py                # HumanML3D dataset
β”‚   └── babel.py                    # BABEL dataset
β”‚
β”œβ”€β”€ models/                         # Model implementations
β”‚   β”œβ”€β”€ vae_wan_1d.py               # VAE encoder-decoder
β”‚   └── diffusion_forcing_wan.py    # LDF diffusion model
β”‚
β”œβ”€β”€ metrics/                        # Evaluation metrics
β”‚   β”œβ”€β”€ t2m.py                      # Text-to-Motion metrics
β”‚   └── mr.py                       # Motion reconstruction metrics
β”‚
β”œβ”€β”€ utils/                          # Utilities
β”‚   β”œβ”€β”€ initialize.py               # Config & model loading
β”‚   β”œβ”€β”€ motion_process.py           # Motion data processing
β”‚   └── visualize.py                # Rendering utilities
β”‚
β”œβ”€β”€ train_vae.py                    # VAE training script
β”œβ”€β”€ train_ldf.py                    # LDF training script
β”œβ”€β”€ pretokenize_vae.py              # Dataset pretokenization
β”œβ”€β”€ generate_ldf.py                 # Motion generation
β”œβ”€β”€ visualize_motion.py             # Batch visualization
β”œβ”€β”€ requirements.txt                # Python dependencies
└── web_demo/                       # Real-time web demo (separate)

External Dependencies:

<project_root>/
β”œβ”€β”€ deps/                           # Model dependencies
└── raw_data/                       # Motion datasets

Update History

  • 2025/12/8: Added EMA smoothing option for joint positions during rendering

Citation

If you use this code in your research, please cite:

@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright (c) 2025 Shanda AI Research Tokyo

Note: This project includes code from third-party sources with separate licenses. See THIRD_PARTY_LICENSES.md for details.

Acknowledgments

Data License Notice

The preprocessed datasets we provide contain extracted motion features (263-dim) and text annotations derived from HumanML3D and BABEL, which are built upon AMASS and HumanAct12. We do not distribute raw AMASS data (SMPL parameters/meshes). This follows standard practice in the motion generation research community. If you require raw motion data or plan to use it for commercial purposes, please register and agree to the licenses on the AMASS website.

About

FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

Topics

Resources

License

Stars

Watchers

Forks