Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Minghao Yin¹, Wenbo Hu^2†, Jiale Xu², Ying Shan², Kai Han^1‡

¹The University of Hong Kong ²ARC Lab, Tencent PCG

^†Project lead ^‡Corresponding author

Official code release for Sculpt4D, a native 4D generative framework that integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D-2.1). Given an image sequence of an animated object, Sculpt4D generates a temporally coherent sequence of 3D meshes — handling complex motions and topological changes while keeping object identity stable across time.

Abstract

Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%.

Method Highlights

Block Sparse Attention. A first-frame anchor locks object identity, and a time-decaying diagonal mask preserves spatial correspondence while pruning uncorrelated frame pairs — cutting ~56% of compute relative to full temporal attention. See hy3dshape/models/denoisers/hunyuandit_4d_blockmask.py.
Consistent Surface Sampling. Barycentric propagation from a rest-pose canonical mapping, plus projection onto per-frame watertight meshes, yields temporally coherent point-cloud inputs to the VAE. See tools/4dsampling/.
Shared-Noise VAE. A single noise vector is broadcast across frames, so latent dynamics are driven purely by deterministic per-frame statistics — eliminating temporal jitter in latent space. See --noise_share_alpha in inference_4d.py.

Pipeline

Data Preparation — Render multi-view images from animated GLB assets and sample temporally consistent 4D surface point clouds (with normals).
Training — Train a 4D diffusion transformer (DiT) on top of a frozen ShapeVAE encoder/decoder, using flow matching and block-mask temporal attention. Only the lightweight temporal adapter is trained; the 3D backbone stays frozen.
Inference — Generate a mesh sequence from an input image sequence via iterative denoising.

Project Structure

sculpt4d/
├── configs/                          # Training configs
│   ├── 4d_config_8.yaml              # Default training config (block-mask attention)
│   ├── 4d_config_16.yaml             # Longer-sequence variant
│   ├── 4d_config_radialattn.yaml     # Reference: radial sparse attention variant
│   ├── 4d_config_framepack.yaml      # Reference: frame-packing variant
│   └── 4d_config_rope.yaml           # Reference: RoPE variant
├── demos/                            # Example image sequences (one folder per case)
│   └── door/  chilun10/  angle/  box/  fatman/   (32 frames each)
├── hy3dshape/                        # Core library
│   ├── data/                         # Data loading and preprocessing
│   ├── models/
│   │   ├── autoencoders/             # ShapeVAE (encoder, decoder, surface extraction)
│   │   ├── denoisers/                # HunYuanDiT-4D temporal attention variants
│   │   ├── diffusion/                # Flow matching training & transport
│   │   └── conditioner_4d.py         # Video-conditioned DINOv2 encoder
│   ├── pipelines.py                  # Mesh export utilities
│   ├── schedulers.py                 # Flow-match Euler scheduler
│   └── utils/                        # Training utilities, EMA, logging
├── scripts/                          # DeepSpeed launch scripts
├── tools/
│   ├── 4dsampling/                   # Consistent 4D point cloud sampling pipeline
│   ├── render/                       # Blender rendering scripts
│   ├── dataset/                      # Data verification & visualization
│   ├── watertight/                   # Watertight mesh processing
│   ├── mini_trainset/                # Minimal demo training data
│   └── pipeline.sh                   # End-to-end rendering pipeline
├── inference_4d.py                   # Inference script
├── main.py                           # Training entry point
└── run_train.sh                      # Launch training (config-driven)

Requirements

Python 3.10+
PyTorch 2.5.1+ with CUDA 12.4
8x GPUs with 80GB+ VRAM (e.g. H20, A100) for training
Blender 4.4+ for data preparation

Installation

# 1. Install PyTorch (CUDA 12.4)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Install Block Sparse Attention (required for temporal attention)
#    See: https://github.com/mit-han-lab/Block-Sparse-Attention
git clone https://github.com/mit-han-lab/Block-Sparse-Attention.git
cd Block-Sparse-Attention
pip install packaging ninja
python setup.py install
cd ..

Pretrained models are downloaded automatically from HuggingFace (tencent/Hunyuan3D-2.1 and facebook/dinov2-large) on first run.

The reference radialattn variant additionally requires flashinfer; it is optional and not needed for the default block-mask model.

Pretrained Checkpoint

The pretrained model is hosted on the Hugging Face Hub at TencentARC/Sculpt4D, under the blockmask_bf16/ subfolder. It is a sharded bf16 checkpoint (~8 GB) in the standard pytorch_model-*.bin + pytorch_model.bin.index.json format. Download it into a local folder:

huggingface-cli download TencentARC/Sculpt4D --include "blockmask_bf16/*" --local-dir checkpoints/sculpt4d

or in Python:

from huggingface_hub import snapshot_download
snapshot_download("TencentARC/Sculpt4D", allow_patterns="blockmask_bf16/*", local_dir="checkpoints/sculpt4d")

Pass the downloaded subfolder (checkpoints/sculpt4d/blockmask_bf16) to --ckpt_path.

Inference

Each case under demos/ is an image sequence (one image per frame) of a single object. Foreground RGBA images are recommended (the alpha channel is used to segment the object). Generate a mesh sequence with:

python inference_4d.py \
    --config configs/4d_config_8.yaml \
    --ckpt_path checkpoints/sculpt4d/blockmask_bf16 \
    --input_dir demos/door \
    --output_dir ./inference_output/door \
    --num_steps 50 \
    --guidance_scale 5.0

The model reads the first sequence_length frames from the folder (each demo ships 32 frames). To try the other bundled cases, just change --input_dir to demos/chilun10, demos/angle, demos/box, or demos/fatman.

Arguments

Argument	Description	Default
`--config`	Training config YAML	(required)
`--ckpt_path`	Sharded checkpoint directory	(required)
`--input_dir`	Folder with the object's image sequence (one image per frame)	(required)
`--output_dir`	Output directory for `.obj` meshes	`./inference_output`
`--num_steps`	Denoising steps	50
`--guidance_scale`	CFG scale	5.0
`--noise_share_alpha`	Noise sharing across frames (0=independent, 1=shared)	1.0
`--mc_algo`	Mesh extraction: `mc`, `dmc`, or `fc`	`mc`
`--dtype`	Data type: `bf16` or `fp32`	`bf16`

Input Folder Convention

demos/door/
├── frame_0000.png
├── frame_0001.png
├── ...
└── frame_0031.png   # frames are read in sorted filename order

Output

One .obj mesh file per generated frame:

inference_output/door/
├── frame_0000.obj
├── frame_0001.obj
├── ...
└── frame_0007.obj

Training

Configuration

The default config is configs/4d_config_8.yaml, which uses block-mask temporal attention. To train on longer sequences, use configs/4d_config_16.yaml instead (it differs only in sequence_length).

Three additional reference configs implement alternative temporal-attention designs. They share the same VAE and diffusion settings and are provided for exploration:

Variant	Config	Denoiser
Radial sparse attention	`configs/4d_config_radialattn.yaml`	`hunyuandit_4d_radial.py`
FramePack	`configs/4d_config_framepack.yaml`	`hunyuandit_4d_framepack.py`
RoPE	`configs/4d_config_rope.yaml`	`hunyuandit_4d_rope.py`

Key parameters in the config:

dataset.params.train_data_list / val_data_list — Path to your preprocessed dataset
dataset.params.sequence_length — Number of frames per sequence (default: 8)
dataset.params.pc_size — Point cloud size for VAE encoding (40960)
training.steps — Total training steps (default: 20,000)
training.base_lr — Learning rate (default: 5e-5)

Launch Training

Edit run_train.sh to set your conda environment and working directory:

# In run_train.sh, uncomment and update:
# source /path/to/conda/bin/activate your_env
# cd /path/to/sculpt4d

Update dataset.params.train_data_list in the config to point to your dataset.
Run (the config is the single source of truth — no separate scripts):

bash run_train.sh                              # default (configs/4d_config_8.yaml)
bash run_train.sh configs/4d_config_16.yaml    # or pass another config

Training uses 8 GPUs by default with DeepSpeed ZeRO. Checkpoints are written to output_folder/dit/<config_name>/.

Data Preparation

The pipeline takes animated GLB assets and produces, for each object, (1) per-frame multi-view renders used as image conditioning, and (2) temporally consistent surface point clouds used to train the model. Each animation is resampled to a fixed 32 frames.

Step 1: Render Multi-View Images

bash tools/pipeline.sh /path/to/blender /path/to/input.glb object_name [output_folder]

This runs tools/render/render4d_32.py in Blender (geometry mode, 512×512). For each of the 32 frames it renders 24 views (random Hammersley camera placement) as RGBA images, exports the per-frame triangle mesh, and writes the camera parameters. Output base folder defaults to dataset/preprocessed.

Output structure (one folder per frame, frame_0000 … frame_0031):

dataset/preprocessed/object_name/
└── render_cond/
    ├── frame_0000/
    │   ├── 000.png  001.png  …  023.png   # 24 multi-view RGBA renders
    │   ├── mesh.ply                        # per-frame triangle mesh
    │   └── transforms.json                 # per-view camera parameters
    ├── frame_0001/
    └── …

Step 2: Sample 4D Point Clouds

parallel_process_v3.py scans a dataset root for object folders (each containing the render_cond/ from Step 1) and, for every object, writes geo_data/4d_sample_new.npz next to it. Internally it samples points once on a rest-pose mesh, propagates them to every frame via barycentric coordinates, and projects them onto each frame's watertight surface — giving point clouds whose ordering is consistent across time.

python tools/4dsampling/parallel_process_v3.py \
    --input_mode render \
    --output_root_1 ./dataset/preprocessed \
    --python_exec /path/to/python \
    --workers 8 \
    --grid_res 256

--output_root_1 — dataset root to scan (the folder produced by Step 1). 4d_sample_new.npz is written into each object's geo_data/.
--python_exec — Python interpreter that has torch, cubvh, pytorch3d, and trimesh installed (used by the per-object worker).
--workers — number of parallel workers (defaults to the GPU count); --grid_res is the marching-cubes resolution for watertighting (default 256).

Processing modes (--input_mode, default hybrid):

Mode	Rest pose source	Requires Blender + GLB?
`render`	`frame_0000/mesh.ply` (simplest, no Blender)	No
`hybrid`	Blender exports the rest pose; frames reuse `render_cond` meshes	Yes
`export`	Full Blender export of the rest pose and all frames	Yes

Output: geo_data/4d_sample_new.npz, keyed by frame. Each frame stores two surface point clouds as float16 arrays of shape (N, 6) — xyz coordinates plus surface normals: random_surface (uniformly sampled) and sharp_surface (concentrated near sharp edges).

Step 3: Verify Data

Sanity-check a processed object — writes a report and multi-view point-cloud snapshots, and checks that the points are centered and normalized into [-0.5, 0.5]:

python tools/dataset/verify_data.py \
    --data_path ./dataset/preprocessed/OBJECT_ID/geo_data/ \
    --output_dir ./verification_output/

Expected Dataset Layout

merged_dataset/
├── object_id_1/
│   ├── render_cond/
│   │   ├── frame_0000/
│   │   │   ├── 000.png … 023.png    # 24 views
│   │   │   ├── mesh.ply
│   │   │   └── transforms.json
│   │   └── …                         # frame_0001 … frame_0031
│   └── geo_data/
│       └── 4d_sample_new.npz
├── object_id_2/
└── …

Acknowledgments

This project builds upon:

Hunyuan3D-2.1 — Base 3D generation framework
Block-Sparse-Attention — Sparse attention kernels for temporal attention
DINOv2 — Visual feature encoder

Citation

If you find this work useful, please cite:

@inproceedings{sculpt4d2026,
  title={Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers},
  author={Yin, Minghao and Hu, Wenbo and Xu, Jiale and Shan, Ying and Han, Kai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Abstract

Method Highlights

Pipeline

Project Structure

Requirements

Installation

Pretrained Checkpoint

Inference

Arguments

Input Folder Convention

Output

Training

Configuration

Launch Training

Data Preparation

Step 1: Render Multi-View Images

Step 2: Sample 4D Point Clouds

Step 3: Verify Data

Expected Dataset Layout

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
demos		demos
hy3dshape		hy3dshape
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
inference_4d.py		inference_4d.py
main.py		main.py
requirements.txt		requirements.txt
run_train.sh		run_train.sh

Folders and files

Latest commit

History

Repository files navigation

Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers

Abstract

Method Highlights

Pipeline

Project Structure

Requirements

Installation

Pretrained Checkpoint

Inference

Arguments

Input Folder Convention

Output

Training

Configuration

Launch Training

Data Preparation

Step 1: Render Multi-View Images

Step 2: Sample 4D Point Clouds

Step 3: Verify Data

Expected Dataset Layout

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages