Hongfei Zhang1*, Harold H. Chen1,2*, Chenfei Liao1*, Jing He1*, Zixin Zhang1, Haodong Li3, Yihao Liang4,
Kanghao Chen1, Bin Ren5, Xu Zheng1, Shuai Yang1, Kun Zhou6, Yinchuan Li7, Nicu Sebe8,
Ying-Cong Chen1,2†
*Equal Contribution; †Corresponding Author
1HKUST(GZ), 2HKUST, 3UCSD, 4Princeton University, 5MBZUAI, 6SZU, 7Knowin, 8UniTrento
Welcome to the official repository for DVD: Deterministic Video Depth!
While recent generative foundation models have shown remarkable potential in zero-shot depth estimation, they inherently suffer from the ambiguity-hallucination dilemma due to their stochastic sampling process. DVD fundamentally shifts this paradigm. We present the first deterministic framework that elegantly adapts pre-trained Video Diffusion Models (like WanV2.1) into single-pass depth regressors.
By cleanly stripping away generative stochasticity, DVD unites the semantic richness of generative foundation models with the structural stability of discriminative regressors.
- 🚀 Extreme Data Efficiency: DVD effectively unlocks profound generative priors using only 367K frames—which is 163× less task-specific training data than leading discriminative baselines like VDA (60M frames).
- ⏱️ Deterministic & Fast: Bypasses iterative ODE integration. Inference is performed in a single forward pass, ensuring absolute temporal stability without generative hallucinations.
- 📐 Unparalleled Structural Fidelity: Powered by our Latent Manifold Rectification (LMR), DVD achieves state-of-the-art high-frequency boundary precision (Boundary Recall & F1) compared to overly smoothed baselines.
- 🎥 Long-Video Inference: Equipped with our training-free Global Affine Coherence module, DVD seamlessly stitches sliding windows to support long-video rollouts with negligible scale drift.
TL;DR: If you want state-of-the-art video depth estimation that is highly detailed, temporally stable across long videos, and exceptionally data-efficient, DVD is what you need.
- [2026.03.13] 📄 Paper is available on arXiv.
- [2026.03.12] 🌐 Project page is live.
- [2026.03.11] 🤗 Pre-trained weights released on Hugging Face.
- [2026.03.10] 🔥 Repository initialized and training & inference code released.
To help you navigate the codebase quickly, we have divided the core directories into two main categories based on what you want to do: Inference (just using the model) or Training (fine-tuning or training from scratch).
If you just want to generate depth maps from your own videos or reproduce our paper's results, focus on these folders:
infer_bash/(The Launchpad): Ready-to-use shell scripts (e.g.,openworld.sh). This is the easiest way to run the model on your data without writing any code.ckpt/(The Vault): This is where you should place our pre-trained model weights downloaded from Huggingface.inference_results/(The Output Bin): Once you run an inference script, your generated depth maps and visualizer videos will appear here.demo/: Quick-start examples and sample inputs to help you verify that your environment is set up correctly.
If you want to train DVD on your own datasets or modify the architecture, these are your go-to folders:
train_config/(The Control Center): YAML configuration files. You can easily tweak hyperparameters (like learning rate, batch size) and dataset paths here.train_script/(The Engine): Contains the training bash.diffsynth/pipelines/wan_video_new_determine.py(The Brain): The core DVD model architecture. If you want to understand or modify how we stripped away the generative noise to build the deterministic forward pass, look here.infer_bash/&test_script/(The Evaluator): Scripts used to evaluate your newly trained checkpoints against standard benchmarks during or after training.examples/dataset/: Codes of dataset construction.
git clone https://github.com/EnVision-Research/DVD.git
cd DVD
conda create -n dvd python=3.10 -y
conda activate dvd
pip install -e .
pip install sageattention # DO NOT USE THIS FOR TRAINING!!!
huggingface-cli login # Or hf auth login
2. Download the checkpoint from huggingface repo
huggingface-cli download FayeHongfeiZhang/DVD --revision main --local-dir ckpt
DVD
├── ckpt/
├──── model_config.yaml
├──── model.safetensors
├── configs/
├── examples/
├── ...
💅🏻 Potential Issue (from DiffSynth Studio)
If you encounter issues during installation, it may be caused by the packages we depend on. Please refer to the documentation of the package that caused the problem.
bash infer_bash/openworld.sh
You may also put more videos in the demo/ directory and alter the video path in the bash to get more results!
1.1. Download the KITTI Dataset, Bonn Dataset, ScanNet Dataset.
1.2. Format the dataset as the structure below
kitti_depth
├── rgb
├──── 2011_09_26
├──── ...
├── depth
├──── train
├──── val
rgbd_bonn_dataset
├── rgbd_bonn_balloon
├── rgbd_bonn_balloon_tracking
├── ...
scannet
├── scene0000_00
├── scene0000_01
├── ...
1.3. Reconfig the bash ($VIDEO_BASE_DATA_DIR) and run the video inference script
bash infer_bash/video.sh
2.1. Download the evaluation datasets (depth) by the following commands (referred to Marigold).
2.2. Reconfig the bash ($IMAGE_BASE_DATA_DIR) and run the image inference script
bash infer_bash/image.sh
Please refer to this document for details on training.
We sincerely thank the authors of Depth Anything and RollingDepth for providing their implementing details. We would also thanks the contributors of DiffSynth where we borrow codes from.
If you find our work useful in your research, please consider citing:
@article{zhang2026dvd,
title={DVD: Deterministic Video Depth Estimation with Generative Priors},
author={Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and He, Jing and Zhang, Zixin and Li, Haodong and Liang, Yihao and Chen, Kanghao and Ren, Bin and Zheng, Xu and Yang, Shuai and Zhou, Kun and Li, Yinchuan and Sebe, Nicu and Chen, Ying-Cong},
journal={arXiv preprint arXiv:2603.12250},
year={2026}
}