GitHub - EnVision-Research/DVD: DVD: Deterministic Video Depth Estimation with Generative Priors

DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang^1*, Harold H. Chen^1,2*, Chenfei Liao^1*, Jing He^1*, Zixin Zhang¹, Haodong Li³, Yihao Liang⁴,
Kanghao Chen¹, Bin Ren⁵, Xu Zheng¹, Shuai Yang¹, Kun Zhou⁶, Yinchuan Li⁷, Nicu Sebe⁸,
Ying-Cong Chen^1,2†

^*Equal Contribution; ^†Corresponding Author
¹HKUST(GZ), ²HKUST, ³UCSD, ⁴Princeton University, ⁵MBZUAI, ⁶SZU, ⁷Knowin, ⁸UniTrento

If you like our project, please give us a star ⭐ on GitHub for latest update.

👋 Introduction

Welcome to the official repository for DVD: Deterministic Video Depth!

While recent generative foundation models have shown remarkable potential in zero-shot depth estimation, they inherently suffer from the ambiguity-hallucination dilemma due to their stochastic sampling process. DVD fundamentally shifts this paradigm. We present the first deterministic framework that elegantly adapts pre-trained Video Diffusion Models (like WanV2.1) into single-pass depth regressors.

By cleanly stripping away generative stochasticity, DVD unites the semantic richness of generative foundation models with the structural stability of discriminative regressors.

✨ Key Highlights

🚀 Extreme Data Efficiency: DVD effectively unlocks profound generative priors using only 367K frames—which is 163× less task-specific training data than leading discriminative baselines like VDA (60M frames).
⏱️ Deterministic & Fast: Bypasses iterative ODE integration. Inference is performed in a single forward pass, ensuring absolute temporal stability without generative hallucinations.
📐 Unparalleled Structural Fidelity: Powered by our Latent Manifold Rectification (LMR), DVD achieves state-of-the-art high-frequency boundary precision (Boundary Recall & F1) compared to overly smoothed baselines.
🎥 Long-Video Inference: Equipped with our training-free Global Affine Coherence module, DVD seamlessly stitches sliding windows to support long-video rollouts with negligible scale drift.

TL;DR: If you want state-of-the-art video depth estimation that is highly detailed, temporally stable across long videos, and exceptionally data-efficient, DVD is what you need.

📢 News

[2026.03.13] 📄 Paper is available on arXiv.
[2026.03.12] 🌐 Project page is live.
[2026.03.11] 🤗 Pre-trained weights released on Hugging Face.
[2026.03.10] 🔥 Repository initialized and training & inference code released.

📂 Core Folders & Files Overview

To help you navigate the codebase quickly, we have divided the core directories into two main categories based on what you want to do: Inference (just using the model) or Training (fine-tuning or training from scratch).

🎬 For Inference (Testing & Using the Model)

If you just want to generate depth maps from your own videos or reproduce our paper's results, focus on these folders:

infer_bash/ (The Launchpad): Ready-to-use shell scripts (e.g., openworld.sh). This is the easiest way to run the model on your data without writing any code.
ckpt/ (The Vault): This is where you should place our pre-trained model weights downloaded from Huggingface.
inference_results/ (The Output Bin): Once you run an inference script, your generated depth maps and visualizer videos will appear here.
demo/ : Quick-start examples and sample inputs to help you verify that your environment is set up correctly.

🏋️‍♂️ For Training (Fine-tuning & Development)

If you want to train DVD on your own datasets or modify the architecture, these are your go-to folders:

train_config/ (The Control Center): YAML configuration files. You can easily tweak hyperparameters (like learning rate, batch size) and dataset paths here.
train_script/ (The Engine): Contains the training bash.
diffsynth/pipelines/wan_video_new_determine.py (The Brain): The core DVD model architecture. If you want to understand or modify how we stripped away the generative noise to build the deterministic forward pass, look here.
infer_bash/ & test_script/ (The Evaluator): Scripts used to evaluate your newly trained checkpoints against standard benchmarks during or after training.
examples/dataset/: Codes of dataset construction.

🛠️ Installation

📦 Install from source code (Basic Dependency):

git clone https://github.com/EnVision-Research/DVD.git
cd DVD
conda create -n dvd python=3.10 -y 
conda activate dvd 
pip install -e .

🏃 Install SageAttention (For Speedup):

pip install sageattention # DO NOT USE THIS FOR TRAINING!!!

🤗 Download the checkpoint from Huggingface

1. Login to huggingface

huggingface-cli login # Or hf auth login

2. Download the checkpoint from huggingface repo

huggingface-cli download FayeHongfeiZhang/DVD --revision main --local-dir ckpt

3. The final structure shoule be like

DVD
├── ckpt/
├──── model_config.yaml
├──── model.safetensors
├── configs/
├── examples/
├── ...

💅🏻 Potential Issue (from DiffSynth Studio)

If you encounter issues during installation, it may be caused by the packages we depend on. Please refer to the documentation of the package that caused the problem.

🕹️ Inference

🤹🏼‍♂️ Quick Start with Demo Videos

bash infer_bash/openworld.sh

You may also put more videos in the demo/ directory and alter the video path in the bash to get more results!

👩🏼‍🏫 For Academic Purpose (Paper Setting)

1. Video Inference

1.1. Download the KITTI Dataset, Bonn Dataset, ScanNet Dataset.

1.2. Format the dataset as the structure below

kitti_depth
├── rgb
├──── 2011_09_26
├──── ...
├── depth
├──── train
├──── val

rgbd_bonn_dataset
├── rgbd_bonn_balloon
├── rgbd_bonn_balloon_tracking
├── ...

scannet
├── scene0000_00
├── scene0000_01
├── ...

1.3. Reconfig the bash ($VIDEO_BASE_DATA_DIR) and run the video inference script

bash infer_bash/video.sh

2. Image Inference

2.1. Download the evaluation datasets (depth) by the following commands (referred to Marigold).

2.2. Reconfig the bash ($IMAGE_BASE_DATA_DIR) and run the image inference script

bash infer_bash/image.sh

🔥Training

Please refer to this document for details on training.

👏 Acknowledgement

We sincerely thank the authors of Depth Anything and RollingDepth for providing their implementing details. We would also thanks the contributors of DiffSynth where we borrow codes from.

👾 Reference

If you find our work useful in your research, please consider citing:

@article{zhang2026dvd,
  title={DVD: Deterministic Video Depth Estimation with Generative Priors},
  author={Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and He, Jing and Zhang, Zixin and Li, Haodong and Liang, Yihao and Chen, Kanghao and Ren, Bin and Zheng, Xu and Yang, Shuai and Zhou, Kun and Li, Yinchuan and Sebe, Nicu and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2603.12250},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
ckpt		ckpt
configs		configs
demo		demo
diffsynth		diffsynth
examples		examples
infer_bash		infer_bash
test_script		test_script
train_config		train_config
train_script		train_script
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DVD: Deterministic Video Depth Estimation with Generative Priors

If you like our project, please give us a star ⭐ on GitHub for latest update.

👋 Introduction

✨ Key Highlights

📢 News

📂 Core Folders & Files Overview

🎬 For Inference (Testing & Using the Model)

🏋️‍♂️ For Training (Fine-tuning & Development)

🛠️ Installation

📦 Install from source code (Basic Dependency):

🏃 Install SageAttention (For Speedup):

🤗 Download the checkpoint from Huggingface

1. Login to huggingface

2. Download the checkpoint from huggingface repo

3. The final structure shoule be like

💅🏻 Potential Issue (from DiffSynth Studio)

🕹️ Inference

🤹🏼‍♂️ Quick Start with Demo Videos

👩🏼‍🏫 For Academic Purpose (Paper Setting)

1. Video Inference

2. Image Inference

🔥Training

Please refer to this document for details on training.

👏 Acknowledgement

👾 Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DVD: Deterministic Video Depth Estimation with Generative Priors

If you like our project, please give us a star ⭐ on GitHub for latest update.

👋 Introduction

✨ Key Highlights

📢 News

📂 Core Folders & Files Overview

🎬 For Inference (Testing & Using the Model)

🏋️‍♂️ For Training (Fine-tuning & Development)

🛠️ Installation

📦 Install from source code (Basic Dependency):

🏃 Install SageAttention (For Speedup):

🤗 Download the checkpoint from Huggingface

1. Login to huggingface

2. Download the checkpoint from huggingface repo

3. The final structure shoule be like

💅🏻 Potential Issue (from DiffSynth Studio)

🕹️ Inference

🤹🏼‍♂️ Quick Start with Demo Videos

👩🏼‍🏫 For Academic Purpose (Paper Setting)

1. Video Inference

2. Image Inference

🔥Training

Please refer to this document for details on training.

👏 Acknowledgement

👾 Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages