Skip to content

EnVision-Research/DVD

Repository files navigation

DVD: Deterministic Video Depth Estimation with Generative Priors

Hongfei Zhang1*, Harold H. Chen1,2*, Chenfei Liao1*, Jing He1*, Zixin Zhang1, Haodong Li3, Yihao Liang4,
Kanghao Chen1, Bin Ren5, Xu Zheng1, Shuai Yang1, Kun Zhou6, Yinchuan Li7, Nicu Sebe8,
Ying-Cong Chen1,2†


*Equal Contribution; Corresponding Author
1HKUST(GZ), 2HKUST, 3UCSD, 4Princeton University, 5MBZUAI, 6SZU, 7Knowin, 8UniTrento

If you like our project, please give us a star ⭐ on GitHub for latest update.

Project Page Paper Model

framework

👋 Introduction

Welcome to the official repository for DVD: Deterministic Video Depth!

While recent generative foundation models have shown remarkable potential in zero-shot depth estimation, they inherently suffer from the ambiguity-hallucination dilemma due to their stochastic sampling process. DVD fundamentally shifts this paradigm. We present the first deterministic framework that elegantly adapts pre-trained Video Diffusion Models (like WanV2.1) into single-pass depth regressors.

By cleanly stripping away generative stochasticity, DVD unites the semantic richness of generative foundation models with the structural stability of discriminative regressors.

✨ Key Highlights

  • 🚀 Extreme Data Efficiency: DVD effectively unlocks profound generative priors using only 367K frames—which is 163× less task-specific training data than leading discriminative baselines like VDA (60M frames).
  • ⏱️ Deterministic & Fast: Bypasses iterative ODE integration. Inference is performed in a single forward pass, ensuring absolute temporal stability without generative hallucinations.
  • 📐 Unparalleled Structural Fidelity: Powered by our Latent Manifold Rectification (LMR), DVD achieves state-of-the-art high-frequency boundary precision (Boundary Recall & F1) compared to overly smoothed baselines.
  • 🎥 Long-Video Inference: Equipped with our training-free Global Affine Coherence module, DVD seamlessly stitches sliding windows to support long-video rollouts with negligible scale drift.

TL;DR: If you want state-of-the-art video depth estimation that is highly detailed, temporally stable across long videos, and exceptionally data-efficient, DVD is what you need.


📢 News

  • [2026.03.13] 📄 Paper is available on arXiv.
  • [2026.03.12] 🌐 Project page is live.
  • [2026.03.11] 🤗 Pre-trained weights released on Hugging Face.
  • [2026.03.10] 🔥 Repository initialized and training & inference code released.

📂 Core Folders & Files Overview

To help you navigate the codebase quickly, we have divided the core directories into two main categories based on what you want to do: Inference (just using the model) or Training (fine-tuning or training from scratch).

🎬 For Inference (Testing & Using the Model)

If you just want to generate depth maps from your own videos or reproduce our paper's results, focus on these folders:

  • infer_bash/ (The Launchpad): Ready-to-use shell scripts (e.g., openworld.sh). This is the easiest way to run the model on your data without writing any code.
  • ckpt/ (The Vault): This is where you should place our pre-trained model weights downloaded from Huggingface.
  • inference_results/ (The Output Bin): Once you run an inference script, your generated depth maps and visualizer videos will appear here.
  • demo/ : Quick-start examples and sample inputs to help you verify that your environment is set up correctly.

🏋️‍♂️ For Training (Fine-tuning & Development)

If you want to train DVD on your own datasets or modify the architecture, these are your go-to folders:

  • train_config/ (The Control Center): YAML configuration files. You can easily tweak hyperparameters (like learning rate, batch size) and dataset paths here.
  • train_script/ (The Engine): Contains the training bash.
  • diffsynth/pipelines/wan_video_new_determine.py (The Brain): The core DVD model architecture. If you want to understand or modify how we stripped away the generative noise to build the deterministic forward pass, look here.
  • infer_bash/ & test_script/ (The Evaluator): Scripts used to evaluate your newly trained checkpoints against standard benchmarks during or after training.
  • examples/dataset/: Codes of dataset construction.

🛠️ Installation

📦 Install from source code (Basic Dependency):

git clone https://github.com/EnVision-Research/DVD.git
cd DVD
conda create -n dvd python=3.10 -y 
conda activate dvd 
pip install -e .

🏃 Install SageAttention (For Speedup):

pip install sageattention # DO NOT USE THIS FOR TRAINING!!!

🤗 Download the checkpoint from Huggingface

1. Login to huggingface

huggingface-cli login # Or hf auth login 

2. Download the checkpoint from huggingface repo

huggingface-cli download FayeHongfeiZhang/DVD --revision main --local-dir ckpt

3. The final structure shoule be like

DVD
├── ckpt/
├──── model_config.yaml
├──── model.safetensors
├── configs/
├── examples/
├── ...

💅🏻 Potential Issue (from DiffSynth Studio)

If you encounter issues during installation, it may be caused by the packages we depend on. Please refer to the documentation of the package that caused the problem.


🕹️ Inference

🤹🏼‍♂️ Quick Start with Demo Videos

bash infer_bash/openworld.sh

You may also put more videos in the demo/ directory and alter the video path in the bash to get more results!


👩🏼‍🏫 For Academic Purpose (Paper Setting)

1. Video Inference

1.1. Download the KITTI Dataset, Bonn Dataset, ScanNet Dataset.

1.2. Format the dataset as the structure below

kitti_depth
├── rgb
├──── 2011_09_26
├──── ...
├── depth
├──── train
├──── val

rgbd_bonn_dataset
├── rgbd_bonn_balloon
├── rgbd_bonn_balloon_tracking
├── ...

scannet
├── scene0000_00
├── scene0000_01
├── ...

1.3. Reconfig the bash ($VIDEO_BASE_DATA_DIR) and run the video inference script

bash infer_bash/video.sh

2. Image Inference

2.1. Download the evaluation datasets (depth) by the following commands (referred to Marigold).

2.2. Reconfig the bash ($IMAGE_BASE_DATA_DIR) and run the image inference script

bash infer_bash/image.sh

🔥Training

Please refer to this document for details on training.

👏 Acknowledgement

We sincerely thank the authors of Depth Anything and RollingDepth for providing their implementing details. We would also thanks the contributors of DiffSynth where we borrow codes from.

👾 Reference

If you find our work useful in your research, please consider citing:

@article{zhang2026dvd,
  title={DVD: Deterministic Video Depth Estimation with Generative Priors},
  author={Zhang, Hongfei and Chen, Harold Haodong and Liao, Chenfei and He, Jing and Zhang, Zixin and Li, Haodong and Liang, Yihao and Chen, Kanghao and Ren, Bin and Zheng, Xu and Yang, Shuai and Zhou, Kun and Li, Yinchuan and Sebe, Nicu and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2603.12250},
  year={2026}
}

About

DVD: Deterministic Video Depth Estimation with Generative Priors

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors