Skip to content

Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification

License

Notifications You must be signed in to change notification settings

OpenDriveLab/SparseVideoNav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SparseVideoNav Logo

SparseVideoNav: Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation

Hai Zhang*Siqi Liang*Li ChenYuxian LiYukuan XuYichao ZhongFu ZhangHongyang Li

Project Page
The University of Hong Kong 

Project Page Repo arXiv License

📖 Introduction

SparseVideoNav introduces video generation models to real-world beyond-the-view vision-language navigation for the first time. It achieves sub-second trajectory inference with a sparse future spanning a 20-second horizon, yielding a remarkable 27× speed-up. Real-world zero-shot experiments show 2.5× higher success rate than state-of-the-art LLM baselines and mark the first realization in challenging night scenes.

Developers: Hai Zhang and Siqi Liang

📢 News

Important

🌟 Stay up to date at opendrivelab.com!

📬 Contact

For further inquiries or assistance, please contact zhanghenryhai12138@gmail.com or liangsiqi@connect.hku.hk

📌 Table of Contents

🔥Highlights

  • We investigate beyond-the-view navigation tasks in the real world by introducing video generation model to this field for the first time.
  • We pioneer a paradigm shift from continuous to sparse video generation for longer prediction horizon.
  • We achieve sub-second trajectory inference guided by a generated sparse future spanning a 20-second horizon. This yields a remarkable 27x speed-up compared to the unoptimized counterpart.
  • We achieve the first realization of beyond-the-view navigation in challenging night scenes with a 17.5% success rate.

📝 TODO List

  • SparseVideoNav Paper Release.
    • arXiv preprint is now available!
  • SparseVideoNav Code Release.
    • Inference code of distilled video generation model and model checkpoint (Estimate 2026 March).
    • Inference code of continuous action head and model checkpoint (Estimate 2026 Q3).
  • SparseVideoNav Dataset Release
    • ~140h real-world VLN data (Estimate 2026 Q3).

📄 License and Citation

All the data and code within this repo are under CC BY-NC-SA 4.0.

  • Please consider citing our work if it helps your research.
@article{zhang2026sparse,
  title={Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation},
  author={Zhang, Hai and Liang, Siqi and Chen, Li and Li, Yuxian and Xu, Yukuan and Zhong, Yichao and Zhang, Fu and Li, Hongyang},
  journal={arXiv preprint arXiv:2602.05827},
  year={2026}
}

About

Sparse Video Generation Model for Embodied Navigation conditioned on loose language guidance, 100% real world verification

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •