Skip to content

[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

License

Notifications You must be signed in to change notification settings

LunarShen/DsicoVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

The official implementation of DiscoVLA [Paper].

If you find this project helpful, you might also be interested in our previous work:

[ICLR 2025] TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval [Code] [Paper]

Dataset

Prepare the dataset by following the instructions from CLIP4Clip.

For MSRVTT, the official data and video links can be found in link.

For the convenience, the splits and captions can be found in sharing from CLIP4Clip,

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Besides, the raw videos can be found in sharing from Frozen in Time, i.e.,

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

Train

Please download the Pseudo Image Captions of MSRVTT from Baidu Cloud or Hugging Face. For more details, please refer to our paper.

We conduct experiments on 4 A100x40G GPUs on MSRVTT. To set up the environment and run the experiments, execute the following commands:

bash scripts/create_env.sh
bash scripts/MSRVTT.sh

Acknowledgement

This project builds upon the following open-source works: DRL.

Citation

@inproceedings{shen2025discovla,
  title={DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval},
  author={Shen, Leqi and Gong, Guoqiang and Hao, Tianxiang and He, Tao and Zhang, Yifeng and Liu, Pengzhang and Zhao, Sicheng and Han, Jungong and Ding, Guiguang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={19702--19712},
  year={2025}
}

About

[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published