[CVPR 2025] DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval
The official implementation of DiscoVLA [Paper].
If you find this project helpful, you might also be interested in our previous work:
[ICLR 2025] TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval [Code] [Paper]
Prepare the dataset by following the instructions from CLIP4Clip.
For MSRVTT, the official data and video links can be found in link.
For the convenience, the splits and captions can be found in sharing from CLIP4Clip,
wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zipBesides, the raw videos can be found in sharing from Frozen in Time, i.e.,
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zipPlease download the Pseudo Image Captions of MSRVTT from Baidu Cloud or Hugging Face. For more details, please refer to our paper.
We conduct experiments on 4 A100x40G GPUs on MSRVTT. To set up the environment and run the experiments, execute the following commands:
bash scripts/create_env.sh
bash scripts/MSRVTT.shThis project builds upon the following open-source works: DRL.
@inproceedings{shen2025discovla,
title={DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval},
author={Shen, Leqi and Gong, Guoqiang and Hao, Tianxiang and He, Tao and Zhang, Yifeng and Liu, Pengzhang and Zhao, Sicheng and Han, Jungong and Ding, Guiguang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={19702--19712},
year={2025}
}