You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To promote the development of Vision-Language Pre-training (VLP) andmultimodal Large Language Model (LLM) in the Chinese community, we firstlyrelease the largest public Chinese high-quality video-language dataset namedYouku-mPLUG, which is collected from Youku, a well-known Chinese video-sharingwebsite, with strict criteria of safety, diversity, and quality. Youku-mPLUGcontains 10 million Chinese video-text pairs filtered from 400 million rawvideos across a wide range of 45 diverse categories for large-scalepre-training. In addition, to facilitate a comprehensive evaluation ofvideo-language models, we carefully build the largest human-annotated Chinesebenchmarks covering three popular video-language tasks of cross-modalretrieval, video captioning, and video category classification. Youku-mPLUG canenable researchers to conduct more in-depth multimodal research and developbetter applications in the future. Furthermore, we release popularvideo-language pre-training models, ALPRO and mPLUG-2, and our proposedmodularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%improvement in video category classification. Besides, mPLUG-video achieves anew state-of-the-art result on these benchmarks with 80.5% top-1 accuracy invideo category classification and 68.9 CIDEr score in video captioning,respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz withonly 1.7% trainable parameters as Chinese multimodal LLM, and demonstrateimpressive instruction and video understanding ability. The zero-shotinstruction understanding experiment indicates that pretraining withYouku-mPLUG can enhance the ability to comprehend overall and detailed visualsemantics, recognize scene text, and leverage open-domain knowledge.
AkihikoWatanabe
changed the title
あ
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for
Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23
Jun 16, 2023
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: