Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 #713

AkihikoWatanabe · 2023-06-16T12:13:08Z

URL

https://arxiv.org/abs//2306.04362

Affiliations

Haiyang Xu, N/A
Qinghao Ye, N/A
Xuan Wu, N/A
Ming Yan, N/A
Yuan Miao, N/A
Jiabo Ye, N/A
Guohai Xu, N/A
Anwen Hu, N/A
Yaya Shi, N/A
Guangwei Xu, N/A
Chenliang Li, N/A
Qi Qian, N/A
Maofei Que, N/A
Ji Zhang, N/A
Xiao Zeng, N/A
Fei Huang, N/A

Abstract

To promote the development of Vision-Language Pre-training (VLP) andmultimodal Large Language Model (LLM) in the Chinese community, we firstlyrelease the largest public Chinese high-quality video-language dataset namedYouku-mPLUG, which is collected from Youku, a well-known Chinese video-sharingwebsite, with strict criteria of safety, diversity, and quality. Youku-mPLUGcontains 10 million Chinese video-text pairs filtered from 400 million rawvideos across a wide range of 45 diverse categories for large-scalepre-training. In addition, to facilitate a comprehensive evaluation ofvideo-language models, we carefully build the largest human-annotated Chinesebenchmarks covering three popular video-language tasks of cross-modalretrieval, video captioning, and video category classification. Youku-mPLUG canenable researchers to conduct more in-depth multimodal research and developbetter applications in the future. Furthermore, we release popularvideo-language pre-training models, ALPRO and mPLUG-2, and our proposedmodularized decoder-only model mPLUG-video pre-trained on Youku-mPLUG.Experiments show that models pre-trained on Youku-mPLUG gain up to 23.1%improvement in video category classification. Besides, mPLUG-video achieves anew state-of-the-art result on these benchmarks with 80.5% top-1 accuracy invideo category classification and 68.9 CIDEr score in video captioning,respectively. Finally, we scale up mPLUG-video based on the frozen Bloomz withonly 1.7% trainable parameters as Chinese multimodal LLM, and demonstrateimpressive instruction and video understanding ability. The zero-shotinstruction understanding experiment indicates that pretraining withYouku-mPLUG can enhance the ability to comprehend overall and detailed visualsemantics, recognize scene text, and leverage open-domain knowledge.

Translation (by gpt-3.5-turbo)

中国のコミュニティにおけるVision-Language Pre-training（VLP）とマルチモーダル大規模言語モデル（LLM）の発展を促進するために、まずYouku-mPLUGという最大の公開中国語高品質ビデオ言語データセットをリリースしました。Youku-mPLUGは、安全性、多様性、品質の厳格な基準で収集されたYoukuという有名な中国のビデオ共有ウェブサイトから抽出された4億本のビデオから10万の中国語ビデオテキストペアを含み、大規模なプレトレーニングに使用できます。さらに、ビデオ言語モデルの包括的な評価を促進するために、クロスモーダル検索、ビデオキャプション、ビデオカテゴリ分類の3つの人気のあるビデオ言語タスクをカバーする最大の人間注釈中国語ベンチマークを慎重に構築しました。Youku-mPLUGは、将来的により深いマルチモーダル研究を行い、より良いアプリケーションを開発するための研究者を支援することができます。さらに、人気のあるビデオ言語プレトレーニングモデルであるALPROとmPLUG-2、およびYouku-mPLUGでプレトレーニングされたモジュール化されたデコーダのみのモデルであるmPLUG-videoをリリースしました。実験により、Youku-mPLUGでプレトレーニングされたモデルは、ビデオカテゴリ分類で最大23.1％の改善を実現しました。さらに、mPLUG-videoは、ビデオカテゴリ分類で80.5％のトップ1精度、ビデオキャプションで68.9のCIDErスコアで、これらのベンチマークで新しい最高の結果を達成しました。最後に、1.7％の訓練可能なパラメータのみを持つ凍結Bloomzに基づいてmPLUG-videoをスケーリングし、中国語のマルチモーダルLLMとして印象的な指示とビデオ理解能力を示しました。ゼロショットの指示理解実験は、Youku-mPLUGでのプレトレーニングが、全体的および詳細な視覚的意味、シーンテキストの認識、およびオープンドメインの知識の活用能力を向上させることを示しています。

Summary (by gpt-3.5-turbo)

中国のコミュニティにおいて、Vision-Language Pre-training（VLP）とマルチモーダル大規模言語モデル（LLM）の発展を促進するために、Youku-mPLUGという最大の公開中国語高品質ビデオ言語データセットをリリースしました。このデータセットは、大規模なプレトレーニングに使用でき、クロスモーダル検索、ビデオキャプション、ビデオカテゴリ分類の3つの人気のあるビデオ言語タスクをカバーする最大の人間注釈中国語ベンチマークを慎重に構築しました。Youku-mPLUGでプレトレーニングされたモデルは、ビデオカテゴリ分類で最大23.1％の改善を実現し、mPLUG-videoは、ビデオカテゴリ分類で80.5％のトップ1精度、ビデオキャプションで68.9のCIDErスコアで、これらのベンチマークで新しい最高の結果を達成しました。また、Youku-mPLUGでのプレトレーニングが、全体的および詳細な視覚的意味、シーンテキストの認識、およびオープンドメインの知識の活用能力を向上させることを示すゼロショットの指示理解実験も行われました。

AkihikoWatanabe added the Pocket label Jun 16, 2023

AkihikoWatanabe changed the title あ Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 #713

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 #713

AkihikoWatanabe commented Jun 16, 2023 •

edited

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 #713

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks, Haiyang Xu+, N/A, arXiv'23 #713

Comments

AkihikoWatanabe commented Jun 16, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Jun 16, 2023 •

edited