Efficient End-to-End Video-Question Answering with Pyramidal Multimodal Transformer - AAAI23
This is the PyTorch Implementation of our paper "[Efficient End-to-End Video-Question Answering with Pyramidal Multimodal Transformer]". (accepted by AAAI’23)
-
Download the dataset
MSVD-QA: link
MSRVTT-QA: link
TGIF-QA: link
ActivityNet-QA: link Youtube2Text-QA: please ref link For the text-to-video retrieval task in our ablation study, pleade ref link -
Word Glove Embedding and Video Frames extraction
- To extract questions or answers Glove Embedding, please ref here.
Take the action task in TGIF-QA dataset as an example, we have features at the path /inputdata: TGIF/word/Action/TGIF_Action_train_questions.pt TGIF/word/Action/TGIF_Action_test_questions.pt TGIF/word/Action/TGIF_Action_vocab.json - To extract video frames, we use the skvideo.io module to eatract the images and then transfer it to .npz format. for Action task in the TGIF-QA dataset as example, we have .npz files at the path /inputdata: TGIF/video/Action/tumblr_no00ddSlG31t34v14o1_250.npz TGIF/video/Action/tumblr_nd24xaX8d11qkb1azo1_250.npz ... TGIF/video/Action/tumblr_no00ddSlG31t34v14o1_250.npz TGIF/video/Action/tumblr_nd24xaX8d11qkb1azo1_250.npz ...
- To extract questions or answers Glove Embedding, please ref here.
@article{peng2022PMT,
title={Efficient End-to-End Video-Question Answering with Pyramidal Multimodal Transformer},
author={Peng Min, Wang Chongyang, Shi Yu, Zhou Xiang-Dong},
journal={Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI)},
year={2023}}