Skip to content

Sejong-VLI/VQA-LRCE-KBS-2023

Repository files navigation

Lightweight Recurrent Cross-modal Encoder

architecture

Setup and Configurations

Environment

Install all the dependencies using conda environment by typing:

conda env create -f env.yaml
conda activate lrce
pip install 'git+https://github.com/katsura-jp/pytorch-cosine-annealing-with-warmup'

Dataset

MSVD-QA

Download the annotations and videos. Extract them into a single directory and place all of the videos under a folder named video. Download the idx-video-mapping.pkl and place it on the same directory. The dataset directory should look as follows:

MSVD-QA
├── idx-video-mapping.pkl
├── readme.txt
├── test_qa.json
├── train_qa.json
├── val_qa.json
└── video
    ├── 00jrXRMlZOY_0_10.avi
    ├── 02Z-kuB3IaM_2_13.avi
    ...
    └── zzit5b_-ukg_5_20.avi

MSRVTT-QA

Download the annotations and videos. Extract them into a single directory and place all of the videos under a folder named video. Download the idx-video-mapping.pkl and place it on the same directory. The dataset directory should look as follows:

MSRVTT-QA
├── category.txt
├── idx-video-mapping.pkl
├── readme.txt
├── test_qa.json
├── train_qa.json
├── val_qa.json
└── video
    ├── video0.mp4
    ├── video1000.mp4
    ...
    └── video9.mp4

TGIF-QA

Download the annotations and gifs from the official repo. Combine all the files into a single directory and restructure it as follows:

TGIF-QA
├── annotations
│   ├── README.md
│   ├── Test_action_question.csv
│   ├── Test_count_question.csv
│   ├── Test_frameqa_question.csv
│   ├── Test_transition_question.csv
│   ├── Total_action_question.csv
│   ├── Total_count_question.csv
│   ├── Total_frameqa_question.csv
│   ├── Total_transition_question.csv
│   ├── Train_action_question.csv
│   ├── Train_count_question.csv
│   ├── Train_frameqa_question.csv
│   └── Train_transition_question.csv
└── gifs
    ├── tumblr_ku4lzkM5fg1qa47qco1_250.gif
    ├── tumblr_ky2syrOMmW1qawjc8o1_250.gif
    ...
    └── tumblr_nrlo5nKKip1uz642so1_400.mp4

GPU

This code will utilize all of the GPU in your machine by default. To only use some of the GPUs, you can set the CUDA_VISIBLE_DEVICES variable in your environment. For example, to use only the first GPU, type:

export CUDA_VISIBLE_DEVICES=0

Feature Extractor

Download the pre-trained video swin transformer here. Then, place it under the pretrained_models directory of this project.

Performance

performance

Training

To see all of the possible arguments and their explanation when performing training, you can type:

python train.py -h

We provided the arguments that we used to reproduce the reported performance in the paper as follows:

  • MSVD-QA
python train.py --dataset msvd-qa-oe \
--dataset-dir <path/to/dataset> --ckpt-interval 2 --batch-size 10 \
--epoch 8 --drop-out-rate 0.1 --lr 5e-5 --reg-strength 0.001 --num-workers 4 \
--use-cosine-scheduler --lr-restart-epoch 1 --lr-restart-mul 2 \
--lr-decay-factor 0.5 --lr-warm-up 0.1 --min-lr 1e-8 \
--temporal-scale 3 --eval-per-epoch 3
  • MSRVTT-QA
python train.py --dataset msrvtt-qa-oe \
--dataset-dir <path/to/dataset> --ckpt-interval 2 --batch-size 10 \
--epoch 7 --drop-out-rate 0.1 --lr 2e-5 --reg-strength 0.001 --num-workers 4 \
--use-cosine-scheduler --lr-restart-epoch 1 --lr-restart-mul 2 \
--lr-decay-factor 1 --lr-warm-up 0.05 --min-lr 1e-8 \
--temporal-scale 3 --eval-per-epoch 3
  • TGIF-FrameQA
python train.py --dataset tgif-frameqa \
--dataset-dir <path/to/dataset> --ckpt-interval 3 --batch-size 10 \
--epoch 15 --drop-out-rate 0.1 --lr 1e-4 --reg-strength 0.001 --num-workers 4 \
--use-cosine-scheduler --lr-restart-epoch 1 --lr-restart-mul 2 \
--lr-decay-factor 0.5 --lr-warm-up 0.1 --min-lr 1e-8 \
--temporal-scale 3 --eval-per-epoch 3
  • TGIF-Transition
python train.py --dataset tgif-transition \
--dataset-dir <path/to/dataset> --ckpt-interval 3 --batch-size 9 \
--epoch 5 --drop-out-rate 0.1 --lr 2e-5 --reg-strength 0.001 --num-workers 4 \
--use-cosine-scheduler --lr-restart-epoch 1 --lr-restart-mul 2 \
--lr-decay-factor 1 --lr-warm-up 0 --min-lr 1e-8 \
--temporal-scale 3 --eval-per-epoch 3
  • TGIF-Action
python train.py --dataset tgif-action \
--dataset-dir <path/to/dataset> --ckpt-interval 3 --batch-size 16 \
--epoch 10 --drop-out-rate 0.1 --lr 3e-5 --reg-strength 0.001 --num-workers 4 \
--use-cosine-scheduler --lr-restart-epoch 1 --lr-restart-mul 2 \
--lr-decay-factor 1 --lr-warm-up 0.1 --min-lr 1e-8 \
--temporal-scale 3 --eval-per-epoch 3

Note: We trained our models with 4 GPUs and utilize the ddp training strategy, so the results might vary when the models is trained under different number of GPUs due to the batch size.

Evaluation

To perform evaluation on a trained model, you can type in the following:

python eval.py --dataset <dataset/name> \
--dataset-dir <path/to/dataset> \
--batch-size 32 --num-workers 4 --temporal-scale 3 \
--model-path <path/to/model>

The dataset arguments can be:

  • msvd-qa-oe for MSVD-QA
  • msvrvtt-qa-oe for MSVD-QA
  • tgif-frameqa for TGIF-FrameQA
  • tgif-transition for TGIF-Transition
  • tgif-action for TGIF-Action

To get our reported performance, download our best training checkpoints here.

Citation

@article{Immanuel2023,
    author  = {S. A. Immanuel and C. Jeong},
    title   = {Lightweight recurrent cross-modal encoder for video question answering},
    journal = {Knowledge-Based Systems},
    volume  = {},
    number  = {},
    pages   = {},
    month   = {6},
    year    = {2023},
}

Credits