Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]

PyTorch code accompanies our PSTP-Net.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

Clone this repo

git clone https://github.com/GeWu-Lab/PSTP-Net.git

Download data

MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/

AVQA: http://mn.cs.tsinghua.edu.cn/avqa/

Feature extraction

feat_script/extract_clip_feat
python extract_patch-level_feat.py

Training

python main_train.py \
--temp_select True --segs 12 --top_k 2 \
--spat_select True --top_m 25 \
--a_guided_attn True \
--global_local True \
--batch-size 64 --epochs 30 --lr 1e-4 --gpu 0 \
--checkpoint PSTP_Net \
--model_save_dir models_pstp

Testing
```
python main_test.py
```

Citation

If you find this work useful, please consider citing it.

coming soon!

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]

Requirements

Usage

Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]

Requirements

Usage

Citation

Acknowledgement