Skip to content

Latest commit

 

History

History
74 lines (47 loc) · 1.27 KB

README.md

File metadata and controls

74 lines (47 loc) · 1.27 KB

Progressive Spatio-temporal Perception for Audio-Visual Question Answering (ACMMM'23) [arXiv]

PyTorch code accompanies our PSTP-Net.

Guangyao Li, Wenxuan Hou, Di Hu


Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

  1. Clone this repo

    git clone https://github.com/GeWu-Lab/PSTP-Net.git
  2. Download data

    MUSIC-AVQA: https://gewu-lab.github.io/MUSIC-AVQA/

    AVQA: http://mn.cs.tsinghua.edu.cn/avqa/

  3. Feature extraction

    feat_script/extract_clip_feat
    python extract_patch-level_feat.py
  4. Training

    python main_train.py \
    --temp_select True --segs 12 --top_k 2 \
    --spat_select True --top_m 25 \
    --a_guided_attn True \
    --global_local True \
    --batch-size 64 --epochs 30 --lr 1e-4 --gpu 0 \
    --checkpoint PSTP_Net \
    --model_save_dir models_pstp
  5. Testing

    python main_test.py

Citation

If you find this work useful, please consider citing it.

coming soon!

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.