This is the PyTorch implementation of our paper:
Yan-Bo Lin and Gedas Bertasius
Our Method
pip3 install -r requirement
- Download AudioSet and VGGSound
- Download jx_vit_base_patch16_224_in21k-e5005f0a.pth and save at
./src/adapt_weights
(Not necessary. But, it somehow affect results a bit.) - Donwload sqllite3 files and save wherever you want. (Instead of reading csv annotation, this can address out of CPU memory issue)
- edit
./scr/dataloader.py
and./scr/dataloader_ft.py
to make sure your video path and sql path is correct.
run ./egs/audioset/run_pretrain_base.sh
- AudioSet 2M:
run ./egs/audioset/run_base_ft_2m.sh
- AudioSet 20K:
run ./egs/audioset/run_base_ft.sh
- VGGSound:
run ./egs/vggsound/run_base_ft.sh
If you use this code in your research, please cite:
@article{lin2024siamese,
title={Siamese Vision Transformers are Scalable Audio-visual Learners},
author={Lin, Yan-Bo and Bertasius, Gedas},
journal={arXiv preprint arXiv:2403.19638},
year={2024}
}
Our code is based on CAV-MAE.
More Checkpoints and training scripts will be available.
Base | Base+ | Large | Huge |
---|---|---|---|
PT AS-2M | PT AS-2M+VGG+ACAV2.4M |