[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Building Environment

VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.

conda create -n vast python=3.9
conda activate vast
sh preinstall.sh

Download basic encoder's pretrained checkpoints

make a dir named pretrained_weights under the main work dir.

1.download evaclip weight:

wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt

2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats

3.download bert weight:

from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')

The processed pretrained_weights path should be as follows:

    ├── pretrained_weights
    │   ├── beats
    │   │   └── BEATs_iter3_plus_AS2M.pt
    │   ├── bert
    │   │   └── bert-base-uncased
    │   ├── clip
    │   │   └── EVA01_CLIP_g_14_psz14_s11B.pt

Download VAST models and captioners (for labeling your own data)

make a dir named output under the main work dir.

1.download vast model (optional, for finetuning)

[Google Drive Link] [Baidu Cloud Link]

2.vision captioner (optional, for labeling images/videos)

[Google Drive Link] [Baidu Cloud Link]

3.audio captioner (optional, for labeling audios)

[Google Drive Link] [Baidu Cloud Link]

The processed output path should be as follows:

    ├── output
    │   ├── vast
    │   │   ├── pretrain_vast
    │   │   ├── vision_captioner
    │   │   └── audio_captioner

Download VAST-27M annotations for pretraining

[Google Drive Link] [Baidu Cloud Link]

Raw videos could be downloaded from YouTube.

Download downstream datasets annotations for finetuning

make a dir named datasets under the main work dir.

[Google Drive Link] [Baidu Cloud Link]

The processed datasets path should be as follows:

    ├── output
    │   ├── annotations
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
    │   ├── srcdata
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd

srcdata (images/videos/audios) should be collected by yourself.

Finetune Model

finetune retrieval tasks

sh scripts/vast/finetune_ret.sh

finetune captioning tasks

sh scripts/vast/finetune_cap.sh

finetune QA tasks

sh scripts/vast/finetune_qa.sh

Pretrain Model

sh scripts/pretrain_vast.sh

Test your finetuned Model

For example, if the cmd for finetuning retrieval model is as follows:

python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \

if you want to test model, just add following two rows to the cmd:

--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt

Labeling your own data use vast's captioner

You need to prepare 1)a folder containing all videos/images or audios.

2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]

and then write the config file.

sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh

Statement of common controllable items in cmd which can overwrite config files.

--train_vision_sample_num

--test_vision_sample_num

--train_audio_sample_num

--test_audio_sample_num

--train_task

--test_task

--learning_rate

--train_batch_size

--test_batch_size

--train_epoch

--train_steps

--checkpointing

--frozen_vision

--valid_freq

--beam_size

Citation

If you find this code useful for your research, please consider citing:

@article{chen2024vast,
  title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
  author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen and Zhu, Xinxin and Liu, Jing},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

License

This project is released under the MIT license

Third-Party Licenses

For the full list of third-party licenses used in this project, please see the THIRD_PARTY_LICENSES.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config/vast		config/vast
data		data
evaluation		evaluation
evaluation_tools		evaluation_tools
img		img
model		model
scripts/vast		scripts/vast
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
preinstall.sh		preinstall.sh
run.py		run.py

License

TXH-mercury/VAST

Folders and files

Latest commit

History

Repository files navigation

[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Building Environment

Download basic encoder's pretrained checkpoints

Download VAST models and captioners (for labeling your own data)

Download VAST-27M annotations for pretraining

Download downstream datasets annotations for finetuning

Finetune Model

Pretrain Model

Test your finetuned Model

Labeling your own data use vast's captioner

Statement of common controllable items in cmd which can overwrite config files.

Citation

License

Third-Party Licenses

About

Topics

Resources

License

Stars

Watchers

Forks

Languages