[Interspeech2024] AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

This repository contains the source code of our paper AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning (Interspeech 2024). This code has been developed by referencing three repositories: V-ACT, GiT, and CAV-MAE. Special thanks to the authors of these excellent previous works.

Also, if you are interested in audio-visual representation learning, please also check out our other work, EquiAV. We greatly appreciate your interest!

Abstract

In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning.

Getting Started

1. Prepare Environment & Dataset

Create conda environment

conda env create -f avcap.yaml
conda activate avcap

To work with the AudioCaps dataset in this project, ensure that you download and organize the files in the following structure within the dataset folder. Additionally, you should download the pretrained models from repositories GiT, and CAV-MAE, and place them in the pretrained_weights folder. The subfolders should be organized as follows:

AVCap/
├── dataset/
|   ├── audiocaps
|       ├── train/
│           ├── frames/
│           └── waveforms/
|       ├── test/
│           ├── frames/
│           └── waveforms/
|   ├── train.json
|   ├── test.json
|   ├── test_coco.json
│   └── torch2iid.json
├── pretrained_weights/
|   ├── cav-mae-scale++.pth
|   ├── model_git_base.pt
|   ├── model_git_large.pt
│   └── model_git_large_r.pt
├── dataloader.py
├── run_avcap.py
├── scheduler.py
├── trainer.py
└── utils.py

Our preprocessed audiocaps dataset can be downloaded from this link.

2. Training

We train our model on a single node environment with 8 RTX 4090 GPUs. We use the CAV-MAE scale++ model as the audio-visual encoder, text decoder from GiT model as the text decoder. You can choose whether to train each part or not using the freeze_av and freeze_text options.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -u -W ignore run_avcap.py \
    --lr 1e-4 --batch-size 8 \
    --num_frames 4 \
    --target_length 1024 --noise True \
    --av_pretrain_path pretrained_weights/cav-mae-scale++.pth \
    --text_pretrain_path pretrained_weights/model_git_base.pt \
    --save_path [path to save model] \
    --freeze_text \
    --modality_specific_depth 11 \
    --no_wandb

If you wish to use a pretrained audio-visual encoder, you must set the modality-specific depth appropriate for that model. For the pretrained text decoder, you can download and use the git_base, git_large, and git_large_r models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[Interspeech2024] AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Abstract

Getting Started

1. Prepare Environment & Dataset

2. Training

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dataset		dataset
image		image
models		models
pretrained_weights		pretrained_weights
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
avcap.yaml		avcap.yaml
dataloader.py		dataloader.py
run_avcap.py		run_avcap.py
scheduler.py		scheduler.py
trainer.py		trainer.py
utils.py		utils.py

License

JongSuk1/AVCap

Folders and files

Latest commit

History

Repository files navigation

[Interspeech2024] AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

Abstract

Getting Started

1. Prepare Environment & Dataset

2. Training

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages