Audio Captioning Transformer

This repository contains source code for our paper Audio Captioning Transformer.

Set up environment

Create a conda environment with dependencies: conda env create -f environment.yml -n name
All of our experiments are running on RTX 30 series GPUs with CUDA11. This environment may just work for RTX 30x GPUs.

Set up dataset

All the experiments were carried out on AudioCaps dataset, which is sourced from AudioSet. Our download version contains 49274/49837 audio clips in training set, 494/495 audio clips in validation set, 957/975 audio clips in test set.

For reproducibility, our downloaded version can be accessed at:

Baidu password: a1p4
Google Drive

To prepare the dataset:

Put downloaded zip files under data directory, and run data_unzip.sh to extract the zip files.
Run python data_prep.py to create h5py files of the dataset.

Prepare evaluation tool

Run coco_caption/get_stanford_models.sh to download the libraries necessary for evaluating the metrics.

Experiments

Training

The default setting is for 'ACT_m_scratch'
Run experiments: python train.py -n exp_name
Set the parameters you want in settings/settings.yaml

Pretrained encoder

We provide two pretrained encoders, one is a pretrained DeiT model, another is the DeiT model pretrained on AudioSet.

To use pretrained encoder:

Download the pretrained encoder models and put them under the directory pretrained_models
Set settings in settings/settings,yaml
- set encoder.model: to 'deit' or 'audioset'
- set encoder.pretrained to 'Yes'
- set path.encoder to the model path, e.g. 'pretrained_models/deit.pth'
Run experiments

Reproducible results

As we have refactored the code and made some improvements after the DECASE workshop, there are little differences among the reproducible results and those reported in the paper (the metrics are higher now), the conclusions are the same.

We provide three pretrained models, those are all trained using a pre-trained encoder.

To get the reproducible results:

Download the pretrained models and put them under the directory pretrained_models
Set settings in settings/settings,yaml
- set mode to 'eval'
- set path.eval_model to the model path
Run experiments

Cite

If you wish to cite this work, please kindly cite the following paper:

@inproceedings{Mei2021act,
    author = "Mei, Xinhao and Liu, Xubo and Huang, Qiushi and Plumbley, Mark D. and Wang, Wenwu",
    title = "Audio Captioning Transformer",
    booktitle = "Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021)",
    address = "Barcelona, Spain",
    month = "November",
    year = "2021",
    pages = "211--215",
    isbn = "978-84-09-36072-7",
    doi. = "10.5281/zenodo.5770113"
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
coco_caption		coco_caption
data/csv_files		data/csv_files
data_handling		data_handling
models		models
pretrained_models/word2vec		pretrained_models/word2vec
settings		settings
tools		tools
.gitignore		.gitignore
README.md		README.md
data_prep.py		data_prep.py
data_unzip.sh		data_unzip.sh
environment.yml		environment.yml
eval_metrics.py		eval_metrics.py
train.py		train.py

XinhaoMei/ACT

Folders and files

Latest commit

History

Repository files navigation

Audio Captioning Transformer

Set up environment

Set up dataset

Prepare evaluation tool

Experiments

Training

Pretrained encoder

Reproducible results

Cite

About

Resources

Stars

Watchers

Forks

Languages