ARCH: Audio Representations benCHmark

This repository contains the code for the ARCH benchmark. It is intended to be used to evaluate audio representations on a wide range of datasets and tasks. The benchmark is intended to be easy to use and to allow the comparison of different audio representations.

The main features of ARCH are:

Plug and play: the benchmark is designed to be easy to use. It provides a unified interface to load the datasets and to evaluate audio representations.
Extensibility: the benchmark is designed to be easy to extend. It is possible to add new datasets and tasks as well as new models to evaluate its audio representations.
Standardization: the benchmark wants to standardize the evaluation of audio representations. The pletora of ARL models and datasets makes it difficult to compare them. The benchmark aims at providing a standard way to evaluate audio representations.

The main components and their interactions are illustrated in the following figure:

Installation

ARCH can be installed by just cloning the repository and installing it with pip:

git clone https://github.com/MorenoLaQuatra/ARCH.git
cd ARCH
pip install -e .

Reproducing the results provided in the first release

The benchmark can be used by importing the arch module. The file evaluate_hf_models.py contains an example of how to use the benchmark. It contains the following parameters that can be used to configure the benchmark:

model: the name of the model to evaluate. It can be any model from the HuggingFace model hub or a local model exposing the same interface.
device: the device to use for the evaluation. It can be cpu or cuda.
max_epochs: the maximum number of epochs to train the linear classifier.
verbose: if True, it prints the results of the evaluation and other information on the standard output.
tsv_logging_file: the file where to save the results of the evaluation in TSV format.
n_iters: the number of times to repeat the evaluation, it can be used to compute the average of multiple runs and their standard deviation.
data_config_file: the file containing the configuration of the datasets to use for the evaluation (you can find it at configs/data_config.json)
enabled_datasets: the list of datasets to use for the evaluation. It can be any of the following: esc50, us8k, fsd50k, vivae, fma_small, magna_tag_a_tune, irmas, medleydb, ravdess, audio_mnist, slurp, emovo.

Datasets and tasks

The benchmark includes multiple datasets and, at the moment, only classification tasks. The following table contains the list of the datasets and tasks currently supported by the benchmark.

Dataset	Task	Type	Reference	Version
ESC-50	Single-label classification	Sound events	ESC: Dataset for Environmental Sound Classification	Version 1
US8K	Single-label classification	Sound events	A Dataset and Taxonomy for Urban Sound Research	Version 1
FSD50K	Single-label classification	Sound events	FSD50K: An Open Dataset of Human-Labeled Sound Events	Version 1
VIVAE	Single-label classification	Sound events	The Variably Intense Vocalizations of Affect and Emotion (VIVAE) corpus prompts new perspective on nonspeech perception	Version 1

FMA-small	Single-label classification	Music	FMA: A Dataset For Music Analysis	Version 1
MagnaTagATune	Multi-label classification	Music	Evaluation of algorithms using games: the case of music annotation	Version 1
IRMAS	Multi-label classification	Music	A Comparison of Sound Segregation Techniques for Predominant Instrument Recognition in Musical Audio Signals	Version 1
Medley-solos-DB	Single-label classification	Music	Deep convolutional networks on the pitch spiral for musical instrument recognition	Version 1

RAVDESS	Single-label classification	Speech	The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English	Version 1
AudioMNIST	Single-label classification	Speech	Interpreting and Explaining Deep Neural Networks for Classification of Audio Signals	Version 1
SLURP	Single-label classification	Speech	SLURP: A Spoken Language Understanding Resource Package	Version 1
EMOVO	Single-label classification	Speech	EMOVO: A Dataset for Emotion Recognition in Spontaneous Speech	Version 1

Version 1: 2022-02-26 - The first released version of the benchmark. The table above indicate which dataset is included for each version of the benchmark.

The instructions to download the datasets are available on the data_download/README.md file.

Detailed information and results of the first version of the benchmark are available on this space. The results include both the numbers reported in the paper and the specific versions of the models evaluated.

Models

We have currently evaluated models that are summarized in the following table. We report the name of the model, the number of parameters, and the number of GFLOPs. The results are reported in the [dedicated 🤗 space]](https://huggingface.co/spaces/ALM/ARCH) page.

Model	# Params	GFLOPs
facebook/wav2vec2-base	~90M	~70
microsoft/wavlm-base	~90M	~70
microsoft/wavlm-base-plus	~90M	~70
facebook/hubert-base-ls960	~90M	~70
facebook/data2vec-audio-base	~90M	~70
ALM/wav2vec2-base-audioset (new)	~90M	~70
ALM/hubert-base-audioset (new)	~90M	~70
facebook/wav2vec2-large-robust	~300M	~190
facebook/wav2vec2-xls-r-300m	~300M	~190
microsoft/wavlm-large	~300M	~190
facebook/hubert-large-ll60k	~300M	~190
facebook/data2vec-audio-large	~300M	~190
ALM/wav2vec2-large-audioset (new)	~300M	~190
ALM/hubert-large-audioset (new)	~300M	~190
facebook/wav2vec2-xls-r-1b	~1B	~530
facebook/hubert-xlarge-ll60k	~1B	~530

Usage

The framework is designed to evaluate the performance of a model on the desired dataset. If the model follows one of the available architectures (see the models section), you can just import the model and use it to evaluate the performance on the desired dataset. The following is an example of how to use the framework to evaluate a Wav2Vec2-style model on the ESC-50 dataset.

import json
from configs.w2v2_wrapper import Wav2Vec2ModelWrapper

from arch_eval import Model, ClassificationModel, ClassificationDataset
from arch_eval import ESC50
from transformers import AutoModel, AutoFeatureExtractor


device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME_OR_PATH = "facebook/wav2vec2-base"
MAX_EPOCHS = 200
# load the dataset information - update this file according to the downloaded dataset(s)
with open("configs/data_config.json") as f:
    datasets_info = json.load(f)
dataset_name = "esc50"

audio_model = AutoModel.from_pretrained(MODEL_NAME_OR_PATH)
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_NAME_OR_PATH)
audio_model = audio_model.to(device)
# create the model wrapper
model = Wav2Vec2ModelWrapper(
    audio_model, 
    feature_extractor, 
    device, 
    max_length=datasets_info[dataset_name]["max_length_seconds"]*16_000
)
# evaluator for the ESC-50 dataset
evaluator = ESC50(datasets_info[dataset_name]["path"], verbose=True)
res_dataset = evaluator.evaluate(
    model, 
    mode="linear", 
    device=device, 
    batch_size=8, 
    max_num_epochs=MAX_EPOCHS
)

for metric, value in res_dataset.items():
    print(f"{metric}: {value}")

In the example above, the model is evaluated on the ESC-50 dataset and the benchmark exploit the configuration file configs/data_config.json to retrieve the path to the dataset and the maximum length of the audio files in seconds. The configuration file is a JSON file that contains the information for each dataset. The following is an example of the configuration file for the ESC-50 dataset.

{
    "esc50": {
        "path": "PATH_TO_AUDIO_DATASETS/esc50/",
        "max_length_seconds": 5,
        "is_multilabel": false
    }
}

Contributing

We welcome contributions to the benchmark. If you want to add a dataset or a model, please follow the instructions in the CONTRIBUTING.md file. If you want to add new features, fix bugs, improve the documentation, or just add new results, please open an issue or a pull request.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Bow icons in the logo created by Slidicon - Flaticon

Authors

Moreno La Quatra

Alkis Koudounas

Lorenzo Vaiani

Acknowledgments

This work could not have been possible without the support of the authors of the datasets and the models used in the benchmark. We would like to thank them for their work and for making their datasets and models publicly available.

References

The table above contains the link and references of the datasets used in the benchmark, if you use them in your work, please cite them accordingly.

The specific models evaluated for each version of the benchmark are reported in the results page, if you use them in your work, please cite them accordingly.

If you use the benchmark in your work, please cite the following paper:

Version 1:

@INPROCEEDINGS{ARCH,
  author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Garza, Paolo and Cagliero, Luca and Siniscalchi, Sabato Marco},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, 
  title={Benchmarking Representations for Speech, Music, and Acoustic Events}, 
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
arch_eval		arch_eval
configs		configs
data_download		data_download
hf_space		hf_space
resources		resources
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
evaluate_hf_models.py		evaluate_hf_models.py
get_average_duration.py		get_average_duration.py
requirements.txt		requirements.txt
setup.py		setup.py

License

MorenoLaQuatra/ARCH

Folders and files

Latest commit

History

Repository files navigation

ARCH: Audio Representations benCHmark

Installation

Reproducing the results provided in the first release

Datasets and tasks

Models

Usage

Contributing

License

Authors

Acknowledgments

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages