data2vec-aqc

Paper Title: data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup. At ICASSP 2023 (arxiv link).

data2vec-aqc is a Self-Supervised Learning (SSL) algorithm for speech representation learning from unlabeled speech data. Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited. Building on the recently introduced data2vec, we introduce additional modules to the data2vec framework that leverage the benefit of data augmentations, quantized representations, and clustering. The interaction between these modules helps solve the cross-contrastive loss as an additional self-supervised objective.

Primary Contributions:

We make data2vec simultaneously solve a masked acoustic modeling based cross-contrastive task between the student and teacher networks by passing randomly augmented version(s) of the same audio sample passed through each network.
We add a quantizer module similar to wav2vec 2.0, as sampling negatives from the quantized representations has been proven to be effective.
Additionally, we introduce a clustering module from ccc-wav2vec 2.0, to cluster the quantized representations and diminish the effect of negatives in the contrastive loss computation that fall into the same cluster as the positive.

SUPERB Benchmark

The data2vec-aqc BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline data2vec BASE model over the array of downstream tasks presented over SUPERB.

Models

The WERs specified are without the use of any language model.

Model	Pre-training data	Fine-tuning data	Model Link	WER (test-clean \| test-other)
wav2vec Base	LibriSpeech-360h	No fine-tuning	download	---
wav2vec Base	LibriSpeech-360h	LibriSpeech-100h	download	7.5 \| 20.2
data2vec Base	LibriSpeech-360h	No fine-tuning	download	---
data2vec Base	LibriSpeech-360h	LibriSpeech-100h	download	6.4 \| 17.7
data2vec-aqc Base	LibriSpeech-360h	No fine-tuning	download	---
data2vec-aqc Base	LibriSpeech-360h	LibriSpeech-100h	download	5.5 \| 14.0
data2vec-aqc Base	LibriSpeech-960h	No fine-tuning	download	---
data2vec-aqc Base	LibriSpeech-960h	LibriSpeech-100h	download	4.8 \| 9.5
data2vec-aqc Base SUPERB	LibriSpeech-960h	No fine-tuning	SUPERB benchmark submission	---

Pre-training and fine-tuning procedures can be found here.

Requirements and Installation

PyTorch version >= 1.10.0
Python version >= 3.8
For training new models, you'll also need an NVIDIA GPU and NCCL
To install fairseq with data2vec-aqc and develop locally:

git clone https://github.com/Speech-Lab-IITM/data2vec-aqc
cd fairseq
pip install --editable ./

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

For large datasets install PyArrow: pip install pyarrow
If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run .
For Augmentations to work install torchaudio-augmentations:

git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./

The clustering module functions on GPU needs fast-pytorch-kmeans to be installed: pip install fast-pytorch-kmeans

Parameters of interest

The cluster_factor and scale_factor parameters (for the clustering module) can be modified from the model section of the pre-training configs which can be found from the pre-training config.
The augmentations used for data2vec-aqc requires the noise set of MUSAN dataset. The path to the same is to be specified in the path_to_musan_noise_set variable of the getitem method of the raw_audio_dataset file.

Reference Code

Facebook AI Research Sequence-to-Sequence Toolkit written in Python. fairseq

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
release_utils.py		release_utils.py
setup.cfg		setup.cfg
setup.py		setup.py
train.py		train.py

License

Speech-Lab-IITM/data2vec-aqc

Folders and files

Latest commit

History

Repository files navigation

SUPERB Benchmark

Models

Requirements and Installation

Parameters of interest

Reference Code

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages