Improving Large-Scale Speech Recognition Robustness by Language Specialization

Latest version: 1.0

The recent Whisper proposes a multi-task weak supervised training on a large-scale dataset collected from the internet. Although the model presents important gains, especially in English, its robustness and limitations in multilingual and low-resource scenarios have not yet been sufficiently explored. In this work, we present a detailed description of this new model; we propose a method to mitigate the performance gap presented in low-resource languages; and we perform an in-depth study of its robustness across different noise types and languages. The results show the effectiveness of language specialization in reducing the performance gap in speech transcription and even in increasing in the model robustness on noisy data.

The code is divided mainly into to scripts. The train_eval.py contains the main pipeline for the finetuning and evaluations. And the data_utils.py contains preprocessing functions as well as the DataCollator.

Prerequisites

pip install -r requirements.txt

Finetune Whisper

Usage

python train_eval.py --size tiny --dataset google/fleurs --lang fr --output_dir tiny_fr  --normalize "lower" --train 1

Arguments	Usage
`size`	A string to specify the Whisper model size to use (e.g., tiny, small,..)
`finetuned`	An integer value to specify whetherto use pre-trained model or the fine-tuned one (0: pre-trained, 1: fine-tuned)
`tokenizer_name`	The initial tokenizer
`dataset`	The dataset that will be used to run the code
`lang`	The language code
`task`	The task on which the model will be fine-tuned
`output_dir`	The directory where to save the model
`cpu_mode`	Runs on CPU if set to 1
`per_device_train_batch_size`	The batch size per device during training
`gradient_accumulation_steps`	Number of accumulation steps
`learning_rate`	Learning rate
`warmup_steps`	Number of warmup steps
`weight_decay`	Weight decay
`max_steps`	Maximum number of steps
`fp16`	Mixed precision if set to 1
`per_device_eval_batch_size`	Batch size per device during evaluation
`eval_steps`	Number of evaluations
`logging_steps`	Number of steps before displaying log messages
`fix_forced_decoder_ids`	Fix the language when using the model
`train`	Finetunes if set to 1 and evaluates if 0
`patience`	Patience before early stopping
`use_peft`	Use PEFT if set to 1
`early_stopping_threshold`	Threshold for applying early stopping
`eval_robustness`	Multiple evaluations if set to 1
`degradations_path`	Path to JSON file precising the degradations
`debug`	Debug mode on 10 samples if set to 1
`normalize`	Text standardization to apply

Evaluate Whisper with Degradations

You can easily apply various degradations to a dataset to evaluate Whisper's Robustness.

Prerequisites

The sox library should be installed separatly with apt-get.

apt-get update && apt-get install -y sox

Usage

Single Evaluation

This is an example of the command line for a single evaluation of Whisper on a dataset.

python train_eval.py  --lang fr --output_dir results_fr --degradations degradations.json  --normalize lower --train 0 --fix_forced_decoder_ids 1

The degradations should be specified in a JSON file, similar to degradatiosn.json. It contains a list of one dictionary, indicating the degradation string and the probability of applying it to audio in the datasets, which should be 1 in this case. More degradation strings can be found at the audio_degrader repository.

[
  {
    "degradation": ["mix,sounds/ambience-pub.wav,6"],
    "prob": 1
  }
]

Multiple Evaluations

For multiple evaluations, the option eval_robustness should be set to 1.

python train_eval.py  --lang fr --output_dir results_dr --degradations evaluate_robustness.json --eval_robustness 1 --normalize "lower" --train 0  --fix_forced_decoder_ids 0

For this task, the JSON file should have a different format, similar to evaluate_robustness.json. Here the degradation string is split on commas into name, param1 and if there is a param2.

[
  {
    "name": "mix",
    "param1":{
        "values": ["sounds/white-noise.wav",
                  "sounds/ambience-pub.wav", 
                  "sounds/helen.wav"],
        "name": "noise"
    },
    "param2":{
        "values": [100,40,35,30,25,20,15,10,5,0,-5,-10],
        "name": "snr"
    }
  }
]

Authors

Thomas Palmeira Ferraz - thomas [dot] palmeira [at] telecom [hyphen] paris [dot] fr
Helene Maxcici - helene [dot] maxcici [at] ens [hyphen] paris [hyphen] saclay [dot] fr
Teysir Baoueb - teysir [dot] baoueb [at] ensta [hyphen] paris [dot] fr

Copyright and license information

This code is licensed under the Apache License, Version 2.0 (the "License"); you may not use this code except in compliance with the License.

You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
adversarial_attacks		adversarial_attacks
degradations		degradations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attack.sh		attack.sh
data_utils.py		data_utils.py
evaluate_robustness.json		evaluate_robustness.json
finetuned_models.json		finetuned_models.json
inference.sh		inference.sh
requirements.txt		requirements.txt
train_eval.py		train_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Large-Scale Speech Recognition Robustness by Language Specialization

Prerequisites

Finetune Whisper

Usage

Evaluate Whisper with Degradations

Prerequisites

Usage

Single Evaluation

Multiple Evaluations

Authors

Copyright and license information

About

Releases 1

Packages

Contributors 4

Languages

License

thomas-ferraz/Whisper-Robustness

Folders and files

Latest commit

History

Repository files navigation

Improving Large-Scale Speech Recognition Robustness by Language Specialization

Prerequisites

Finetune Whisper

Usage

Evaluate Whisper with Degradations

Prerequisites

Usage

Single Evaluation

Multiple Evaluations

Authors

Copyright and license information

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages