# Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

## Introduction

Whisper is a pre-trained model for automatic speech recognition (ASR) published in [September 2022](https://openai.com/blog/whisper/) by the authors Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as [Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained on un-labelled audio data, Whisper is pre-trained on a vast quantity of __labelled__ audio-transcription data, 680,000 hours to be precise. This is an order of magnitude more data than the un-labelled audio data used to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this pre-training data is multilingual ASR data. This results in checkpoints that can be applied to over 96 languages, many of which are considered _low-resource_.

This quantity of labelled data enables Whisper to be pre-trained directly on the _supervised_ task of speech recognition, learning a speech-to-text mapping from the labelled audio-transcription pre-training data<sup>1</sup>. As a consequence, Whisper requires little additional fine-tuning to yield a performant ASR model. This is in contrast to Wav2Vec 2.0, which is pre-trained on the _unsupervised_ task of masked prediction. Here, the model is trained to learn an intermediate mapping from speech to hidden states from un-labelled audio only data. While unsupervised pre-training yields high-quality representations of speech, it does __not__ learn a speech-to-text mapping. This mapping is only learned during fine-tuning, thus requiring more fine-tuning to yield competitive performance.

------------------------------------------------------------------------

\\({}^1\\) The name Whisper follows from the acronym “WSPSR”, which stands for “Web-scale Supervised Pre-training for Speech Recognition”.

When scaled to 680,000 hours of labelled pre-training data, Whisper models demonstrate a strong ability to generalise to many datasets and domains. The pre-trained checkpoints achieve competitive results to state-of-the-art ASR systems, with near 3% word error rate (WER) on the test-clean subset of LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)). The extensive multilingual ASR knowledge acquired by Whisper during pre-training can be leveraged for other low-resource languages; through fine-tuning, the pre-trained checkpoints can be adapted for specific datasets and languages to further improve upon these results.

Whisper is a Transformer based encoder-decoder model, also referred to as a _sequence-to-sequence_ model. It maps a _sequence_ of audio spectrogram features to a _sequence_ of text tokens. First, the raw audio inputs are converted to a log-Mel spectrogram by action of the feature extractor. The Transformer encoder then encodes the spectrogram to form a sequence of encoder hidden states. Finally, the decoder autoregressively predicts text tokens, conditional on both the previous tokens and the encoder hidden states. Figure 1 summarises the Whisper model.

<figure>
<img src="./img/whisper-architecture.svg" alt="Trulli" style="width:100%">
<figcaption align = "center"><b>Figure 1:</b> Whisper model. The architecture follows the standard Transformer-based encoder-decoder model. A log-Mel spectrogram is input to the encoder. The last encoder hidden states are input to the decoder via cross-attention mechanisms. The decoder autoregressively predicts text tokens, jointly conditional on the encoder hidden states and previously predicted tokens. Figure source: <a href="https://openai.com/blog/whisper/">OpenAI Whisper Blog</a>.</figcaption>
</figure>

In a sequence-to-sequence model, the encoder transforms the audio inputs into a set of hidden state representations, extracting important features from the spoken speech. The decoder plays the role of a language model, processing the hidden state representations and generating the corresponding text transcriptions. Incorporating a language model __internally__ in the system architecture is termed _deep fusion_. This is in contrast to _shallow fusion_, where a language model is combined __externally__ with an encoder, such as with CTC + $n$-gram (_c.f._ [Internal Language Model Estimation](https://arxiv.org/pdf/2011.01991.pdf)). With deep fusion, the entire system can be trained end-to-end with the same training data and loss function, giving greater flexibility and generally superior performance (_c.f._ [ESB Benchmark](https://arxiv.org/abs/2210.13352)).

Whisper is pre-trained and fine-tuned using the cross-entropy objective function, a standard objective function for training sequence-to-sequence systems on classification tasks. Here, the system is trained to correctly classify the target text token from a pre-defined vocabulary of text tokens.

The Whisper checkpoints come in five configurations of varying model sizes. The smallest four are trained on either English-only or multilingual data. The largest checkpoints are multilingual only. All 11 of the pre-trained checkpoints are available on the [Hugging Face Hub](https://huggingface.co/models?search=openai/whisper). The checkpoints are summarised in the following table with links to the models on the Hub:

|   Size   | Layers | Width | Heads | Parameters |                     English-only                     |                    Multilingual                     |
|:--------:|:------:|:-----:|:-----:|:----------:|:----------------------------------------------------:|:---------------------------------------------------:|
|   tiny   |   4    |  384  |   6   |    39 M    |  [✓](https://huggingface.co/openai/whisper-tiny.en)  |   [✓](https://huggingface.co/openai/whisper-tiny)   |
|   base   |   6    |  512  |   8   |    74 M    |  [✓](https://huggingface.co/openai/whisper-base.en)  |   [✓](https://huggingface.co/openai/whisper-base)   |
|  small   |   12   |  768  |  12   |   244 M    | [✓](https://huggingface.co/openai/whisper-small.en)  |  [✓](https://huggingface.co/openai/whisper-small)   |
|  medium  |   24   | 1024  |  16   |   769 M    | [✓](https://huggingface.co/openai/whisper-medium.en) |  [✓](https://huggingface.co/openai/whisper-medium)  |
|  large   |   32   | 1280  |  20   |   1550 M   |                          x                           |  [✓](https://huggingface.co/openai/whisper-large)   |
| large-v2 |   32   | 1280  |  20   |   1550 M   |                          x                           | [✓](https://huggingface.co/openai/whisper-large-v2) |
| large-v3 |   32   | 1280  |  20   |   1550 M   |                          x                           | [✓](https://huggingface.co/openai/whisper-large-v3) |

For demonstration purposes, we'll fine-tune the multilingual version of the [small](https://huggingface.co/openai/whisper-small) checkpoint with 244M params (~= 1GB). As for our data, we'll train and evaluate our system on a low-resource language taken from the [Common Voice dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). We'll show that with as little as 8 hours of fine-tuning data, we can achieve strong performance in this language.