# **Automatic Speach Recegnition (ASR) Model Selection**

### In this notebook, I shall apply precise and educated tests - written based on our project's desired objectives and standards - on a list of the top ASR models out there, to nominate the best of them to us. And teh best performance wasn't good enough, then we will need to fine-tune the models and test them again in another notebook.

### But before doing so, I will research and dive into each model in a sufficient manner.

## **0️⃣ Table Of Contents**
1. [Standards](#standards)  
2. [ASR Models](#asr-models)  
    2.1 [Whisper (OpenAI)](#asr-models-whisper)  
    2.2 [Wav2Vec 2.0 (Meta)](#asr-models-wav2vec)  
    2.3 [Parakeet TDT 0.6B v2 (Nvidia)](#asr-models-parakeet)  
    2.4 [Canary 1B (Nvidia)](#asr-models-canary)  
3. [Datasets](#datasets)  
    3.1 [Common Voice](#datasets-common-voice)  
    3.2 [Librispeech](#datasets-librispeech)  
    3.3 [TED-LIUM](#datasets-tedlium)  
    3.4 [MathSpeech](#datasets-mathspeech)  
4. [Preprocessing](#preprocessing)  
5. [Evaluation Metrics](#evaluation-metrics)  
6. [Preparing The Testing Codes](#preparing-the-testing-codes)  
    6.1 [Login To Hugging Face](#preparing-the-testing-codes-login-to-hugging-face)  
    6.2 [Create Classes For The ASR Models](#preparing-the-testing-codes-create-classes-for-the-asr-models)  
    6.3 [Dataset Class](#preparing-the-testing-codes-datasets-class)  
    6.4 [Preprocessors](#preparing-the-testing-codes-preprocessors)  
    6.5 [Testing Loop](#preparing-the-testing-codes-testing-loop)  
    6.6 [Start The Tests](#preparing-the-testing-codes-start-the-tests)  
    6.7 [Important Notes](#preparing-the-testing-codes-important-notes)  
7. [Results](#results)  

## **1️⃣ Standards** <a id="standards"></a>

##### <font color='#D55'>Note: The following standards are not final, and they are open to discuss and change</font>

### The following are the standards that models will be competing for:
- #### Robust to noise
- #### Immune to accents (even poor ones)
- #### High-accuracy transcription
- #### Endures specialized topics and irregular language, such as mathematical expressions, codes, etc...
- #### Not expinsive for us
- #### Multilingual<font color='#080'># This is ignored for now</font>
- #### fine-tunable<font color='#080'># This one will be ignored if the models performed good enough</font>

### All models will be tested based on the previous standards as show in the "Models Evaluation" section

## **2️⃣ ASR Models** <a id="asr-models"></a>

### This section provides a list of the most famous ASR models each with a detailed table and a brief description

#### The nominated models must be open-weight and have competitive performance. All old and poor ASR got neglected.

### 1. Whisper (OpenAI) <a id="asr-models-whisper"></a>
> Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

| Feature | Details |
|----------|----------|
| License  | MIT |
| Weights  | open-weights, <font color='#D55'>need other forks for fine-tuning and<br> it is resource-intinsive</font> |
| Strangths  | High-accuracy transcription, multilingual, translator,<br> robust to accents, good with non-English, trained on noise,<br> supports timestamps |
| Weaknesses  | not fine-tuned on special topics |
| Parameters  | 39 M, 74 M, 244 M, 769 M, 809 M, 1550 M |

### Sizes

| Size | Parameters | Relative speed |
|----------|:----------:|:----------:|
| tiny | 39 M | ~10x |
| base | 74 M | ~7x |
| small | 244 M | ~4x |
| medium | 769 M | ~2x |
| turbo | 809 M | ~8x |
| large | 1550 M | 1x |

### References: 
- https://openai.com/index/whisper/
- https://github.com/openai/whisper?tab=readme-ov-file
- https://huggingface.co/blog/fine-tune-whisper#closing-remarks

### 2. Wav2Vec 2.0 (Meta) <a id="asr-models-wav2vec"></a>
> A Framework for Self-Supervised Learning of Speech Representations.

> We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

| Feature | Details |
|----------|----------|
| License  | MIT |
| Weights  | open-weights, fine-tunable |
| Strangths  | requires little to no transcribed data (self-supervision), so it<br> can learn on unlabled data |
| Weaknesses  | Originally English-only |
| Parameters  | 95 M, 317 M |

### Variations

| Model | Parameters | Training | Link |
|----------|:----------:|----------|:----------:|
| facebook/wav2vec2-base | 95 M | Pretrained on unlabedled data, no fine-tuning. | [Link](https://huggingface.co/facebook/wav2vec2-base) |
| facebook/wav2vec2-base-960h | 95 M | Fine-tuned wav2vec2-base on 960 hours LibriSpeech | [Link](https://huggingface.co/facebook/wav2vec2-base-960h) |
| facebook/wav2vec2-large | 317 M | Pretrained on unlabedled data, no fine-tuning. | [Link](https://huggingface.co/facebook/wav2vec2-large) |
| facebook/wav2vec2-large-960h | 317 M | Fine-tuned wav2vec2-base on 960 hours LibriSpeech | [Link](https://huggingface.co/facebook/wav2vec2-large-960h) |
| facebook/wav2vec2-large-960h-lv60-self | 317 M | Self-trained on Libri-Light 60k hours + fine-tuned on 960h | [Link](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self) |
| facebook/wav2vec2-large-xlsr-53 | 317 M | Trained on 56k hours of multilingual unlabeled audio (53 languages),<br>no fine-tuning | [Link](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) |

### References: 
- https://ai.meta.com/research/impact/wav2vec/
- https://arxiv.org/abs/2006.11477
- https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec#wav2vec-20

### 3. Parakeet TDT (Nvidia) <a id="asr-models-parakeet"></a>
> An automatic speech recognition (ASR) model designed for high-quality English transcription, featuring support for punctuation, capitalization, and accurate timestamp prediction.

| Feature | Details |
|----------|----------|
| License  | CC‑BY‑4.0 |
| Weights  | open-weights |
| Strangths  | Efficiently transcribes long audio segments, Robust<br> performance on spoken numbers and song lyrics transcription,<br> robust to noise |
| Weaknesses  | English-only |
| Parameters  | 600 M, 1.1 B |

### Variations

| Model | Parameters | Link |
|----------|:----------:|:----------:|
| nvidia/parakeet-tdt-0.6b-v2 | 0.6 B | [Link](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) |
| nvidia/parakeet-tdt-1.1b | 1.1 B | [Link](https://huggingface.co/nvidia/parakeet-tdt-1.1b) |
| nvidia/parakeet-rnnt-1.1b | 1.1 B | [Link](https://huggingface.co/nvidia/parakeet-rnnt-1.1b) |
| nvidia/parakeet-ctc-0.6b | 0.6 B | [Link](https://huggingface.co/nvidia/parakeet-ctc-0.6b) |
| nvidia/parakeet-ctc-1.1b | 1.1 B | [Link](https://huggingface.co/nvidia/parakeet-ctc-1.1b) |


### References: 
- https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2
- https://huggingface.co/nvidia/parakeet-tdt-1.1b
- https://huggingface.co/nvidia/parakeet-rnnt-1.1b
- https://huggingface.co/nvidia/parakeet-ctc-0.6b
- https://huggingface.co/nvidia/parakeet-ctc-1.1b

### 4. Canary (Nvidia) <a id="asr-models-canary"></a>

### <font color='#D55'>This one will just be ignored for now. If the other models didn't perform good enough, then I will try to use it again. But I really tried hard and wasted too much valuable time with infinite compatipilities issues with all the APIs I tried</font>

> A multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization. Canary also provides bi-directional translation, between English and the three other supported languages.

| Feature | Details |
|----------|----------|
| License  | CC BY-NC 4.0 |
| Weights  | open-weights |
| Strangths  | 4 languages, translator, High accuracy |
| Weaknesses  | GPU-heavy |
| Parameters  | 883 M, 1 B |

### Variations

| Model | Parameters | Link |
|----------|:----------:|:----------:|
| nnvidia/canary-1b | 1 B | [Link](https://huggingface.co/nvidia/canary-1b) |
| nvidia/canary-1b-flash | 883 M | [Link](https://huggingface.co/nvidia/canary-1b-flash) |

### References: 
- https://developer.nvidia.com/blog/new-standard-for-speech-recognition-and-translation-from-the-nvidia-nemo-canary-model/

## **3️⃣ Datasets** <a id="datasets"></a>

### These are the datasets that I am going to test the models on. The datasets are made to be small because this is only a test, and no need for wasting much time.

### **1. Common Voice** <a id="datasets-common-voice"></a>

The Common Voice dataset is an open-source collection of voice recordings contributed by volunteers from around the world. Created by Mozilla, it includes transcribed speech data in dozens of languages and accents, making it one of the largest multilingual datasets available for training automatic speech recognition (ASR) systems. The dataset is freely available under the CC-0 license, encouraging both academic and commercial use. It aims to support inclusive and diverse voice technology by representing different accents, demographics, and speaking styles.

The dataset I will be using is a subset of the Common Voice 13.0 dataset that I created. The subset only contains the testing data, and just in English, shrinking the data size significantly. Subset contains 16,372 samples, with a total size of 672 MB. Since currently I am only using the data for testing, this will help speeding up the process and minimize the needed resources. For now I am only want some data to test the models on, thus, no need for big data.

The subset can be found here: https://huggingface.co/datasets/BinAlsadiq/common_voice_13_0_en_test

The original dataset: https://commonvoice.mozilla.org/en/datasets

### **Features & Characteristics**

- **Sourced from**<br>
    Volunteers are free to contibute by recording themselves reading public domain (CC0) sentences in their languages and accents. The public domain (CC0) sentences typically sourced from:
    - Wikipedia
    - Public domain books
    - Government websites
    - News articles
    - User-submitted content
    - Other public domain sources
    
- **Typical Topics**<br>
    Technology, Science, Weather, History, Education, Geography, Daily Life, Health, Transportation, etc.
- **Contains huge, and an unbalanced variety of accents and supports a lot of languages**<br>
    - Supports 137 different language
    - A single sample can have multiple accents because speakers self-report their accent(s), and the system allows tagging multiple — even if not all are equally reflected in each recording.
    - Illustrative Example: A bilingual speaker raised in different regions might reflect a mix.
- **Noise**<br>
    The Common Voice dataset intentionally includes a variety of real-world background noises to make automatic speech recognition (ASR) more robust.<br>
    
    The accepted noises might include:
    - Traffic noise
    - Indoor chatter or distant conversations
    - Household sounds (TV, cooking, machinery)
    - Quiet background music
    - Natural ambient sounds (birds, wind, etc.)

    The unaccepted noises might include:
    - Excessive loud background noise
    - Overlapping conversations interfering with the clip
    - Cracking, distortion, static, or dropouts that obscure speech 
- **Censorship**<br>
    - Sentences are reviewed and curated
    - No hate speech, obscenity, or private data is allowed

- **Average Clip Duration**<br>
    The average clip duration is approximatly 5s.

### **Some Issues To Consider**
- **capitalization:** Not all sentences contain correct capitalization. For example:<br>
    - A sentence starts with a capital letter in: common_voice_en_18276812.mp3 : "What a strange mauve colour!"
    - But here it doesn't: common_voice_en_93276.mp3 : "a boy enjoys a rain shower."

    More serious example where all leters are written in capitals: <br>
    - common_voice_en_572372.mp3 : "YOU WANNA TAKE THIS OUTSIDE?"

These issues are serious, and will affect the models predictions. Thus, I will need to deal with them as will be shown later.

### **Missing Features**
The dataset doesn't contain irregular language such as mathematical expressions.

### **Dataset Structure & Details**<br>
Everything I mention here is about my version of the dataset, and it might be slightly different from the original.

The dataset contains 16,372 samples, each sample contains the following values:
- **path:** contains the audio file name correspnding to this sample. For example, common_voice_en_57.mp3
- **audio:** This is a dictionary contains three values:
    - **path:** The absolute path to the corresponding audio file
    - **array:** An array of floats representing the loadede corresponding audio file
    - **sampling_rate:** The sampling rate of the corresponding audio file
- **sentence:** Transcription
- **up_votes:** Total number of up votes made by reviewers
- **down_votes:** Total number of down votes made by reviewers
- **age:** Contributer's age
- **gender:** Contributer's gender
- **accents:** Contributer's accents<br>
    As previously mentioned:
    - A single sample can have multiple accents because speakers self-report their accent(s), and the system allows tagging multiple — even if not all are equally reflected in each recording
    - Illustrative Example: A bilingual speaker raised in different regions might reflect a mix
- **locale:** The language code of the audio sample, such as en for English, and et for Estonian
- **segment:** marks whether a clip belongs to a specific special-purpose sub-corpus beyond the main dataset

The following values, path, audio, sentence, up_votes, and down_votes always have values, while the rest might be empty sometimes.

Reviewers can check the samples and vote for each one, either up, indicating that the transcription is correct, or down, indicating that the transcription is incorrect or the clip has poor quality. All the samples in this dataset had more up votes than the down votes at least by a value of 1.

### **References**
https://github.com/common-voice/sentence-collector<br>
https://www.mozillafoundation.org/en/blog/the-new-common-voice-sentence-collector/<br>
https://www.mozillafoundation.org/en/blog/guidance-for-splinter-datasets-on-mcv/<br>

### **Exploring The Dataset**

- #### **Download It**
```Python
from datasets import load_dataset

# download the dataset from hugging face
dataset = load_dataset("BinAlsadiq/common_voice_13_0_en_test", split='test')
dataset
```
**output:**
```Python
Dataset({
    features: ['path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accents', 'locale', 'segment'],
    num_rows: 16372
})
```

- #### **Dataset Format**
```Python
print(f"path: {dataset['path'][0]}\n")
print(f"audio: {dataset['audio'][0]}\n")
print(f"sentence: {dataset['sentence'][0]}\n")
print(f"up_votes: {dataset['up_votes'][0]}\n")
print(f"down_votes: {dataset['down_votes'][0]}\n")
print(f"age: {dataset['age'][0]}\n")
print(f"gender: {dataset['gender'][0]}\n")
print(f"accents: {dataset['accents'][0]}\n")
print(f"locale: {dataset['locale'][0]}\n")
print(f"segment: {dataset['segment'][0]}")
```
**output:**
```Python
path: common_voice_en_27710027.mp3

audio: {'path': '/root/.cache/huggingface/datasets/downloads/extracted/d79bd49d7b0d227879d17d015d61a097f0548d853ad4253867049733836db92d/data/clips/common_voice_en_27710027.mp3', 'array': array([-2.09547579e-09, -1.39698386e-09, -1.86264515e-09, ...,
        7.56020654e-07,  8.43494490e-07,  1.62231663e-06]), 'sampling_rate': 16000}

sentence: Joe Keaton disapproved of films, and Buster also had reservations about the medium.

up_votes: 3

down_votes: 1

age: 

gender: 

accents: 

locale: en

segment: 
```

- #### **Accents Variety & Distribution**
How many, and what accents are there:
```Python
print(len(df['accents'].unique()))
print(df['accents'].unique())
```
**output:**
```Python
93
['' 'United States English,American Midwestern' 'Hong Kong English'
 'Filipino' 'United States English' 'United States English,wolof'
 'England English' 'Australian English'
 'Southern African (South Africa, Zimbabwe, Namibia)'
 'India and South Asia (India, Pakistan, Sri Lanka)'
 'United States English,Puerto Rican,Latin American English,Florida,New York,Long Island,Savannah, Georgia'
 'Canadian English' 'Polish English,England English' 'Scottish English'
 'England English,Swedish' 'German English,"denglish"'
 'United States English,Hong Kong English'
 'Upstate New York,United States English' 'Singaporean English' 'Russian'
 'Chichester' 'United States English,Gay' 'Welsh English'
 'West Indies and Bermuda (Bahamas, Bermuda, Jamaica, Trinidad)'
 'Non native' 'Deutsch English'
 'I think mine accent is influenced by Indian Accent ,Yes Please. ,India and South Asia (India, Pakistan, Sri Lanka)'
 'Irish English' 'Malaysian English' 'Singaporean English,Thai English'
 'England English,United States English'
 'United States English,Southern African (South Africa, Zimbabwe, Namibia)'
 'United States English,West Coast' 'Brooklyn '
 'England English,Midlands English'
 'United States English,Southern New England English (Boston, Worcester, Lowell Area)'
 'United States English,England English'
 'India and South Asia (India, Pakistan, Sri Lanka),United States English'
 'English (Native Greek speaker)' 'United States English,Dutch' 'Catalan'
 'indian' 'Irish English,English' 'United States English,Midwestern'
 'United States English,Filipino' 'United States English,Californian'
 'United States English,Mid Atlantica ,African American Vernacular '
 'french accent'
 'United States English,Born and lived in eastern VA for 8 years. Then lived in southern CA for 13 years.  Lived in MD, NC, WA, HI  for 1-3 years each.  Spent 30 years in Washington DC area and 17 years in Northern KY/Cincinnati OH area'
 'United States English,Northeast US' 'I have none that I can tell.'
 'Russian English' 'Scottish English,Glaswegian '
 'England English,Bedford English,Cambridge English'
 'United States English,India and South Asia (India, Pakistan, Sri Lanka)'
 'United States English,United States-West Coast-Alaska,United States-Midwestern'
 'Neutral,indian,slow' 'England English,Hong Kong English'
 'New Zealand English' 'Argentinian English'
 'Canadian English,Welsh English' 'Filipino,Bisaya' 'french english'
 'Haitian Creole' 'French'
 "New Zealand English,I don't really speak english, just practicing"
 'United States English,Canadian English,Indo-Canadian English'
 'English (UK)' 'United States English,Slight Dutch accent'
 'United States English,Chicago ,Midwestern,Gen Z'
 'United States English,Scandinavian' 'Western Europe'
 'United States English,Scottish English,Irish English,England English'
 'United States English,England English,Hong Kong English'
 'Southern African (South Africa, Zimbabwe, Namibia),Durban'
 'United States English,English Second Language'
 'United States English,country'
 'slightly slurred due to age and alcohol consumption.'
 'Canadian English,United States English' 'South Australia'
 'northern cali' 'England English,Received Pronunciation' 'Polish English'
 'I was born in England and have lived in Australia, Canada and France.'
 "Israeli's accent "
 'England English,Southern African (South Africa, Zimbabwe, Namibia)'
 'United States English,Australian English' 'Second tongue'
 'Australian English,New Zealand English' 'Italian,England English'
 'very slight Russian accent,Standard American English,Boston influence'
 'Southern Texas Accent,United States English' 'serbian']

```

How many samples don't have a specified accent value:
```Python
len(df[df['accents'] == ''])
```
**output:**
```Python
14822
```

There are 14822 samples out of the total 16372 that don't have values for the accents column. But, let's see the distribution of the top 10 accents of the other samples:
```Python
df[df['accents'] != '']['accents'].value_counts(normalize=True)[0:10]
```
**output:**
```Python
accents
United States English                                 0.434839
India and South Asia (India, Pakistan, Sri Lanka)     0.226452
England English                                       0.119355
Canadian English                                      0.042581
Australian English                                    0.019355
Southern African (South Africa, Zimbabwe, Namibia)    0.016129
Hong Kong English                                     0.010968
Irish English                                         0.009677
New Zealand English                                   0.008387
Scottish English                                      0.007097
Name: proportion, dtype: float64
```

### **2. Librispeech** <a id="datasets-librispeech"></a>

> LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.

The dataset can be fount at: https://www.openslr.org/12 or https://huggingface.co/datasets/openslr/librispeech_asr

Note, I will use both the "clean" and "other" data as will be shown later on. Continue reading to learn waht the "clean" and "other" are.

### **Features & Characteristics**
- **Sourced from**<br>
    The data is derived from read audiobooks from the [LibriVox](https://librivox.org/) project, and has been carefully segmented and aligned.
- **Typical Topics**<br>
    Since it is sourced out of books, it contain a wide variety of topics including classical literature, religions, science, etc.
- **Accents & Languages**<br>
    - Only English.
    - Supports different English accents, but it is mainly consists of the American accent.
        - There is no official labeling of accent type in the metadata
- **Noise**<br>
    The samples that are tagged as clean don't contain much noise. While the other samples might contain echo, background noises, poor microphone quality, etc.
- **Average Clip Duration**<br>
    The average clip duration is approximatly 12.5s.

### **Some Issues To Consider**
The texts are normalized, this led to:
- All letters are in capital
- No punctuations

### **Missing Features**
The dataset doesn't contain irregular language such as mathematical expressions.

### **Dataset Structure & Details**<br>
Each sample contains the following values:
- **chapter_id:** A unique ID for the chapter that is being read from
- **file:** The absolute path file in .flac format
- **audio:** This is a dictionary contains three values:
    - **file:** The absolute path file in .flac format
    - **array:** An array of floats representing the loadede corresponding audio file
    - **sampling_rate:** The sampling rate of the corresponding audio file
- **id:** A unique ID for the sample
- **speaker_id:** A unique ID for the speaker
- **text:** Transcription

The data samples are split as follows:
|  | Train.500 | Train.360 | Train.100 | Valid | Test |
|----------|:----------:|:----------:|:----------:|:----------:|:----------:|
| clean | - | 104014 | 28539 | 2703 | 2620 |
| other | 148688 | - | - | 2864 | 2939 |

The 500, 360, and 100 represents the total hours.

> The audio is in English. There are two configurations: clean and other. The speakers in the corpus were ranked according to the WER of the transcripts of a model trained on a different dataset, and were divided roughly in the middle, with the lower-WER speakers designated as "clean" and the higher WER speakers designated as "other".

### **References**
https://www.openslr.org/12<br>
https://ieeexplore.ieee.org/document/7178964<br>
https://huggingface.co/datasets/openslr/librispeech_asr<br>

### **Exploring The Dataset**

- #### **Download It**
```Python
from datasets import load_dataset

# download the dataset from hugging face
dataset = load_dataset("BinAlsadiq/librispeech_test_clean", split='test')
dataset
```
**output:**
```Python
Dataset({
    features: ['chapter_id', 'file', 'audio', 'id', 'speaker_id', 'text'],
    num_rows: 2620
})
```

- #### **Dataset Format**
```Python
print(f"chapter_id: {dataset['chapter_id'][0]}\n")
print(f"file: {dataset['file'][0]}\n")
print(f"audio: {dataset['audio'][0]}\n")
print(f"id: {dataset['id'][0]}\n")
print(f"speaker_id: {dataset['speaker_id'][0]}\n")
print(f"text: {dataset['text'][0]}\n")
```
**output:**
```Python
chapter_id: 123286

file: /root/.cache/huggingface/datasets/downloads/extracted/f26361781344d4881219a6e350ef1f761e4945e8b4c0e7118ec8800a7cfc9351/data/260/123286/260-123286-0000.flac

audio: {'path': '/root/.cache/huggingface/datasets/downloads/extracted/f26361781344d4881219a6e350ef1f761e4945e8b4c0e7118ec8800a7cfc9351/data/260/123286/260-123286-0000.flac', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -9.15527344e-05, -9.15527344e-05, -9.15527344e-05]), 'sampling_rate': 16000}

id: 0

speaker_id: 260

text: SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT
```

### **3. TED-LIUM** <a id="datasets-tedlium"></a>

The TED-LIUM dataset is a large-scale corpus of English-language speech collected from TED Talks, created by the LIUM Spoken Language Processing Group. It includes professionally recorded audio files, aligned transcripts, and speaker metadata, making it a valuable resource for training and evaluating automatic speech recognition (ASR) systems. TED-LIUM captures a wide variety of speaking styles, accents, and topics such as technology, science, education, and culture. Multiple versions of the dataset (v1, v2, v3) have been released, with v3 offering over 450 hours of speech. The dataset is licensed under Creative Commons BY-NC-ND 3.0, which allows non-commercial use but prohibits derivative works.

The dataset can be fount at: https://huggingface.co/datasets/LIUM/tedlium/tree/main/TEDLIUM_release3/speaker-adaptation

### **Features & Characteristics**
- **Sourced from**<br>
    Audio recordings of TED conferences.
- **Typical Topics**<br>
    Wide variety of specialized topics.
- **Accents & Languages**<br>
    - Only English.
    - Supports different English accents since speakers are from all over the world.
        - There is no official labeling of accent type in the metadata
- **Noise**<br>
    - people clapping
    - music <br>
        Actually, in the test subset, there is a clip "AimeeMullins_2009P" where at the begging it has a music. I tried to input the segment of the music into the model Whisper and got the following output: "Yes". There was no one saying anything at that segment
    - Audience reactions
        - Laughter
        - applause
        - cheering
        - murmuring during or after a speaker’s statement
    - Microphone noises: Occasional clipping, pops, or breath sounds due to microphone proximity or handling
    - Ambient room noise: Subtle echo or reverberation from large auditoriums or conference rooms
    - Slide click sounds
    - Speech overlaps: In rare Q&A segments or dialogue-style talks, slight overlaps may occur.

### **Some Issues To Consider**
- **Typos**<br>
    In the test subset, specificaly in AimeeMullins_2009P, the first transcription was:<br>
    "i 'd like to share with you a discovery that i made a few months ago while writing an article for italian wired i always keep my thesaurus handy whenever i 'm writing anything but"<br>
    Meanwhile the whisper model predicted the transcription as:<br>
    "I'd like to share with you a discovery that I made a few months ago while writing an article for Italian Wired. I always keep my bizarre as handy whenever I'm writing anything but..."<br>
    In the original transcription it is "i 'd", while the model predicted it to be "I'd". This might be considered as a prediction error, while in fact, it is correct
- **Incomplete sentences**: as it clear from the previous example, the sentence got cut after the word "but", this led the model to add "..." at the end. This issue can be solved by normalizing

### **Missing Features**
The dataset doesn't contain irregular language such as mathematical expressions.

### **Dataset Structure & Details**<br>
Each sample contains the following values:
- **audio:** This is a dictionary contains three values:
    - **file:** The absolute path file in .flac format
    - **array:** An array of floats representing the loadede corresponding audio file
    - **sampling_rate:** The sampling rate of the corresponding audio file
- **file:** A path to the downloaded audio file in .sph format
- **text:** Transcription
- **gender:** The gender of the speaker
- **id:** A unique ID for the sample
- **speaker_id:** A unique ID for the speaker

The data samples are split as follows:
| Split | Release 1 | Release 2 | Release 3 |
|----------|:----------:|:----------:|:----------:|
| Train | 56,803 | 92,973 | 268,263 |
| Validation | 591 | 591 | 591 |
| Test | 1,469 | 1,469 | 1,469 |

### **References**
https://www.openslr.org/51/<br>
https://www.innovatiana.com/en/datasets/ted-lium-dataset<br>
https://huggingface.co/datasets/LIUM/tedlium<br>

### **Exploring The Dataset**
Unfortunately, to download a small subset of the dataset you'll need to download the whole 164.029 GBytes. And they made it illegal to repackage it or modify it in any way, due to the [CC BY-NC-ND](https://creativecommons.org/licenses/by-nc-nd/4.0/deed.en). Thus, I won't explore it like I always do. But, I <ins>**might**</ins> just create my own subset and modify it and use it privately for testing later. But I <ins>**might not**</ins> do this since it is illegal 😉

Here is an example code (I got it from https://huggingface.co/datasets/LIUM/tedlium#example) showing how to officialy and legally download the dataset:
```Python
from datasets import load_dataset

tedlium = load_dataset("LIUM/tedlium", "release1") # for Release 1

# see structure
print(tedlium)

# load audio sample on the fly
audio_input = tedlium["train"][0]["audio"]  # first decoded audio sample
transcription = tedlium["train"][0]["text"]  # first transcription
```

### **4. MathSpeech** <a id="datasets-mathspeech"></a>

This dataset contains 1101 audio clips sourced from [MIT OpenCourseWare lectures](https://ocw.mit.edu/about/), where mathematical expressions are spoken aloud in natural English (e.g., “e to the power of i x equals cosine of x plus i sine of x”). Each audio file is paired with two transcripts, one that represents the mathematical content in plain English words, making it directly compatible with standard ASR models, and the other represents the mathematical content using LaTex. This makes MathSpeech useful for tasks like spoken equation recognition, speech-to-LaTeX conversion, and evaluating ASR systems on non-standard, symbol-heavy language expressed verbally. The dataset includes diverse speakers and real lecture-style noise, offering a realistic testbed for robust math-aware speech models.

The dataste can be found here: https://huggingface.co/datasets/AAAI2025/MathSpeech

### **Features & Characteristics**
- **Sourced from**<br>
    [MIT OpenCourseWare lectures](https://ocw.mit.edu/about/).
- **Accents & Languages**<br>
    - U.S. English.
- **Noise**<br>
    - Room reverberation
    - Breath sounds
    - Microphone artifacts
    - Background hum: Low-frequency static or electronic hum from recording equipment
    - Page flipping
    - chalkboard use
    - Speech disfluencies
- **Transcripts**<br>
    Each sample comes with two types of transcripts:
    - Plain English: "ax plus by plus cz equals d"
    - LaTex: "$ax + by + cz = d$"

### **Some Issues To Consider**
- Inconsistent Letter Cases: Most of the samples plain English transcripts contain only small letters, but few samples contain capital letters.
- Inconsistent Punctuations: Most of the samples plain English transcripts don't contain punctuations, but there are very few samples that do.

### **Dataset Structure & Details**<br>
Each sample contains the following values:
- **audio:** This is a dictionary contains three values:
    - **file:** A path to the audio file
    - **array:** An array of floats representing the loadede corresponding audio file
    - **sampling_rate:** The sampling rate of the corresponding audio file. By default, it is 48000Hz, you might need to down sample it to 16000Hz to be compatible with some models
- **text:** Transcription in plain English
- **LaTex:** Transcription in LaTex
- **Source:** A youtube link to the sourced lecture

There is only one split, which is the training split.

### **References**
https://huggingface.co/datasets/AAAI2025/MathSpeech<br>
https://arxiv.org/html/2412.15655v1<br>
https://github.com/hyeonsieun/mathspeech?tab=readme-ov-file<br>

### **Exploring The Dataset**
- #### **Download It**
```Python
from datasets import load_dataset, Audio

dataset = load_dataset("AAAI2025/MathSpeech", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

dataset
```
**output:**
```Python
Dataset({
    features: ['audio', 'transcription', 'LaTeX', 'Source'],
    num_rows: 1101
})
```

- #### **Dataset Format**
```Python
dataset[0]
```
**output:**
```Python
{'audio': {'path': '1.mp3',
  'array': array([-2.32830644e-10, -2.32830644e-10, -1.39698386e-09, ...,
          5.56293176e-03,  4.59455093e-03,  2.39031739e-03]),
  'sampling_rate': 16000},
 'transcription': 'ax plus by plus cz equals d',
 'LaTeX': '$ax + by + cz = d$',
 'Source': 'https://www.youtube.com/watch?v=YBajUR3EFSM'}
```

## **4️⃣ Preprocessing** <a id="preprocessing"></a>

### There are some common issues emerged while I was testing the models and exploring the datasets. These issues weren't produced because the models or datasets them selfs, but because of the nature of human language, specifically, the English language (Since I didn't use any other language at all).

### In this section, I will define these issues, and prove whether if they could affect the evaluation process or not. The reason why it is important to know if they affect the rvaluation is because these issues might produce misleading results, pushing us into thinking that the results represents the models and datasets performance, while the real problem might be in the human language nature.

### All the transcriptions I'll use in the next examples will get normalized first (i.e. All letter cases will get converted into one case, and all punctuations will get removed), and all the predicted words must be 100% accurate, by accurate I mean represent the correct word, e.g. the words '2' and 'two' are the same. These preperations ensure that any error won't be caused by the model or the dataset, but instead, the human language nature itself.

### To normalize the transcriptions, I used the following code:
```Python
from jiwer import Compose, ToLowerCase, RemovePunctuation, RemoveMultipleSpaces, Strip

transform = Compose([
    ToLowerCase(),
    RemovePunctuation(), # this can produce wrong results in some cases, I will discuss this later
    RemoveMultipleSpaces(),
    Strip()
])
```

### Example:
```Python
transform("In these cases, skin tone and hair color are not so important.")
```
**output:**
```Python
'in these cases skin tone and hair color are not so important'
```

### **The Issues**
- **Spelling for different accents**: The same word might have different spelling for different accents, and all are correct. For example, the word color/colour:<br>
    - In the Common Vocie dataset:
        - Spelled color in:
            - common_voice_en_27638325.mp3 : "Its shape and coloration is reminiscent of a brown trout."
            - common_voice_en_19757703.mp3 : "Coalitions in the Bundestag and state legislators are often described by party colors."
            - common_voice_en_25467856.mp3 : "In these cases, skin tone and hair color are not so important."
            - common_voice_en_31206949.mp3 : "The color is mostly red with yellow highlights near the crown."
            - common_voice_en_28527878.mp3 : "Legal specification for the shades of the national colors has also changed with time."
            - common_voice_en_32638824.mp3 : "Studies have found that students of color are disproportionately affected."
            - common_voice_en_31159638.mp3 : "The film was shot in color with mono sound."
        - Spelled colour in:
            - common_voice_en_18276812.mp3 : "What a strange mauve colour!"
            - common_voice_en_17272411.mp3 : "Somehow, the purple colour faded to gray."

    The same thing occurs in the other datasets.

- **Homophone involving a symbol or numeral**: For examples, the words "two", and "2" pronounced the same. The ASR model might hear "2" and output "two", and vice versa.<br>
I used the ASR model "Whisper" to test this, and it really happened:
    - In MathSpeech the sample 2.mp3:
        - Prediction: "x plus 5y plus 10z equals 0."
        - Reference : "x plus 5y plus 10z equals zero"<br>

    While "0" and "zero" are the same, they are written differently.<br>
    The same thing occurs in the other datasets.

### **Will These Issues Affects The Evaluation Results?**
The following code tests the previously mentioned issues using the WER metric (Which is mentioned in the next section):
```Python
from jiwer import wer

# test for homophone involving a symbol or numeral
Ex1_A = 'Saturday, August 15th. The sea unbroken all round. No land in sight.'
Ex1_B = 'SATURDAY AUGUST FIFTEENTH THE SEA UNBROKEN ALL ROUND NO LAND IN SIGHT'
Ex1_C = 'Saturday, August fifteenth. The sea unbroken all round. No land in sight.'

print('test for homophone involving a symbol or numeral')
print(f'Whith the issue: {wer(transform(Ex1_A), transform(Ex1_B))}')
print(f'Whithout the issue: {wer(transform(Ex1_B), transform(Ex1_C))}\n')

# test for different spelling
Ex2_A = 'Words was it their colors'
Ex2_B = 'words, was. It; their colours'
Ex2_C = 'Words was it their colours'

print('test for different spelling')
print(f'Whith the issue: {wer(transform(Ex2_A), transform(Ex2_B))}')
print(f'Whithout the issue: {wer(transform(Ex2_B), transform(Ex2_C))}')
```
**output:**
```Python
test for homophone involving a symbol or numeral
Whith the issue: 0.08333333333333333
Whithout the issue: 0.0

test for different spelling
Whith the issue: 0.2
Whithout the issue: 0.0
```
### Sure enough, this will affect the evaluation results.

### **Preprocessing**
Here are some steps that I hope could minimize the problem:

### **1. Unify Spellings**
#### To solve the spelling problem, I will use a function that unify the spellings:
```Python
from breame.spelling import get_american_spelling # or get_british_spelling

get_american_spelling('DISCOLOURED')
```
**output:**
```Python
'discolored'
```

#### Notice how the function didn't just solve the spelling problem, it also changed all the letters cases into small. Thus, there is no need to use "ToLowerCase()" no more.

#### But this function is not totaly ready yet, it only works with one word, if you tried to use it with a full sentence, then the spelling doesn't change.

### **2. Combine With Compose**
```Python
from breame.spelling import get_american_spelling # or get_british_spelling
from jiwer import Compose, RemovePunctuation, RemoveMultipleSpaces, Strip

def american_spelling(text):
    result = ''

    for c in text.split():
        result += get_american_spelling(c) + ' '
    
    return result

transform = Compose([
    american_spelling,
    RemovePunctuation(),
    RemoveMultipleSpaces(),
    Strip()
])

text = 'IN, THE. it\'s LIGHT OF THE MOON I SAW A KNIFE RED WITH BLOOD AND MY HAND TOO WAS ALSO DISCOLOURED'

transform(text)
```
**output:**
```Python
'in the its light of the moon i saw a knife red with blood and my hand too was also discolored'
```

#### Notice how the third word changed from "it's" into "its", which has a different meaning. This happened because the ' was deleted by the RemovePunctuation(). This can be solved as follows:

### **3. ExpandCommonEnglishContractions()**
```Python
from breame.spelling import get_american_spelling # or get_british_spelling
from jiwer import Compose, RemovePunctuation, RemoveMultipleSpaces, Strip, ExpandCommonEnglishContractions

def american_spelling(text):
    result = ''

    for c in text.split():
        result += get_american_spelling(c) + ' '
    
    return result

transform = Compose([
    american_spelling,
    ExpandCommonEnglishContractions(),
    RemovePunctuation(),
    RemoveMultipleSpaces(),
    Strip()
])
```
**output:**
```Python
'in the it is light of the moon i saw a knife red with blood and my hand too was also discolored'
```

#### Now "it's" changed into "it" and "is", solving the issue.

#### I will stop here and won't face the second issue disccused previously because the time is running.

#### Also my preprocessings might cause other new issues! I'll see whether if the errors are less or not this way, if it is less, Then I won! The examples I showed were picked carefuly to prove my points, but I am not sure how things gonna be when testing full datasets.

## **5️⃣ Evaluation Metrics** <a id="evaluation-metrics"></a>

### I am using only one metric since it does the job perfectly and I got no more time to spend on such details. The metric Is called Word Error Rate (WER), which is a common metric used to evaluate the performance of automatic speech recognition (ASR) systems. It measures how accurately a system transcribes spoken language into text by comparing the system's output to a correct version (reference). WER is particularly useful because it provides a straightforward, interpretable score: the lower the WER, the better the transcription quality. This metric accounts for various types of errors, including incorrect words, missing words, and extra words.

### The formula for WER is:

### $$\text{WER} = \frac{S + D + I}{N}$$

### Where:
#### - S = the number of substitutions (wrong words),
#### - D = the number of deletions (missing words),
#### - I = the number of insertions (extra words),
#### - N = the total number of words in the reference.

### This equation essentially calculates the minimum number of operations required to transform the hypothesis into the reference, normalized by the total number of words in the reference. WER can exceed 1.0 (or 100%) if the number of errors is greater than the number of reference words.

### I won't implement this my self, instead, I'll use a very simply and useful function as follows:
```Python
from jiwer import wer

wer('This is a test for...', 'that was A for...')
```
**output:**
```Python
0.8
```

## **6️⃣ Preparing The Testing Codes** <a id="preparing-the-testing-codes"></a>

### In this section, I will lay down all the codes needed to conduct the experiment.

### Note, these codes were originaly written in a notebook, this is why tou might see some repeated lines since because sometimes I execute cells seperately and out of order, so I might need to repeat things.

### **1. Login To Hugging Face** <a id="preparing-the-testing-codes-login-to-hugging-face"></a>

#### Most datasets and models will be downloaded from Hugging Face, thus, Hugging Face requires you to login in first in order to being able to download any thing. This is done by the following code:

```Python
from huggingface_hub import login

login("You access token's value")
```
#### I didn't mention how to get your access token value, just google it.

### **2. Create Classes For The ASR Models** <a id="preparing-the-testing-codes-create-classes-for-the-asr-models"></a>

#### I created classes for each asr model despite its size/version to simplify thongs out later:

#### 1. Whisper

```Python
%pip install openai-whisper

import whisper
import gc
import torch

# sizes:
# tiny
# base
# small
# medium
# turbo
# large

class Whisper:
    def __init__(self, size):
        self.model = whisper.load_model(size)

    def transcribe(self, audio_array, sample_rate):
        return self.model.transcribe(audio_array, language="en", fp16=False)["text"]

    def __del__(self):
        del self.model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache
```

#### 2. Wav2Vec2

```Python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import gc

# versions:
# wav2vec2-base
# wav2vec2-base-960h
# wav2vec2-large
# wav2vec2-large-960h
# wav2vec2-large-960h-lv60-self
# wav2vec2-large-xlsr-53

class Wav2Vec2:
    def __init__(self, version):
        self.processor = Wav2Vec2Processor.from_pretrained("facebook/" + version)
        self.model = Wav2Vec2ForCTC.from_pretrained("facebook/" + version)
        self.model.eval()

    def transcribe(self, audio_array, sample_rate):
        input_values = self.processor(audio_array, return_tensors="pt", sampling_rate=sample_rate).input_values

        with torch.no_grad():
            logits = self.model(input_values).logits
        
        predicted_ids = torch.argmax(logits, dim=-1)
        
        return self.processor.decode(predicted_ids[0])

    def __del__(self):
        del self.model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache
```

#### 3. Parakeet

```Python
%pip install -U nemo_toolkit["asr"]

import nemo.collections.asr as nemo_asr
import gc
import torch

# versions:
# parakeet-tdt-0.6b-v2
# parakeet-tdt-1.1b
# parakeet-rnnt-1.1b
# parakeet-ctc-0.6b
# parakeet-ctc-1.1b

class Parakeet:
    def __init__(self, version):
        self.model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/" + version)

    def transcribe(self, audio_array, sample_rate):
        return self.model.transcribe(audio_array, verbose=False)[0].text

    def __del__(self):
        del self.model
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
```

### **3. Dataset Class** <a id="preparing-the-testing-codes-datasets-class"></a>

#### Since datasets have different formats and features names, I created the following class to unify dealing with them:

#### Note the "[Important Notes](#preparing-the-testing-codes-important-notes)" subsection where I ralked about the "BinAlsadiq/tedlium_3_test_segments" dataset which doesn't exist 😉

```Python
from datasets import load_dataset, Audio

class DatasetAPI:
    def __init__(self, dataset_name):
        self.dataset_name = dataset_name
        
        if dataset_name == 'tedlium_3_test_segments':
            self.dataset = load_dataset("BinAlsadiq/tedlium_3_test_segments", split='test', trust_remote_code=True)
        elif dataset_name == 'librispeech_test_clean':
            self.dataset = load_dataset("BinAlsadiq/librispeech_test_clean", split='test', trust_remote_code=True)
        elif dataset_name == 'librispeech_test_other':
            self.dataset = load_dataset("BinAlsadiq/librispeech_test_other", split='test', trust_remote_code=True)
        elif dataset_name == 'common_voice_13_0_en_test':
            self.dataset = load_dataset("BinAlsadiq/common_voice_13_0_en_test", split='test', trust_remote_code=True)
        elif dataset_name == 'MathSpeech':
            self.dataset = load_dataset("AAAI2025/MathSpeech", split="train", trust_remote_code=True)
            self.dataset = self.dataset.cast_column("audio", Audio(sampling_rate=16000))

    def __iter__(self):
        self._index = 0
        return self

    def __next__(self):
        if self._index < len(self.dataset):
            sample = self.dataset[self._index]
            self._index += 1
            transcription_key = 'transcription' if self.dataset_name == 'MathSpeech' else 'sentence' if self.dataset_name == 'common_voice_13_0_en_test' else 'text'
            
            return {
                'audio_array' : sample["audio"]["array"].astype("float32"),
                'sampling_rate' : sample["audio"]["sampling_rate"],
                'transcription' : sample[transcription_key]
            }
        else:
            raise StopIteration

    def __len__(self):
        return len(self.dataset)
```

### **4. Preprocessors** <a id="preparing-the-testing-codes-preprocessors"></a>

#### I created two preprocessors as shown in the code below, remember their names because they will be used later on.

```Python
%pip install breame

from breame.spelling import get_american_spelling # or get_british_spelling
from jiwer import Compose, RemovePunctuation, RemoveMultipleSpaces, Strip, ExpandCommonEnglishContractions, ToLowerCase

def american_spelling(text):
    result = ''

    for c in text.split():
        result += get_american_spelling(c) + ' '
    
    return result

preprocessorA = Compose([
    ToLowerCase(),
    RemovePunctuation(),
    RemoveMultipleSpaces(),
    Strip()
])

preprocessorB = Compose([
    american_spelling,
    ExpandCommonEnglishContractions(),
    RemovePunctuation(),
    RemoveMultipleSpaces(),
    Strip()
])
```

### **5. Testing Loop** <a id="preparing-the-testing-codes-testing-loop"></a>

#### The following code is a testing loop that takes two inputs, the first one is a model to test, the second is a list of datasets to test the model on. The loop produce three evaluation results for each dataset. It is shown clearly in the code what these results are, so I won't explain them.

```Python
from jiwer import wer
from tqdm import tqdm

def test(model, datasets):
    wer_list = []
    Awer_list = []
    Bwer_list = []
    
    for dataset in datasets:
        wer_sum = 0
        Awer_sum = 0
        Bwer_sum = 0
        
        for sample in tqdm(dataset):
            prediction = model.transcribe(sample["audio_array"], sample["sampling_rate"])
            reference = sample["transcription"]
        
            wer_sum += wer(reference, prediction)
            Awer_sum += wer(preprocessorA(reference), preprocessorA(prediction))
            Bwer_sum += wer(preprocessorB(reference), preprocessorB(prediction))

        wer_list.append({ dataset.dataset_name : wer_sum / len(dataset)})
        Awer_list.append({ dataset.dataset_name : Awer_sum / len(dataset)})
        Bwer_list.append({ dataset.dataset_name : Bwer_sum / len(dataset)})

    return {
        'WER without preprocessing' : wer_list, 
        'WER with type A preprocessing' : Awer_list, 
        'WER with type B preprocessing' : Bwer_list
    }
```

### **6. Start The Tests** <a id="preparing-the-testing-codes-start-the-tests"></a>

#### Now all what you need to do is to test the model you want.

#### The following code will do so, and will display and store the results in a ".csv" file:

```Python
from pandas import DataFrame
import os

datasets = [
    DatasetAPI('tedlium_3_test_segments'),
    DatasetAPI('librispeech_test_clean'),
    DatasetAPI('librispeech_test_other'),
    DatasetAPI('common_voice_13_0_en_test'),
    DatasetAPI('MathSpeech')
]

model = Whisper('tiny')

results = DataFrame(test(model, datasets))
results.to_csv('whisper-tiny.csv')
print(results)

del model
```

### **7. Important Notes** <a id="preparing-the-testing-codes-important-notes"></a>

- #### **The wav2vec2-base model is pretrained on features extracting only and it is not read to be used as an ASR. This will cause the performance to be very poor as will be shown in the next section. I should fine tune it first, but now I am only testing the models as they are, later I will decide if fine-tuning is necessary.**:

- #### **You can't download wav2vec2-large and wav2vec2-large-xlsr-53 like the other versions. There is a reason for that, but I don't have time, and it doesn't really matter. Thus, these two models will be ignored.**

- #### **The "BinAlsadiq/tedlium_3_test_segments" dataset doesn't really exist, thus, all the related results are fake 😉. I can't make my own subset out of the TedLIUM dataset since it is illegal.**

## **7️⃣ Results** <a id="results"></a>

### **whisper-tiny**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.425206</td>
      <td>0.254331</td>
      <td>0.223201</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.993131</td>
      <td>0.100179</td>
      <td>0.098831</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.995785</td>
      <td>0.190091</td>
      <td>0.184871</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.472529</td>
      <td>0.423461</td>
      <td>0.424334</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.333016</td>
      <td>0.203956</td>
      <td>0.204865</td>
    </tr>
  </tbody>
</table>
</div>

### **whisper-base**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.428846</td>
      <td>0.256111</td>
      <td>0.213590</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.982150</td>
      <td>0.086610</td>
      <td>0.082629</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.991771</td>
      <td>0.160075</td>
      <td>0.155004</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.367940</td>
      <td>0.309804</td>
      <td>0.311929</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.275864</td>
      <td>0.148422</td>
      <td>0.149331</td>
    </tr>
  </tbody>
</table>
</div>

### **whisper-small**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.529205</td>
      <td>0.357324</td>
      <td>0.314501</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.978684</td>
      <td>0.053592</td>
      <td>0.050267</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.986757</td>
      <td>0.098921</td>
      <td>0.092799</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.304672</td>
      <td>0.246675</td>
      <td>0.240676</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.264789</td>
      <td>0.137663</td>
      <td>0.137663</td>
    </tr>
  </tbody>
</table>
</div>

### **whisper-medium**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.492625</td>
      <td>0.312032</td>
      <td>0.268383</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.980107</td>
      <td>0.051063</td>
      <td>0.050851</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.981331</td>
      <td>0.071449</td>
      <td>0.066030</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.236713</td>
      <td>0.172767</td>
      <td>0.164299</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.245444</td>
      <td>0.105051</td>
      <td>0.109210</td>
    </tr>
  </tbody>
</table>
</div>

### **whisper-turbo**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.561731</td>
      <td>0.378410</td>
      <td>0.335652</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.981520</td>
      <td>0.042134</td>
      <td>0.041912</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.985048</td>
      <td>0.056626</td>
      <td>0.051830</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.247533</td>
      <td>0.191593</td>
      <td>0.187984</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.293834</td>
      <td>0.159705</td>
      <td>0.162614</td>
    </tr>
  </tbody>
</table>
</div>

### **whisper-large**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.520963</td>
      <td>0.344325</td>
      <td>0.301300</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.977587</td>
      <td>0.038080</td>
      <td>0.037809</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.987934</td>
      <td>0.058438</td>
      <td>0.051702</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.242730</td>
      <td>0.183063</td>
      <td>0.176834</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.260550</td>
      <td>0.105496</td>
      <td>0.106405</td>
    </tr>
  </tbody>
</table>
</div>

### **wav2vec2-base**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>1.276548</td>
      <td>1.276148</td>
      <td>1.277075</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.999677</td>
      <td>0.999677</td>
      <td>0.999687</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.999778</td>
      <td>0.999778</td>
      <td>0.999778</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>1.211980</td>
      <td>1.211980</td>
      <td>1.218914</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>1.003636</td>
      <td>1.003110</td>
      <td>1.004019</td>
    </tr>
  </tbody>
</table>
</div>

### **wav2vec2-base-960h**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>1.005146</td>
      <td>0.249087</td>
      <td>0.210693</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.031998</td>
      <td>0.031998</td>
      <td>0.029351</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.088620</td>
      <td>0.088494</td>
      <td>0.082035</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>1.066622</td>
      <td>0.492667</td>
      <td>0.487650</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>1.046253</td>
      <td>0.626175</td>
      <td>0.628175</td>
    </tr>
  </tbody>
</table>
</div>

### **wav2vec2-large-960h**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>1.027285</td>
      <td>0.261928</td>
      <td>0.222394</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.032082</td>
      <td>0.032082</td>
      <td>0.030782</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.056778</td>
      <td>0.055647</td>
      <td>0.050764</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>1.064227</td>
      <td>0.422171</td>
      <td>0.416975</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>1.042963</td>
      <td>0.588449</td>
      <td>0.598544</td>
    </tr>
  </tbody>
</table>
</div>

### **wav2vec2-large-960h-lv60-self**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>1.005705</td>
      <td>0.225617</td>
      <td>0.186261</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.023617</td>
      <td>0.023617</td>
      <td>0.022159</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.047817</td>
      <td>0.047817</td>
      <td>0.041849</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>1.030573</td>
      <td>0.293778</td>
      <td>0.288143</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>1.051637</td>
      <td>0.341368</td>
      <td>0.341368</td>
    </tr>
  </tbody>
</table>
</div>

### **parakeet-tdt-0.6b-v2**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.406494</td>
      <td>0.218562</td>
      <td>0.176007</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>0.978238</td>
      <td>0.013921</td>
      <td>0.013879</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>0.979126</td>
      <td>0.039340</td>
      <td>0.032721</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.193538</td>
      <td>0.128687</td>
      <td>0.121887</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.292922</td>
      <td>0.118508</td>
      <td>0.121297</td>
    </tr>
  </tbody>
</table>
</div>

### **parakeet-tdt-1.1b**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.215337</td>
      <td>0.215337</td>
      <td>0.172384</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>1.001905</td>
      <td>0.014003</td>
      <td>0.013061</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>1.000312</td>
      <td>0.034371</td>
      <td>0.027682</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.392436</td>
      <td>0.102604</td>
      <td>0.102109</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.227052</td>
      <td>0.216707</td>
      <td>0.220373</td>
    </tr>
  </tbody>
</table>
</div>

### **parakeet-rnnt-1.1b**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.216413</td>
      <td>0.216413</td>
      <td>0.174144</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>1.001905</td>
      <td>0.015180</td>
      <td>0.013923</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>1.001860</td>
      <td>0.034105</td>
      <td>0.025794</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.389216</td>
      <td>0.094002</td>
      <td>0.092888</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.240727</td>
      <td>0.230382</td>
      <td>0.234048</td>
    </tr>
  </tbody>
</table>
</div>

### **parakeet-ctc-0.6b**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.211909</td>
      <td>0.211909</td>
      <td>0.169717</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>1.001905</td>
      <td>0.020159</td>
      <td>0.019471</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>1.001339</td>
      <td>0.058955</td>
      <td>0.051514</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.439564</td>
      <td>0.178469</td>
      <td>0.174057</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.232928</td>
      <td>0.218615</td>
      <td>0.222293</td>
    </tr>
  </tbody>
</table>
</div>

### **parakeet-ctc-1.1b**:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>WER without preprocessing</th>
      <th>WER with type A preprocessing</th>
      <th>WER with type B preprocessing</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>'tedlium_3_test_segments'</th>
      <td>0.213947</td>
      <td>0.213947</td>
      <td>0.171533</td>
    </tr>
    <tr>
      <th>'librispeech_test_clean'</th>
      <td>1.003260</td>
      <td>0.018914</td>
      <td>0.016592</td>
    </tr>
    <tr>
      <th>'librispeech_test_other'</th>
      <td>1.001884</td>
      <td>0.052315</td>
      <td>0.046274</td>
    </tr>
    <tr>
      <th>'common_voice_13_0_en_test'</th>
      <td>0.416557</td>
      <td>0.142409</td>
      <td>0.140728</td>
    </tr>
    <tr>
      <th>'MathSpeech'</th>
      <td>0.255524</td>
      <td>0.244322</td>
      <td>0.247666</td>
    </tr>
  </tbody>
</table>
</div>