# Automatic Speech Recognition (ASR)

Welcome to this introductory guide to Automatic Speech Recognition (ASR)! ASR is a fascinating field within Artificial Intelligence (AI) that focuses on converting spoken language into written text. Think about voice assistants like Siri or Alexa, dictation software, or automatic captioning on videos ‚Äì these are all powered by ASR technology.

**What is ASR?**

At its core, ASR systems take an audio waveform (your voice) as input and produce a sequence of words (text) as output. This process involves several complex steps, including:

1.  **Signal Processing:** Cleaning the audio signal, removing noise, and extracting relevant features.
2.  **Acoustic Modeling:** Mapping the audio features to basic units of sound, like phonemes /a/, /o/ ...
3.  **Language Modeling:** Understanding the probability of sequences of words occurring in a given language. This helps the system choose the most likely words.
4.  **Decoding:** Combining the acoustic and language models to find the most probable sequence of words corresponding to the input audio.

Modern ASR heavily relies on deep learning techniques, which have significantly improved accuracy over the past decade.

## Key ASR Models: Wav2Vec 2.0 and Whisper

Two prominent models have significantly advanced the field of ASR: Wav2Vec 2.0 and Whisper. These models leverage large amounts of data and sophisticated deep learning architectures to achieve state-of-the-art performance.

**Wav2Vec 2.0 (from Meta AI):** This model uses a clever approach called self-supervised learning. Instead of needing vast amounts of transcribed audio (audio paired with text), Wav2Vec 2.0 learns powerful representations directly from raw audio data. It masks parts of the audio input and tries to predict them based on the surrounding context, similar to how language models like BERT work with text. Once pre-trained on unlabeled audio, Wav2Vec 2.0 can be fine-tuned with a relatively small amount of labeled data for specific ASR tasks and languages, making it very versatile.

**Whisper (from OpenAI):** Whisper takes a different approach. It's trained on a massive and diverse dataset comprising 680,000 hours of multilingual and multitask supervised data collected from the web. This extensive training allows Whisper to perform remarkably well across a wide range of languages, accents, and noisy conditions, often without needing specific fine-tuning (a capability known as zero-shot performance). It's designed as an end-to-end system, directly mapping audio to text.

## Practical Examples: Arabic and Moroccan Darija ASR

Let's see how we can use pre-trained models for ASR tasks, specifically focusing on Arabic and Moroccan Darija. We will use models available on the Hugging Face Hub, a platform hosting thousands of pre-trained models.

First, we need to install the necessary libraries. If you haven't already, run the following cell. Note: Installation might take a few minutes, and Whisper might require `ffmpeg` to be installed on your system (`sudo apt update && sudo apt install ffmpeg`).

In [None]:
!pip install transformers torch soundfile librosa speechbrain
!sudo apt update && sudo apt install ffmpeg

Collecting speechbrain
  Downloading speechbrain-1.0.3-py3-none-any.whl.metadata (24 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Colle

In [None]:
!pip install -U datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m491.5/491.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m193.6/193.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets


### Example 1: Arabic ASR with Wav2Vec 2.0

We will use a Wav2Vec 2.0 model fine-tuned for Arabic. The `transformers` library from Hugging Face makes it easy to load and use these models. We'll need an audio file in Arabic to test this. For demonstration purposes, we'll load a sample from the `datasets` library, but you can replace `\'common_voice\' ` and the specific sample index with your own audio file path after loading it appropriately (e.g., using `librosa` or `soundfile`). Remember that the audio needs to be sampled at 16kHz for most Wav2Vec models.

In [None]:
import torch
import librosa
from datasets import load_dataset, Audio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load a pre-trained Arabic ASR model and processor
model_name_wav2vec = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
processor_wav2vec = Wav2Vec2Processor.from_pretrained(model_name_wav2vec) # Processor for Audio
model_wav2vec = Wav2Vec2ForCTC.from_pretrained(model_name_wav2vec)

# Load a sample Arabic audio file (e.g., from Common Voice dataset)
arabic_audio_sample = None
original_sentence = "(Could not load sample)"

# Load a small part of the dataset for demonstration
common_voice_ar = load_dataset("mozilla-foundation/common_voice_11_0", "ar", split="test[:1%]")
# Resample the audio to 16kHz as required by the model
common_voice_ar = common_voice_ar.cast_column("audio", Audio(sampling_rate=16000))
# Select the first audio sample
arabic_audio_sample = common_voice_ar[0]["audio"]["array"]
sampling_rate = common_voice_ar[0]["audio"]["sampling_rate"]
original_sentence = common_voice_ar[0]['sentence']
print(f"Loaded sample audio with rate: {sampling_rate} Hz")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/158 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]



vocab.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

common_voice_11_0.py:   0%|          | 0.00/8.13k [00:00<?, ?B/s]

languages.py:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

release_stats.py:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

The repository for mozilla-foundation/common_voice_11_0 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/mozilla-foundation/common_voice_11_0.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


n_shards.json:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(‚Ä¶)56b0aecbc3c08db0d8e09d55ecd91f0256b5bbfe:   0%|          | 0.00/712M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(‚Ä¶)7c255626f873df769eacbec7833a781f8ee26ab5:   0%|          | 0.00/300M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(‚Ä¶)78ecb94b35dfc3db5e3ecc2dbeb303d14aa22160:   0%|          | 0.00/312M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(‚Ä¶)8268846963207ea97cd3a9655bd39481234af7df:   0%|          | 0.00/978M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


(‚Ä¶)ffa1b3f8851e6db378f48f206ac22f098dafb7bf:   0%|          | 0.00/449M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train.tsv:   0%|          | 0.00/6.90M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


dev.tsv:   0%|          | 0.00/2.52M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test.tsv:   0%|          | 0.00/2.41M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


other.tsv:   0%|          | 0.00/8.44M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


invalidated.tsv:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 10887it [00:00, 108851.68it/s][A
Reading metadata...: 28043it [00:00, 111226.67it/s]


Generating validation split: 0 examples [00:00, ? examples/s]


Reading metadata...: 10438it [00:00, 113374.54it/s]


Generating test split: 0 examples [00:00, ? examples/s]


Reading metadata...: 10440it [00:00, 113552.70it/s]


Generating other split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 6256it [00:00, 62534.86it/s][A
Reading metadata...: 12763it [00:00, 64022.41it/s][A
Reading metadata...: 20081it [00:00, 68197.18it/s][A
Reading metadata...: 26901it [00:00, 67663.15it/s][A
Reading metadata...: 35514it [00:00, 66889.90it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 14959it [00:00, 109100.29it/s]


Loaded sample audio with rate: 16000 Hz


In [None]:
# Preprocess the audio
input_values = processor_wav2vec(arabic_audio_sample, sampling_rate=sampling_rate, return_tensors="pt").input_values

In [None]:
# Perform inference
with torch.no_grad():
    logits = model_wav2vec(input_values).logits

In [None]:
print(logits)

tensor([[[ 15.0634, -19.0045, -18.8847,  ...,  -6.0619,  -6.0984,  -4.9515],
         [ 15.1378, -19.2175, -19.1233,  ...,  -5.9245,  -5.6953,  -4.9738],
         [ 15.0448, -17.9181, -17.7681,  ...,  -4.5779,  -5.6573,  -3.7451],
         ...,
         [ 16.6778, -19.3526, -19.1507,  ...,  -6.0263,  -5.4574,  -3.7518],
         [ 16.2758, -19.1083, -18.9090,  ...,  -6.1576,  -4.7993,  -4.2137],
         [  2.0743,  -6.9718,  -6.9368,  ...,  -1.6385,  -1.6517,   0.4072]]])


In [None]:
# Decode the predicted IDs
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor_wav2vec.batch_decode(predicted_ids)[0]

In [None]:
print(f"Original Sentence (from dataset): {original_sentence}")
print(f"Wav2Vec Transcription: {transcription}")

Original Sentence (from dataset): ÿ≤ÿßÿ±ŸÜŸä ŸÅŸä ÿ£Ÿàÿßÿ¶ŸÑ ÿßŸÑÿ¥Ÿáÿ± ÿ®ÿØÿ±Ÿä
Wav2Vec Transcription: ÿ≤ÿßÿ±ŸÜŸä ŸÅŸä ÿ£Ÿàÿßÿ¶ŸÑ ÿßŸÑÿ¥Ÿáÿ± ÿ®ÿØÿ±Ÿä


### Example 2: Multilingual ASR with Whisper

Whisper models are known for their strong multilingual capabilities out-of-the-box. We can easily use them via the `transformers` pipeline. This pipeline handles the pre-processing, model inference, and post-processing for us. We can test it on the same Arabic audio sample loaded earlier, or you can provide a path to any audio file (Whisper handles various formats if `ffmpeg` is installed). Whisper automatically detects the language, but you can also specify it for potentially better results.

In [None]:
from transformers import pipeline
import numpy as np

# Load the ASR pipeline with a Whisper model
# You can choose different sizes like 'whisper-tiny', 'whisper-base', 'whisper-small', 'whisper-medium', 'whisper-large'
# Larger models are more accurate but require more resources.
whisper_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-base")
print("Whisper pipeline loaded successfully.")

# Use the Arabic audio sample loaded in Example 1 if available
if arabic_audio_sample is not None:
    print("Transcribing Arabic sample with Whisper...")
    # Ensure the audio is in the format Whisper expects (numpy float32 array)
    audio_input_whisper = np.array(arabic_audio_sample, dtype=np.float32)

    # Perform transcription (language can be auto-detected or specified)
    # result = whisper_pipeline(audio_input_whisper, generate_kwargs={"language": "arabic"}) # Specify language
    result = whisper_pipeline(audio_input_whisper) # Auto-detect language
    transcription_whisper = result["text"]

    print(f"Original Sentence (from dataset): {original_sentence}")
    print(f"Whisper Transcription: {transcription_whisper}")

config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.81k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

Device set to use cpu


Whisper pipeline loaded successfully.
Transcribing Arabic sample with Whisper...


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.


Original Sentence (from dataset): ÿ≤ÿßÿ±ŸÜŸä ŸÅŸä ÿ£Ÿàÿßÿ¶ŸÑ ÿßŸÑÿ¥Ÿáÿ± ÿ®ÿØÿ±Ÿä
Whisper Transcription:  ÿ≤ÿßÿ±ŸÜŸä ŸÅŸä ÿ£ŸàÿßŸÇŸÑ ÿßŸÑÿ¥Ÿáÿ± ÿ®ÿØÿ±Ÿä


### Example 3: Moroccan Darija ASR with SpeechBrain [Optional]

For Moroccan Darija, we can use a model specifically trained for it, available through the `speechbrain` library, which also integrates with Hugging Face. This example demonstrates how to load the `speechbrain/asr-wav2vec2-dvoice-darija` model and use it for transcription. Similar to the previous example, you'll need a Darija audio file (sampled at 16kHz). We'll use a placeholder here; you should replace `\'path/to/your/darija_audio.wav\' ` with the actual path to your audio file.

Challenge the model by intentionally creating a sample that lead to poor transcription results. Think about factors that could make transcription difficult.

In [None]:
from speechbrain.pretrained import EncoderASR
import torchaudio
import os

# Load the pre-trained Darija ASR model
# This will download the model from Hugging Face Hub if not already cached
asr_model_sb = EncoderASR.from_hparams(source="speechbrain/asr-wav2vec2-dvoice-darija", savedir="pretrained_models/asr-wav2vec2-dvoice-darija")
print("Darija ASR model loaded successfully.")

INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Using symlink found at '/content/pretrained_models/asr-wav2vec2-dvoice-darija/hyperparams.yaml'
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/asr-wav2vec2-dvoice-darija' if not cached
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in pretrained_models/asr-wav2vec2-dvoice-darija.
INFO:speechbrain.utils.fetching:Fetch wav2vec2.ckpt: Using symlink found at '/content/pretrained_models/asr-wav2vec2-dvoice-darija/wav2vec2.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["wav2vec2"] = /content/pretrained_models/asr-wav2vec2-dvoice-darija/wav2vec2.ckpt
INFO:speechbrain.utils.fetching:Fetch asr.ckpt: Using symlink found at '/content/pretrained_models/asr-wav2vec2-dvoice-darija/asr.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["asr"] = /content/pretrained_models/asr-wav2vec2-dvoice-darija/asr.

Darija ASR model loaded successfully.


In [None]:
darija_audio_file = 'path/to/your/darija_audio.wav'

# Perform transcription
print(f"Transcribing {darija_audio_file}...")
transcription_sb = asr_model_sb.transcribe_file(darija_audio_file)
print(f"SpeechBrain Transcription: {transcription_sb}")

Transcribing /content/sample_data/0.wav...


DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible


SpeechBrain Transcription: ÿßÿ≥ŸÑÿßŸÖ ÿπŸÑŸäŸÉŸÖ Ÿàÿ±ÿ≠ŸÖŸÖÿ±ÿ≠ÿ®ŸÖŸäŸÉŸÖ ŸÅŸä ÿßŸÑŸÖÿØÿ±ÿßÿ≥Ÿáÿ¥ŸäŸäŸä


## Finding and Using Moroccan Darija Datasets

While pre-trained models offer convenience, you may want to fine-tune a model on specific data (we‚Äôll cover that in a different notebook). For Moroccan Darija, having access to the right datasets is essential. One great place to start is the Hugging Face Hub.

For example, the `speechbrain/asr-wav2vec2-dvoice-darija` model we used was trained on the **DVoice Darija** dataset, which is available [on Zenodo](https://zenodo.org/records/6342622) and may also have versions on Hugging Face.

Currently, two main datasets are commonly used for Darija:

* **DVoice** (available on Zenodo)
* **DODA**, which integrates smoothly with the Hugging Face datasets library and will be used for fine-tuning a Wav2Vec2 model in a separate notebook.

If your goal is to use ASR purely for inference, several pre-trained models for Darija are readily available:

* [`speechbrain/asr-wav2vec2-dvoice-darija`](https://huggingface.co/speechbrain/asr-wav2vec2-dvoice-darija)
* [`boumehdi/wav2vec2-large-xlsr-moroccan-darija`](https://huggingface.co/boumehdi/wav2vec2-large-xlsr-moroccan-darija)
* [`KandirResearch/Whisper-Small-Darija`](https://huggingface.co/KandirResearch/Whisper-Small-Darija)

For additional guidance or questions related to your project, feel free to contact **Yassine El Kheir**.


## üß† Understanding Check

1. Outline all the major steps to run inference on an exsiting ASR model.  
2. Identify two potential challenges specific to Darija ASR.