# Section 1: Welcome to the World of Speech Recognition!

Hello and welcome to this interactive introduction to Automatic Speech Recognition (ASR), also known as Speech-to-Text (STT)!

ASR is a fascinating field of artificial intelligence that enables computers to understand and transcribe human speech into text. It's a technology that bridges the gap between human language and machine comprehension, unlocking a vast array of applications that make our lives easier, more productive, and more accessible.

In this Colab notebook, our primary objective is to explore the world of speech recognition.
  1. You'll learn how to handle speech data, understand its characteristics through visualizations
  2. We will then dive into a powerful model called Wav2Vec2, understanding its architecture and the clever techniques like self-supervised and contrastive learning that make it so effective.
  3. We will fine-tune this pre-trained model on a specific dataset (Darija Speech ^^)
  4. Finally, you'll learn how to test your model's performance and even share it with the world by deploying it on the Hugging Face Hub.

# Section 2: The Journey from Sound to Text - Understanding the Pipeline and Its Hurdles

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.

Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)




credits to:




Wav2Vec2 is fine-tuned using **Connectionist Temporal Classification (CTC)**, which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.

I highly recommend reading the blog post [Sequence Modeling with CTC (2017)](https://distill.pub/2017/ctc/) very well-written blog post by Awni Hannun.

First, let's try make sure you have GPU set ...

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sat May 24 17:25:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   60C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## No GPU? No problem!  
Just activate the GPU in Colab (ask a mentor if you need help), then start from here.

## 2.1 Preparing for Fine-tuning: Data is Key

Before we dive into the code, let's discuss data preparation for fine-tuning Wav2Vec2:

1.  **Labeled Dataset:** You need a dataset where each audio sample has an accurate corresponding text transcription. We will continue using a subset of the `atlasia/DODa-audio-dataset` for this demonstration.
2.  **Consistent Sampling Rate:** Wav2Vec2 models are pre-trained with audio at a specific sampling rate (commonly 16 kHz). All audio in your fine-tuning dataset *must* be resampled to match this rate. The `datasets` library can help with this.
3.  **Vocabulary Definition:** The model needs a vocabulary, which is the set of all possible characters (or subwords) it can predict. This vocabulary is created from the transcriptions in your training dataset.
4.  **Processor/Tokenizer:** A `Wav2Vec2Processor` (or `Wav2Vec2CTCTokenizer` for older versions) handles both audio preprocessing (like resampling and normalization) and text tokenization (converting transcriptions into sequences of IDs based on the vocabulary).

## 2.2 The Fine-tuning Process: An Overview

The fine-tuning process generally involves these steps:

1.  **Set up the Environment:** Install necessary libraries like `transformers`, `datasets`, `evaluate`, and `accelerate` (for efficient training).
2.  **Load Dataset and Processor:** Load your speech dataset and a pre-trained Wav2Vec2 processor (which includes a tokenizer).
3.  **Preprocess Data:** Resample audio, tokenize transcriptions, and prepare the data in a format suitable for the model.
4.  **Load Pre-trained Model:** Load a pre-trained Wav2Vec2 model suitable for ASR (e.g., `Wav2Vec2ForCTC`).
5.  **Define Training Configuration:** Specify training arguments like learning rate, batch size, number of epochs, and evaluation strategy.
6.  **Define Evaluation Metrics:** Choose metrics to evaluate your model, typically Word Error Rate (WER) and Character Error Rate (CER).
7.  **Define Data Collator:** A data collator is responsible for batching your processed data samples and padding them so that all sequences in a batch have the same length.
8.  **Instantiate and Run Trainer:** Use the Hugging Face `Trainer` class, which simplifies the training loop, handles evaluation, and saves checkpoints.

Let's get to the code!

In [None]:
!pip install transformers[torch] evaluate accelerate torchaudio librosa
!pip install -U datasets

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=2.0->transformers[torch])
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-c

In [None]:
!pip install hf_xet

Collecting hf_xet
  Downloading hf_xet-1.1.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (879 bytes)
Downloading hf_xet-1.1.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/5.2 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m5.2/5.2 MB[0m [31m180.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m103.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf_xet
Successfully installed hf_xet-1.1.2


## 2.3. Load Dataset and Create Vocabulary/Processor

We will use the `atlasia/DODa-audio-dataset` again. We need to extract all unique characters from our transcriptions to build a vocabulary.



In [None]:
from datasets import load_dataset, Audio
import re
from huggingface_hub import login


token = "hgface_token"

# Authenticate your session
login(token=token)

# --- 1. Load a small subset of the dataset for this demo ---
dataset_name = "atlasia/DODa-audio-dataset"
raw_dataset = load_dataset(dataset_name, split="train[:30%]", token=token) # Using only 20 samples for speed
print("Raw dataset loaded:", raw_dataset)

README.md:   0%|          | 0.00/5.36k [00:00<?, ?B/s]

data/train-00000-of-00005.parquet:   0%|          | 0.00/333M [00:00<?, ?B/s]

data/train-00001-of-00005.parquet:   0%|          | 0.00/279M [00:00<?, ?B/s]

data/train-00002-of-00005.parquet:   0%|          | 0.00/237M [00:00<?, ?B/s]

data/train-00003-of-00005.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

data/train-00004-of-00005.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12743 [00:00<?, ? examples/s]

Raw dataset loaded: Dataset({
    features: ['audio', 'darija_Latn', 'darija_Arab_new', 'english', 'darija_Arab_old'],
    num_rows: 3823
})


We'll use the ***`darija_Arab_new`*** column as our target — it's already preprocessed (Yaaay! 🎉... faster coding ahead 🚀).


In [None]:
# --- 2. Define a function to extract all characters from the transcriptions ---

def extract_all_chars(batch):
    transcription_key = "darija_Arab_new"
    all_text = " ".join(t for t in batch[transcription_key] if t is not None)
    vocab = list(set(all_text.lower())) # Convert to lowercase and get unique chars
    return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
# --- 3. Create the vocabulary ---

vocabs = raw_dataset.map(extract_all_chars, batched=True, batch_size=8, keep_in_memory=True, remove_columns=raw_dataset.column_names)
vocab_list = list(set(c for vocab_item in vocabs["vocab"] for c in vocab_item))

Map:   0%|          | 0/3823 [00:00<?, ? examples/s]

In [None]:
vocab_list

[':',
 'ء',
 'ا',
 'ى',
 'ض',
 '؟',
 '-',
 '1',
 'ب',
 'آ',
 'ؤ',
 'و',
 'ن',
 'ل',
 'ث',
 ',',
 '،',
 'ج',
 'م',
 'ي',
 'د',
 'إ',
 'ڤ',
 'س',
 '!',
 'ئ',
 'ق',
 'ص',
 '.',
 '9',
 'ك',
 'غ',
 'ح',
 'ه',
 'ز',
 'ة',
 'ش',
 'خ',
 ' ',
 'ف',
 'ّ',
 'پ',
 'ر',
 'ُ',
 'ظ',
 'ڭ',
 'ت',
 '6',
 'ط',
 'ع',
 'ذ',
 'أ',
 '?']

## 🧠 Let's Talk About Vocabulary!

When you look at the list below, you’ll see lots of strange or unnecessary characters — things like punctuation marks, special diacritics, and even numbers.

```python
# This is an example of a "messy" vocabulary extracted directly from raw text data
messy_vocabulary_example = ['’','ش','ر','َ','؟','ب','0','ص','ؤ','ث','پ','9',',','3','ء','?',
 'ة','خ','غ','ي','ف','-','ى','2','أ','ك','ظ','و','1','ّ','ُ','ت',
 'ض','ح','س','ئ','د','5','ه','ل','ڤ','ط','ڭ','إ','!','"',' ','ا',
 ':','6','ق','ذ','ز','ن','ج','.','،','ع','ً','م','آ']
```

### ❓ But wait... why does our vocabulary look like this?
Because it was built automatically from data not fully cleaned, not filtered. So, yes:

✅ it includes Arabic letters (like ش, ر, ب)

✅ but also punctuation (like ؟, ,, .), numbers (0-9), and even other symbols (like ’, ،, َ, ً, ُ, ّ)

### 💡 Hint for You & Your Thoughts
Try to think about:

Which of these symbols are actually important for recognizing spoken words?
For example: do we need to distinguish between "hello." and "hello!" or "hello,"? Does these sounds different? Do we need punctutations??

Which symbols might be useless or even harmful during training of a speech recognition model?
Could they confuse the model? => Can you select some?

If you were to clean this vocabulary:

What would you keep, and what would you remove or replace?

Let's do some further cleaning ...

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    if batch["darija_Arab_new"] == None:
      batch["text"] = "خاوي"
    else:
      batch["text"] = re.sub(chars_to_ignore_regex, '', batch["darija_Arab_new"]).lower() + " "
    return batch

def remove_digits(batch):
    if batch["darija_Arab_new"] == None:
      batch["text"] = "خاوي"
    else:
      batch["text"] = re.sub(r'\d+', '', batch["text"])
    return batch

def normalize_hamza(batch):
    if batch["darija_Arab_new"] == None:
      batch["text"] = "خاوي"
    else:
      hamza_pattern = re.compile(r'[ءأإؤئيآ]')
      batch["text"] = re.sub(hamza_pattern, 'ء', batch["text"])
    return batch


In [None]:
raw_dataset = raw_dataset.map(remove_special_characters)
raw_dataset = raw_dataset.map(remove_digits)
raw_dataset = raw_dataset.map(normalize_hamza)

Map:   0%|          | 0/3823 [00:00<?, ? examples/s]

In [None]:
raw_dataset

Dataset({
    features: ['audio', 'darija_Latn', 'darija_Arab_new', 'english', 'darija_Arab_old', 'text'],
    num_rows: 3823
})

Do you see a new column names `"text"` ?? Make sure you have:
Then now, let's make a cleaner Vocab ...

In [None]:
def extract_all_chars(batch):
    transcription_key = "text"
    all_text = " ".join(t for t in batch[transcription_key] if t is not None)
    vocab = list(set(all_text.lower())) # Convert to lowercase and get unique chars
    return {"vocab": [vocab], "all_text": [all_text]}

vocabs = raw_dataset.map(extract_all_chars, batched=True, batch_size=8, keep_in_memory=True, remove_columns=raw_dataset.column_names)
vocab_list = list(set(c for vocab_item in vocabs["vocab"] for c in vocab_item))

Map:   0%|          | 0/3823 [00:00<?, ? examples/s]

In [None]:
## make the dictionary as json: mapping characters == Numbers
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}

To make it clearer that `" "` has its own token class, we give it a more visible character `|`. In addition, we also add an "unknown" token so that the model can later deal with characters not encountered in Timit's training set.

Finally, we also add a padding token that corresponds to CTC's "*blank token*". The "blank token" is a core component of the CTC algorithm. For more information, please take a look at the "Alignment" section [here](https://distill.pub/2017/ctc/).

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [None]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

49

## 2.4. Create Tokenizer -- Wav2Vec2CTCTokenizer

**Cool**, now our vocabulary is complete and consists of less than 49 tokens, which means that the linear layer that we will add on top of the pretrained Wav2Vec2 checkpoint will have an output dimension of (total number of vocab).

Let's now save the vocabulary as a json file.

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In a final step, we use the json file to instantiate an object of the `Wav2Vec2CTCTokenizer` class.

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

## 2.5. Create Feature Extractor -- Wav2Vec2FeatureExtractor

*A* Wav2Vec2 feature extractor object requires the following parameters to be instantiated:

- `feature_size`: Speech models take a sequence of feature vectors as an input. While the length of this sequence obviously varies, the feature size should not. In the case of Wav2Vec2, the feature size is 1 because the model was trained on the raw speech signal ${}^2$.
- `sampling_rate`: The sampling rate at which the model is trained on.
- `padding_value`: For batched inference, shorter inputs need to be padded with a specific value
- `do_normalize`: Whether the input should be *zero-mean-unit-variance* normalized or not. Usually, speech models perform better when normalizing the input
- `return_attention_mask`: Whether the model should make use of an `attention_mask` for batched inference. In general, models should **always** make use of the `attention_mask` to mask padded tokens. However, due to a very specific design choice of `Wav2Vec2`'s "base" checkpoint, better results are achieved when using no `attention_mask`. This is **not** recommended for other speech models. For more information, one can take a look at [this](https://github.com/pytorch/fairseq/issues/3227) issue. **Important** If you want to use this notebook to fine-tune [large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60), this parameter should be set to `True`.

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

## 2.6. Create Processor -- Wav2Vec2Processor


Great, Wav2Vec2's feature extraction pipeline is thereby fully defined!

To make the usage of Wav2Vec2 as user-friendly as possible, the feature extractor and tokenizer are *wrapped* into a single `Wav2Vec2Processor` class so that one only needs a `model` and `processor` object.

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

## 2.7. Excute everything on our Data ...

Finally, we can process the dataset to the format expected by the model for training. We will make use of the `map(...)` function.

First, we load and resample the audio data, simply by calling `batch["audio"]`.
Second, we extract the `input_values` from the loaded audio file. In our case, the `Wav2Vec2Processor` only normalizes the data.

Third, we encode the transcriptions to label ids (using tokenizer =? what computers can understands).

**Note**: This mapping function is a good example of how the `Wav2Vec2Processor` class should be used. In "normal" context, calling `processor(...)` is redirected to `Wav2Vec2FeatureExtractor`'s call method. When wrapping the processor into the `as_target_processor` context, however, the same method is redirected to `Wav2Vec2CTCTokenizer`'s call method.
For more information please check the [docs](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#transformers.Wav2Vec2Processor.__call__).

In [None]:
def prepare_dataset(batch):
    ## get speech arrays
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

Let's apply the data preparation function to all examples.

In [None]:
## it should take 6 minutes Max
raw_dataset = raw_dataset.map(prepare_dataset)

Map:   0%|          | 0/3823 [00:00<?, ? examples/s]



Awesome, now we are ready to start training!

# Section 3: Training & Evaluation

The data is processed so that we are ready to start setting up the training pipeline. We will make use of 🤗's [Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer) for which we essentially need to do the following:

- Define a data collator. In contrast to most NLP models, Wav2Vec2 has a much larger input length than output length. *E.g.*, a sample of input length 50000 has an output length of no more than 100. Given the large input sizes, it is much more efficient to pad the training batches dynamically meaning that all training samples should only be padded to the longest sample in their batch and not the overall longest sample. Therefore, fine-tuning Wav2Vec2 requires a special padding data collator, which we will define below

- Evaluation metric. During training, the model should be evaluated on the word error rate. We should define a `compute_metrics` function accordingly

- Load a pretrained checkpoint. We need to load a pretrained checkpoint and configure it correctly for training.

- Define the training configuration.

After having fine-tuned the model, we will correctly evaluate it on the test data and verify that it has indeed learned to correctly transcribe speech.

### 3.1 Define Data Collator

This class will take care of padding our inputs and labels dynamically per batch.


In [None]:
import torch
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        labels_batch = self.processor.pad(
            labels=label_features,
            padding=self.padding,
            return_tensors="pt",
        )

        # Replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels
        return batch

if processor:
    data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
    print("Data collator defined.")
else:
    data_collator = None
    print("Some Problem is there, call Mentor.")


Data collator defined.


### 3.2 Define Evaluation Metrics (WER & CER)

Word Error Rate (WER) and Character Error Rate (CER) are standard metrics for ASR.


In [None]:
!pip install jiwer

Collecting jiwer
  Downloading jiwer-3.1.0-py3-none-any.whl.metadata (2.6 kB)
Collecting rapidfuzz>=3.9.7 (from jiwer)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading jiwer-3.1.0-py3-none-any.whl (22 kB)
Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rapidfuzz, jiwer
Successfully installed jiwer-3.1.0 rapidfuzz-3.13.0


We introduced WER earlier, but let's delve deeper. This is the standard metrics to evaluate the performance of an ASR system.

*   **Word Error Rate (WER):** This metric measures errors at the word level. It's calculated by comparing the predicted sequence of words with the reference (ground truth) transcription. The formula is:

    `WER = (S + D + I) / N`

    Where:
    *   `S` is the number of substitutions (words in the prediction that are different from the reference at the same position, e.g., reference "hello world", prediction "hallo world" -> 1 substitution).
    *   `D` is the number of deletions (words in the reference that are missing in the prediction, e.g., reference "hello brave new world", prediction "hello new world" -> 1 deletion, "brave").
    *   `I` is the number of insertions (words in the prediction that are not in the reference, e.g., reference "hello world", prediction "hello there world" -> 1 insertion, "there").
    *   `N` is the total number of words in the reference transcription.

    A lower WER is better, with 0% being a perfect transcription. WER can sometimes be greater than 100% if the prediction is much longer than the reference and has many errors.


In [None]:
import evaluate

wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    # Decode predictions
    pred_str = processor.batch_decode(pred_ids)
    label_ids_cleaned = []
    for label_seq in pred.label_ids:
        label_ids_cleaned.append([token_id for token_id in label_seq if token_id != -100 and token_id != processor.tokenizer.pad_token_id])
    label_str = processor.batch_decode(label_ids_cleaned, group_tokens=False) # group_tokens=False for char-level

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

print("Metrics functions defined.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

Metrics functions defined.


### 3.3 Load Pre-trained Model

We'll load a pre-trained Wav2Vec2 model designed for CTC (Connectionist Temporal Classification), which is the common loss function for this type of ASR.


In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",  # A common base model
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer) # Ensure vocab size matches our tokenizer
)
# Freeze feature encoder layers if you want to train only the top layers (common practice)
model.freeze_feature_encoder()
print("Pre-trained model loaded.")

config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Pre-trained model loaded.


### 3.4 Define Training Arguments

These arguments control the training process.


In [None]:
from transformers import TrainingArguments

# Define a directory for saving model outputs (checkpoints, logs)
output_dir = "./wav2vec2-finetuned-hackai-demo"

# These are example arguments. Adjust them based on your resources and dataset size.
# For a quick demo, we use very few steps.
training_args = TrainingArguments(
    output_dir=output_dir,
    group_by_length=True, # Speeds up training by batching similar length inputs
    per_device_train_batch_size=2, # Reduce if OOM, increase if GPU memory allows
    per_device_eval_batch_size=2,
    eval_strategy="steps",
    num_train_epochs=3, # For demo.
    save_steps=50, # Save checkpoint every N steps (adjust based on training length)
    eval_steps=50, # Evaluate every N steps
    logging_steps=50, # Log metrics every N steps
    learning_rate=3e-4, # learning rate to adapt
    report_to="none",
)
print("Training arguments defined.")

Training arguments defined.


### 3.5 Instantiate the Trainer

Now, we bring everything together using the `Trainer` class.


In [None]:
from transformers import Trainer
import numpy as np # ensure numpy is imported
import dataclasses # ensure dataclasses is imported
from typing import Dict, List, Optional, Union # ensure typing is imported

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=raw_dataset, # Use the small preprocessed dataset
    eval_dataset=raw_dataset,  # For demo, using same small set for eval. Ideally, use a separate validation set.
    tokenizer=processor.feature_extractor, # Important for the Trainer to handle feature extraction correctly
)

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

  trainer = Trainer(


### 3.6 Start Fine-tuning!

This is where the actual training happens.

You don't need to finish the full fine-tuning, just show the mentor, the progress and some steps results (2 steps enough ...).


In [None]:
trainer.train()
print("Fine-tuning completed.")
# Save the final model and processor
model_save_path = f"{output_dir}/final_model"
processor.save_pretrained(model_save_path)
trainer.save_model(model_save_path)
print(f"Final model and processor saved to {model_save_path}")

Step,Training Loss,Validation Loss,Wer
50,3.8282,3.56341,1.000519
100,3.1373,3.662459,0.999793


KeyboardInterrupt: 

In [None]:
# trainer.train()
print("Fine-tuning completed.")
# Save the final model and processor
model_save_path = f"{output_dir}/final_model"
processor.save_pretrained(model_save_path)
trainer.save_model(model_save_path)
print(f"Final model and processor saved to {model_save_path}")

Fine-tuning completed.
Final model and processor saved to ./wav2vec2-finetuned-hackai-demo/final_model


# Challenge - Section 4: Final Exercise – Show Us Your Acoustic Talents

Your task is to create three speech samples and test them using a pretrained Arabic model from Hugging Face. The challenge? Push the system to its limits and try to generate samples that result in a Word Error Rate (WER) above 50%.

How you challenge the model is entirely up to you — be creative! You might try background noise, strong accents, unusual phrasing, or any technique that makes recognition more difficult. We're here to see how well you can break the system and showcase your audio manipulation skills.

In [None]:
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import Dataset, Audio
from jiwer import wer
import os

# Load processor and model
processor = Wav2Vec2Processor.from_pretrained("boumehdi/wav2vec2-large-xlsr-moroccan-darija")
model = Wav2Vec2ForCTC.from_pretrained("boumehdi/wav2vec2-large-xlsr-moroccan-darija").cuda()

# Function to map each audio sample to predicted text
def map_to_result(batch):
    with torch.no_grad(), torch.cuda.amp.autocast():
        # Convert audio array to tensor and process
        input_values = processor(
            batch["audio"]["array"],
            sampling_rate=batch["audio"]["sampling_rate"],
            return_tensors="pt"
        ).input_values.to('cuda')

        # Inference
        logits = model(input_values).logits
        pred_ids = torch.argmax(logits, dim=-1)

        # Decode with processor's vocabulary
        batch["pred_str"] = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]

    return batch

    return batch

In [None]:
# NB: Change these functions as you see fit

# Load audio files into a Dataset object
def xload_audio_files(audio_paths):
    data = {"audio": audio_paths}
    dataset = Dataset.from_dict(data).cast_column("audio", Audio(sampling_rate=16000))
    return dataset

# Main function to process files and compute WER
def evaluate_samples(audio_folder, references):

    # List audio file paths
    audio_files = [os.path.join(audio_folder, f"{i}.wav") for i in range(3)]

    # Load dataset
    dataset = load_audio_files(audio_files)

    # Transcribe
    results = dataset.map(map_to_result)

    # Compute WERs
    wers = []
    for i, ref in enumerate(references):
        hyp = results[i]["pred_str"]
        error = wer(ref, hyp)
        wers.append((i, error, ref, hyp))

    return wers

In [None]:
references = [
    "سجل النص الصحيح الأول هنا",
    "سجل النص الصحيح الثاني هنا",
    "سجل النص الصحيح الثالث هنا"
]

results = evaluate_samples("./audio_folder", references)

# Print out WER results
for idx, error, ref, hyp in results:
    print(f"Sample {idx} - WER: {error:.2f}")
    print(f"REF: {ref}")
    print(f"HYP: {hyp}")
    print("-----------")