# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NouamaneTazi/hackai-challenges/blob/main/new_notebooks/speech_wav2vec_dodaaudio.ipynb)
# ---
# # 🎤 Welcome to Speech Recognition with AI! 🇲🇦
#
# In this notebook, you'll learn how computers turn Moroccan Darija speech into text using AI. No experience needed! We'll guide you step by step, explain every new word, and show you how to use powerful models like Wav2Vec2.
#
# **What will you do?**
# 1. Listen to and visualize real Moroccan speech data.
# 2. Learn what a speech recognition model is, and how it works.
# 3. Try out a pre-trained model (so you don't need a big computer!).
# 4. (Optional) See how to train your own model if you want to go further.
#
# **All code runs on free Google Colab!** If training is slow, we'll show you how to use a ready-made model.
#
# Let's get started! 🚀
#
# ---
#
# ## What is Speech Recognition?
#
# **Speech Recognition** (also called Automatic Speech Recognition, or ASR) is when a computer listens to your voice and writes down what you said. It's used in voice assistants, subtitles, and more.
#
# ## What is Wav2Vec2?
#
# **Wav2Vec2** is a special AI model that can turn raw audio (your voice) into text. It was trained on thousands of hours of speech, so it can understand many accents and ways of speaking.
#
# ---
#
# ## Moroccan Context
#
# In this notebook, we'll use a dataset of Moroccan Darija speech. This is perfect for local projects and helps AI understand our unique way of speaking!
#
# ---
#
# ## How to Use This Notebook
#
# - Run each cell one by one (Shift+Enter in Colab)
# - Read the short explanations before each code cell
# - If you see a word you don't know, check the explanation or ask your mentor
#
# ---
#
# ## TODO: Add a fun image about speech recognition here!
# ---

# Section 1: Welcome to the World of Speech Recognition!

Hello and welcome to this interactive introduction to Automatic Speech Recognition (ASR), also known as Speech-to-Text (STT)!

ASR is a fascinating field of artificial intelligence that enables computers to understand and transcribe human speech into text. It's a technology that bridges the gap between human language and machine comprehension, unlocking a vast array of applications that make our lives easier, more productive, and more accessible.

In this Colab notebook, our primary objective is to explore the world of speech recognition.
  1. You'll learn how to handle speech data, understand its characteristics through visualizations
  2. We will then dive into a powerful model called Wav2Vec2, understanding its architecture and the clever techniques like self-supervised and contrastive learning that make it so effective.
  3. We will fine-tune this pre-trained model on a specific dataset (Darija Speech ^^)
  4. Finally, you'll learn how to test your model's performance and even share it with the world by deploying it on the Hugging Face Hub.

# 👋 Need Help with Anything Speech-Related?
If you're interested in incorporating text-to-speech or any speech component into your project, feel free to reach out to Yassine El Kheir at yassine.el_kheir@dfki.de.
I'd be happy to give you a hand and share some helpful tips!

# Section 2: Getting Our Hands Dirty - Exploring Speech Data

## 2.1 Setting Up Our Toolkit: Installing Libraries

We'll use a few Python libraries to work with speech data. If you're running this on Colab, just run the cell below to install them. (If you see an error about a missing library, run this cell again!)

- `datasets`: lets us easily download and use speech datasets
- `torchaudio`: helps us load and process audio files
- `librosa`: for audio analysis and making cool visualizations
- `matplotlib`: for plotting graphs and images
- `IPython`: lets us play audio directly in the notebook

**Run this cell if you need to install the libraries:**
```python
!pip install -U datasets torchaudio librosa matplotlib IPython
```

In [None]:
!pip install -U datasets
!pip install torchaudio
!pip install librosa
!pip install matplotlib
!pip install IPython

## 2.2 Loading a Speech Dataset from Hugging Face

Hugging Face Hub is an amazing resource that hosts thousands of datasets and pre-trained models. For this tutorial, we'll use the `atlasia/DODa-audio-dataset`. This dataset contains moroccan darija audio samples that we can use to explore and later, to fine-tune our model. To load this dataset, we will use the `load_dataset` function from the `datasets` library.

---
## Step 2: Load Moroccan Speech Data

A **dataset** is just a big collection of data. Here, we'll use a dataset of Moroccan Darija speech from Hugging Face. Each entry has:
- The audio (what was said)
- The text in Darija (Latin and Arabic letters)
- The English translation

We'll use the `atlasia/DODa-audio-dataset`, which has real Moroccan voices. This helps our AI learn how we really speak!

---

In [None]:
from datasets import load_dataset
from huggingface_hub import login

token = "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"

# Authenticate your session
login(token=token)

# Load the dataset
dataset_name = "atlasia/DODa-audio-dataset"
dataset = load_dataset(dataset_name, split="train[:50%]")  # 1% for testing
print(dataset)

# Show an example
if 'dataset' in locals() and dataset:
    print("\nExample entry:")
    print(dataset[0])

This snippet loads a small portion of the dataset and prints its structure, allowing you to inspect the available fields for each audio sample. Each entry includes:

- **`audio`**: the raw waveform (`array`), file path (`path`), and sampling rate (`sampling_rate`)
- **`darija_Latn`**: the transcription of the speech in Darija using Latin characters  
- **`darija_Arab_new`**: the processed Arabic script version of the transcription  
- **`darija_Arab_old`**: an earlier version of the Arabic transcription  
- **`english`**: the English translation of the utterance

These fields provide a rich representation of the data across scripts and languages.

## 2.3 Listening to Speech 🎧

Before training any AI, it's good to listen to your data! This helps you understand:
- How people really speak (accents, speed, background noise)
- If the recordings are clear or noisy

Let's listen to a few Moroccan Darija samples from our dataset. Try to notice different accents or any background sounds!

In [None]:
import IPython.display as ipd
import numpy as np

if 'dataset' in locals() and dataset and len(dataset) >= 3:
    for i in range(3):
        sample = dataset[i]
        audio_data = np.array(sample["audio"]["array"], dtype=np.float32)
        sampling_rate = sample["audio"]["sampling_rate"]

        print(f"\n--- Sample {i + 1} ---")
        print(f"Darija (Latin): {sample['darija_Latn']}")
        print(f"Darija (Arabic - new): {sample['darija_Arab_new']}")
        print(f"Darija (Arabic - old): {sample['darija_Arab_old']}")
        print(f"English translation: {sample['english']}")
        print(f"Sampling Rate: {sampling_rate} Hz")
        display(ipd.Audio(audio_data, rate=sampling_rate))
else:
    print("Dataset not loaded or contains fewer than 3 samples.")

## 2.4 Visualizing Speech: Waveforms and Spectrograms

Listening to audio is intuitive, but visualizing it can reveal patterns and characteristics that are not immediately obvious to the ear. Two common ways to visualize audio are waveforms and spectrograms.

**Waveforms:** A waveform is a graphical representation of an audio signal that shows the changes in air pressure over time. It's the most direct visual representation of the raw audio data. From a waveform, you can get a sense of the loudness and a rough idea of the speech segments versus silence.

**Spectrograms:** A spectrogram is a more informative visualization. It shows the spectrum of frequencies of a signal as it varies with time. In a typical spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the intensity or color of each point represents the Amplitude/Energy of that particular frequency at that particular time. Spectrograms are incredibly useful in speech recognition because they highlight the phonetic(characters sounds) content of speech, making different sounds visually distinct. They essentially show *what* frequencies are present *when* and *how strongly*.

Let's generate and display these visualizations for one of our audio samples.



## 🔍 Instructions

1. Play the audio using the embedded player.
2. Observe the **waveform** — it shows amplitude over time.
3. Look at the **Mel spectrogram** — it shows how energy is distributed over frequency and time.
4. Try to **link sounds** to parts of the spectrogram:
   - Vowels (like /a/, /i/) → smooth bands in low-to-mid frequencies
   - Fricatives (like /s/, /sh/) → noise in high frequencies
   - Stops (like /b/, /t/) → sudden bursts or gaps


In [None]:
import librosa
import librosa.display
import matplotlib.pyplot as plt

if 'dataset' in locals() and dataset and len(dataset) > 0:
    sample = dataset[0] # Take the first sample again or a different one
    audio_data = np.array(sample["audio"]["array"], dtype=np.float32)
    sampling_rate = sample["audio"]["sampling_rate"]

    # 1. Visualize the Waveform
    plt.figure(figsize=(14, 5))
    librosa.display.waveshow(audio_data, sr=sampling_rate)
    plt.title(f'Waveform (Sampling Rate: {sampling_rate} Hz)')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.tight_layout()
    plt.show()

    # 2. Visualize the Spectrogram (Mel Spectrogram is common)
    # A Mel spectrogram uses the Mel scale for frequencies, which is closer to human auditory perception.
    S = librosa.feature.melspectrogram(y=audio_data, sr=sampling_rate, n_mels=128, fmax=8000)
    S_dB = librosa.power_to_db(S, ref=np.max)

    plt.figure(figsize=(14, 5))
    librosa.display.specshow(S_dB, sr=sampling_rate, x_axis='time', y_axis='mel', fmax=8000)
    plt.colorbar(format='%+2.0f dB')
    plt.title(f'Mel Spectrogram (Sampling Rate: {sampling_rate} Hz)')
    plt.xlabel('Time (s)')
    plt.ylabel('Frequency (Mel)')
    plt.tight_layout()
    plt.show()

else:
    print("Dataset not loaded or empty, cannot visualize audio.")

Try to **link what you hear with what you see** in the spectrogram. This is the core of understanding sounds visually!

If you would like to understand Frequencies further, please contact yassine.el_Kheir@dfki.de


# Section 3: The Journey from Sound to Text - Understanding the Pipeline and Its Hurdles

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.

Using a novel contrastive pretraining objective, Wav2Vec2 learns powerful speech representations from more than 50.000 hours of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/wav2vec2.png)

For the first time, it has been shown that pretraining, followed by fine-tuning on very little labeled speech data achieves competitive results to state-of-the-art ASR systems. Using as little as 10 minutes of labeled data, Wav2Vec2 yields a word error rate (WER) of less than 5% on the clean test set of [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) - *cf.* with Table 9 of the [paper](https://arxiv.org/pdf/2006.11477.pdf).



Wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition.

I highly recommend reading the blog post [Sequence Modeling with CTC (2017)](https://distill.pub/2017/ctc/) very well-written blog post by Awni Hannun.

First, let's try make sure you have GPU set ...

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

## No GPU? No problem!  
Just activate the GPU in Colab (ask a mentor if you need help), then start from here.

## 3.1 Preparing for Fine-tuning: Data is Key

Before we dive into the code, let's discuss data preparation for fine-tuning Wav2Vec2:

1.  **Labeled Dataset:** You need a dataset where each audio sample has an accurate corresponding text transcription. We will continue using a subset of the `atlasia/DODa-audio-dataset` for this demonstration.
2.  **Consistent Sampling Rate:** Wav2Vec2 models are pre-trained with audio at a specific sampling rate (commonly 16 kHz). All audio in your fine-tuning dataset *must* be resampled to match this rate. The `datasets` library can help with this.
3.  **Vocabulary Definition:** The model needs a vocabulary, which is the set of all possible characters (or subwords) it can predict. This vocabulary is created from the transcriptions in your training dataset.
4.  **Processor/Tokenizer:** A `Wav2Vec2Processor` (or `Wav2Vec2CTCTokenizer` for older versions) handles both audio preprocessing (like resampling and normalization) and text tokenization (converting transcriptions into sequences of IDs based on the vocabulary).

## 3.2 The Fine-tuning Process: An Overview

The fine-tuning process generally involves these steps:

1.  **Set up the Environment:** Install necessary libraries like `transformers`, `datasets`, `evaluate`, and `accelerate` (for efficient training).
2.  **Load Dataset and Processor:** Load your speech dataset and a pre-trained Wav2Vec2 processor (which includes a tokenizer).
3.  **Preprocess Data:** Resample audio, tokenize transcriptions, and prepare the data in a format suitable for the model.
4.  **Load Pre-trained Model:** Load a pre-trained Wav2Vec2 model suitable for ASR (e.g., `Wav2Vec2ForCTC`).
5.  **Define Training Configuration:** Specify training arguments like learning rate, batch size, number of epochs, and evaluation strategy.
6.  **Define Evaluation Metrics:** Choose metrics to evaluate your model, typically Word Error Rate (WER) and Character Error Rate (CER).
7.  **Define Data Collator:** A data collator is responsible for batching your processed data samples and padding them so that all sequences in a batch have the same length.
8.  **Instantiate and Run Trainer:** Use the Hugging Face `Trainer` class, which simplifies the training loop, handles evaluation, and saves checkpoints.

Let's get to the code!

In [None]:
!pip install transformers[torch] evaluate accelerate torchaudio librosa
!pip install -U datasets

### 3.2. Load Dataset and Create Vocabulary/Processor

We will use the `atlasia/DODa-audio-dataset` again. We need to extract all unique characters from our transcriptions to build a vocabulary.



In [None]:
from datasets import load_dataset, Audio
import re
from huggingface_hub import login


token = "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"

# Authenticate your session
login(token=token)

# --- 1. Load a small subset of the dataset for this demo ---
dataset_name = "atlasia/DODa-audio-dataset"
try:
    raw_dataset = load_dataset(dataset_name, split="train[:50%]", token=token) # Using only 20 samples for speed
    print("Raw dataset loaded:", raw_dataset)
except Exception as e:
    print(f"Error loading dataset: {e}. Please ensure authentication if needed.")
    raw_dataset = None

We'll use the ***`darija_Arab_new`*** column as our target — it's already preprocessed (Yaaay! 🎉... faster coding ahead 🚀).


In [None]:
# --- 2. Define a function to extract all characters from the transcriptions ---

def extract_all_chars(batch):
    transcription_key = "darija_Arab_new"
    all_text = " ".join(t for t in batch[transcription_key] if t is not None)
    vocab = list(set(all_text.lower())) # Convert to lowercase and get unique chars
    return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
# --- 3. Create the vocabulary ---

vocabs = raw_dataset.map(extract_all_chars, batched=True, batch_size=8, keep_in_memory=True, remove_columns=raw_dataset.column_names)
vocab_list = list(set(c for vocab_item in vocabs["vocab"] for c in vocab_item))

In [None]:
vocab_list

### 🧠 Let's Talk About Vocabulary!

When you look at the list below, you'll see lots of strange or unnecessary characters — things like punctuation marks, special diacritics, and even numbers.

```python
# This is an example of a "messy" vocabulary extracted directly from raw text data
messy_vocabulary_example = ['','ش','ر','َ','؟','ب','0','ص','ؤ','ث','پ','9',',','3','ء','?',
 'ة','خ','غ','ي','ف','-','ى','2','أ','ك','ظ','و','1','ّ','ُ','ت',
 'ض','ح','س','ئ','د','5','ه','ل','ڤ','ط','ڭ','إ','!','"',' ','ا',
 ':','6','ق','ذ','ز','ن','ج','.','،','ع','ً','م','آ']
```

### ❓ But wait... why does our vocabulary look like this?
Because it was built automatically from data not fully cleaned, not filtered. So, yes:

✅ it includes Arabic letters (like ش, ر, ب)
✅ but also punctuation (like ؟, ,, .), numbers (0-9), and even other symbols (like ','،, َ, ً, ُ, ّ)

### 💡 Hint for You & Your Thoughts
Try to think about:

Which of these symbols are actually important for recognizing spoken words?
For example: do we need to distinguish between "hello." and "hello!" or "hello,"? Does these sounds different? Do we need punctutations??

Which symbols might be useless or even harmful during training of a speech recognition model?
Could they confuse the model? => Can you select some?

If you were to clean this vocabulary:

What would you keep, and what would you remove or replace?

Let's do some further cleaning ...

In [None]:
import re
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    if batch["darija_Arab_new"] == None:
      batch["text"] = "خاوي"
    else:
      batch["text"] = re.sub(chars_to_ignore_regex, '', batch["darija_Arab_new"]).lower() + " "
    return batch

def remove_digits(batch):
    ## to be implemented ...
    return batch

def normalize_hamza(batch):
    ## to be implemented ...
    return batch

In [None]:
raw_dataset = raw_dataset.map(remove_special_characters)
# raw_dataset = raw_dataset.map(remove_digits)

In [None]:
raw_dataset

Do you see a new column names `"text"` ?? Make sure you have:
Then now, let's make a cleaner Vocab ...

In [None]:
def extract_all_chars(batch):
    transcription_key = "text"
    all_text = " ".join(t for t in batch[transcription_key] if t is not None)
    vocab = list(set(all_text.lower())) # Convert to lowercase and get unique chars
    return {"vocab": [vocab], "all_text": [all_text]}

vocabs = raw_dataset.map(extract_all_chars, batched=True, batch_size=8, keep_in_memory=True, remove_columns=raw_dataset.column_names)
vocab_list = list(set(c for vocab_item in vocabs["vocab"] for c in vocab_item))

In [None]:
## make the dictionary as json: mapping characters == Numbers
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))}

To make it clearer that `" "` has its own token class, we give it a more visible character `|`. In addition, we also add an "unknown" token so that the model can later deal with characters not encountered in Timit's training set.

Finally, we also add a padding token that corresponds to CTC's "*blank token*". The "blank token" is a core component of the CTC algorithm. For more information, please take a look at the "Alignment" section [here](https://distill.pub/2017/ctc/).

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

In [None]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)

### 3.3. Create Tokenizer -- Wav2Vec2CTCTokenizer

**Cool**, now our vocabulary is complete and consists of less than 63 tokens, which means that the linear layer that we will add on top of the pretrained Wav2Vec2 checkpoint will have an output dimension of (total number of vocab).

Let's now save the vocabulary as a json file.

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In a final step, we use the json file to instantiate an object of the `Wav2Vec2CTCTokenizer` class.

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

### 3.3. Create Feature Extractor -- Wav2Vec2FeatureExtractor

*A* Wav2Vec2 feature extractor object requires the following parameters to be instantiated:

- `feature_size`: Speech models take a sequence of feature vectors as an input. While the length of this sequence obviously varies, the feature size should not. In the case of Wav2Vec2, the feature size is 1 because the model was trained on the raw speech signal ${}^2$.
- `sampling_rate`: The sampling rate at which the model is trained on.
- `padding_value`: For batched inference, shorter inputs need to be padded with a specific value
- `do_normalize`: Whether the input should be *zero-mean-unit-variance* normalized or not. Usually, speech models perform better when normalizing the input
- `return_attention_mask`: Whether the model should make use of an `attention_mask` for batched inference. In general, models should **always** make use of the `attention_mask` to mask padded tokens. However, due to a very specific design choice of `Wav2Vec2`'s "base" checkpoint, better results are achieved when using no `attention_mask`. This is **not** recommended for other speech models. For more information, one can take a look at [this](https://github.com/pytorch/fairseq/issues/3227) issue. **Important** If you want to use this notebook to fine-tune [large-lv60](https://huggingface.co/facebook/wav2vec2-large-lv60), this parameter should be set to `True`.

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

### 3.4. Create Processor -- Wav2Vec2Processor


Great, Wav2Vec2's feature extraction pipeline is thereby fully defined!

To make the usage of Wav2Vec2 as user-friendly as possible, the feature extractor and tokenizer are *wrapped* into a single `Wav2Vec2Processor` class so that one only needs a `model` and `processor` object.

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### 4. Excute everything on our Data ...

Finally, we can process the dataset to the format expected by the model for training. We will make use of the `map(...)` function.

First, we load and resample the audio data, simply by calling `batch["audio"]`.
Second, we extract the `input_values` from the loaded audio file. In our case, the `Wav2Vec2Processor` only normalizes the data.

Third, we encode the transcriptions to label ids (using tokenizer =? what computers can understands).

**Note**: This mapping function is a good example of how the `Wav2Vec2Processor` class should be used. In "normal" context, calling `processor(...)` is redirected to `Wav2Vec2FeatureExtractor`'s call method. When wrapping the processor into the `as_target_processor` context, however, the same method is redirected to `Wav2Vec2CTCTokenizer`'s call method.
For more information please check the [docs](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#transformers.Wav2Vec2Processor.__call__).

In [None]:
def prepare_dataset(batch):
    ## get speech arrays
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    batch["input_length"] = len(batch["input_values"])

    with processor.as_target_processor():
        batch["labels"] = processor(batch["text"]).input_ids
    return batch

Let's apply the data preparation function to all examples.

In [None]:
## it should take 6 minutes Max
raw_dataset = raw_dataset.map(prepare_dataset)

Awesome, now we are ready to start training!

# Section 5: Final Exercise – Show Us Your Acoustic Talents

Your task is to create three speech samples and test them using a pretrained Arabic model from Hugging Face. The challenge? Push the system to its limits and try to generate samples that result in a Word Error Rate (WER) above 50%.

How you challenge the model is entirely up to you — be creative! You might try background noise, strong accents, unusual phrasing, or any technique that makes recognition more difficult. We're here to see how well you can break the system and showcase your audio manipulation skills.

In [None]:
import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datasets import Dataset, Audio
from jiwer import wer
import os

# Load processor and model
processor = Wav2Vec2Processor.from_pretrained("boumehdi/wav2vec2-large-xlsr-moroccan-darija")
model = Wav2Vec2ForCTC.from_pretrained("boumehdi/wav2vec2-large-xlsr-moroccan-darija").cuda()

# Function to map each audio sample to predicted text
def map_to_result(batch):
    with torch.no_grad():
        # Preprocess input audio
        input_values = processor(batch["audio"]["array"], sampling_rate=16000).input_values[0]
        input_values = torch.tensor(input_values).to("cuda")

        # Forward pass through the model
        logits = model(input_values.unsqueeze(0)).logits

        # Decode predictions
        pred_ids = torch.argmax(logits, dim=-1)
        batch["pred_str"] = processor.batch_decode(pred_ids)[0]

    return batch

# Load audio files into a Dataset object
def load_audio_files(audio_paths):
    data = {"audio": audio_paths}
    dataset = Dataset.from_dict(data).cast_column("audio", Audio(sampling_rate=16000))
    return dataset

# Main function to process files and compute WER
def evaluate_samples(audio_folder, references):
    # List audio file paths
    audio_files = [os.path.join(audio_folder, f"{i}.wav") for i in range(3)]

    # Load dataset
    dataset = load_audio_files(audio_files)

    # Transcribe
    results = dataset.map(map_to_result)

    # Compute WERs
    wers = []
    for i, ref in enumerate(references):
        hyp = results[i]["pred_str"]
        error = wer(ref, hyp)
        wers.append((i, error, ref, hyp))

    return wers

In [None]:
references = [
    "سجل النص الصحيح الأول هنا",
    "سجل النص الصحيح الثاني هنا",
    "سجل النص الصحيح الثالث هنا"
]

results = evaluate_samples("./audio_folder", references)

# Print out WER results
for idx, error, ref, hyp in results:
    print(f"Sample {idx} - WER: {error:.2f}")
    print(f"REF: {ref}")
    print(f"HYP: {hyp}")
    print("-----------")

# 👋 Need Help with Anything Speech-Related?
If you're interested in incorporating text-to-speech or any speech component into your project, feel free to reach out to Yassine El Kheir at yassine.el_kheir@dfki.de.
I'd be happy to give you a hand and share some helpful tips!

---
🎉 **Congrats! You've completed the speech-to-text challenge.**

- Try more Moroccan audio, or even your friends' voices!
- Share your results with your classmates or mentors.
- If you're curious, explore other Moroccan datasets or models on [Hugging Face](https://huggingface.co/models?language=ar&search=moroccan).

If you have questions, don't hesitate to ask. Keep experimenting and have fun with AI!
---