# **Wav2Vec2 Model Training and Evaluation in Persian Speech Recognition**
This section imports necessary libraries for loading datasets, preprocessing audio, configuring and training a Wav2Vec2 model, and evaluating its performance in Persian speech recognition tasks.

# Installing and Configuring Required Libraries

This section installs and configures the necessary libraries for training and evaluating a Wav2Vec2 model for Persian speech recognition.


In [1]:
!pip install hazm
!pip uninstall -y pyarrow requests datasets
!pip install pyarrow==14.0.1 requests==2.31.0 datasets>=1.18.3
!pip install transformers==4.11.3
!pip install torch torchaudio
!pip install librosa
!pip install jiwer
!pip install accelerate
!pip install Num2fawords


Found existing installation: pyarrow 14.0.1
Uninstalling pyarrow-14.0.1:
  Successfully uninstalled pyarrow-14.0.1
Found existing installation: requests 2.31.0
Uninstalling requests-2.31.0:
  Successfully uninstalled requests-2.31.0
[0mCollecting transformers==4.11.3
  Using cached transformers-4.11.3-py3-none-any.whl (2.9 MB)
Collecting sacremoses (from transformers==4.11.3)
  Using cached sacremoses-0.1.1-py3-none-any.whl (897 kB)
Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.11.3)
  Using cached tokenizers-0.10.3.tar.gz (212 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding wheel for tokenizers [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [

## Logging into Hugging Face CLI

This command logs you into the Hugging Face CLI, allowing you to access and manage Hugging Face resources, such as datasets and models, which are necessary for training and evaluating a Wav2Vec2 model for Persian speech recognition.


In [2]:
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your ter

# Importing Libraries

This section imports the necessary libraries for loading datasets, preprocessing audio data, and configuring and training a Wav2Vec2 model for Persian speech recognition.


In [3]:
import torch
from datasets import load_dataset, load_metric, Audio, ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re
from hazm import Normalizer
import torchaudio
import time
import torch.multiprocessing as mp
import IPython.display as ipd
import numpy as np
from transformers import (
    Wav2Vec2Processor, Wav2Vec2FeatureExtractor, Wav2Vec2ForCTC,
    TrainingArguments, Trainer, AdamW, get_linear_schedule_with_warmup,
    Wav2Vec2CTCTokenizer,
)
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


# Setting Up Device for Model Training

This section checks for the availability of a GPU and sets the device accordingly to either CUDA (if a GPU is available) or CPU, ensuring efficient model training and evaluation.


In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


Using device: cuda


# Loading Common Voice Dataset

This section loads the Persian (fa) split of the Common Voice dataset for both training and testing, using the `datasets` library.


In [46]:
common_voice_train = load_dataset("mozilla-foundation/common_voice_6_1", "fa",trust_remote_code=True, split="train+validation")
common_voice_test = load_dataset("mozilla-foundation/common_voice_6_1", "fa",trust_remote_code=True, split="test")

# common_voice_train = load_dataset("mozilla-foundation/common_voice_6_1", "fa",trust_remote_code=True, split="train[:1%]")
# common_voice_test = load_dataset("mozilla-foundation/common_voice_6_1", "fa",trust_remote_code=True, split="test[:1%]")


# Preparing Common Voice Dataset for Training

This section preprocesses the Common Voice dataset by removing unnecessary columns to streamline training and testing processes.


In [47]:
common_voice_train = common_voice_train.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])
common_voice_test = common_voice_test.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])


In [48]:
def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))


In [49]:
show_random_elements(common_voice_train.remove_columns(["path", "audio"]), num_examples=10)


Unnamed: 0,sentence
0,پنجاه و هفت، پنجاه و هشت، پنجاه و نه.
1,او اینجا در تعطیلات است.
2,چطور ميتونم راحت باشم؟
3,از این گوش میگیره از اون گوش در میکنه
4,برنشست
5,برای دیگران توضیح دهد
6,مصر
7,این قطار مستقیم است؟
8,بیماری کم خونی داسی شکل
9,هیچ دستمالی دارید؟


In [50]:
normalizer = Normalizer()

# Character mapping
chars_to_mapping = {
'ك': 'ک', 'دِ': 'د', 'بِ': 'ب', 'زِ': 'ز', 'ذِ': 'ذ', 'شِ': 'ش', 'سِ': 'س', 'ى': 'ی',
'ي': 'ی', 'أ': 'ا', 'ؤ': 'و', "ے": "ی", "ۀ": "ه", "ﭘ": "پ", "ﮐ": "ک", "ﯽ": "ی",
"ﺎ": "ا", "ﺑ": "ب", "ﺘ": "ت", "ﺧ": "خ", "ﺩ": "د", "ﺱ": "س", "ﻀ": "ض", "ﻌ": "ع",
"ﻟ": "ل", "ﻡ": "م", "ﻢ": "م", "ﻪ": "ه", "ﻮ": "و", 'ﺍ': "ا", 'ة': "ه",
'ﯾ': "ی", 'ﯿ': "ی", 'ﺒ': "ب", 'ﺖ': "ت", 'ﺪ': "د", 'ﺮ': "ر", 'ﺴ': "س", 'ﺷ': "ش",
'ﺸ': "ش", 'ﻋ': "ع", 'ﻤ': "م", 'ﻥ': "ن", 'ﻧ': "ن", 'ﻭ': "و", 'ﺭ': "ر", "ﮔ": "گ",
"ها": "  ها", "ئ": "ی",
"a": " ای ", "b": " بی ", "c": " سی ", "d": " دی ", "e": " ایی ", "f": " اف ",
"g": " جی ", "h": " اچ ", "i": " آی ", "j": " جی ", "k": " کی ", "l": " ال ",
"m": " ام ", "n": " ان ", "o": " او ", "p": " پی ", "q": " کیو ", "r": " آر ",
"s": " اس ", "t": " تی ", "u": " یو ", "v": " وی ", "w": " دبلیو ", "x": " اکس ",
"y": " وای ", "z": " زد ",
"\u200c": " ", "\u200d": " ", "\u200e": " ", "\u200f": " ", "\ufeff": " ",
}

chars_to_ignore = [
    '!', '#', '�', "'", '’', '%', ':', ';', '-', '!', '.', '?', ',', '؟',
    '!', '.', '?', ',', 'ٔ', '٬', 'ٔ', '؛', '(', ')', '،', '«', '»',
    ';', ':', '”', '‘‘', '%', '‘', '=', '–', '…', '_', '‘', '‘', '„', 'ā', 'š'
]


def preprocess_text(text):
    # Normalize Persian text
    text = normalizer.normalize(text)

    for char, replacement in chars_to_mapping.items():
        text = text.replace(char, replacement)

    # Remove special characters
    text = re.sub('|'.join(map(re.escape, chars_to_ignore)), '', text)

    return text


# Apply preprocessing to the sampled datasets
common_voice_train = common_voice_train.map(lambda batch: {"sentence": preprocess_text(batch["sentence"])})
common_voice_test = common_voice_test.map(lambda batch: {"sentence": preprocess_text(batch["sentence"])})


Map:   0%|          | 0/12806 [00:00<?, ? examples/s]

Map:   0%|          | 0/5213 [00:00<?, ? examples/s]

In [51]:
show_random_elements(common_voice_train.remove_columns(["path","audio"]))


Unnamed: 0,sentence
0,او تمام امتحانات را با درخشش گذراند
1,شما به مقصد رسیدید
2,پسر ها قبولش داشتند و می گفتند اگر فلانی بود جلسه بهتر می شد
3,من لثه متورم دارم
4,چیزی که فکر می کنید وقتی پیر شدید میشید
5,برو بابا دلت خوشه شوهرم کجا بود
6,یه مهمونی باکلاس باشه
7,این بخاطر تو و من هست
8,نظرت درباره دی جی چیست
9,آوا های مشترک در تلگرام عضو شوید


In [52]:
def extract_all_chars(batch):
  all_text = " ".join(batch["sentence"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}


In [53]:
vocab_train = common_voice_train.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=common_voice_train.column_names)
vocab_test = common_voice_test.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=common_voice_test.column_names)


Map:   0%|          | 0/12806 [00:00<?, ? examples/s]

Map:   0%|          | 0/5213 [00:00<?, ? examples/s]

In [56]:
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0]))


In [57]:
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict


{'ج': 0,
 ' ': 1,
 'آ': 2,
 'T': 3,
 '&': 4,
 'غ': 5,
 'ی': 6,
 'ه': 7,
 'F': 8,
 'ژ': 9,
 'H': 10,
 'G': 11,
 'ز': 12,
 'ث': 13,
 'K': 14,
 'A': 15,
 'E': 16,
 'M': 17,
 'S': 18,
 '"': 19,
 'I': 20,
 'ا': 21,
 'ل': 22,
 'D': 23,
 'ش': 24,
 'ص': 25,
 'Q': 26,
 'ط': 27,
 'س': 28,
 'Z': 29,
 'ض': 30,
 'C': 31,
 'م': 32,
 'ب': 33,
 'پ': 34,
 'ر': 35,
 'ت': 36,
 'ح': 37,
 'P': 38,
 'د': 39,
 'ع': 40,
 'ذ': 41,
 'ف': 42,
 'B': 43,
 'ظ': 44,
 'ء': 45,
 'ق': 46,
 'چ': 47,
 'خ': 48,
 'و': 49,
 'ن': 50,
 'گ': 51,
 'U': 52,
 'ک': 53}

In [58]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]


In [59]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
len(vocab_dict)


56

In [60]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)


In [61]:
# read audio files with torchaudio
def read_audio_file(batch):
    speech, sampling_rate = torchaudio.load(batch["path"])
    # map the 48kHz frequency to 16kHz
    speech = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)(speech)
    batch["speech"] = speech
    batch["sampling_rate"] = 16000
    return batch


common_voice_train = common_voice_train.map(read_audio_file)
common_voice_test = common_voice_test.map(read_audio_file)


Map:   0%|          | 0/12806 [00:00<?, ? examples/s]

Map:   0%|          | 0/5213 [00:00<?, ? examples/s]

In [62]:
tokenizer = Wav2Vec2CTCTokenizer(
    "./vocab.json",
    unk_token="[UNK]",
    pad_token="[PAD]",
    word_delimiter_token="|"
)

tokenizer.add_tokens(vocab_list)


1

In [63]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)


In [64]:
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)


In [65]:
common_voice_train[0]["path"]


'/root/.cache/huggingface/datasets/downloads/extracted/c5031a95ac0e40545f16661cce5727f9d20ab85ee4a9fa43234b6306ea2568e3/cv-corpus-6.1-2020-12-11/fa/clips/common_voice_fa_18202376.mp3'

In [66]:
common_voice_train[0]["audio"]


{'path': '/root/.cache/huggingface/datasets/downloads/extracted/c5031a95ac0e40545f16661cce5727f9d20ab85ee4a9fa43234b6306ea2568e3/cv-corpus-6.1-2020-12-11/fa/clips/common_voice_fa_18202376.mp3',
 'array': array([ 0.00000000e+00,  3.80888761e-14,  2.51364142e-15, ...,
        -2.76742480e-06, -5.66327799e-06, -4.90683669e-06]),
 'sampling_rate': 48000}

In [67]:
common_voice_train = common_voice_train.cast_column("audio", Audio(sampling_rate=16_000))
common_voice_test = common_voice_test.cast_column("audio", Audio(sampling_rate=16_000))


In [68]:
common_voice_train[0]["audio"]


{'path': '/root/.cache/huggingface/datasets/downloads/extracted/c5031a95ac0e40545f16661cce5727f9d20ab85ee4a9fa43234b6306ea2568e3/cv-corpus-6.1-2020-12-11/fa/clips/common_voice_fa_18202376.mp3',
 'array': array([ 9.09494702e-13, -1.81898940e-12,  1.00044417e-11, ...,
         5.28323289e-06,  4.76915011e-06, -1.47444371e-06]),
 'sampling_rate': 16000}

In [69]:
rand_int = random.randint(0, len(common_voice_train)-1)

ipd.Audio(data=common_voice_train[rand_int]["audio"]["array"], autoplay=True, rate=16000)


In [70]:
rand_int = random.randint(0, len(common_voice_train)-1)

print("Target text:", common_voice_train[rand_int]["sentence"])
print("Input array shape:", common_voice_train[rand_int]["audio"]["array"].shape)
print("Sampling rate:", common_voice_train[rand_int]["audio"]["sampling_rate"])


Target text: به شوهرجان گفتم این سری به کسی نگیم و خودم خونه رو تمیز کنم و ذره ذره وسایلارو بچینیم
Input array shape: (127488,)
Sampling rate: 16000


In [28]:
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched"
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch


In [71]:
common_voice_train = common_voice_train.map(prepare_dataset, remove_columns=common_voice_train.column_names, num_proc=4)
common_voice_test = common_voice_test.map(prepare_dataset, remove_columns=common_voice_test.column_names, num_proc=4)


  self.pid = os.fork()


Map (num_proc=4):   0%|          | 0/12806 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/5213 [00:00<?, ? examples/s]

In [72]:
def filter_datasets_by_duration(train_dataset, test_dataset):
    def add_duration(example):
        # Access input_values and sampling_rate directly
        input_values = example['input_values']
        sampling_rate = processor.feature_extractor.sampling_rate # Assuming processor is available in scope
        duration = len(input_values) / sampling_rate
        example['duration'] = duration
        return example

    # Add duration to datasets
    train_dataset = train_dataset.map(add_duration)
    test_dataset = test_dataset.map(add_duration)

    # Filter train dataset (4s to 6s)
    filtered_train = train_dataset.filter(lambda x: 4 <= x['duration'] <= 6)

    # Filter test dataset (0s to 15s)
    filtered_test = test_dataset.filter(lambda x: 0 <= x['duration'] <= 15)

    return filtered_train, filtered_test


common_voice_train, common_voice_test = filter_datasets_by_duration(common_voice_train, common_voice_test)


Map:   0%|          | 0/12806 [00:00<?, ? examples/s]

Map:   0%|          | 0/5213 [00:00<?, ? examples/s]

Filter:   0%|          | 0/12806 [00:00<?, ? examples/s]

Filter:   0%|          | 0/5213 [00:00<?, ? examples/s]

In [73]:
@dataclass
class DataCollatorCTCWithPadding:

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None


    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        # if torch.cuda.is_available():
        #     batch = {k: v.to(device) for k, v in batch.items()}

        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        # Move batch to device
        batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}

        return batch


In [74]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)


In [75]:
wer_metric = load_metric("wer")


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [76]:
def compute_metrics(pred):
    pred_ids = pred.predictions.argmax(-1)
    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_ids = pred.label_ids
    # replace padding with -100 to compute the correct WER
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    label_str = processor.batch_decode(label_ids, group_tokens=False)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}


In [77]:
model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-large-xlsr-53",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
).to(device)


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-large-xlsr-53 and are newly initialized: ['lm_head.bias', 'lm_head.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [78]:
model.freeze_feature_extractor()




In [79]:
model.gradient_checkpointing_enable()


In [80]:
mp.set_start_method('spawn', force=True)

training_args = TrainingArguments(
  output_dir="./wav2vec2-large-xlsr-persian-demo",
  group_by_length=True,
  per_device_train_batch_size=12,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=5,
  fp16=True,
  no_cuda= not torch.cuda.is_available(),
  save_strategy="epoch",
  save_steps=150,
  eval_steps=150,
  # save_steps=1,
  # eval_steps=1,
  logging_steps=100,
  learning_rate=1e-4,
  warmup_steps=1000,
  save_total_limit=2,
  dataloader_num_workers=0,
  dataloader_pin_memory=False,
)




In [81]:
# Use AdamW optimizer with weight decay
optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)

# Create a learning rate scheduler
num_training_steps = len(common_voice_train) // training_args.gradient_accumulation_steps * training_args.num_train_epochs
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=num_training_steps)

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=common_voice_train,
    eval_dataset=common_voice_test,
    tokenizer=processor.feature_extractor,
    optimizers=(optimizer, lr_scheduler) # Pass the optimizer and lr_scheduler as a tuple
)




In [82]:
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of trainable parameters: {total_params}")
print(f"Total number of parameters: {sum(p.numel() for p in model.parameters())}")


Total number of trainable parameters: 311289019
Total number of parameters: 315499195


In [83]:
# number of training datapoints
print(f"Number of training files: {len(common_voice_train)}")


Number of training files: 4321


In [84]:
start_time = time.time()

trainer.train()

end_time = time.time()
training_time = end_time - start_time
print(f"Training time: {training_time} seconds")




Step,Training Loss,Validation Loss,Wer
150,25.8841,26.647411,1.0
300,5.9368,3.844015,0.998649
450,3.4101,3.11899,0.998649
600,2.9987,3.025712,0.998649
750,2.9667,2.975458,0.998649
900,2.937,2.967196,0.998649




Training time: 4511.272331476212 seconds


In [85]:
print(f"Number of testing files: {len(common_voice_test)}")


Number of testing files: 5212


In [86]:
# testing
trainer.evaluate()




{'eval_loss': 2.967195510864258,
 'eval_wer': 0.9986490205398915,
 'eval_runtime': 360.7933,
 'eval_samples_per_second': 14.446,
 'eval_steps_per_second': 1.807,
 'epoch': 4.986149584487535}

In [88]:
# Save the final model
model.save_pretrained("./wav2vec2-large-xlsr-persian-demo-final")
processor.save_pretrained("./wav2vec2-large-xlsr-persian-demo-final")


[]

In [93]:
# save model to drive
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [94]:
# prompt: save model to drive

!cp -r ./wav2vec2-large-xlsr-persian-demo-final /content/drive/MyDrive/
