### Training A Neural-Network-Based Speech-To-Text Model

Now after the processed.tsv is ready with audio paths, transcripts, tokens and token indices, we need to train a neural network based or transformer-based speech to text model.

One of the examples of speech-to-text NN-based model is Wav2Vec 2.0 by Hugging Face. We will take this pre-trained model, which is best for speech-to-text , and train it on the __Large Sinhala ASR dataset__ in the /audio folder. Then, we see the trained model's performance and fine-tune it's parameters afterwards.

We need to give a __loss function__ to the model, which is like a way in which the model can learn from it's mistakes and one of the famous loss function is called Connectionist Temporal Classification (CTC).This helps the model learn even if the length of input and output don't match.So, when the model listens to a sentence and tries to guess the words, CTC helps it figure out how far off its guesses are from the correct words and guides it on how to improve.

### First
We need to sample all of the audio to 16Hz because Wav2Vec 2.0 takes input of speech files in 16Hz, using torch audio.

In [1]:
import pandas as pd
import torchaudio
from datasets import Dataset, load_metric
import soundfile as sf
import torch

# Load the TSV file into a pandas DataFrame
df = pd.read_csv("./tsv_files/processed.tsv", delimiter='\t')

# Convert the DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Function to load and preprocess the audio
def preprocess_audio(batch):
    # Load the audio file
    speech_array, sample_rate = sf.read(batch['audio_path'])
    # Resample to 16kHz if necessary
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    speech_array = resampler(torch.tensor(speech_array))
    batch['speech'] = speech_array.numpy()
    return batch

# Apply the preprocessing function to the dataset
dataset = dataset.map(preprocess_audio)


  from .autonotebook import tqdm as notebook_tqdm
Map:   5%|▌         | 7986/155970 [00:49<12:43, 193.76 examples/s] 

: 

Then, we load the pretrained Wav2Vec 2.0 model and processor.Model(Brain) does all of the learning, predictions and the neural network part while the Processor(Helper to Brain) takes care of the input audio feeding and decoding the output of the model acting as a medium btween us and the model.

In [None]:
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Load the pre-trained Wav2Vec2 model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h", vocab_size=len(open("vocabulary.txt").readlines()))


Then we tokenise the transcripts and map each of the token to the features found by the model in the audio.Here, the loss function will be useful.

In [None]:
def tokenize_text(batch):
    batch['input_values'] = processor(batch['speech'], sampling_rate=16000).input_values[0]
    batch['labels'] = batch['token_indices']
    return batch

# Apply tokenization
dataset = dataset.map(tokenize_text)


We define all the parameters the pre-trained model needs to be trained on and let it train on the sinhala dataset. These can then be changed while fine-tuning it for the language.

In [None]:
from transformers import TrainingArguments, Trainer

# Define training arguments
training_args = TrainingArguments(
  output_dir="./wav2vec2-finetuned",
  per_device_train_batch_size=8,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=3,
  save_steps=500,
  eval_steps=500,
  logging_steps=100,
  learning_rate=1e-4,
  warmup_steps=500,
  save_total_limit=2,
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    eval_dataset=dataset,  # Replace with a validation set if available
    tokenizer=processor.feature_extractor,
)


Now we start training the model.

In [None]:
trainer.train()


Now finding the performance of the model is fairly easy.

We measure performance of the model based on 2 values : Word Error Rate and Character Error Rate.
In Simple words,
WER tells you how many words the model got wrong in its transcription.
CER tells you how many characters the model got wrong in its transcription.

#### WER 
WER is a metric that calculates the difference between the predicted text and the actual text at the word level and is computed by comparing the number of words that were incorrectly transcribed by the model to the total number of words in the reference (correct) transcript. __LOWER WER=MORE ACCURATE.__

$\text{WER} = \frac{S + D + I}{N}$
- **S** = Number of substitutions (wrong word instead of the correct one)
- **D** = Number of deletions (missed words)
- **I** = Number of insertions (extra words added)
- **N** = Total number of words in the reference transcript

#### CER
Character Error Rate is same as WER but for individual characters and useful when the text contains a lot of short words, or when you want finer granularity in the error analysis. __LOWER CER=MORE ACCURATE.__
$\text{CER} = \frac{S + D + I}{N}$
- **S** = Number of substitutions (wrong character instead of the correct one)
- **D** = Number of deletions (missed characters)
- **I** = Number of insertions (extra characters added)
- **N** = Total number of characters in the reference transcript

​


In [None]:
# Load the WER metric
wer_metric = load_metric("wer")

# Predict on the dataset
def compute_metrics(pred):
    pred_ids = pred.predictions.argmax(-1)
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    # Compute WER
    wer = wer_metric.compute(predictions=pred_str, references=dataset['transcript'])
    return {"wer": wer}

# Evaluate the model
results = trainer.evaluate()
print(f"Word Error Rate (WER): {results['eval_wer']}")
