## **Downloading the Dataset**

In [None]:
!pip install --upgrade --no-cache-dir gdown
!gdown 1pUcYT_Tmy58guECl87b3-AUw0ygGCsaz

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Downloading...
From: https://drive.google.com/uc?id=1pUcYT_Tmy58guECl87b3-AUw0ygGCsaz
To: /content/common_voice_12.zip
100% 59.7M/59.7M [00:01<00:00, 33.6MB/s]


We can now uncompress it:

In [None]:
%%capture
!unzip common_voice_12.zip -d data

We also have to install the needed libraries:

In [None]:
%%capture
!pip install transformers
!git clone https://github.com/speechbrain/speechbrain.git
%cd speechbrain
!pip install -r requirements.txt
!pip install .
%cd ..

## **Step 1: Data Preparation**

In [None]:
import json
import torchaudio
from tqdm.contrib import tzip
import re

# Create the data-manifest files
def create_json(tsv_file,data_folder,json_file,max_duration):

  json_dict = {}


  loaded_tsv = open(tsv_file, "r").readlines()[1:]
  nb_samples = str(len(loaded_tsv))
  # Calculate total duration for sampling
  total_duration =0
  for line in tzip(loaded_tsv):
    line = line[0]

    # Get  path and identifier
    line_members = line.split("\t")
    mp3_path = "/content/data/common_voice_12/sr/clips/" + line_members[1]
    snt_id = line_members[1].split(".")[0]

    # Setting torchaudio backend to sox-io (needed to read mp3 files)
    if torchaudio.get_audio_backend() != "sox_io":
        torchaudio.set_audio_backend("sox_io")

    

    # Get duration
    audioinfo = torchaudio.info(mp3_path)
    duration = audioinfo.num_frames/audioinfo.sample_rate
    # Check if total duration is less than max_duration

    #  Get words
    words = line_members[2]

    # Remove multiple spaces
    removed_mul_spaces = re.sub(" +", " ", words)
    # Remove spaces at the beginning and the end of the sentence
    words = removed_mul_spaces.strip()
    # Remove too short sentences (< 3words):

    total_duration += duration
    if (total_duration > max_duration):
        break


    # Create entry for this utterance
    if (len(words.split(" ")) > 2):
        # Create entry for this utterance
        json_dict[snt_id] = {
                  "path": mp3_path,
                  "duration": duration,
                  "words": words,
                  }
  # Writing the dictionary to the json file
  print(total_duration)
  with open(json_file, mode="w") as json_f:
    #json.dump(json_dict, json_f, indent=2, ensure_ascii = False)
    json.dump(json_dict, json_f, indent=2)




# Set up data folder
data_folder='data/common_voice_12/sr'
max_durations=[3600,400,600]

# Create json files
create_json(data_folder+'/train.tsv',data_folder,'train.json',max_durations[0])
create_json(data_folder+'/dev.tsv',data_folder,'valid.json',max_durations[1])
create_json(data_folder+'/test.tsv',data_folder,'test.json',max_durations[2])

  0%|          | 0/1380 [00:00<?, ?it/s]

3603.8520000000053


  0%|          | 0/1037 [00:00<?, ?it/s]

401.0039999999997


  0%|          | 0/1112 [00:00<?, ?it/s]

601.8120000000004


## **Step 2: Speech recognition with Whisper**


**hyperparameters part**:


In [None]:
%%file hparams_sr_whisper.yaml


# #################################
# Basic training parameters for speaker identification with Xvector
#
# #################################

# Seed needs to be set at top of yaml, before objects with parameters are made
seed: 1986
__set_seed: !!python/object/apply:torch.manual_seed [!ref <seed>]

# Dataset will be downloaded to the `data_original`
data_folder: !PLACEHOLDER  # e.g., /path/to/data
output_folder: !ref results/train_with_whisper/sr/<seed>
save_folder: !ref <output_folder>/save
train_log: !ref <output_folder>/train_log.txt

# Path where data manifest files are stored
train_annotation: train.json
valid_annotation: valid.json
test_annotation: test.json

locale: serbian


whisper_hub: openai/whisper-tiny
whisper_folder: !ref <save_folder>/whisper_checkpoint

# Normalize inputs with the same normalization done in the paper (https://cdn.openai.com/papers/whisper.pdf). Refer to Appendix C for further information.
normalized_transcripts: True

# We remove utterance slonger than 10s in the train/dev/test sets as
# longer sentences certainly correspond to "open microphones".
avoid_if_longer_than: 10.0



# Training parameters
number_of_epochs: 1
lr_whisper: 0.00003
sorting: ascending
auto_mix_prec: False
sample_rate: 16000

# With data_parallel batch_size is split into N jobs
# With DDP batch_size is multiplied by N jobs
batch_size: 12
test_batch_size: 8

# These values are only used for the searchers.
# They needs to be hardcoded and should not be changed with Whisper.
# They are used as part of the searching process.
# The bos token of the searcher will be timestamp_index
# and will be concatenated with the bos, language and task tokens.
timestamp_index: 50363
eos_index: 50257
bos_index: 50258

# Decoding parameters
min_decode_ratio: 0.0
max_decode_ratio: 1.0
test_beam_size: 8

# Model parameters
freeze_whisper: False
freeze_encoder: True

train_loader_kwargs:
    batch_size: !ref <batch_size>

valid_loader_kwargs:
    batch_size: !ref <batch_size>

test_loader_kwargs:
    batch_size: !ref <test_batch_size>


#
# Functions and classes
#
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>

augmentation: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [95, 100, 105]


whisper: !new:speechbrain.lobes.models.huggingface_whisper.HuggingFaceWhisper
    source: !ref <whisper_hub>
    freeze: !ref <freeze_whisper>
    freeze_encoder: !ref <freeze_encoder>
    save_path: !ref <whisper_folder>
    encoder_only: False

log_softmax: !new:speechbrain.nnet.activations.Softmax
    apply_log: True

nll_loss: !name:speechbrain.nnet.losses.nll_loss

modules:
    whisper: !ref <whisper>

whisper_opt_class: !name:torch.optim.AdamW
    lr: !ref <lr_whisper>
    weight_decay: 0.000000001

valid_greedy_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
    model: !ref <whisper>
    bos_index: !ref <timestamp_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>

test_beam_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearch
    module: [!ref <whisper>]
    bos_index: !ref <timestamp_index>
    eos_index: !ref <eos_index>
    min_decode_ratio: !ref <min_decode_ratio>
    max_decode_ratio: !ref <max_decode_ratio>
    beam_size: !ref <test_beam_size>

lr_annealing_whisper: !new:speechbrain.nnet.schedulers.NewBobScheduler
    initial_value: !ref <lr_whisper>
    improvement_threshold: 0.0025
    annealing_factor: 0.9
    patient: 0

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        whisper: !ref <whisper>
        scheduler_whisper: !ref <lr_annealing_whisper>
        counter: !ref <epoch_counter>

train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

error_rate_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats

cer_computer: !name:speechbrain.utils.metric_stats.ErrorRateStats
    split_tokens: True

Overwriting hparams_sr_whisper.yaml


**training script**:

In [None]:
%%file train_with_whisper.py



#!/usr/bin/env python3
"Recipe for training a speaker-id classifier."
import os
import sys
import torch
import torchaudio
import speechbrain as sb
from speechbrain.utils.data_utils import undo_padding
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main


# Brain class for speech enhancement training
class ASR(sb.Brain):
    def compute_forward(self, batch, stage):
        """Forward computations from the waveform batches to the output probabilities."""
        
        batch = batch.to(self.device)
        wavs, wav_lens = batch.sig
        bos_tokens, bos_tokens_lens = batch.tokens_bos

        
        # Add augmentation if specified
        if stage == sb.Stage.TRAIN:
            if hasattr(self.hparams, "augmentation"):
                wavs = self.hparams.augmentation(wavs, wav_lens)

        # We compute the padding mask and replace the values with the pad_token_id
        # that the Whisper decoder expect to see.
        abs_tokens_lens = (bos_tokens_lens * bos_tokens.shape[1]).long()
        pad_mask = (
            torch.arange(abs_tokens_lens.max(), device=self.device)[None, :]
            < abs_tokens_lens[:, None]
        )
        bos_tokens[~pad_mask] = self.tokenizer.pad_token_id


        
        # Forward encoder + decoder
        enc_out, logits, _ = self.modules.whisper(wavs, bos_tokens)

        
        #  Generate Hypothesis for validation and test using greedy _searcher
        hyps = None
        if stage != sb.Stage.TRAIN:
           hyps, _ = self.hparams.valid_greedy_searcher(enc_out, wav_lens)

        return logits, hyps, wav_lens

    def compute_objectives(self, predictions, batch, stage):
        """Computes the loss NLL given predictions and targets."""

        
        # apply log_softmax+ Compute Loss
        logits, hyps, wav_lens, = predictions
        batch = batch.to(self.device)
        ids = batch.id
        tokens_eos, tokens_eos_lens = batch.tokens_eos

        log_probs = self.hparams.log_softmax(logits)
        loss = self.hparams.nll_loss(log_probs, tokens_eos, length=tokens_eos_lens,)

        if stage != sb.Stage.TRAIN:
            tokens, tokens_lens = batch.tokens

            
            # Decode token terms to words
            predicted_words = self.tokenizer.batch_decode(hyps, skip_special_tokens=True)

            
            # Convert indices to words
            target_words = undo_padding(tokens, tokens_lens)
            target_words = target_words = self.tokenizer.batch_decode(target_words, skip_special_tokens=True)

            if hasattr(self.hparams, "normalized_transcripts"):
                predicted_words = [
                    self.tokenizer._normalize(text).split(" ")
                    for text in predicted_words
                ]

                target_words = [
                    self.tokenizer._normalize(text).split(" ")
                    for text in target_words
                ]
            else:
                predicted_words = [text.split(" ") for text in predicted_words]

                target_words = [text.split(" ") for text in target_words]
            self.wer_metric.append(ids, predicted_words, target_words)
            self.cer_metric.append(ids, predicted_words, target_words)

        return loss

    def on_stage_start(self, stage, epoch):
        """Gets called at the beginning of each epoch"""
        if stage != sb.Stage.TRAIN:
            self.cer_metric = self.hparams.cer_computer()
            self.wer_metric = self.hparams.error_rate_computer()

    def on_stage_end(self, stage, stage_loss, epoch):
        """Gets called at the end of an epoch."""
        # Compute/store important stats
        stage_stats = {"loss": stage_loss}
        if stage == sb.Stage.TRAIN:
            self.train_stats = stage_stats
        else:
            stage_stats["CER"] = self.cer_metric.summarize("error_rate")
            stage_stats["WER"] = self.wer_metric.summarize("error_rate")

        # Perform end-of-iteration things, like annealing, logging, etc.
        if stage == sb.Stage.VALID:

            old_lr_whisper, new_lr_whisper = self.hparams.lr_annealing_whisper(
                stage_stats["loss"]
            )

            sb.nnet.schedulers.update_learning_rate(
                self.optimizer, new_lr_whisper
            )
            self.hparams.train_logger.log_stats(
                stats_meta={"epoch": epoch, "lr_whisper": old_lr_whisper},
                train_stats=self.train_stats,
                valid_stats=stage_stats,
            )
            self.checkpointer.save_and_keep_only(
                meta={"WER": stage_stats["WER"]}, min_keys=["WER"],
            )
        elif stage == sb.Stage.TEST:
            self.hparams.train_logger.log_stats(
                stats_meta={"Epoch loaded": self.hparams.epoch_counter.current},
                test_stats=stage_stats,
            )
            with open(self.hparams.wer_file, "w") as w:
                self.wer_metric.write_stats(w)


def dataio_prepare(hparams, tokenizer):
    """This function prepares the datasets to be used in the brain class.
    It also defines the data processing pipeline through user-defined functions."""
    data_folder = hparams["data_folder"]

    train_data = sb.dataio.dataset.DynamicItemDataset.from_json(
        json_path=hparams["train_annotation"], replacements={"data_root": data_folder},
    )

    if hparams["sorting"] == "ascending":
        # we sort training data to speed up training and get better results.
        train_data = train_data.filtered_sorted(sort_key="duration")
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["train_loader_kwargs"]["shuffle"] = False

    elif hparams["sorting"] == "descending":
        train_data = train_data.filtered_sorted(
            sort_key="duration", reverse=True
        )
        # when sorting do not shuffle in dataloader ! otherwise is pointless
        hparams["train_loader_kwargs"]["shuffle"] = False

    elif hparams["sorting"] == "random":
        pass

    else:
        raise NotImplementedError(
            "sorting must be random, ascending or descending"
        )

    valid_data = sb.dataio.dataset.DynamicItemDataset.from_json(
        json_path=hparams["valid_annotation"], replacements={"data_root": data_folder},
    )
    valid_data = valid_data.filtered_sorted(sort_key="duration")

    # test is separate
    test_data = sb.dataio.dataset.DynamicItemDataset.from_json(
        json_path=hparams["test_annotation"], replacements={"data_root": data_folder},
    )

    datasets = [train_data, valid_data, test_data]

    # 2. Define audio pipeline:
    @sb.utils.data_pipeline.takes("path")
    @sb.utils.data_pipeline.provides("sig")
    def audio_pipeline(path):
        
        # Load and resample audio
        info = torchaudio.info(path)
        sig = sb.dataio.dataio.read_audio(path)
        resampled = torchaudio.transforms.Resample(
            info.sample_rate, hparams["sample_rate"],
        )(sig)
        return resampled

    sb.dataio.dataset.add_dynamic_item(datasets, audio_pipeline)

    # 3. Define text pipeline:
    @sb.utils.data_pipeline.takes("words")
    @sb.utils.data_pipeline.provides(
        "words", "tokens_list", "tokens_bos", "tokens_eos", "tokens"
    )
    def text_pipeline(words):
        
        yield words
        tokens_list = tokenizer.encode(words)
        
        tokens_list = tokens_list[1:-1]
        yield tokens_list
        tokens_bos = torch.LongTensor([hparams["bos_index"]] + tokens_list)
        yield tokens_bos
        tokens_eos = torch.LongTensor(tokens_list + [hparams["eos_index"]])
        yield tokens_eos
        tokens = torch.LongTensor(tokens_list)
        yield tokens

    sb.dataio.dataset.add_dynamic_item(datasets, text_pipeline)

    # 4. Set output:
    sb.dataio.dataset.set_output_keys(
        datasets,
        ["id", "sig", "tokens_list", "tokens_bos", "tokens_eos", "tokens"],
    )

    return train_data, valid_data, test_data


# Recipe begins!
if __name__ == "__main__":

    # Reading command line arguments.
    hparams_file, run_opts, overrides = sb.parse_arguments(sys.argv[1:])


    # Load hyperparameters file with command-line overrides.
    with open(hparams_file) as fin:
        hparams = load_hyperpyyaml(fin, overrides)

    # Create experiment directory
    sb.create_experiment_directory(
        experiment_directory=hparams["output_folder"],
        hyperparams_to_save=hparams_file,
        overrides=overrides,
    )

    # Defining tokenizer and loading it
    tokenizer = hparams["whisper"].tokenizer
    language = hparams["locale"]
    tokenizer.set_prefix_tokens(language, "transcribe", False)


    # Create dataset objects "train", "valid", and "test".
    train_data, valid_data, test_data = dataio_prepare(hparams,tokenizer)


        # we need to prepare the tokens for searchers
    hparams["valid_greedy_searcher"].set_decoder_input_tokens(
        tokenizer.prefix_tokens
    )
    hparams["valid_greedy_searcher"].set_language_token(
        tokenizer.prefix_tokens[1]
    )

    hparams["test_beam_searcher"].set_decoder_input_tokens(
        tokenizer.prefix_tokens
    )
    hparams["test_beam_searcher"].set_language_token(tokenizer.prefix_tokens[1])



    # Trainer initialization
    asr_brain = ASR(
        modules=hparams["modules"],
        hparams=hparams,
        run_opts=run_opts,
        checkpointer=hparams["checkpointer"],
        opt_class=hparams["whisper_opt_class"],
    )

    # We load the pretrained whisper model
    if "pretrainer" in hparams.keys():
        run_on_main(hparams["pretrainer"].collect_files)
        hparams["pretrainer"].load_collected(asr_brain.device)

  # We dynamicaly add the tokenizer to our brain class.
    # NB: This tokenizer corresponds to the one used for Whisper.
    asr_brain.tokenizer = tokenizer
    asr_brain.fit(
          asr_brain.hparams.epoch_counter,
          train_data,
          valid_data,
          train_loader_kwargs=hparams["train_loader_kwargs"],
          valid_loader_kwargs=hparams["valid_loader_kwargs"],
      )


    # Load the best checkpoint for evaluation
    # Testing
    asr_brain.hparams.wer_file = hparams["output_folder"] + "/wer_test.txt"
    asr_brain.evaluate(
        test_data,
        min_key="WER",
        test_loader_kwargs=hparams["test_loader_kwargs"],
    )


Overwriting train_with_whisper.py


**Run the code below** to train the model.

It would take around 20-30 minutes on GPU.

In [None]:
# Delete the output folder to start training from scratch
# (and not from a previous checkpoint).
!rm -rf ./results/train_with_whisper/sr/1986

# Run Training
!python train_with_whisper.py hparams_sr_whisper.yaml  --data_folder='data/common_voice_12/sr' --device='cuda:0' --number_of_epochs=1 --seed=1986

Downloading (…)rocessor_config.json: 100% 185k/185k [00:00<00:00, 278kB/s]
Downloading (…)lve/main/config.json: 100% 1.96k/1.96k [00:00<00:00, 314kB/s]
Downloading pytorch_model.bin: 100% 151M/151M [00:00<00:00, 364MB/s]
speechbrain.lobes.models.huggingface_whisper - whisper encoder is frozen.
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/train_with_whisper/sr/1986
speechbrain.core - Info: auto_mix_prec arg from hparam file is used
speechbrain.core - 29.6M trainable parameters in ASR
speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
speechbrain.utils.epoch_loop - Going into epoch 1
100% 109/109 [00:41<00:00,  2.60it/s, train_loss=1.08]
  0% 0/13 [00:00<?, ?it/s]2023-03-17 18:12:31.704361: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enab

The expected test error rate is about 60%.

We could observe, with an encoder-decoder model already trained on ASR task, we could get far better results compared to a self-supervised encoder + CTC decoder even with whisper-tiny model and 1 epoch. Much better results can be obtained with a larger dataset trained on the whisper large model.