# Team Bucephalus NSU - Training Notebook - Whisper

- Team Members: `Mohammed Rakib` | `Ismail Hossain`

## Table of Contents

- [1. Introduction](#0)
- [2. Installing Dependencies](#1)
- [3. Training Configuration](#2)
- [4. Function and Class definitions](#3)
  - [4.1 Filtering Dataset](#3-1)
  - [4.2 Audio Loader](#3-2)
- [5. Main](#4)
  - [5.1 Building LM](#4-1)
  - [5.2 Loading Base Wav2Vec2 Model](#4-2)
  - [5.3 Adapting Base Wav2Vec2Model](#4-3)

<a id="0"></a> <br>
## Introduction

- This notebook contains the steps followed for adapting a pretrained Wav2Vec2 model on the provided dataset.

- We adapted a Wav2Vec2 model that was already pretrained on the OpenSLR53 Bengali dataset and available on HuggingFace (https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali). The model checkpoitn was downloaded and added as a dataset for this notebook so this could be run without internet enabled.

- We empolyed several audio augmentation techniques during training including:
  - Speed perturbation (0.9x 1.1x)
  - Volume perturbation (0.125x ~ 2.0x)
  - Adding background noise using the other audio files from the dataset.
  - Adding reverberation.

- We first trained the base model for 100.8k steps (early stopping) with an intial learning rate of 3e-5, following cosine decay schedule. We enable a small proportion of different types of dropout during this stage.

- Next we trained the model for 40k steps at a constant learning rage of 1e-7; this time we disabled all dropout.

- The best checkpoint from the second stage of training was paired with a 5-gram language model built from the training text (and validation text when inferring on test set).

- The best model with decoding parameters tuned on the validation set acheived the following LB score (mean Levenshtein distance):
  - Public:  1.46527
  - Private: 1.47186

<a id="1"></a> <br>
## Installing and Importing Dependencies

In [1]:
!cp -r ../input/csefest2022dlsprintdeps ./deps

In [None]:
!pip install ./deps/pygtrie-2.5.0/pygtrie-2.5.0
!pip install ./deps/exceptiongroup-1.0.0rc8-py3-none-any.whl
!pip install ./deps/hypothesis-6.54.4-py3-none-any.whl
!pip install ./deps/pyctcdecode-0.4.0-py2.py3-none-any.whl
!pip install ./deps/pypi-kenlm-0.1.20220713/pypi-kenlm-0.1.20220713
!pip install ./deps/bnunicodenormalizer-0.0.23/bnunicodenormalizer-0.0.23
!pip install ./deps/python-Levenshtein-0.12.2/python-Levenshtein-0.12.2
!pip install ./deps/jiwer-2.3.0-py3-none-any.whl

!chmod +x ./deps/kenlm/kenlm/bin/lmplz

In [3]:
from typing import Dict, List, Tuple, Any, Union, Optional

import os
import re
import json
import random
from pprint import pprint

import unicodedata
from bnunicodenormalizer import Normalizer 

import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
from pandarallel import pandarallel
from tqdm.auto import tqdm

import torch
import torchaudio
import torchaudio.functional as F
import torchaudio.transforms as T
from torch.utils.data import Dataset, DataLoader, IterableDataset

import transformers
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2FeatureExtractor, Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import TrainingArguments, Trainer

from datasets import load_dataset, load_metric
from dataclasses import dataclass, field

from IPython.display import display, Audio, HTML, Markdown

bnorm = Normalizer()
pandarallel.initialize(progress_bar=True,nb_workers=os.cpu_count())
tqdm.pandas()

INFO: Pandarallel will run on 2 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [4]:
os.environ["WANDB_DISABLED"] = "true"

<a id="2"></a> <br>
## Configuration

In [28]:
# Training config class.
class Config:
    # Path to audio data directories. Change to directory containing pre-converted and
    # resampled wav files if available, along with audio_ext.
    train_path = "../input/dlsprint/train_files"
    valid_path = "../input/dlsprint/validation_files"
    audio_ext = "mp3"
    sample_rate = 16000
    
    # Path to csv metadata files.
    train_csv_path = "../input/dlsprint/train.csv"
    valid_csv_path = "../input/dlsprint/validation.csv"
    
    # Directory where language models are written to.
    language_model_dir = "./langauge_models"
    
    # n-gram order of language model.
    ngram_order = 5

    # If True, enables all audio augmentations.
    use_augmentation = True
    
    # Dropout configs for pretrained wav2vec2 model.
    attention_dropout = 0.1
    hidden_dropout = 0.1
    feat_proj_dropout = 0.1
    mask_time_prob = 0.05
    layerdrop = 0.1
        
    # Early stopping.
    early_stopping_patience = 10

    # Trainer arugments.
    trainer = TrainingArguments(
      output_dir="./run-003-wav2vec2-fulldata-cosine-lr3e-5",
      group_by_length=False,
      per_device_train_batch_size=16,
      per_device_eval_batch_size=16,
      gradient_accumulation_steps=1,
      evaluation_strategy="steps",
      num_train_epochs=10,
      gradient_checkpointing=True,
      fp16=True,
      save_steps=400,
      eval_steps=400,
      logging_steps=400,
      learning_rate=3e-5,
      dataloader_num_workers=os.cpu_count(),
      warmup_steps=300,
      save_total_limit=10,
      push_to_hub=False,
      run_name="run-003-wav2vec2-fulldata-cosine-lr3e-5",
      load_best_model_at_end=True,
      lr_scheduler_type="cosine",
      resume_from_checkpoint=True,
    )

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
import json
# Vocabulary of the pretrained model we use.
vocab_dict_cn = json.load(open("vocab_dict.json"))
vocab_dict_cn['\x93'] = vocab_dict_cn.pop('\\x93')
vocab_dict_cn['\x94'] = vocab_dict_cn.pop('\\x94')
vocab_dict_cn = dict(sorted(vocab_dict_cn.items(), key=lambda x: x[1]))
vocab_dict_cn

In [38]:
skipFiles = open("corrupt_files.txt").read().splitlines()
skipFiles = skipFiles[3:]
length = len(skipFiles)
first = skipFiles[0]
last = skipFiles[-1]
length, first, last

(3715, 'common_voice_bn_30665285', 'common_voice_bn_31832651')

In [39]:
skipFiles = [ f"{p}.mp3" for p in skipFiles ]
print(f"Skipping {len(skipFiles)} files from training set")
skipFiles[0]

Skipping 3715 files from training set


'common_voice_bn_30665285.mp3'

<a id="3"></a> <br>
## Function Definitions

<a id="3-1"></a> <br>
### Functions for filter and normalizing dataset

In [8]:
def filterVotes(x: pd.Series):
    """
    This function returns whether x should be filtered based on
    the ratio of up to down votes. if it should be filtered, the
    function returns None, otherwise 1. We later use panda's dropna()
    method to drop rows containing None.
    """
    up = x["up_votes"]
    down = x["down_votes"]
    if down > 0 and up / down < 1:
        return None
    return 1

# Regex for matching zero witdh joiner variations.
STANDARDIZE_ZW = re.compile(r'(?<=\u09b0)[\u200c\u200d]+(?=\u09cd\u09af)')

# Regex for removing standardized zero width joiner, except in edge cases.
DELETE_ZW = re.compile(r'(?<!\u09b0)[\u200c\u200d](?!\u09cd\u09af)')

## Regex matching punctuations to remove.
# PUNC = re.compile(r'([\?\.।;:,!"\'])')
## Keeps fullstop(.), dari(|), comma(,), exclamaition(!) and question mark(?) and removes all other punctuations (semicolon (;), colon (:), double quote (") and single quote (')).
PUNC = re.compile(r'([;:"\'])')

def removeOptionalZW(text):
    """
    Removes all optional occurrences of ZWNJ or ZWJ from Bangla text.
    """
    text = STANDARDIZE_ZW.sub('\u200D', text)
    text = DELETE_ZW.sub('', text)
    return text

def removePunc(text):
    """
    Remove for punctuations from text.
    """
    text = PUNC.sub(r"", text)
    return text

def normalizeUnicode(text, normalize_nukta=True):
    """
    Normalizes unicode strings using the Normalization Form Canonical
    Composition (NFC) scheme where we first decompose all characters and then
    re-compose combining sequences in a specific order as defined by the
    standard in unicodedata module. Finally all zero-width joiners are
    removed.
    """
    if normalize_nukta:
        words = [ bnorm(word)['normalized']  for word in text.split() ]
        text = " ".join([word for word in words if word is not None])
        text = text.replace("\u2047", "-")

    text = text.replace(u"\u098c", u"\u09ef")
    text = unicodedata.normalize("NFC", text)
    text = removeOptionalZW(text)
    text = removePunc(text)

    return text

<a id="3-2"></a> <br>
### Audio Loader Class and PyTorch Dataset

- This class transcodes and augments audio files and can be used with PyTorch dataloaders.

In [9]:
class AudioConverter:
    """
    AudioConverter offers methods to load, transcode and augment
    audio data in various ways.
    """

    # Configurations for parameters used in torchaudio's resampling kernel.
    resampleFilterParams = {
        "fast": {  # Fast and less accurate but still MSE = ~2e-5 compared to librosa.
            "lowpass_filter_width": 16,
            "rolloff": 0.85,
            "resampling_method": "kaiser_window",
            "beta": 8.555504641634386,
        },
        "best": { # Twice as slow, and a little bit more accburate.
            "lowpass_filter_width": 64,
            "rolloff": 0.9475937167399596,
            "resampling_method": "kaiser_window",
            "beta": 14.769656459379492,       
        },
    }

    def __init__(
        self,
        sampleRate: int,
        disableAug: bool = False,
        speedAugProb: float = 0.5,
        volAugProb: float = 0.5,
        reverbAugProb: float = 0.25,
        noiseAugProb: float = 0.25,
        speedFactors: Tuple[float, float] = None,
        volScaleMinMax: Tuple[float, float] = None,
        reverbRoomScaleMinMax: Tuple[float, float] = None,
        reverbHFDampingMinMax: Tuple[float, float] = None,
        reverbSustainMinMax: Tuple[float, float] = None,
        noiseSNRMinMax: Tuple[float, float] = None,
        noiseFileList: List[str] = None,
    ):
        """
        Initializes AudioConverter.

        Parameters
        ----------
        sampleRate: int
            Sampling rate to convert audio to, if required.

        disableAug: bool, optional
            If True, overrides all other augmentation configs and
            disables all augmentatoins.

        speedAugProb: float, optional
            Probability that speed augmentation will be applied.
            If <= 0, speed augmentation is disabled.

        volAugProb: float, optional
            Probability that volume augmentation will be applied.
            If <= 0, volume augmentation is disabled.

        reverbAugProb: float, optional
            Probability that reverberation augmentation will be applied.
            If <= 0, reverberation augmentation is disabled.

        noiseAugProb: float, optional
            Probability that noise augmentation will be applied.
            If <= 0, noise augmentation is disabled.

        speedFactors: List[float], optional
            List of factors by which to speed up (>1) or slow down (<1)
            audio by. One factor is chosen randomly if provided. Otherwise,
            default speed factors are [0.9, 1.0, 1.0].
            
        volScaleMinMax: Tuple[float, float], optional
            [Min, Max] range for volume scale factors. One factor is
            chose randomly with uniform probability from this range.
            Default range is [0.125, 2.0].

        reverbRoomScaleMinMax: Tuple[float, float], optional
            [Min, Max] range for room size percentage. Values must be
            between 0 and 100. Larger room size results in more reverb.
            Default range is [25, 75].

        reverbHFDampingMinMax: Tuple[float, float], optional
            [Min, Max] range for high frequency damping percentage. Values must
            be between 0 and 100. More damping results in muffled sound.
            Default range is [25, 75].
        
        reverbSustainMinMax: Tuple[float, float], optional
            [Min, Max] range for reverberation sustain percentage. Values must
            be between 0 and 100. More sustain results in longer lasting echoes.
            Default range is [25, 75].
            
        noiseSNRMinMax: Tuple[float, float], optional
            [Min, Max] range for signal-to-noise ratio when adding noise. One
            factor is chose randomly with uniform probability from this range.
            Lower SNR results in louder noise. Default range is [10.0, 30.0].

        noiseFileList: List[str], optional
            List of paths to audio files to use as noise samples. If None is provided,
            noise augmentation will be disabled. Otherwise, the audio files will be assumed
            to be sources of noise, and be mixed in with speech audio on-the-fly.
        """
        self.sampleRate = sampleRate
        
        enableAug = not disableAug
        self.speedAugProb = speedAugProb if enableAug else -1
        self.volAugProb = volAugProb if enableAug else -1
        self.reverbAugProb = reverbAugProb if enableAug else -1
        self.noiseAugProb = noiseAugProb if enableAug else -1
        
        # Factors by which audio speed is perturbed.
        self.speedFactors = speedFactors
        if speedFactors is None:
            self.speedFactors = [0.9, 1.0, 1.1]
        
        # [Min, Max] Volume scale range.
        self.volScaleRange = volScaleMinMax
        if volScaleMinMax is None:
            self.volScaleRange = [0.125, 2.0]
        
        # [Min, Max] Room size as a percentage, higher = more reverb
        self.reverbRoomScaleRange = reverbRoomScaleMinMax
        if reverbRoomScaleMinMax is None:
            self.reverbRoomScaleRange = [25, 75]
        
        # [Min, Max] High frequency damping as a percentage, higher = more damping.
        self.reverbHFDampingRange = reverbHFDampingMinMax
        if reverbHFDampingMinMax is None:
            self.reverbHFDampingRange = [25, 75]
        
        # [Min, Max] How long reverb is sustained as a percentage, higher = lasts longer.
        self.reverbSustainRange = reverbSustainMinMax 
        if reverbSustainMinMax is None:
            self.reverbSustainRange = [25, 75]       

        # Audio files to use as source of noise.
        self.noiseFiles = noiseFileList
        if self.noiseFiles is None or len(self.noiseFiles) == 0:
            self.noiseAugProb = -1

        # [Min, Max] Signal to noise ratio range for adding noise to audio.
        # Lower SNR = noise is more prominent, i.e. speech is more noisy.
        self.noiseSNRRange = noiseSNRMinMax
        if noiseSNRMinMax is None:
            self.noiseSNRRange = [10.0, 30.0]
        
        self.validateConfig()
        
    def validateConfig(self):
        """
        Checks configured options and raises an error if they
        are not consistent with what is expected.
        """
        if len(self.volScaleRange) != 2:
            raise ValueError("volume scale range must be provided as [min, max]")
        if len(self.reverbRoomScaleRange) != 2:
            raise ValueError("reverb room scale range must be provided as [min, max]")
        if len(self.reverbHFDampingRange) != 2:
            raise ValueError("reverb high frequency dampling range must be provided as [min, max]")
        if len(self.reverbSustainRange) != 2:
            raise ValueError("reverb sustain range must be provided as [min, max]")
        if len(self.noiseSNRRange) != 2:
            raise ValueError("noise SNR range must be provided as [min, max]")
            
        for v in self.reverbRoomScaleRange:
            if v > 100 or v < 0:
                raise ValueError("reverb room scale must be between 0 and 100")
        for v in self.reverbHFDampingRange:
            if v > 100 or v < 0:
                raise ValueError("reverb high frequency dampling must be between 0 and 100")
        for v in self.reverbSustainRange:
            if v > 100 or v < 0:
                raise ValueError("reverb sustain range must be between 0 and 100")

    @classmethod
    def loadAudio(
        cls, audioPath: str, sampleRate: int = None, returnTensor: bool = True, resampleType: str = "fast",
    ) -> Union[torch.Tensor, np.ndarray]:
        """
        Uses torchaudio to load and resample (if necessary) audio files and returns
        audio samples as either a numpy.float32 array or a torch.Tensor.
        
        Parameters
        ----------
        audioPath: str
            Path to audio file file (wav / mp3 / flac).
        
        sampleRate: int, optional
            Sampling rate to convert audio to. If None,
            audio is not resampled.
        
        returnTensor: bool, optional
            If True, the audio samples are returned as a torch.Tensor.
            Otherwise, the samples are returned as a numpy.float32 array.
            
        resampleType: str, optional
            Either "fast" or "best" - sets the quality of resampling.
            "best" is twice as slow as "fast" but more accurate. "fast"
            is still comparable to librosa's resampled output though,
            in terms of MSE.

        Returns
        -------
        Union[torch.Tensor, np.ndarray]
            Audio waveform scaled between +/- 1.0 as either a numpy.float32 array,
            or torch.Tensor, with shape (channels, numSamples)
        """
        x, sr = torchaudio.load(audioPath)
        if sampleRate is not None or sr != sampleRate:
            x = F.resample(x, sr, sampleRate, **cls.resampleFilterParams[resampleType])
        
        if returnTensor:
            return x
        
        return x.numpy()

    def getAudio(self, audioPath: str, returnTensor: bool = False) -> Union[np.ndarray, torch.Tensor]:
        """
        Loads audio from specified path and applies augmentations randomly
        on-the-fly. Audio samples scaled between -1.0 and +1.0 are returned
        as a numpy.float32 array or torch.Tensor with shape (numSamples,).

        Parameters
        ----------
        audioPath: str
            Path to audio file file (wav / mp3 / flac).
        
        returnTensor: bool, optional
            If True, the audio samples are returned as a torch.Tensor.
            Otherwise, the samples are returned as a numpy.float32 array.
        
        Returns
        ------- 
        Union[torch.Tensor, np.ndarray]
            Audio waveform scaled between +/- 1.0 as either a numpy.float32 array,
            or torch.Tensor, with shape (channels, numSamples)
        """
        wav = self.loadAudio(
            audioPath, sampleRate=self.sampleRate, returnTensor=True, resampleType="fast",
        )

        # Applying sox-based effects first.
        effects = []
        
        if random.uniform(0, 1) <= self.speedAugProb:
            effects.extend([
                ["speed", f"{random.choice(self.speedFactors)}"],
                ["rate", f"{self.sampleRate}"],
            ])

        if random.uniform(0, 1) <= self.reverbAugProb:
            effects.append([
                "reverb",
                f"{random.uniform(*self.reverbSustainRange)}",
                f"{random.uniform(*self.reverbHFDampingRange)}",
                f"{random.uniform(*self.reverbRoomScaleRange)}",
            ])
        
        # If no effects are selected, this is a no-op.
        wav = self.applySoxEffects(wav, effects)

        if random.uniform(0, 1) <= self.noiseAugProb:
            noiseFile = random.choice(self.noiseFiles)
            noiseSNR = random.uniform(*self.noiseSNRRange)
            wav = self.addNoiseFromFile(wav, noiseFile, noiseSNR)

        if random.uniform(0, 1) <= self.volAugProb:
            volScale = random.uniform(*self.volScaleRange)
            wav = self.scaleVolume(wav, volScale)
        
        if returnTensor:
            return wav
        
        return wav.numpy()


    def scaleVolume(self, wav: Union[np.ndarray, torch.Tensor], scale: float) -> torch.Tensor:
        """
        Scales the amplitude (with clipping) of the provided audio signal
        by the given scale factor.
        
        Parameters
        ----------
        wav: Union[np.ndarray, torch.Tensor]
             Audio samples scaled between -1.0 and +1.0, with shape
             (channels, numSamples).

        Returns
        -------
        torch.Tensor
            Audio samples with perturbed volume.
        """
        if scale == 1.0:
            return wav

        return torch.clamp(wav * scale, -1.0, 1.0)

    def addNoiseFromFile(
        self, wav: Union[np.ndarray, torch.Tensor], noiseFile: str, snr: float,
    ) -> torch.Tensor:
        """
        Adds noise signal from provided noise audio file at the 
        specified SNR to the speech signal.
        
        Parameters
        ----------
        wav: Union[np.ndarray, torch.Tensor]
             Audio samples scaled between -1.0 and +1.0, with shape
             (channels, numSamples).

        snr: float
            Signal-to-Noise ratio at which to mix in the noise signal.
        
        Returns
        -------
        torch.Tensor
            Audio samples with noise added at specified SNR.
        """
        # Loading noise signal.
        noiseSig = self.loadAudio(
            noiseFile, sampleRate=self.sampleRate, returnTensor=True, resampleType="fast",
        )

        # Computing noise power.
        noisePower = torch.mean(torch.pow(noiseSig, 2))
        
        # Computing signal power.
        signalPower = torch.mean(torch.pow(wav, 2))

        # Noise Coefficient for target SNR; amplitude coeff is sqrt of power coeff.
        noiseScale = torch.sqrt((signalPower / noisePower) / (10 ** (snr / 20.0)))
        
        # Add noise at random location in speech signal.
        nWav, nNoise = wav.shape[-1], noiseSig.shape[-1]

        if nWav < nNoise:
            a = random.randint(0, nNoise-nWav)
            b = a + nWav
            return wav + (noiseSig[..., a:b] * noiseScale)
        
        a = random.randint(0, nWav-nNoise)
        b = a + nNoise          
        wav[..., a:b] += (noiseSig * noiseScale)

        return wav
    
        
    def applySoxEffects(self, wav: Union[np.ndarray, torch.Tensor], effects: List[List[str]]) -> torch.Tensor:
        """
        Applies different audio manipulation effects to provided audio, like
        speed and volume perturbation, reverberation etc. For a full list of
        supported effects, check torchaudio.sox_effects.

        Parameters
        ----------
        wav: Union[np.ndarray, torch.Tensor]
             Audio samples scaled between -1.0 and +1.0, with shape
             (channels, numSamples).
        
        effects: List[List[str]]
            List of sox effects and associated arguments, example:
            '[ ["speed", "1.2"], ["vol", "0.5"] ]'

        Returns
        -------
        torch.Tensor
            Audio samples with effects applied. May not be the same
            number of samples as input sample array, depending on types
            of effects applied (e.g. speed perturbation may reduce or
            increase the number of samples).
        """
        if effects is None or len(effects) == 0:
            return wav

        wav, _ = torchaudio.sox_effects.apply_effects_tensor(
            wav, sample_rate=self.sampleRate, effects=effects,
        )

        return wav
    
    def perturbSpeed(self, wav: Union[np.ndarray, torch.Tensor], factor: float) -> torch.Tensor:
        """
        Perturbs the speed of the provided audio signal by the given factor.
        
        Parameters
        ----------
        wav: Union[np.ndarray, torch.Tensor]
             Audio samples scaled between -1.0 and +1.0, with shape
             (channels, numSamples).

        Returns
        -------
        torch.Tensor
            Audio samples with perturbed speed. Will have more or less
            samples than input depending on whether slowed down or
            sped up.
        """
        effects = [
            ["speed", f"{factor}"],
            ["rate", f"{self.sampleRate}"],
        ]
        
        return self.applySoxEffects(wav, effects)
    
    def addReverb(
        self, wav: Union[np.ndarray, torch.Tensor], roomSize: float, hfDamping: float, sustain: float,
    ) -> torch.Tensor:
        """
        Adds reverberation to the provided audio signal using given parameters.
        
        Parameters
        ----------
        wav: Union[np.ndarray, torch.Tensor]
             Audio samples scaled between -1.0 and +1.0, with shape
             (channels, numSamples).
        
        roomSize: float
            Room size as a percentage between 0 and 100,
            higher = more reverb

        hfDamping: float
            High Frequency damping as a percentage between 0 and 100,
            higher = more damping.

        sustain: float
            How long reverb is sustained as a percentage between 0 and 100,
            higher = lasts longer.

        Returns
        -------
        torch.Tensor
            Audio samples with reverberated audio.
        """
        effects = [["reverb", f"{roomSize}", f"{hfDamping}", f"{sustain}"]]
        return self.applySoxEffects(wav, effects)

- This PyTorch Dataset class uses the AudioConverter class to load audio files parallely.

In [10]:
class SprintDataset(Dataset):
        
    def __init__(self, df, processor, audioConverter, loopDataset=1):
        self.df = df
        self.paths = df['path']
        self.sentences = df['sentence']
        self.len = len(self.df) * loopDataset

        self.processor = processor
        self.ac = audioConverter

    def __len__(self):
        return self.len

    def loadSample(self, idx):
        idx %= len(self.df)
        audio_path = self.paths[idx]
        sentence = self.sentences[idx]

        wave = self.ac.getAudio(audio_path)[0]
        input_values = processor(wave, sampling_rate=16000).input_values[0]

        input_length = len(input_values)
        with self.processor.as_target_processor():
            labels = self.processor(sentence).input_ids

        return {
            'input_values':input_values,
            'input_length':input_length,
            'labels':labels
        }

    def __getitem__(self, idx): 
        if idx >= self.len:
            raise IndexError('index out of range')
        return self.loadSample(idx)

In [11]:
@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lenghts and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

<a id="4"></a> <br>
## Main

In [12]:
# Testing punctuation removal
text = "আমার আ,ম; ;?অনেক?! .ভাল: 'লাগে"
display(HTML(f"original: {text}"))
display(HTML(f"puncs removed: {removePunc(text)}"))

In [13]:
# Loading metadata files.
train_df = pd.read_csv(Config.train_csv_path)
valid_df = pd.read_csv(Config.valid_csv_path)

print(f"Train utts before filtering: {len(train_df)}")

# Removing files with an up down vote ratio of less than 1.
print("Filtering training dataset ... ")
train_df["up_votes"] = train_df.apply(filterVotes, axis=1)
train_df.dropna(subset = ['up_votes'], inplace=True)

# Removing files that we encountered errors in a previous runs.
train_df = train_df[~train_df.path.isin(skipFiles)]
train_df = train_df.reset_index()
valid_df = valid_df.reset_index()

print(f"Train utts after filtering: {len(train_df)}")

Train utts before filtering: 206950
Filtering training dataset ... 
Train utts after filtering: 197710


<a id="4-1"></a> <br>
### Build Language Model

- The language model is built using KenLM tooklit (https://github.com/kpu/kenlm).
- The text used to create the language model is the combination of training and validation transcripts.
- During model selection process based on validation results, we left out validation transcripts.

In [14]:
lmDir = Config.language_model_dir
order = Config.ngram_order

os.makedirs(lmDir, exist_ok=True)

with open(f"{lmDir}/train-text.txt", 'w') as f:
    for line in train_df["sentence"]:
        # Keeping the variants of letters where the nukta is part of the character
        # and where it's a separate joining character since the base model was trained
        # this way.
        line = normalizeUnicode(line.strip(), normalize_nukta=False)
        f.write(f"{line}\n")

    for line in valid_df["sentence"]:
        # Keeping the variants of letters where the nukta is part of the character
        # and where it's a separate joining character since the base model was trained
        # this way.
        line = normalizeUnicode(line.strip(), normalize_nukta=False)
        f.write(f"{line}\n")

!echo -e "LM training text stats [lines words chars]:\n$(wc {lmDir}/train-text.txt)"
!./deps/kenlm/kenlm/bin/lmplz -o "{order}" --discount_fallback < "{lmDir}/train-text.txt" > "{lmDir}/dlsprint-data-lm.{order}.arpa"

LM training text stats [lines words chars]:
  205457  1923329 34704335 ./langauge_models/train-text.txt
=== 1/5 Counting and sorting n-grams ===
Reading /kaggle/working/langauge_models/train-text.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 1923329 types 50780
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:609360 2:1310169216 3:2456567296 4:3930507520 5:5731990528
Statistics:
1 50780 D1=0.659444 D2=0.513934 D3+=0.826866
2 611193 D1=0.80271 D2=1.14673 D3+=1.37313
3 963343 D1=0.875048 D2=1.33804 D3+=1.49194
4 1008162 D1=0.886956 D2=1.48785 D3+=1.82227
5 931390 D1=0.395871 D2=1.8116 D3+=2.60684
Memory estimate for binary LM:
type    MB
probing 76 assuming -p 1.5
probing 91 assuming -r models -p 1.5
trie    36 without quantization
trie    19 assuming -q 8 -b 8 quantization 
trie    32 assuming -a 

In [15]:
# Apply unicode normalization.
print("Normalizing training and validation transcripts... ")
# train_df["sentence"] = train_df.apply(lambda x: normalizeUnicode(x["sentence"]), axis=1)
# valid_df["sentence"] = valid_df.apply(lambda x: normalizeUnicode(x["sentence"]), axis=1)
train_df["sentence"] = [ normalizeUnicode(x) for x in tqdm(train_df["sentence"]) ]
valid_df["sentence"] = [ normalizeUnicode(x) for x in tqdm(valid_df["sentence"]) ]

# Updating paths to point to data directories.
train_df["path"] = [ os.path.join(Config.train_path, x.replace("mp3", Config.audio_ext)) for x in train_df['path'] ]
valid_df["path"] = [ os.path.join(Config.valid_path, x.replace("mp3", Config.audio_ext)) for x in valid_df['path'] ]

# Keeping only audio filename and transcript columns.
train_df = train_df[["path","sentence"]]
valid_df = valid_df[["path","sentence"]]

Normalizing training and validation transcripts... 


  0%|          | 0/197710 [00:00<?, ?it/s]

  0%|          | 0/7747 [00:00<?, ?it/s]

In [None]:
# Checking a sample from the train dataframe.
sample = train_df.iloc[random.randint(0, len(train_df))]

display(HTML(sample['sentence']))

x = AudioConverter.loadAudio(sample['path'], sampleRate=Config.sample_rate, returnTensor=False)
display(Audio(x, rate=Config.sample_rate))

In [16]:
vocab_dict_nc={v:k for k,v in vocab_dict_cn.items()}
vocab = list(vocab_dict_cn.keys())

with open('vocab.json', 'w') as f:
    json.dump(vocab_dict_cn, f)
    
print(f"vocabulary size = {len(vocab)}")

vocabulary size = 112


In [21]:
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained(
    "./",
    unk_token="[UNK]",
    pad_token="[PAD]",
    word_delimiter_token="|",
    bos_token="<s>",
    eos_token="</s>",
)

feature_extractor = Wav2Vec2FeatureExtractor(
    feature_size=1,
    sampling_rate=Config.sample_rate,
    padding_value=0.0,
    padding_side="right",
    do_normalize=True,
    return_attention_mask=True,
)

processor = Wav2Vec2Processor(
    feature_extractor=feature_extractor,
    tokenizer=tokenizer,
)

In [22]:
# Creating dataset objects for train and val.
disable_aug = not Config.use_augmentation
loop_train_dataset = 2
loop_val_datset = 1

train_ac = AudioConverter(sampleRate=Config.sample_rate, disableAug=disable_aug, noiseFileList=train_df['path'].tolist())
val_ac = AudioConverter(sampleRate=Config.sample_rate, disableAug=True)

train_dataset = SprintDataset(train_df, processor, train_ac, loop_train_dataset)
valid_dataset = SprintDataset(valid_df, processor, val_ac, loop_val_datset)

data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [23]:
# Checking sample from dataset.
sample = train_dataset[123]
pprint(sample)

x = sample['input_values']
y = [ vocab_dict_nc[x] for x in sample['labels'] ]

display(HTML(str(y)))
display(Audio(x, rate=Config.sample_rate))

{'input_length': 99072,
 'input_values': array([ 0.00016054,  0.00016054,  0.00016054, ..., -0.00492946,
       -0.00243762,  0.00338007], dtype=float32),
 'labels': [65,
            78,
            45,
            79,
            75,
            88,
            60,
            78,
            64,
            79,
            0,
            67,
            78,
            76,
            79,
            64,
            80,
            45,
            84,
            0,
            35,
            60,
            88,
            69,
            75,
            69,
            71,
            88,
            65,
            59,
            84,
            71,
            0,
            52,
            64,
            88,
            70,
            0,
            67,
            79,
            69,
            78,
            64,
            0,
            61,
            84,
            45,
            84,
            0,
            50,
            79,
            56,
            79,
   

### Callback for computing metrics

In [24]:
wer_metric = load_metric("../input/csefest2022dlsprintdeps/metrics/metrics/wer.py")
cer_metric = load_metric("../input/csefest2022dlsprintdeps/metrics/metrics/cer.py")

def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
    
    # We do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids)
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

<a id="4-2"></a> <br>
### Loading base Wav2Vec2 model from HuggingFace

- We use a pretrained Wav2Vec2 model from Hugging Face as a starting point.
- Model Link: https://huggingface.co/arijitx/wav2vec2-xls-r-300m-bengali
- The model was trained on the OPENSLR SLR53 Bengali Dataset: https://www.openslr.org/53/

In [25]:
base_model = "../input/csefest2022dlsprintbasemodels/models/arijitx/wav2vec2-xls-r-300m-bengali"

# Loading model.
model = Wav2Vec2ForCTC.from_pretrained(
    base_model, 
    ignore_mismatched_sizes=False,
    attention_dropout=Config.attention_dropout,
    hidden_dropout=Config.hidden_dropout,
    feat_proj_dropout=Config.feat_proj_dropout,
    mask_time_prob=Config.mask_time_prob,
    layerdrop=Config.layerdrop,
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

# Freezing encoder layers.
model.freeze_feature_encoder()

# Printing stats.
total_param = sum(p.numel() for p in model.parameters())
trainable_param = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"total_param = {total_param}")
print(f"trainable = {trainable_param}")

total_param = 315553520
trainable = 311343344


<a id="4-3"></a> <br>
### Adapting Wav2Vec2 Model

- First run with cosine lr decay schedule, dropouts enabled

In [29]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=Config.trainer,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,   
    tokenizer=processor.feature_extractor,
    callbacks=[transformers.EarlyStoppingCallback(early_stopping_patience=Config.early_stopping_patience)],
)

Using cuda_amp half precision backend


In [30]:
trainer.train()

***** Running training *****
  Num examples = 395420
  Num Epochs = 10
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 247140
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.
The following columns in the training set don't have a corresponding argument in `Wav2Vec2ForCTC.forward` and have been ignored: input_length. If input_length are not expected by `Wav2Vec2ForCTC.forward`,  you can safely ignore this message.


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

- Second run with best model from first run (ckpt-100800)
- Constant lr schedule with lower lr, dropouts disabled.

In [31]:
# Updating trainer config.
Config.trainer = TrainingArguments(
  output_dir="./run-004-wav2vec2-fulldata-constant-lr1e-7",
  group_by_length=False,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  gradient_accumulation_steps=1,
  evaluation_strategy="steps",
  num_train_epochs=15,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=400,
  eval_steps=400,
  logging_steps=400,
  learning_rate=1e-7,
  dataloader_num_workers=os.cpu_count(),
  warmup_steps=0,
  save_total_limit=10,
  push_to_hub=False,
  run_name="run-004-wav2vec2-fulldata-constant-lr1e-7" ,
  load_best_model_at_end=True,
  lr_scheduler_type="constant",
  resume_from_checkpoint=True,
)

# Disabling dropouts.
Config.attention_dropout = 0
Config.hidden_dropout = 0
Config.feat_proj_dropout = 0
Config.mask_time_prob = 0
Config.layerdrop = 0

# Increasing patience.
Config.early_stopping_patience = 15

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [32]:
# Loading best model from first run.
model_path = "./run-003-wav2vec2-fulldata-cosine-lr3e-5/ckpt-100800"

model = Wav2Vec2ForCTC.from_pretrained(
    model_path,
    ignore_mismatched_sizes=False,
    attention_dropout=Config.attention_dropout,
    hidden_dropout=Config.hidden_dropout,
    feat_proj_dropout=Config.feat_proj_dropout,
    mask_time_prob=Config.mask_time_prob,
    layerdrop=Config.layerdrop,
    ctc_loss_reduction="mean", 
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./run-003-wav2vec2-fulldata-cosine-lr3e-5/ckpt-100800 is not the path to a directory containing a config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.

In [None]:
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=Config.trainer,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,   
    tokenizer=processor.feature_extractor,
    callbacks=[transformers.EarlyStoppingCallback(early_stopping_patience=Config.early_stopping_patience)],
)

In [None]:
trainer.train()