# Fine-tuning XLS-R 300M model on OpenSLR

## Initial Setup to Google Drive

### Installing required libraries

In [None]:
%%capture
!pip install transformers[torch]
!pip install accelerate -U
!pip install datasets
!pip install transformers
!pip install torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
!pip install jiwer
!pip install evaluate

### Creating a Directory 'Nepali_ASR'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Using my own Google Drive during the experiment to save all checkpoints and training logs.
path = '/content/drive/MyDrive/Nepali_ASR_Final_Model/'

import os

def create_and_set_working_directory(path: str):
    # check if your project folder exists. if not, it will be created.
    if os.path.isdir(path) == False:
        os.makedirs(path)
        print(path + ' did not exist but was created.')

    # change the OS to use your project folder as the working directory
    os.chdir(path)

    print('Working directory changed to: \n' + path)

create_and_set_working_directory(path)
!pwd

Working directory changed to: 
/content/drive/MyDrive/Nepali_ASR_Final_Model/
/content/drive/MyDrive/Nepali_ASR_Final_Model


### Cloning the model repo into the drive to resume training

In [None]:
!git lfs install

Git LFS initialized.


In [None]:
!git clone https://huggingface.co/iamTangsang/test-wav2vec2-large-xls-r-300m-nepali-openslr

Cloning into 'test-wav2vec2-large-xls-r-300m-nepali-openslr'...
remote: Enumerating objects: 489, done.[K
remote: Counting objects: 100% (486/486), done.[K
remote: Compressing objects: 100% (484/484), done.[K
remote: Total 489 (delta 166), reused 0 (delta 0), pack-reused 3 (from 1)[K
Receiving objects: 100% (489/489), 141.83 KiB | 811.00 KiB/s, done.
Resolving deltas: 100% (166/166), done.
Filtering content: 100% (7/7), 4.67 GiB | 8.85 MiB/s, done.
fatal: cannot exec '/content/drive/MyDrive/Nepali_ASR/test-wav2vec2-large-xls-r-300m-nepali-openslr/.git/hooks/post-checkout': Permission denied


### Hugging Face Token authentication

Obtain an authentication token with read and write permissions both from Huggingface. This is so that, we need to push our models later to the hub. And, also to access datasets, authentication is needed.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Install Git-LFS to upload model checkpoints

In [None]:
%%capture
!apt install git-lfs

## Preparing Dataset, Tokenizer, Feature Extractor

### Downloading OpenSLR dataset

In [None]:
from datasets import load_dataset

DATASET_TYPE = 'original'  # change to `original` or `cleaned` for downloading original or cleaned version of openslr dataset

dataset = load_dataset("spktsagar/openslr-nepali-asr-cleaned", name=DATASET_TYPE, split='train')
dataset = dataset.shuffle(seed=42)
dataset

openslr-nepali-asr-cleaned.py:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

The repository for spktsagar/openslr-nepali-asr-cleaned contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/spktsagar/openslr-nepali-asr-cleaned.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


utt_spk_text_orig.tsv:   0%|          | 0.00/11.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/589M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/582M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/589M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/575M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/584M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/573M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/588M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/588M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/585M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/579M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/588M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/584M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/579M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/588M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/578M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/578M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/157905 [00:00<?, ? examples/s]

Dataset({
    features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
    num_rows: 157905
})

### Removing Numerals

We decided to remove all instances with numerals because the utterances and transcriptions were inconsistent in many instances. Example: In one data '००७' has been pronounced as 'सुन्य सुन्य सात', in another it has been pronounced as 'जिरो जिरो सात' and in another it has been pronounced as just 'सात'. So, this would confuse the model so, all instances with Numerals have been removed.

In [None]:
# Define Nepali numerals (Unicode characters)

nepali_numerals = '०१२३४५६७८९'

# To remove any non-Nepali Characters
import string
init_len = len(dataset)

def check_nepali_numerals(text):
    """Returns if this text contains any Nepali Numerals"""
    return any([c in text for c in nepali_numerals])

# Use dataset filter to remove examples with above function
dataset = dataset.filter(
    lambda ex: not check_nepali_numerals(ex),
    input_columns=['transcription',],
    with_indices=False, batched=False, batch_size=0,
)
dataset

count_removed = init_len - len(dataset)

# Display the results
print(f"Number of items containing Nepali numerals: {count_removed}")
print(f"Number of items after filtering: {len(dataset)}")
print(string.ascii_letters)
print(dataset[0])

Filter:   0%|          | 0/157905 [00:00<?, ? examples/s]

Number of items containing Nepali numerals: 9717
Number of items after filtering: 148188
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
{'utterance_id': '65e8d9e90c', 'speaker_id': '8b798', 'utterance': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/afa156b38476608da77244b7268d7d8da29e00ff524cd03185e6925ee8d42d31/asr_nepali/data/65/65e8d9e90c.flac', 'array': array([ 0.00128174,  0.00030518,  0.00170898, ..., -0.00027466,
        0.00012207, -0.00015259]), 'sampling_rate': 16000}, 'transcription': 'तर उनी कुनै', 'num_frames': 54400}


In [None]:
# After removing, first ten data items
for i in range(10):
  print(f"{i+1}. {dataset[i]['transcription']}")

1. तर उनी कुनै
2. तपाईँले सायद राम्रै
3. पन्तले आफूले जानेको
4. कोली शब्द जातिविशेष
5. यो रोगलाई मृगी
6. व्यक्तित्व रहँदै आएकी छन्
7. दहीचिउरे स्वभावको बनाइदिन्छ
8. यो साह्रै किफायती
9. मुक्तेश्वर महादेवको आज्ञाअनुसार
10. यसले अस्पतालाहरूको धेरै


### Removing any Non-Nepali Characters

If there are any non-Nepali characters, then we need to remove it from our dataset. We want our vocabulary to contain only Nepali characters.

In [None]:
# To remove any non-Nepali Characters
import string

def check_english_chars(text):
    """Returns if this text contains any english characters"""
    return any([c in text for c in string.ascii_letters])

# Use dataset filter to remove examples with above function
dataset = dataset.filter(
    lambda ex: not check_english_chars(ex),
    input_columns=['transcription',],
    with_indices=False, batched=False, batch_size=0,
)
dataset

Filter:   0%|          | 0/148188 [00:00<?, ? examples/s]

Dataset({
    features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
    num_rows: 148187
})

## Mapping Numerals to Words `(Not needed anymore)`

`IMPORTANT` **Start from the Vocabulary section. Ignore this section.**

**IMPORTANT:** We first trained our model, thinking that mapping numerals to words was the correct approach. But, upon closer inspection, we came to find many inconsistencies in the dataset regarding numerals. So, we simply decided to remove all instances of Numerals. And, mapping was not the correct approach. But, this function might be required later for post-processing steps.

This function below is first used to convert Nepali Numerals to English and then the converted numbers are converted to Nepali words using the NepaliNumberingSystem function. **Don't run these cells**

In [None]:
# Mapping of Nepali numerals to English numerals
import pandas as pd
nepali_to_english_numerals = {
    '०': '0',
    '१': '1',
    '२': '2',
    '३': '3',
    '४': '4',
    '५': '5',
    '६': '6',
    '७': '7',
    '८': '8',
    '९': '9'
}


# Function to replace Nepali numerals with English numerals in a text
def replace_nepali_numerals(text):
    return ''.join(nepali_to_english_numerals.get(char, char) for char in text)

# Function to apply replacement to each example in the dataset
def replace_numerals_in_transcription(example):
    example['transcription'] = replace_nepali_numerals(example['transcription'])
    return example

# Apply the function to the dataset
dataset = dataset.map(replace_numerals_in_transcription)

dataset[0]['transcription']

In [None]:
dataset[1]

{'utterance_id': '86e521554a',
 'speaker_id': '1f680',
 'utterance': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/849cb6a83cd0b47ab5a37038d7457f710008228e3c65a8d21077786bfb20d5d0/asr_nepali/data/86/86e521554a.flac',
  'array': array([-0.00128174, -0.00177002, -0.00186157, ...,  0.00134277,
          0.00054932,  0.00057983]),
  'sampling_rate': 16000},
 'transcription': '007 मिलको दूरीमा',
 'num_frames': 62400}

The code below works. But, upon closer inspection of the dataset, there are lots of noisy data examples. So, don't run these cells again. `Not required`.

In [None]:
class NepaliNumberingSystem:
    def __init__(self):
        # Nepali number names for 0-99
        self.np = [
            'सुन्य', 'एक', 'दुई', 'तिन', 'चार', 'पाँच', 'छ', 'सात', 'आठ', 'नौ', 'दश', 'एघार', 'बाह्र', 'तेह्र', 'चौध', 'पन्ध्र', 'सोह्र',
            'सत्र', 'अठार', 'उन्नाइस', 'बिस', 'एक्काइस', 'बाइस', 'तेइस', 'चौबिस', 'पच्चीस', 'छब्बीस', 'सत्ताइस', 'अठाइस', 'उनन्तीस', 'तिस',
            'एकतिस', 'बत्तीस', 'तेत्तीस', 'चाैतीस', 'पैतिस', 'छत्तीस', 'सरतीस', 'अरतीस', 'उननचालीस', 'चालीस', 'एकचालीस', 'बयालिस',
            'तीरचालीस', 'चौवालिस', 'पैंतालिस', 'छयालिस', 'सरचालीस', 'अरचालीस', 'उननचास', 'पचास', 'एकाउन्न', 'बाउन्न', 'त्रिपन्न',
            'चौवन्न', 'पच्पन्न', 'छपन्न', 'सन्ताउन्न', 'अन्ठाउँन्न', 'उनन्नसाठी', 'साठी', 'एकसट्ठी', 'बयसट्ठी', 'त्रिसट्ठी', 'चौंसट्ठी',
            'पैंसट्ठी', 'छयसट्ठी', 'सतसट्ठी', 'अठसट्ठी', 'उनन्नसत्तरी', 'सत्तरी', 'एकहत्तर', 'बहत्तर', 'त्रिहत्तर', 'चौहत्तर', 'पचहत्तर',
            'छहत्तर', 'सत्हत्तर', 'अठ्हत्तर', 'उनास्सी', 'अस्सी', 'एकासी', 'बयासी', 'त्रीयासी', 'चौरासी', 'पचासी', 'छयासी', 'सतासी', 'अठासी',
            'उनान्नब्बे', 'नब्बे', 'एकान्नब्बे', 'बयान्नब्बे', 'त्रियान्नब्बे', 'चौरान्नब्बे', 'पंचान्नब्बे',
            'छयान्नब्बे', 'सन्तान्नब्बे', 'अन्ठान्नब्बे', 'उनान्सय', 'सय'
        ]
        # Nepali large numbers
        self.nns_np = [
            '', 'हजार', 'लाख', 'करोड', 'अर्ब', 'खर्ब', 'नील', 'पद्म'
        ]

    def less_than_100(self, number):
        """Convert numbers less than 100 to Nepali words."""
        return self.np[number] if number > 0 else self.np[0]

    def less_than_1000(self, number):
        """Convert numbers less than 1000 to Nepali words."""
        if number < 100:
            return self.less_than_100(number)

        hundreds, below_hundred = divmod(number, 100)
        in_words = ''
        if hundreds > 0:
            in_words = self.np[hundreds] + ' सय'
        if below_hundred > 0:
            in_words += ' ' + self.less_than_100(below_hundred)
        return in_words.strip()

    def output(self, number):
      # Convert the number to a string to check for leading zeros
        number_str = number

        # Initialize a list to hold parts of the final output
        integer_in_words = []

        # Count leading zeros and add 'सुन्य' for each leading zero
        leading_zeros = len(number_str) - len(number_str.lstrip('0'))
        integer_in_words.extend(['सुन्य'] * leading_zeros)
        number=number_str.lstrip('0')

        # Convert the rest of the number to words
        number = int(number_str)  # Convert back to integer to process
        parts = []
        place_value = 0

        while number > 0:
            if place_value == 0:  # First group (up to 3 digits)
                chunk = number % 1000
                number //= 1000
            else:  # Subsequent groups (2 digits each)
                chunk = number % 100
                number //= 100

            if chunk > 0:
                words = self.less_than_1000(chunk) if place_value == 0 else self.less_than_100(chunk)
                if place_value > 0:
                    words += ' ' + self.nns_np[place_value]
                parts.append(words)

            place_value += 1

        # Add the processed number parts to the integer_in_words list
        integer_in_words.extend(reversed(parts))

        # Build the final result from parts
        final_output = ' '.join(integer_in_words).strip()
        return final_output

# Example usage
num = '00100007'  # If you want to see the effect of leading zeros, try num = '00123'
nns = NepaliNumberingSystem()
print("Nepali Numbering System (Integer in Words):")
print(nns.output(num))  # For '00123', Output: 'सुन्य सुन्य एक सय तेइस'

Nepali Numbering System (Integer in Words):
सुन्य सुन्य एक लाख सात


In [None]:
def convert_numerals_to_words(text):
    # Extract and replace numerals with words
    import re
    nns = NepaliNumberingSystem()

    # Function to replace each numeral in a string with its word form
    def replace_numerals(match):
        return nns.output(match.group())

    # Regex to find numerals (sequences of digits)
    return re.sub(r'\d+', replace_numerals, text)

In [None]:
def apply_conversion(example):
    example['transcription'] = convert_numerals_to_words(example['transcription'])
    return example

# Apply the function to each item in the dataset
dataset = dataset.map(apply_conversion)

# Check the first few entries to verify the change
# print(dataset.features)  # or 'test', depending on your split


Map:   0%|          | 0/157904 [00:00<?, ? examples/s]

In [None]:
print(dataset[1]['transcription'])  # or 'test', depending on your split

सुन्य सुन्य सात मिलको दूरीमा


## Vocabulary

In [None]:
# Define a global variable to store our sampling rate
SPEECH_SAMPLING_RATE = 16000

Long input sequences require a lot of memory. XLS-R is based on self-attention the memory requirement scales quadratically with the input length for long input sequences. So, to avoid an `“Out-of-memory”` error, we use the following code to filter all sequences that are longer than 5 seconds for training.

### All characters in our dataset

Only using clips of maximum 5 seconds

In [None]:
# Only 5 seconds to be safe from out of bound errors.
MAX_FRAMES = SPEECH_SAMPLING_RATE*5  # 5 sec

dataset = dataset.filter(
    lambda ex: ex < MAX_FRAMES,
    input_columns=['num_frames',],
    with_indices=False, batched=False, batch_size=0,
)

dataset

Filter:   0%|          | 0/148187 [00:00<?, ? examples/s]

Dataset({
    features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
    num_rows: 136095
})

So, at last we have 136,095 tuples.

In [None]:
# All characters in our dataset now:
''.join(sorted(set([c for s in dataset['transcription'] for c in s])))

' !;?\\ँंःअआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरऱलवशषसह़ािीुूृेैॉॊोौ्ॐॠ।\u200c\u200d\u200e\u200f“'

### Removing data tuples containing irrelavant symbols

In [None]:
# To remove any non-Nepali Characters
import string
remove_irr_chars = ['ऱ', 'ॊ', '॰', '॑']
remove_irr_chars = ''.join(remove_irr_chars)

def check_irr_symb(text):
    """Returns if this text contains any english characters"""
    return any([c in text for c in remove_irr_chars])

# Use dataset filter to remove examples with above function
dataset = dataset.filter(
    lambda ex: not check_irr_symb(ex),
    input_columns=['transcription',],
    with_indices=False, batched=False, batch_size=0,
)
print(dataset)
print(remove_irr_chars)
print(string.ascii_letters)

Filter:   0%|          | 0/136095 [00:00<?, ? examples/s]

Dataset({
    features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
    num_rows: 136083
})
ऱॊ॰॑
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ


In [None]:
for i in range(10):
  print(dataset[i]['transcription'])

तर उनी कुनै
तपाईँले सायद राम्रै
पन्तले आफूले जानेको
यो रोगलाई मृगी
व्यक्तित्व रहँदै आएकी छन्
दहीचिउरे स्वभावको बनाइदिन्छ
यो साह्रै किफायती
यसले अस्पतालाहरूको धेरै
ग्याँसहरूलाई निस्कने
व्यापक शब्द हो


### Removing punctuations and irrelavant symbols.

We only want to include letters in our transcription. And, since punctuations and symbols don't hold any significance in sounds. And, people don't speak with punctuation marks. Speech is a continuous flow without explicit indications of punctuation like commas, periods, or question marks. Speech recognition systems focus on capturing what is said, not how it might be punctuated in writing.

- And, Removing punctuation simplifies the processing and interpretation of spoken language.
- The primary goal of our speech recognition system is to accurately transcribe the words spoken. Adding punctuation would add another layer of complexity, which is not necessary.

In [None]:
# Removing irrelavant and unused characters
remove_chars = ['!', '%', '.', ';', '?', '\\', '।', '\xa0', '\u200c', '\u200d', '\u200e', '\u200f', '“', 'ऱ', "ॊ", '॰', ' ॑']
# remove_chars = ['!', '%', '.', ';', '?', '\\', '।', '\xa0', '\u200c', '\u200d', '\u200e', '\u200f', '“']

def remove_special_characters(row):
    row['transcription'] = ''.join(
        [c for c in row['transcription'] if c not in remove_chars]
    ).strip()
    return row

dataset = dataset.map(remove_special_characters)
''.join(sorted(set([c for s in dataset['transcription'] for c in s])))

Map:   0%|          | 0/136083 [00:00<?, ? examples/s]

' ँंःअआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह़ािीुूृेैॉोौ्ॐॠ'

In [None]:
print(len(dataset))

136083


In [None]:
for i in range(10):
  print(dataset[i]['transcription'])

तर उनी कुनै
तपाईँले सायद राम्रै
पन्तले आफूले जानेको
यो रोगलाई मृगी
व्यक्तित्व रहँदै आएकी छन्
दहीचिउरे स्वभावको बनाइदिन्छ
यो साह्रै किफायती
यसले अस्पतालाहरूको धेरै
ग्याँसहरूलाई निस्कने
व्यापक शब्द हो


### Initially, if there is no vocab `Use this`. Can skip if already saved.

- In CTC, it is common to classify speech chunks into letters, so we need to do the same here. First, we extract all distinct letters of the training data and will build our vocabulary from this set of letters.
- So, we write a mapping function that concatenates all transcriptions into one long transcription and then transform the string into a set of chars. It is important to pass the argument batched=True to the map(...) function so that the mapping function has access to all transcriptions at once.

In [None]:
def extract_all_chars(batch):
    all_text = " ".join(batch["transcription"])
    vocab = list(set(all_text))
    return {"vocab": [vocab]}

vocab_all = dataset.map(extract_all_chars, batched=True,
                        batch_size=-1, keep_in_memory=True,
                        remove_columns=dataset.column_names)

Map:   0%|          | 0/136083 [00:00<?, ? examples/s]

In [None]:
vocab_list = sorted(list(set(vocab_all["vocab"][0])))

Finally, we also add a padding token that corresponds to CTC’s “blank token”. The “blank token” is a core component of the CTC algorithm.
- It helps to spit out 'hello' instead of 'helo' while colapsing.

In [None]:
# Blank Token for CTC
UNK_TOKEN = '__UNK__'
PAD_TOKEN = '__PAD__'

vocab_list = [PAD_TOKEN, UNK_TOKEN, *vocab_list]

In [None]:
# Enumerated Dictionary
vocab_dict = {v: k for k, v in enumerate(vocab_list)}

# for printing vocab in single line
', '.join([f"{k}: {v}" for k, v in (vocab_dict.items())])

'__PAD__: 0, __UNK__: 1,  : 2, ँ: 3, ं: 4, ः: 5, अ: 6, आ: 7, इ: 8, ई: 9, उ: 10, ऊ: 11, ऋ: 12, ए: 13, ऐ: 14, ओ: 15, औ: 16, क: 17, ख: 18, ग: 19, घ: 20, ङ: 21, च: 22, छ: 23, ज: 24, झ: 25, ञ: 26, ट: 27, ठ: 28, ड: 29, ढ: 30, ण: 31, त: 32, थ: 33, द: 34, ध: 35, न: 36, प: 37, फ: 38, ब: 39, भ: 40, म: 41, य: 42, र: 43, ल: 44, व: 45, श: 46, ष: 47, स: 48, ह: 49, ़: 50, ा: 51, ि: 52, ी: 53, ु: 54, ू: 55, ृ: 56, े: 57, ै: 58, ॉ: 59, ो: 60, ौ: 61, ्: 62, ॐ: 63, ॠ: 64'

To make it clearer that " " has its own token class, we give it a more visible character '|' . In addition, we also add an “unknown” token so that the model can later deal with characters not encountered in the training dataset.

In [None]:
# Adding '|' instead of " " for more visual information.
WORD_DELIMITER = '|'

vocab_dict[WORD_DELIMITER] = vocab_dict[" "]
del vocab_dict[" "]
len(vocab_dict)

65

So, we have our vocabulary with 65 unique characters.
Now, saving it.

In [None]:
# Saving the vocabulary as a json file
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file, ensure_ascii=False)

Loading the Vocab

In [None]:
# Loading the vocabulary to Wav2Vec2CTCTokenizer instance
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token=UNK_TOKEN, pad_token=PAD_TOKEN, word_delimiter_token=WORD_DELIMITER)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]



In [None]:
# # Push the tokenizer to hub for easiness later.
# tokenizer.push_to_hub(repo_name)

### ` Directly loading` from Hub from previously trained model.

In [None]:
from transformers import Wav2Vec2CTCTokenizer
tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR")

tokenizer_config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

### Saving and Loading dataset for further use `Use if saving locally`.

If working on a local machine, saving and loading the dataset would be beneficial instead of downloading while running the notebook everytime.

In [None]:
# # Save the dataset to a local directory
# dataset.save_to_disk('./local_dataset_path')  # Replace with your desired path
# from datasets import load_from_disk

# # Load the dataset from the local directory
# dataset = load_from_disk('./local_dataset_path')  # Replace with your local path


In [None]:
import random
SPEECH_SAMPLING_RATE=16000
import IPython.display as ipd

sample_idx = random.randint(0, len(dataset))
# sample_idx = 0

print(dataset[sample_idx]['transcription'])
ipd.Audio(dataset[sample_idx]['utterance']["array"], autoplay=True, rate=SPEECH_SAMPLING_RATE)

म नभएमा त्यहाँ


### Feature Extractor

In [None]:
# Feature Extractor
SPEECH_SAMPLING_RATE=16000
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=SPEECH_SAMPLING_RATE,
                                             padding_value=0.0, do_normalize=True,
                                             return_attention_mask=True)

In [None]:
# Feature Extractor and Tokenizer wrapped to single processor
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(
    feature_extractor=feature_extractor,
    tokenizer=tokenizer
)

My repo name. Modify accordingly. This repo will be created in HuggingFace and model checkpoints will be pushed here.

Only use the two cells below in the beginning and ignore

In [None]:
repo_name = "Wav2Vec2_XLS-R-300m_Nepali_ASR"

Push tokenizer and processor config to repo if not pushed already

In [None]:
tokenizer.push_to_hub(repo_name)

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/iamTangsang/Final_XLS-R-300m_Nepali/commit/42204dae8adf0f036f31f998154fd0fb93174f02', commit_message='Upload tokenizer', commit_description='', oid='42204dae8adf0f036f31f998154fd0fb93174f02', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
processor.push_to_hub(repo_name)

In [None]:
# Random inspection of dataset item
dataset[45]

{'utterance_id': '2a360a0dfb',
 'speaker_id': '99942',
 'utterance': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/2773ff6d739daaa0871c0821d6e667225e93578c2d9e93baa986f68f12a7d2f8/asr_nepali/data/2a/2a360a0dfb.flac',
  'array': array([ 0.00109863, -0.01113892, -0.01919556, ..., -0.00549316,
         -0.00271606, -0.00640869]),
  'sampling_rate': 16000},
 'transcription': 'तँलाई धेरै मलाई',
 'num_frames': 46400}

### Lazy Loading

**`IMPORTANT: Need more documentation. Will add later.`** We will convert Huggingface’s Dataset to PyTorch dataset, so that audio files are loaded lazily as we are restricted by space availablity and memory size.

In [None]:
# Lazy Loading
'''
We will convert Huggingface’s Dataset to PyTorch dataset, so that audio files are loaded lazily
as we are restricted by space availablity and memory size.

'''
import torch

class NepaliASRProcessedDataset(torch.utils.data.Dataset):
    """Takes HF dataset and processor, and process the audio files
    and transcription with the processor only when items are requested
    """
    def __init__(
        self,
        dataset,
        processor,
    ):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        """Length of dataset"""
        return len(self.dataset)

    def __getitem__(self, idx):
        """Return processed data at `idx` index."""
        example = self.dataset[idx]

        # Return dict
        return_dict = {}

        # first, process the audio with Wav2Vec2 feature extractor
        return_dict['input_values'] = self.processor(
            audio=example['utterance']['array'],
            sampling_rate=example['utterance']['sampling_rate'],
            return_attention_mask=False,  # will be calculated during batching
        )['input_values'][0]
        # add the length of extracted features of audio
        return_dict['input_length'] = len(return_dict['input_values'])

        # second, process the transcription with Wav2Vec2 tokenizer
        return_dict['labels'] = self.processor(
            text=example['transcription'],
            return_attention_mask=False,  # will be calculated during batching
        )['input_ids']
        return return_dict

80% train data, 10% validation data and 10% to test data

In [None]:
# Split the dataset
train_size = int(0.8 * len(dataset))
validation_size = int(0.1 * len(dataset))
test_size = len(dataset) - train_size - validation_size
train_dataset = dataset.select(range(0, train_size))
validation_dataset = dataset.select(range(train_size, train_size + validation_size))
test_dataset = dataset.select(range(train_size + validation_size, len(dataset)))


In [None]:
train_dataset, test_dataset, validation_dataset

(Dataset({
     features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
     num_rows: 108866
 }),
 Dataset({
     features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
     num_rows: 13609
 }),
 Dataset({
     features: ['utterance_id', 'speaker_id', 'utterance', 'transcription', 'num_frames'],
     num_rows: 13608
 }))

In [None]:
# test_dataset.push_to_hub('iamTangsang/OPENSLR-TEST')

### Pushing dataset to Hub

In [None]:
# Create a DatasetDict containing your splits
from datasets import DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset,
    'test': test_dataset
})

# Specify the repository name on Hugging Face Hub
repo_name = "iamTangsang/OpenSLR54-Nepali-ASR"  # Replace with your username and desired repo name

# Push the dataset to the Hugging Face Hub
dataset_dict.push_to_hub(repo_name)

Uploading the dataset shards:   0%|          | 0/12 [00:00<?, ?it/s]

Map:   0%|          | 0/9073 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9073 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Map:   0%|          | 0/9072 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/91 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/6804 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

Map:   0%|          | 0/6804 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Map:   0%|          | 0/6805 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

Map:   0%|          | 0/6804 [00:00<?, ? examples/s]

Creating parquet from Arrow format:   0%|          | 0/69 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/iamTangsang/OpenSLR54-Nepali-ASR/commit/2382bbae14eb4d04527efd09d1cdd87290337e3d', commit_message='Upload dataset', commit_description='', oid='2382bbae14eb4d04527efd09d1cdd87290337e3d', pr_url=None, pr_revision=None, pr_num=None)

## Training

In [None]:
# Convert the Huggingface’s train/test dataset to Pytorch train/test data

train_dataset = NepaliASRProcessedDataset(train_dataset, processor)
test_dataset = NepaliASRProcessedDataset(test_dataset, processor)
validation_dataset = NepaliASRProcessedDataset(validation_dataset, processor)

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union


LARGE_NEG = -100

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths,
        # and need different padding methods
        batch = {}
        input_features = [{"input_values": feature["input_values"]} for feature in features if 'input_values' in feature]
        label_features = [{"input_ids": feature["labels"]} for feature in features if 'labels' in feature]

        if input_features:
            batch.update(self.processor.pad(
                input_features,
                padding=self.padding,
                return_tensors="pt",
            ))
        if label_features:
            labels_batch = self.processor.tokenizer.pad(
                label_features,
                padding=self.padding,
                return_tensors="pt",
            )

            # replace padding with large negative number to ignore loss correctly
            labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), LARGE_NEG)

            batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
import evaluate
import numpy as np

wer_metric = evaluate.load("wer")
# cer_metric = evaluate.load('cer')

In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == LARGE_NEG] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    # cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}
    # return {"wer": wer, "cer": cer}

In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    # "iamTangsang/test-wav2vec2-large-xls-r-300m-nepali-openslr",
    # '/content/drive/MyDrive/Nepali_ASR_New_LR/Wav2vec2-large-xls-r-300m-Nepali-New/checkpoint-55200',
    'iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR',
    ignore_mismatched_sizes=False,
    attention_dropout=0.15, # init 0.1
    hidden_dropout=0.15, # init 0.1
    feat_proj_dropout=0.1,
    mask_time_prob=0.075,
    layerdrop=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer),
)

config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

### Checking Tokenizer

In [None]:
special_tokens = processor.tokenizer.special_tokens_map
print(special_tokens)

{'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '__UNK__', 'pad_token': '__PAD__'}


In [None]:
# Get the full vocabulary from the tokenizer
vocab_dict = processor.tokenizer.get_vocab()

# Get all the special tokens from the tokenizer
special_tokens = list(processor.tokenizer.special_tokens_map_extended.values())

# Filter out special tokens to get only the non-special tokens
non_special_tokens = {token: index for token, index in vocab_dict.items() if token not in special_tokens}

# To see the non-special tokens
print(non_special_tokens)


{'__PAD__': 0, '__UNK__': 1, '|': 2, 'ँ': 3, 'ं': 4, 'ः': 5, 'अ': 6, 'आ': 7, 'इ': 8, 'ई': 9, 'उ': 10, 'ऊ': 11, 'ऋ': 12, 'ए': 13, 'ऐ': 14, 'ओ': 15, 'औ': 16, 'क': 17, 'ख': 18, 'ग': 19, 'घ': 20, 'ङ': 21, 'च': 22, 'छ': 23, 'ज': 24, 'झ': 25, 'ञ': 26, 'ट': 27, 'ठ': 28, 'ड': 29, 'ढ': 30, 'ण': 31, 'त': 32, 'थ': 33, 'द': 34, 'ध': 35, 'न': 36, 'प': 37, 'फ': 38, 'ब': 39, 'भ': 40, 'म': 41, 'य': 42, 'र': 43, 'ल': 44, 'व': 45, 'श': 46, 'ष': 47, 'स': 48, 'ह': 49, '़': 50, 'ा': 51, 'ि': 52, 'ी': 53, 'ु': 54, 'ू': 55, 'ृ': 56, 'े': 57, 'ै': 58, 'ॉ': 59, 'ो': 60, 'ौ': 61, '्': 62, 'ॐ': 63, 'ॠ': 64, '<s>': 65, '</s>': 66}


In [None]:
# Get all tokens using the tokenizer directly
all_tokens = processor.tokenizer.convert_ids_to_tokens(range(len(processor.tokenizer)))

# Print out the tokens
print(f"All tokens in processor.tokenizer: {all_tokens}")
print(f"Total tokens reported by tokenizer: {len(all_tokens)}")

# Compare with vocab_dict tokens
vocab_tokens = list(vocab_dict.keys())
print(f"Tokens in vocab_dict: {vocab_tokens}")
print(f"Total tokens in vocab_dict: {len(vocab_tokens)}")


All tokens in processor.tokenizer: ['__PAD__', '__UNK__', '|', 'ँ', 'ं', 'ः', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ल', 'व', 'श', 'ष', 'स', 'ह', '़', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'े', 'ै', 'ॉ', 'ो', 'ौ', '्', 'ॐ', 'ॠ', '<s>', '</s>']
Total tokens reported by tokenizer: 67
Tokens in vocab_dict: ['__PAD__', '__UNK__', '|', 'ँ', 'ं', 'ः', 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ', 'ए', 'ऐ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ल', 'व', 'श', 'ष', 'स', 'ह', '़', 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ', 'े', 'ै', 'ॉ', 'ो', 'ौ', '्', 'ॐ', 'ॠ', '<s>', '</s>']
Total tokens in vocab_dict: 67


In [None]:
model.freeze_feature_encoder()

In [None]:
len(processor.tokenizer)

67

In [None]:
tokenizer.push_to_hub(repo_name)

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/iamTangsang/Wav2vec2-large-xls-r-300m-Nepali-New/commit/60df95a290fcdcaaac542cf3c250e8ade2206c76', commit_message='Upload tokenizer', commit_description='', oid='60df95a290fcdcaaac542cf3c250e8ade2206c76', pr_url=None, pr_revision=None, pr_num=None)

### Training Arguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=16,
  gradient_accumulation_steps=2,
  evaluation_strategy="steps",
  num_train_epochs=10,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=800,
  eval_steps=800,
  logging_steps=100,
  learning_rate=1e-6, # from 3e-4 to 2e-5 to 1e-6
  # warmup_steps=500, # This was done for all previous models
  warmup_steps=500, # New
  save_total_limit=1,
  push_to_hub=True,
  hub_strategy='checkpoint',
  report_to="tensorboard",
  logging_dir='./logs_final',
  # resume_from_checkpoint='last-checkpoint',
  load_best_model_at_end=True
)



In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    # tokenizer=processor.feature_extractor,
    tokenizer=processor.feature_extractor,
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [None]:
model.freeze_feature_extractor()



### Learning about Model Architecture (Final inspection before starting training)

1. Using print(model)

In [None]:
from transformers import Wav2Vec2ForCTC
model = Wav2Vec2ForCTC.from_pretrained('iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR')

config.json:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

In [None]:
print(model)

Wav2Vec2ForCTC(
  (wav2vec2): Wav2Vec2Model(
    (feature_extractor): Wav2Vec2FeatureEncoder(
      (conv_layers): ModuleList(
        (0): Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (1-4): 4 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (5-6): 2 x Wav2Vec2LayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): Wav2Vec2FeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (projec

2. Model config detail

In [None]:
print(model.config)

Wav2Vec2Config {
  "_name_or_path": "iamTangsang/XLS-R-300m-Nepali-CommonVoice",
  "activation_dropout": 0.0,
  "adapter_attn_dim": null,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForCTC"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 768,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": true,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "mean",
  "ctc_zero_infinity": false,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": true,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "layer",
  "feat_proj_dropout": 0.0,
  "feat_quantizer_dropout": 0.0,
  "fina

3. Explore layer by layer

In [None]:
# To make encoder untrainable
model.freeze_feature_extractor()

In [None]:
# Print the names and shapes of all parameters in the model
for name, param in model.named_parameters():
    print(f"Layer: {name} | Shape: {param.shape} | Requires grad: {param.requires_grad}")


Layer: wav2vec2.masked_spec_embed | Shape: torch.Size([1024]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.0.conv.weight | Shape: torch.Size([512, 1, 10]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.0.conv.bias | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.0.layer_norm.weight | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.0.layer_norm.bias | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.1.conv.weight | Shape: torch.Size([512, 512, 3]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.1.conv.bias | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.1.layer_norm.weight | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2vec2.feature_extractor.conv_layers.1.layer_norm.bias | Shape: torch.Size([512]) | Requires grad: True
Layer: wav2ve

4. Exlpore specific parts of the model

In [None]:
# Access the feature extractor and print its architecture
print(model.wav2vec2.feature_extractor)


Wav2Vec2FeatureEncoder(
  (conv_layers): ModuleList(
    (0): Wav2Vec2LayerNormConvLayer(
      (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,))
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (activation): GELUActivation()
    )
    (1-4): 4 x Wav2Vec2LayerNormConvLayer(
      (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,))
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (activation): GELUActivation()
    )
    (5-6): 2 x Wav2Vec2LayerNormConvLayer(
      (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,))
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (activation): GELUActivation()
    )
  )
)


In [None]:
# Access the encoder layers and print the number of transformer layers
print(f"Number of transformer layers: {len(model.wav2vec2.encoder.layers)}")

# Print the details of one transformer layer (e.g., the first one)
print(model.wav2vec2.encoder.layers[0])


Number of transformer layers: 24
Wav2Vec2EncoderLayerStableLayerNorm(
  (attention): Wav2Vec2SdpaAttention(
    (k_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (v_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (q_proj): Linear(in_features=1024, out_features=1024, bias=True)
    (out_proj): Linear(in_features=1024, out_features=1024, bias=True)
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
  (feed_forward): Wav2Vec2FeedForward(
    (intermediate_dropout): Dropout(p=0.0, inplace=False)
    (intermediate_dense): Linear(in_features=1024, out_features=4096, bias=True)
    (intermediate_act_fn): GELUActivation()
    (output_dense): Linear(in_features=4096, out_features=1024, bias=True)
    (output_dropout): Dropout(p=0.1, inplace=False)
  )
  (final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)


5. Visualization

In [None]:
!pip install torch-summary

Collecting torch-summary
  Downloading torch_summary-1.4.5-py3-none-any.whl.metadata (18 kB)
Downloading torch_summary-1.4.5-py3-none-any.whl (16 kB)
Installing collected packages: torch-summary
Successfully installed torch-summary-1.4.5


In [None]:
from torchsummary import summary

# Use torchsummary to print a summary of the model
summary(model, input_size=(1, 16000))  # input_size=(1, seq_len) for audio


### Tensorboard for training Visualization.

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs_final

In [None]:
trainer.train(
    resume_from_checkpoint='/content/drive/MyDrive/Nepali_ASR_Final_Model/Final_XLS-R-300m_Nepali/checkpoint-4800',  # Set to false if you want to start from the beginning
)

In [None]:
print(train_dataset)

<__main__.NepaliASRProcessedDataset object at 0x7e332a612470>


Push to hub once the model has converged

In [None]:
trainer.push_to_hub('iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR')

# **Ignore codes below for now. Refer Inference_Test_OpenSLR notebook. Will update the codes below later**

## Evaluation

In [None]:
# model = Wav2Vec2ForCTC.from_pretrained(repo_name).to("cuda")
model = Wav2Vec2ForCTC.from_pretrained(repo_name)
processor = Wav2Vec2Processor.from_pretrained(repo_name)

In [None]:
from transformers import Wav2Vec2ForCTC
model = Wav2Vec2ForCTC.from_pretrained('/content/drive/MyDrive/Nepali_ASR/results/checkpoint-800/')

In [None]:
from transformers import AutoProcessor
# model_name = 'Harveenchadha/vakyansh-wav2vec2-nepali-nem-130'
model_name = 'iamTangsang/test-wav2vec2-large-xls-r-300m-nepali-openslr'
model = Wav2Vec2ForCTC.from_pretrained(model_name)
# processor = AutoProcessor.from_pretrained("Harveenchadha/vakyansh-wav2vec2-nepali-nem-130")
# tokenizer = Wav2Vec2CTCTokenizer.from_pretrained('iamTangsang/Final_XLS-R-300m_Nepali')
processor = Wav2Vec2Processor.from_pretrained('iamTangsang/XLS-R-300m-Nepali-CommonVoice')
# processor = Wav2Vec2Processor(
#     feature_extractor=Wav2Vec2FeatureExtractor.from_pretrained(model_name),
#     tokenizer=tokenizer
# )


### On Validation Set

In [None]:
import random
# only take 5 examples from

# pred = trainer.predict(
#     torch.utils.data.Subset(
#         test_dataset,
#         random.sample(list(range(len(test_dataset))), 10)
#     )
# )
pred = trainer.predict(validation_dataset)
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == LARGE_NEG] = processor.tokenizer.pad_token_id

pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

In [None]:
list(zip(label_str, pred_str))[:5]

[('परिवारसहित मेरो पिता', 'परिवारसहित मेरो पिता'),
 ('योगी महा सिद्धको', 'योगी महा सिद्धको'),
 ('यहाँजीलाई धेरै', 'यहाँजीलाई धेरै'),
 ('उनलाई आफ्नो करियरको', 'उनलाई आफ्नो करियरको'),
 ('यो कुवाको पानी', 'यो कुवाको पानी')]

In [None]:
wer = wer_metric.compute(predictions=pred_str, references=label_str)
cer = cer_metric.compute(predictions=pred_str, references=label_str)

In [None]:
# wer_score = wer(label_str, pred_str)
# cer_score = cer(label_str, pred_str)
# wer_score = wer(references, corrected_predictions) # 0.58
# cer_score = cer(references, corrected_predictions) # 0.18

print(f"WER: {wer}")
print(f"CER: {cer}")

WER: 0.16815561211239807
CER: 0.02720467865488672


In [None]:
test_dataset

<__main__.NepaliASRProcessedDataset at 0x7f29ad137070>

### On Test Set

In [None]:
import random
# only take 5 examples from

# pred = trainer.predict(
#     torch.utils.data.Subset(
#         test_dataset,
#         random.sample(list(range(len(test_dataset))), 10)
#     )
# )
pred = trainer.predict(test_dataset)
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == LARGE_NEG] = processor.tokenizer.pad_token_id

pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

In [None]:
list(zip(label_str, pred_str))[:5]

[('यिनका छोरा थिए', 'यिनका छोरा थिए'),
 ('गर्न साँगाचोकको सुकुटेमा', 'गर्न साघाचोको सुकुटेमा'),
 ('जोडिने पुरुषको नाता', 'जोडिने पुरुषको नाता'),
 ('एक सशस्त्र समूह', 'एक सशस्त्र समूह'),
 ('राजाले जिब्रो लर्बराउँदै', 'राजाले जिब्रो लर्बराउँदै')]

In [None]:
wer = wer_metric.compute(predictions=pred_str, references=label_str)
cer = cer_metric.compute(predictions=pred_str, references=label_str)

In [None]:
# wer_score = wer(label_str, pred_str)
# cer_score = cer(label_str, pred_str)
# wer_score = wer(references, corrected_predictions) # 0.58
# cer_score = cer(references, corrected_predictions) # 0.18

print(f"WER: {wer}")
print(f"CER: {cer}")

WER: 0.16251845329051307
CER: 0.02692532332081769


### Creating Final Repository for Final Model

In [None]:
new_repo_name = 'Wav2Vec2_XLS-R-300m_Nepali_ASR'
model.push_to_hub(new_repo_name, commit_message="Upload Final Trained Model")
tokenizer.push_to_hub(new_repo_name, commit_message='Upload Final Tokenizer')
processor.push_to_hub(new_repo_name, commit_message='Upload Final Processor')
# OR
# trainer.push_to_hub(new_repo_name, commit_message='Upload Final Trainer Config')

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR/commit/8c487560a9e781aa50c9b8cb7695267c06d9765a', commit_message='Upload Final Processor', commit_description='', oid='8c487560a9e781aa50c9b8cb7695267c06d9765a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
trainer.push_to_hub('Upload Trainer Arguments')

training_args.bin:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR/commit/4b8d9d8aaa4cf2a0f8c2808715faa06a3ed608d9', commit_message='Upload Trainer Arguments', commit_description='', oid='4b8d9d8aaa4cf2a0f8c2808715faa06a3ed608d9', pr_url=None, pr_revision=None, pr_num=None)

### Evaluation `(Rubbish)`. Don't use this. This was only for testing.

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("./", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [None]:
model = Wav2Vec2ForCTC.from_pretrained('/content/drive/MyDrive/Nepali_ASR/results/checkpoint-800/').to("cuda")


In [None]:
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [None]:
# Define proportions for training, validation, and test sets
train_ratio = 0.8  # 80% for training
validation_ratio = 0.1  # 10% for validation
test_ratio = 0.1  # 10% for testing
common_voice_train_evaluation = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="validated+other", use_auth_token=True)
# Calculate sizes based on ratios
num_samples = len(common_voice_train_evaluation)
print(f'Original: {num_samples}')
train_size = int(train_ratio * num_samples)
validation_size = int(validation_ratio * num_samples)
test_size = num_samples - train_size - validation_size

# Split the dataset
train_dataset = common_voice_train_evaluation.select(range(0, train_size))
common_voice_dev_evaluation = common_voice_train_evaluation.select(range(train_size, train_size + validation_size))
common_voice_test_evaluation = common_voice_train_evaluation.select(range(train_size + validation_size, train_size + validation_size + test_size))

common_voice_train_evaluation = train_dataset

# Print the sizes of each dataset
print(f"Training dataset size: {len(common_voice_train_evaluation)} samples")
print(f"Validation dataset size: {len(common_voice_dev_evaluation)} samples")
print(f"Test dataset size: {len(common_voice_test_evaluation)} samples")



Original: 1337
Training dataset size: 1069 samples
Validation dataset size: 133 samples
Test dataset size: 135 samples


In [None]:
input_dict = processor(common_voice_test[111]["input_values"], return_tensors="pt", padding=True)

logits = model(input_dict.input_values.to("cuda")).logits

pred_ids = torch.argmax(logits, dim=-1)[0]

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [None]:
# OPEN SLR
input_dict = processor((dataset[69]['utterance']['array']), return_tensors="pt", padding=True)

logits = model(input_dict.input_values.to("cpu")).logits

pred_ids = torch.argmax(logits, dim=-1)[0]

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer.from_pretrained("/content/drive/MyDrive/Nepali_ASR/results/checkpoint-800", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

In [None]:
# OPEN SLR
print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset[69]["transcription"].lower())

Prediction:
एकषेतिम वैसाक पुनी माना ।

Reference:
१०३ वैशाख पूर्णिमामा


In [None]:
print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(common_voice_test_evaluation[111]["sentence"].lower())

Prediction:
हो भाइ अबट्यो पनि गर्नुपर्छ ।

Reference:
हो भाइ, अब त्यो पनि गर्नु पर्छ ।


## Another One

### On Validation Set

In [None]:
import random
# only take 5 examples from

# pred = trainer.predict(
#     torch.utils.data.Subset(
#         test_dataset,
#         random.sample(list(range(len(test_dataset))), 10)
#     )
# )
pred = trainer.predict(validation_dataset)
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)

pred.label_ids[pred.label_ids == LARGE_NEG] = processor.tokenizer.pad_token_id

pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
# we do not want to group tokens when computing the metrics
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

In [None]:
list(zip(label_str, pred_str))[:5]

[('परिवारसहित मेरो पिता', 'परिवारसहित मेरो पिता'),
 ('योगी महा सिद्धको', 'योगी महा सिद्धको'),
 ('यहाँजीलाई धेरै', 'यहाँजीलाई धेरै'),
 ('उनलाई आफ्नो करियरको', 'उनलाई आफ्नो करियरको'),
 ('यो कुवाको पानी', 'यो कुवाको पानी')]

In [None]:
wer = wer_metric.compute(predictions=pred_str, references=label_str)
# cer = cer_metric.compute(predictions=pred_str, references=label_str)

In [None]:
# wer_score = wer(label_str, pred_str)
# cer_score = cer(label_str, pred_str)
# wer_score = wer(references, corrected_predictions) # 0.58
# cer_score = cer(references, corrected_predictions) # 0.18

print(f"WER: {wer}")
# print(f"CER: {cer}")

WER: 0.16815561211239807


In [None]:
test_dataset

<__main__.NepaliASRProcessedDataset at 0x7f29ad137070>