# **Fine-tuning XLS-R for Multi-Lingual ASR with 🤗 Transformers**

***New (11/2021)***: *This blog post has been updated to feature XLSR's successor, called [XLS-R](https://huggingface.co/models?other=xls_r)*.

## Notebook Setup

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


In [None]:
!pip install wandb



In [None]:
import wandb
wandb.login()
# 796635845bbcb935e270354f88bafb37436ad1c4

[34m[1mwandb[0m: Currently logged in as: [33mkyagabajonah[0m ([33masr-africa-research-team[0m). Use [1m`wandb login --relogin`[0m to force relogin


True

Before we start, let's install `datasets` and `transformers`. Also, we need the `torchaudio` to load audio files and `jiwer` to evaluate our fine-tuned model using the [word error rate (WER)](https://huggingface.co/metrics/wer) metric ${}^1$.

In [None]:
%%capture
!pip install datasets
!pip install transformers==4.11.3
!pip install torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
!pip install jiwer

In [None]:
!pip install accelerate -U



In [None]:
!pip install transformers[torch]




Then you need to install Git-LFS to upload your model checkpoints:

In [None]:
%%capture
!apt install git-lfs




---

${}^1$ In the [paper](https://arxiv.org/pdf/2006.13979.pdf), the model was evaluated using the phoneme error rate (PER), but by far the most common metric in ASR is the word error rate (WER). To keep this notebook as general as possible we decided to evaluate the model using WER.

## Prepare Data, Tokenizer, Feature Extractor

### Create `Wav2Vec2CTCTokenizer`

Many ASR datasets only provide the target text, `'sentence'` for each audio array `'audio'` and file `'path'`. Common Voice actually provides much more information about each audio file, such as the `'accent'`, etc. Keeping the notebook as general as possible, we only consider the transcribed text for fine-tuning.



In [None]:
!git clone https://huggingface.co/datasets/mbazaNLP/kinyarwanda-tts-dataset

fatal: destination path 'kinyarwanda-tts-dataset' already exists and is not an empty directory.


In [None]:
%cd kinyarwanda-tts-dataset

/content/kinyarwanda-tts-dataset


In [None]:

# Install dependencies
!pip install wget
!pip install datasets
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib

## Install NeMo
BRANCH = 'r2.0.0rc0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

exit()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1 is already the newest version (1.0.31-2ubuntu0.1).
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
libsox-fmt-mp3 is already the newest version (14.4.2+git20190427-2+deb11u2ubuntu0.22.04.1).
sox is already the newest version (14.4.2+git20190427-2+deb11u2ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
[33mDEPRECATION: git+https://github.com/NVIDIA/NeMo.git@r2.0.0rc0#egg=nemo_toolkit[all] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting nemo_toolkit[all]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision r2.0.0rc0) to /tmp/pip-install-wte8u3rf/nemo-toolkit_b5f06bf0cf314d71b4f5f431749556f3
  Running command

In [None]:
import pandas as pd
import os
import librosa
import numpy as np
from collections import Counter
import json

# Path to the dataset repository and CSV file
repo_path = "/content/kinyarwanda-tts-dataset"
csv_path = os.path.join(repo_path, "tts-dataset.csv")

# Load transcriptions into a DataFrame
transcriptions_df = pd.read_csv(csv_path, header=None, names=["combined"])
transcriptions_df[['file_name', 'transcription']] = transcriptions_df['combined'].str.extract(r'([^ ]+) "(.*)"')
transcriptions_df.drop(columns=['combined'], inplace=True)
print(transcriptions_df.head())

# Directory where the audio files are stored
audio_dir = os.path.join(repo_path, "audio")

# List all audio files
audio_files = [f for f in os.listdir(audio_dir) if f.endswith('.wav')]

# Create a DataFrame for the audio files
audio_df = pd.DataFrame(audio_files, columns=['file_name'])

# Adjust file names to add missing underscore and remove any whitespace
audio_df['file_name'] = audio_df['file_name'].apply(lambda x: "TTS_" + x.split('TTS')[1].replace('.wav', '').replace(' ', ''))

# Print the first few rows to inspect
print(audio_df.head())

# Match audio files with transcriptions
matched_records = []
transcriptions_set = set(transcriptions_df['file_name'].values)
for audio_file in audio_df['file_name']:
    if audio_file in transcriptions_set:
        audio_path = os.path.join(audio_dir, audio_file + '.wav')
        audio_path = audio_path.replace("_", " ", 1)  # Replace only the first underscore with a space
        transcription = transcriptions_df[transcriptions_df['file_name'] == audio_file]['transcription'].values[0]

        # Load the audio file to get the duration
        audio_data, sr = librosa.load(audio_path, sr=None)
        # duration = librosa.get_duration(y=audio_data, sr=sr)

        matched_records.append({'file_name': audio_file, 'audio_path': audio_path, 'transcription': transcription })
    else:
        print(f"File {audio_file} not found in transcriptions")

# Convert matched records to a DataFrame
matched_df = pd.DataFrame(matched_records)

# Print the first few rows of the matched DataFrame to inspect
print(matched_df.head())
print(matched_df.shape)



  file_name                                      transcription
0   TTS_1_2  ntivuga ko nehemiya yakoreye umugabo wa esiter...
1   TTS_1_3     iyo nzu ni nto ariko ni nini bihagije kuri twe
2   TTS_1_4  amaze kubona izi mbwa ngo yageze ikigali ateke...
3   TTS_1_5  abana banjye batandatu umugabo wanjye n'abavan...
4   TTS_1_6  seyoboka yavuze ko adahakana ko bariya bantu b...
    file_name
0   TTS_4_141
1   TTS_6_132
2  TTS_17_139
3  TTS_17_262
4  TTS_17_138
    file_name                                         audio_path  \
0   TTS_4_141  /content/kinyarwanda-tts-dataset/audio/TTS 4_1...   
1   TTS_6_132  /content/kinyarwanda-tts-dataset/audio/TTS 6_1...   
2  TTS_17_139  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   
3  TTS_17_262  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   
4  TTS_17_138  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   

                                       transcription  
0  kuko abana bo mu murenge wa ntongwe n'uwa kina...  
1  kuburyo habonets

In [None]:

matched_df = pd.DataFrame(matched_records)

# Print the first few rows of the matched DataFrame to inspect
print(matched_df.head())
print(matched_df.shape)


    file_name                                         audio_path  \
0   TTS_4_141  /content/kinyarwanda-tts-dataset/audio/TTS 4_1...   
1   TTS_6_132  /content/kinyarwanda-tts-dataset/audio/TTS 6_1...   
2  TTS_17_139  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   
3  TTS_17_262  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   
4  TTS_17_138  /content/kinyarwanda-tts-dataset/audio/TTS 17_...   

                                       transcription  duration  
0  kuko abana bo mu murenge wa ntongwe n'uwa kina...  5.166667  
1  kuburyo habonetse amahirwe yo gukina filime zo...  5.500000  
2  ibi byabaye mu mpera z'icyumweru gishize nk'uk...  4.750000  
3  ibyo yabivuze yirengagije ko uwari umukunzi we...  8.250000  
4  ibi byabaye k'ubuyobozi bw'uwo muperezida kand...  6.500000  
(3992, 4)


In [None]:

# Function to select a subset of the dataset
def select_subset(df, target_samples):
    total_samples = 0
    selected_indices = []

    for i, row in df.iterrows():
        # audio_path = row.audio_path.replace("_", " ", 1)

        # Check if the file exists
        if os.path.exists(audio_path):
            try:
                audio, sampling_rate = librosa.load(audio_path, sr=None)
                audio_length = len(audio)
                if total_samples + audio_length > target_samples:
                    break
                total_samples += audio_length
                selected_indices.append(i)
            except Exception as e:
                print(f"Error loading {audio_path}: {e}")
        else:
            print(f"File not found: {audio_path}")

    subset = df.iloc[selected_indices]
    return subset

# Define target samples for train, validation, and test sets (1 hour each)
target_samples = 16_000 * 3600 * 1 # 1 hour of audio

# Select subsets for train, validation, and test
subset_train = select_subset(matched_df, target_samples)
subset_val = select_subset(matched_df, target_samples)
subset_test = select_subset(matched_df, target_samples)

print(f"Train subset size: {len(subset_train)} samples")
print(f"Validation subset size: {len(subset_val)} samples")
print(f"Test subset size: {len(subset_test)} samples")

# # Function to generate JSON from a subset
# import json

def generate_json(subset, output_path):
    json_records = []
    for _, row in subset.iterrows():
        # Replace the first space between "TTS" and the number with an underscore
        audio_path = row['audio_path'].replace(" TTS ", " TTS_", 1)

        json_records.append({
            "audio_filepath": audio_path,
            "text": row['transcription'],
            # "duration": row['duration']
        })

    with open(output_path, 'w') as json_file:
        json.dump(json_records, json_file, indent=4)

    print(f"JSON file saved to {output_path}")
# Paths to save the JSON files
train_json_path = os.path.join(repo_path, "train_records.json")
val_json_path = os.path.join(repo_path, "val_records.json")
test_json_path = os.path.join(repo_path, "test_records.json")

# Generate JSON files for train, validation, and test sets
generate_json(subset_train, train_json_path)
generate_json(subset_val, val_json_path)
generate_json(subset_test, test_json_path)

Train subset size: 163 samples
Validation subset size: 163 samples
Test subset size: 163 samples
JSON file saved to /content/kinyarwanda-tts-dataset/train_records.json
JSON file saved to /content/kinyarwanda-tts-dataset/val_records.json
JSON file saved to /content/kinyarwanda-tts-dataset/test_records.json


In [None]:
import os
import librosa
import numpy as np
from collections import Counter

def get_audio_statistics(df):
    durations = []
    vocabulary = Counter()

    for row in df.itertuples():
        # audio_path = row.audio_path.replace("_", " ", 1)  # Replace only the first underscore with a space
        # Check if the file exists
        if os.path.exists(audio_path):
            try:
                audio, sampling_rate = librosa.load(audio_path, sr=None)
                duration = len(audio) / sampling_rate
                durations.append(duration)

                transcript = row.transcription
                tokens = transcript.split()
                vocabulary.update(tokens)
            except Exception as e:
                print(f"Error loading {audio_path}: {e}")
        else:
            print(f"File not found: {audio_path}")

    return {
        'mean_duration': np.mean(durations),
        'std_duration': np.std(durations),
        'max_duration': np.max(durations),
        'vocab_size': len(vocabulary),
        'unique_words': len(set(vocabulary.keys())),
        'token_freq': vocabulary,
        'durations': durations
    }



audio_stats_train = get_audio_statistics(subset_train)
print(audio_stats_train)


{'mean_duration': 4.0, 'std_duration': 0.0, 'max_duration': 4.0, 'vocab_size': 2275, 'unique_words': 2275, 'token_freq': Counter({'mu': 140, 'ko': 83, 'ku': 49, 'ngo': 40, 'ni': 39, 'muri': 34, 'no': 25, 'na': 25, 'kuko': 25, 'kandi': 23, 'ariko': 21, 'ari': 20, 'igihe': 17, 'kugira': 17, 'hari': 17, 'icyo': 17, 'ati': 17, 'we': 17, 'cyane': 17, 'ya': 17, 'kuri': 17, 'rwanda': 16, 'yo': 16, 'ibyo': 15, 'yari': 14, 'uyu': 14, 'iyo': 14, 'abantu': 13, 'gihe': 13, 'iki': 10, 'iyi': 10, 'cyangwa': 10, 'cyo': 10, 'wa': 9, 'buryo': 9, 'avuga': 9, 'nta': 9, 'ibi': 9, 'bamwe': 9, 'bari': 8, 'nka': 8, 'uko': 8, 'bwo': 8, 'bafite': 8, 'kubera': 8, 'wo': 8, 'cya': 8, 'neza': 8, 'abo': 7, 'izi': 7, 'rya': 7, 'be': 7, 'abandi': 7, 'yavuze': 7, 'uwo': 7, 'aba': 7, 'kujya': 7, 'imana': 7, 'i': 7, 'bo': 6, 'gihugu': 6, 'kuba': 6, 'kumenya': 6, 'zo': 6, 'nabi': 6, 'arimo': 6, 'mbere': 6, 'buri': 6, 'agaciro': 6, 'aho': 6, 'ubwo': 6, 'yanjye': 6, 'yabo': 6, 'agira': 6, 'ikintu': 5, 'umwe': 5, 'bavuga': 

In [None]:

audio_stats_test = get_audio_statistics(subset_test)
print(audio_stats_test)


{'mean_duration': 4.0, 'std_duration': 0.0, 'max_duration': 4.0, 'vocab_size': 2275, 'unique_words': 2275, 'token_freq': Counter({'mu': 140, 'ko': 83, 'ku': 49, 'ngo': 40, 'ni': 39, 'muri': 34, 'no': 25, 'na': 25, 'kuko': 25, 'kandi': 23, 'ariko': 21, 'ari': 20, 'igihe': 17, 'kugira': 17, 'hari': 17, 'icyo': 17, 'ati': 17, 'we': 17, 'cyane': 17, 'ya': 17, 'kuri': 17, 'rwanda': 16, 'yo': 16, 'ibyo': 15, 'yari': 14, 'uyu': 14, 'iyo': 14, 'abantu': 13, 'gihe': 13, 'iki': 10, 'iyi': 10, 'cyangwa': 10, 'cyo': 10, 'wa': 9, 'buryo': 9, 'avuga': 9, 'nta': 9, 'ibi': 9, 'bamwe': 9, 'bari': 8, 'nka': 8, 'uko': 8, 'bwo': 8, 'bafite': 8, 'kubera': 8, 'wo': 8, 'cya': 8, 'neza': 8, 'abo': 7, 'izi': 7, 'rya': 7, 'be': 7, 'abandi': 7, 'yavuze': 7, 'uwo': 7, 'aba': 7, 'kujya': 7, 'imana': 7, 'i': 7, 'bo': 6, 'gihugu': 6, 'kuba': 6, 'kumenya': 6, 'zo': 6, 'nabi': 6, 'arimo': 6, 'mbere': 6, 'buri': 6, 'agaciro': 6, 'aho': 6, 'ubwo': 6, 'yanjye': 6, 'yabo': 6, 'agira': 6, 'ikintu': 5, 'umwe': 5, 'bavuga': 

In [None]:
audio_stats_val = get_audio_statistics(subset_val)
print(audio_stats_val)

{'mean_duration': 4.0, 'std_duration': 0.0, 'max_duration': 4.0, 'vocab_size': 2275, 'unique_words': 2275, 'token_freq': Counter({'mu': 140, 'ko': 83, 'ku': 49, 'ngo': 40, 'ni': 39, 'muri': 34, 'no': 25, 'na': 25, 'kuko': 25, 'kandi': 23, 'ariko': 21, 'ari': 20, 'igihe': 17, 'kugira': 17, 'hari': 17, 'icyo': 17, 'ati': 17, 'we': 17, 'cyane': 17, 'ya': 17, 'kuri': 17, 'rwanda': 16, 'yo': 16, 'ibyo': 15, 'yari': 14, 'uyu': 14, 'iyo': 14, 'abantu': 13, 'gihe': 13, 'iki': 10, 'iyi': 10, 'cyangwa': 10, 'cyo': 10, 'wa': 9, 'buryo': 9, 'avuga': 9, 'nta': 9, 'ibi': 9, 'bamwe': 9, 'bari': 8, 'nka': 8, 'uko': 8, 'bwo': 8, 'bafite': 8, 'kubera': 8, 'wo': 8, 'cya': 8, 'neza': 8, 'abo': 7, 'izi': 7, 'rya': 7, 'be': 7, 'abandi': 7, 'yavuze': 7, 'uwo': 7, 'aba': 7, 'kujya': 7, 'imana': 7, 'i': 7, 'bo': 6, 'gihugu': 6, 'kuba': 6, 'kumenya': 6, 'zo': 6, 'nabi': 6, 'arimo': 6, 'mbere': 6, 'buri': 6, 'agaciro': 6, 'aho': 6, 'ubwo': 6, 'yanjye': 6, 'yabo': 6, 'agira': 6, 'ikintu': 5, 'umwe': 5, 'bavuga': 

In [None]:

# Install dependencies
!pip install wget
!pip install datasets
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib

## Install NeMo
BRANCH = 'r2.0.0rc0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

exit()

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libsndfile1 is already the newest version (1.0.31-2ubuntu0.1).
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
libsox-fmt-mp3 is already the newest version (14.4.2+git20190427-2+deb11u2ubuntu0.22.04.1).
sox is already the newest version (14.4.2+git20190427-2+deb11u2ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
[33mDEPRECATION: git+https://github.com/NVIDIA/NeMo.git@r2.0.0rc0#egg=nemo_toolkit[all] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting nemo_toolkit[all]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision r2.0.0rc0) to /tmp/pip-install-fopdan1b/nemo-toolkit_008cae20769e4c91b4824d6b2b4200a8
  Running command

In [None]:
# import os
# import json
# import nemo.collections.asr as nemo_asr
# from datasets import load_dataset
# from nemo.collections.asr.models import EncDecCTCModel
# from omegaconf import OmegaConf
# from pytorch_lightning import Trainer
# # Load pretrained model
# model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_conformer_ctc_small", map_location='cpu')
# # Configure training
# config = model.cfg
# config.train_ds.manifest_filepath = '/content/kinyarwanda-tts-dataset/train_records.json'
# config.train_ds.batch_size = 16

# config.validation_ds.manifest_filepath = '/content/kinyarwanda-tts-dataset/val_records.json'
# config.validation_ds.batch_size = 16

# config.test_ds.manifest_filepath = '/content/kinyarwanda-tts-dataset/test_records.json'
# config.test_ds.batch_size = 16

# config.optim.lr = 0.001

# # Update model configuration
# model.setup_training_data(train_data_config=config.train_ds)
# model.setup_validation_data(val_data_config=config.validation_ds)
# model.setup_test_data(test_data_config=config.test_ds)

# # Train the model
# trainer = Trainer(max_epochs=10, gpus=1)
# trainer.fit(model)

# # Save the fine-tuned model
# model.save_to("fine_tuned_parakeet_ctc.nemo")


    The secret `HF_TOKEN` does not exist in your Colab secrets.
    To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
    You will be able to reuse this secret in all of your notebooks.
    Please note that authentication is recommended but still optional to access public models or datasets.
    


[NeMo I 2024-07-10 06:36:20 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-07-10 06:36:20 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2024-07-10 06:36:20 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librispeech_withs

[NeMo I 2024-07-10 06:36:20 features:305] PADDING: 0
[NeMo I 2024-07-10 06:36:21 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /root/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/f879b51de584983383de815ce87d25469b2abbf3/stt_en_conformer_ctc_small.nemo.


[NeMo W 2024-07-10 06:36:21 audio_to_text_dataset:343] dataset does not have explicitly defined labels
[NeMo E 2024-07-10 06:36:21 manifest:99] Failed to parse 817 lines from manifest file: /content/kinyarwanda-tts-dataset/train_records.json
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `[`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `{`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `"audio_filepath": "/content/kinyarwanda-tts-dataset/audio/TTS 4_141.wav",`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `"text": "kuko abana bo mu murenge wa ntongwe n'uwa kinazi batangira gukorera amafaranga bakiri batoya",`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `"duration": 5.166666666666667`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `},`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to parse line: `{`
[NeMo E 2024-07-10 06:36:21 manifest:101] -- Failed to par

RuntimeError: Failed to parse some lines from manifest files. See logs for more details.

In [None]:
import numpy as np
import pandas as pd

def get_transcript_statistics(df):
    lengths = [len(row.transcription.split()) for row in df.itertuples()]

    return {
        'max_length': np.max(lengths),
        'mean_length': np.mean(lengths),
        'outliers': [length for length in lengths if length > np.mean(lengths) + 2 * np.std(lengths)],
        'lengths': lengths
    }
transcript_stats = get_transcript_statistics(matched_df)
print(transcript_stats)

# outliers
mean_length = np.mean(transcript_stats['lengths'])
std_length = np.std(transcript_stats['lengths'])
outlier_threshold = mean_length + 2 * std_length


matched_df['length'] = matched_df['transcription'].apply(lambda x: len(x.split()))
matched_df['is_outlier'] = matched_df['length'] > outlier_threshold

# Drop rows with outliers
filtered_df = matched_df[~matched_df['is_outlier']].drop(columns=['length', 'is_outlier'])


print(filtered_df)


{'max_length': 24, 'mean_length': 11.88376753507014, 'outliers': [18, 17, 17, 18, 17, 17, 18, 17, 18, 23, 18, 17, 17, 17, 17, 17, 18, 17, 20, 17, 18, 17, 18, 17, 17, 17, 18, 17, 17, 17, 22, 17, 20, 18, 17, 17, 17, 18, 17, 19, 22, 17, 17, 18, 17, 18, 18, 17, 17, 18, 17, 18, 17, 17, 19, 17, 17, 17, 17, 17, 17, 19, 17, 17, 17, 17, 17, 17, 17, 17, 18, 17, 22, 17, 19, 17, 17, 19, 17, 18, 17, 17, 17, 18, 18, 17, 17, 17, 18, 18, 18, 19, 18, 17, 17, 17, 17, 18, 17, 17, 18, 17, 17, 19, 17, 18, 17, 19, 18, 19, 17, 18, 20, 17, 17, 17, 17, 17, 18, 17, 17, 19, 17, 17, 17, 24, 17, 18, 17, 18, 19, 17, 17, 20, 19], 'lengths': [10, 13, 11, 11, 12, 12, 13, 13, 8, 13, 13, 18, 13, 11, 15, 14, 16, 15, 12, 15, 10, 12, 13, 12, 13, 10, 15, 13, 10, 13, 13, 12, 11, 14, 12, 11, 14, 12, 16, 13, 12, 7, 14, 13, 7, 16, 16, 11, 11, 12, 12, 13, 11, 12, 13, 14, 10, 11, 12, 16, 12, 10, 12, 8, 12, 8, 12, 9, 14, 10, 12, 13, 10, 14, 11, 14, 8, 17, 14, 11, 11, 13, 10, 13, 10, 15, 9, 15, 12, 11, 16, 8, 8, 10, 13, 16, 12, 17,

In [None]:
filtered_df.head()

Unnamed: 0,file_name,audio_path,transcription
0,TTS_9_117,/content/kinyarwanda-tts-dataset/audio/TTS_9_1...,azanasaba ko abaturage bahabwa murandasi nk'uk...
1,TTS_15_59,/content/kinyarwanda-tts-dataset/audio/TTS_15_...,batwitse amacumbi arindwi y'iri shuri banagera...
2,TTS_15_26,/content/kinyarwanda-tts-dataset/audio/TTS_15_...,baramukanze uburibwe bugera ku magufa banamwin...
3,TTS_17_35,/content/kinyarwanda-tts-dataset/audio/TTS_17_...,hari n'abanyeshuri bifuje ko byavanwaho nk'aha...
4,TTS_15_12,/content/kinyarwanda-tts-dataset/audio/TTS_15_...,bamwe muri leta y'u burundi bavuga ko ari imyi...


In [None]:
import os
import librosa
import numpy as np

def get_speech_rate_statistics(df):
    cps = []
    wps = []

    for row in df.itertuples():
        audio_path = row.audio_path.replace("_", " ", 1)  # Replace only the first underscore with a space

        # Check if the file exists
        if os.path.exists(audio_path):
            try:
                audio, sampling_rate = librosa.load(audio_path, sr=None)
                duration = len(audio) / sampling_rate

                transcript = row.transcription

                characters_per_sec = len(transcript) / duration
                words_per_sec = len(transcript.split()) / duration

                cps.append(characters_per_sec)
                wps.append(words_per_sec)
            except Exception as e:
                print(f"Error loading {audio_path}: {e}")
        else:
            print(f"File not found: {audio_path}")

    return {
        'mean_cps': np.mean(cps),
        'std_cps': np.std(cps),
        'mean_wps': np.mean(wps),
        'std_wps': np.std(wps),
        'cps': cps,
        'wps': wps
    }

# Assuming 'filtered_df' is your DataFrame
speech_rate_stats = get_speech_rate_statistics(filtered_df)
print(speech_rate_stats)


{'mean_cps': 14.761801597467896, 'std_cps': 1.8646373729501364, 'mean_wps': 2.0805969613857767, 'std_wps': 0.3529380392587765, 'cps': [14.545454545454545, 15.703703703703704, 15.609790630585985, 15.157894736842104, 16.842105263157894, 16.0, 14.339647182042839, 13.877576708920744, 14.857142857142858, 14.04, 12.275862068965518, 14.76923076923077, 15.299999999999999, 18.387096774193548, 15.454545454545455, 13.67741935483871, 17.05263157894737, 14.363636363636363, 12.823529411764707, 12.96, 15.789473684210526, 16.545454545454547, 16.105263157894736, 16.142857142857142, 16.0, 14.153846153846153, 15.0, 14.166666666666666, 14.476190476190476, 15.833333333333334, 13.957473744340703, 15.75, 13.0, 15.627874011761827, 16.17391304347826, 15.86882886312202, 16.666666666666668, 14.933333333333334, 13.263157894736842, 16.680883255431574, 13.575794889836844, 15.418156391413909, 13.655172413793103, 12.75, 15.500043934364893, 13.882352941176471, 13.166666666666666, 15.461538461538463, 13.692307692307692

In [None]:


def calculate_snr(audio):
    signal = np.mean(audio**2)
    noise = np.var(audio)
    snr = 10 * np.log10(signal / noise)
    return snr

def get_snr_statistics(df):
    snr_values = []

    for row in df.itertuples():
        audio_path = row.audio_path.replace("_", " ", 1)

        if os.path.exists(audio_path):
            try:
                audio, sampling_rate = librosa.load(audio_path, sr=None)
                snr = calculate_snr(audio)
                snr_values.append(snr)
            except Exception as e:
                print(f"Error loading {audio_path}: {e}")
        else:
            print(f"File not found: {audio_path}")

    return {
        'mean_snr': np.mean(snr_values),
        'std_snr': np.std(snr_values),
        'max_snr': np.max(snr_values),
        # 'snr_values': snr_values
    }

# Assuming 'matched_df' is your DataFrame
snr_stats = get_snr_statistics(filtered_df)
print(snr_stats)


{'mean_snr': 2.908094771356415e-06, 'std_snr': 1.664322619502992e-05, 'max_snr': 0.0009964953642338514}


In [None]:
# Install wandb, transformers, and datasets
!pip install wandb transformers datasets

# Uninstall conflicting packages
!pip uninstall -y pyarrow requests

# Install compatible versions of required packages
!pip install pyarrow==14.0.1 requests==2.31.0

# Restart the kernel
import os
# os.kill(os.getpid(), 9)

Found existing installation: pyarrow 16.1.0
Uninstalling pyarrow-16.1.0:
  Successfully uninstalled pyarrow-16.1.0
Found existing installation: requests 2.32.3
Uninstalling requests-2.32.3:
  Successfully uninstalled requests-2.32.3
Collecting pyarrow==14.0.1
  Using cached pyarrow-14.0.1-cp310-cp310-manylinux_2_28_x86_64.whl (38.0 MB)
Collecting requests==2.31.0
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Installing collected packages: requests, pyarrow
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.20.0 requires pyarrow>=15.0.0, but you have pyarrow 14.0.1 which is incompatible.
datasets 2.20.0 requires requests>=2.32.2, but you have requests 2.31.0 which is incompatible.[0m[31m
[0mSuccessfully installed pyarrow-14.0.1 requests-2.31.0


In [None]:
import os
import glob
import subprocess
import tarfile
import wget
import copy
from omegaconf import OmegaConf, open_dict

In [None]:
import nemo
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.metrics.wer import word_error_rate
from nemo.utils import logging, exp_manager

# Preparing the dataset for training

Before we start training the model on the above unprocessed manifest files, we need to analyze the data. Data pre-processing is perhaps the most essential task, and often requires moderate expertise in the language.

While we could technically use the manifests above to train a model, the results would potentially be abysmal. Let's dive a little deeper into what challenges this dataset poses to our models.

In [None]:
# Manifest Utils
from tqdm.auto import tqdm
from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest
import json
def write_processed_manifest(data, original_path):
    original_manifest_name = os.path.basename(original_path)
    new_manifest_name = original_manifest_name.replace(".json", "_processed.json")

    manifest_dir = os.path.split(original_path)[0]
    filepath = os.path.join(manifest_dir, new_manifest_name)
    write_manifest(filepath, data)
    print(f"Finished writing manifest: {filepath}")
    return filepath

In [None]:
from tqdm.auto import tqdm
from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest
import os
import json

def write_processed_manifest(data, original_path):
    original_manifest_name = os.path.basename(original_path)
    new_manifest_name = original_manifest_name.replace(".json", "_processed.json")

    manifest_dir = os.path.split(original_path)[0]
    filepath = os.path.join(manifest_dir, new_manifest_name)
    write_manifest(filepath, data)
    print(f"Finished writing manifest: {filepath}")
    return filepath

# Replace with the path to your manifest file
manifest_path = "/content/train_records (1).json"

try:
    manifest_data = read_manifest(manifest_path)
    print("Manifest loaded successfully. Number of entries:", len(manifest_data))
    # Process the manifest data as needed
    # For example, filtering out invalid entries
    valid_data = [entry for entry in manifest_data if 'audio_filepath' in entry and 'duration' in entry and 'text' in entry]
    # Write the processed manifest
    write_processed_manifest(valid_data, manifest_path)
except Exception as e:
    print("An error occurred while reading or writing the manifest:", e)


[NeMo E 2024-07-10 08:01:00 manifest_utils:494] 76 Errors encountered while reading manifest file: <class 'nemo.utils.data_utils.DataStoreObject'>: store_path=/content/train_records (1).json, local_path=/content/train_records (1).json
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `[`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `{`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `"audio_filepath": "/content/kinyarwanda-tts-dataset/audio/TTS_18_57.wav",`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `"text": "igihe utekereza ko uzi ikintu ukwiye kubanza kureba mu kindi cyerekezo",`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `"duration": 4.6249886621315195`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `},`
[NeMo E 2024-07-10 08:01:00 manifest_utils:496] -- Failed to parse line: `{`
[NeMo E 2024-07-10 08:01:00 manifest_utils:4

An error occurred while reading or writing the manifest: Errors encountered while reading manifest file: <class 'nemo.utils.data_utils.DataStoreObject'>: store_path=/content/train_records (1).json, local_path=/content/train_records (1).json


Next, we extract just the text corpus from the manifest.

In [None]:
train_text = [data['text'] for data in train_manifest_data]
# dev_text = [data['text'] for data in dev_manifest_data]
test_text = [data['text'] for data in test_manifest_data]

## Character set

Let us calculate the character set - which is the set of unique tokens that exist within the text manifests.

In [None]:
from collections import defaultdict

def get_charset(manifest_data):
    charset = defaultdict(int)
    for row in tqdm(manifest_data, desc="Computing character set"):
        text = row['text']
        for character in text:
            charset[character] += 1
    return charset

In [None]:
train_charset = get_charset(train_manifest_data)
# dev_charset = get_charset(dev_manifest_data)
test_charset = get_charset(test_manifest_data)

Computing character set:   0%|          | 0/686 [00:00<?, ?it/s]

Computing character set:   0%|          | 0/674 [00:00<?, ?it/s]

In [None]:
# Count the number of unique tokens that exist within this dataset
# train_dev_set = set.union(set(train_charset.keys()), set(dev_charset.keys()))
train_dev_set = set(train_charset.keys())
test_set = set(test_charset.keys())

In [None]:
print(f"Number of tokens in train+dev set : {len(train_dev_set)}")
print(f"Number of tokens in test set : {len(test_set)}")

Number of tokens in train+dev set : 27
Number of tokens in test set : 26


In [None]:
# OOV tokens in test set
train_test_common = set.intersection(train_dev_set, test_set)
test_oov = test_set - train_test_common
print(f"Number of OOV tokens in test set : {len(test_oov)}")
print()
print(test_oov)

Number of OOV tokens in test set : 0

set()


In [None]:
# Remove Out-of-Vocabulary tokens from the test set
all_tokens = set.union(train_dev_set, test_set)
print(f"Original train+dev+test vocab size : {len(all_tokens)}")

extra_kanji = set(test_oov)
train_token_set = all_tokens - extra_kanji
print(f"New train vocab size : {len(train_token_set)}")

Original train+dev+test vocab size : 27
New train vocab size : 27


In [None]:
# Processing pipeline
def apply_preprocessors(manifest, preprocessors):
    for processor in preprocessors:
        for idx in tqdm(range(len(manifest)), desc=f"Applying {processor.__name__}"):
            manifest[idx] = processor(manifest[idx])

    print("Finished processing manifest !")
    return manifest

In [None]:

PREPROCESSORS = []

In [None]:
# Load manifests
train_data = read_manifest(train_manifest)
# dev_data = read_manifest(dev_manifest)
test_data = read_manifest(test_manifest)

# Apply preprocessing
train_data_processed = apply_preprocessors(train_data, PREPROCESSORS)
# dev_data_processed = apply_preprocessors(dev_data, PREPROCESSORS)
test_data_processed = apply_preprocessors(test_data, PREPROCESSORS)

# Write new manifests
train_manifest_cleaned = write_processed_manifest(train_data_processed, train_manifest)
# dev_manifest_cleaned = write_processed_manifest(dev_data_processed, dev_manifest)
test_manifest_cleaned = write_processed_manifest(test_data_processed, test_manifest)

Finished processing manifest !
Finished processing manifest !
Finished writing manifest: /content/datasets/default/KasuleTrevor/cleaned-common-voice-1hr-split-sw/default/train/train_KasuleTrevor_cleaned-common-voice-1hr-split-sw_manifest_processed.json
Finished writing manifest: /content/datasets/default/KasuleTrevor/cleaned-common-voice-1hr-split-sw/default/test/test_KasuleTrevor_cleaned-common-voice-1hr-split-sw_manifest_processed.json


In [None]:
train_manifest_data = read_manifest(train_manifest_cleaned)
train_charset = get_charset(train_manifest_data)

# dev_manifest_data = read_manifest(dev_manifest_cleaned)
# dev_charset = get_charset(dev_manifest_data)

# train_dev_set = set.union(set(train_charset.keys()), set(dev_charset.keys()))
train_dev_set = set(train_charset.keys())

Computing character set:   0%|          | 0/686 [00:00<?, ?it/s]

In [None]:
print(f"Number of tokens in preprocessed train+dev set : {len(train_dev_set)}")

Number of tokens in preprocessed train+dev set : 27


# Sub-word Encoding CTC Model

Sub-word encoding models are almost nearly identical to the Character encoding models. The primary difference lies in the fact that a sub-encoding model accepts a sub-word tokenized text corpus and emits sub-word tokens in its decoding step. The following section will detail how we prepare a CTC model which utilizes a sub-word Encoding scheme.

For this section, we will utilize a pre-trained [Parakeet Nvidia 0.6b](https://arxiv.org/abs/2104.01721) trained on roughly 7,000 hours of English speech as the base model. We will modify the decoder layer (thereby changing the model's vocabulary) and then train for a small number of epochs.

## Prepare Tokenizer

Before we update the vocabulary of the model, first, we need to construct a tokenizer. NeMo supports both Word Piece Tokenizer (via HuggingFace) or Sentence Piece Tokenizer (via Google SentencePiece library). We will utilize the SentencePiece tokenizer in this tutorial.

-----
Preparation of the tokenizer is made simple by the `process_asr_text_tokenizer.py` script in NeMo. We will leverage this script to build the text corpus from the manifest directly, then create a tokenizer using that corpus.

**Note**: Ordinarily, for languages that have such substantially large vocabularies, there is no significant benefit obtained by constructing sub-word vocabulary. In Natural Language Processing, we could use enormous vocabulary sizes of 10,000+ tokens, but that is unfeasible for CTC loss training of ASR models.

Therefore, we will construct a sub-word tokenizer with vocabulary size exactly the same as the character encoding model plus add a few tokens required by SentencePiece required to perform tokenization. You can experiment with the effect of larger vocabularies by editing `VOCAB_SIZE` below.

In [None]:
BRANCH = "r2.0.0rc0"
if not os.path.exists("scripts/process_asr_text_tokenizer.py"):
  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py

--2024-06-19 08:50:56--  https://raw.githubusercontent.com/NVIDIA/NeMo/r2.0.0rc0/scripts/tokenizers/process_asr_text_tokenizer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16631 (16K) [text/plain]
Saving to: ‘scripts/process_asr_text_tokenizer.py’


2024-06-19 08:50:57 (4.63 MB/s) - ‘scripts/process_asr_text_tokenizer.py’ saved [16631/16631]



In [None]:
#@title Tokenizer Config { display-mode: "form" }
TOKENIZER_TYPE = "bpe" #@param ["bpe", "unigram"]

In [None]:
# << VOCAB SIZE can be changed to any value larger than (len(train_dev_set) + 2)! >>
VOCAB_SIZE = len(train_dev_set) + 2

In [None]:
VOCAB_SIZE

29

In [None]:
tokenizer_dir = os.path.join('tokenizers', LANGUAGE)
tokenizer_dir

'tokenizers/default'

In [None]:
!python scripts/process_asr_text_tokenizer.py \
  --manifest="/content/datasets/default/KasuleTrevor/cleaned-common-voice-1hr-split-sw/default/train/train_KasuleTrevor_cleaned-common-voice-1hr-split-sw_manifest_processed.json"\
  --vocab_size=29 \
  --data_root="tokenizers/default" \
  --tokenizer="spe" \
  --spe_type="bpe" \
  --spe_character_coverage=1.0 \
  --no_lower_case \
  --log

INFO:root:Finished extracting manifest : /content/datasets/default/KasuleTrevor/cleaned-common-voice-1hr-split-sw/default/train/train_KasuleTrevor_cleaned-common-voice-1hr-split-sw_manifest_processed.json
INFO:root:Finished extracting all manifests ! Number of sentences : 686
[NeMo I 2024-06-19 08:51:13 sentencepiece_tokenizer:317] Processing tokenizers/default/text_corpus/document.txt and store at tokenizers/default/tokenizer_spe_bpe_v29
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tokenizers/default/text_corpus/document.txt --model_prefix=tokenizers/default/tokenizer_spe_bpe_v29/tokenizer --vocab_size=29 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=bpe --character_coverage=1.0 --bos_id=-1 --eos_id=-1
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: tokenizers/default/text_corpus/document.txt
  input_format: 
  model_prefix: tokenizers/default/tokenizer_spe_bpe_v29/tokenizer
  model_type: BPE
  vocab_size

In [None]:
train_manifest_cleaned

'/content/datasets/default/KasuleTrevor/cleaned-common-voice-1hr-split-sw/default/train/train_KasuleTrevor_cleaned-common-voice-1hr-split-sw_manifest_processed.json'

In [None]:
TOKENIZER_DIR = f"{tokenizer_dir}/tokenizer_spe_{TOKENIZER_TYPE}_v{VOCAB_SIZE}/"
print("Tokenizer directory :", TOKENIZER_DIR)

Tokenizer directory : tokenizers/default/tokenizer_spe_bpe_v29/


In [None]:
# Number of tokens in tokenizer -
with open(os.path.join(TOKENIZER_DIR, 'tokenizer.vocab')) as f:
  tokens = f.readlines()

num_tokens = len(tokens)
print("Number of tokens : ", num_tokens)

Number of tokens :  29


In [None]:
if num_tokens < VOCAB_SIZE:
    print(
        f"The text in this dataset is too small to construct a tokenizer "
        f"with vocab size = {VOCAB_SIZE}. Current number of tokens = {num_tokens}. "
        f"Please reconstruct the tokenizer with fewer tokens"
    )

## Load pre-trained model

Here we will load a pre-trained Citrinet 512. The model possesses nearly twice the parameter count of QuartzNet, and has a larger receptive field due to its three stride layers (effectively striding the temporal dimension by 8x).

In [None]:
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-ctc-0.6b", map_location='cpu')
# nvidia/parakeet-ctc-0.6b

    The secret `HF_TOKEN` does not exist in your Colab secrets.
    To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
    You will be able to reuse this secret in all of your notebooks.
    Please note that authentication is recommended but still optional to access public models or datasets.
    


parakeet-ctc-0.6b.nemo:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

[NeMo I 2024-06-19 08:53:21 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-06-19 08:53:22 modelPT:176] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /disk1/NVIDIA/datasets/LibriSpeech_NeMo/librivox-train-all.json
    sample_rate: 16000
    batch_size: 16
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 16.7
    min_duration: 0.1
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
    
[NeMo W 2024-06-19 08:53:22 modelPT:183] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /disk1/NVIDIA/datasets/LibriSpee

[NeMo I 2024-06-19 08:53:22 features:305] PADDING: 0
[NeMo I 2024-06-19 08:53:31 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /root/.cache/huggingface/hub/models--nvidia--parakeet-ctc-0.6b/snapshots/097ffc5b027beabc73acb627def2d1d278e774e9/parakeet-ctc-0.6b.nemo.


## Preserving decoder initialization for sub-word models

Subword tokenization has an interesting phenomenon. In many languages, the base character set is small enough that many sub-words can be computed to produce a finite-sized tokenizer vocabulary (the model above has a vocabulary size of 1024 Byte Pair subwords). When preparing the tokenizer on a fine-tuning corpus, it might be possible to once again prepare a tokenizer with exactly the same number of tokens (say 1024).

In such a case, the weight matrices of the decoder match exactly, and therefore the pre-trained weights of the original model can be loaded onto the new model! This is treated as a good initialization only since further gradient updates will update significantly change the alignments of the decoder. However, we find that such an initialization sometimes significantly improves word error rate and slightly improved convergence speed.

In [None]:
# Preserve the decoder parameters in case weight matching can be done later
pretrained_decoder = model.decoder.state_dict()

## Update the vocabulary

Changing the vocabulary of a sub-word encoding ASR model is as simple as passing the path of the tokenizer dir to `change_vocabulary()`.

In [None]:
model.change_vocabulary(new_tokenizer_dir=TOKENIZER_DIR, new_tokenizer_type="bpe")

[NeMo W 2024-06-19 08:53:31 modelPT:272] You tried to register an artifact under config key=tokenizer.model_path but an artifact for it has already been registered.
[NeMo W 2024-06-19 08:53:31 modelPT:272] You tried to register an artifact under config key=tokenizer.vocab_path but an artifact for it has already been registered.
[NeMo W 2024-06-19 08:53:31 modelPT:272] You tried to register an artifact under config key=tokenizer.spe_tokenizer_vocab but an artifact for it has already been registered.


[NeMo I 2024-06-19 08:53:31 mixins:172] Tokenizer SentencePieceTokenizer initialized with 29 tokens
[NeMo I 2024-06-19 08:53:32 ctc_bpe_models:265] 
    Replacing old number of classes (1024) with new number of classes - 29
[NeMo I 2024-06-19 08:53:32 ctc_bpe_models:307] Changed tokenizer to ['<unk>', '▁k', 'a', '▁', 'i', 'n', 'k', 'u', 'm', 'e', 'w', 'h', 'o', 't', 'l', 'y', 's', 'r', 'd', 'z', 'b', 'g', 'j', 'f', 'p', 'c', 'v', 'q', 'x'] vocabulary.


In [None]:
# Insert preserved model weights if shapes match
if model.decoder.decoder_layers[0].weight.shape == pretrained_decoder['decoder_layers.0.weight'].shape:
    model.decoder.load_state_dict(pretrained_decoder)
    logging.info("Decoder shapes matched - restored weights from pre-trained model")
else:
    logging.info("\nDecoder shapes did not match - could not restore decoder weights from pre-trained model.")

[NeMo I 2024-06-19 08:53:32 <ipython-input-49-a62237b63b26>:6] 
    Decoder shapes did not match - could not restore decoder weights from pre-trained model.


## Frozen Encoder - Unfrozen Batch Normalization

In [None]:
#@title Freeze Encoder { display-mode: "form" }
freeze_encoder = True #@param ["False", "True"] {type:"raw"}
freeze_encoder = bool(freeze_encoder)

In [None]:
import torch
import torch.nn as nn

def enable_bn_se(m):
    if type(m) == nn.BatchNorm1d:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)

    if 'SqueezeExcite' in type(m).__name__:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)

In [None]:
if freeze_encoder:
  model.encoder.freeze()
  model.encoder.apply(enable_bn_se)
  logging.info("Model encoder has been frozen")
else:
  model.encoder.unfreeze()
  logging.info("Model encoder has been un-frozen")

[NeMo I 2024-06-19 08:53:32 <ipython-input-52-d6e18de51153>:4] Model encoder has been frozen


## Update config

Similar to the character encoding CTC model above, we will update the config for the sub-word encoding model.

It is primarily the data loaders that will be affected by the switch from character encoding to sub-word encoding.

In [None]:
cfg = copy.deepcopy(model.cfg)

### Setup tokenizer

This step is merely for demonstration - when we updated the tokenizer previously using `change_vocabulary()`, it internally performed this step as well.

In [None]:
# Setup new tokenizer
cfg.tokenizer.dir = TOKENIZER_DIR
cfg.tokenizer.type = "bpe"

# Set tokenizer config
model.cfg.tokenizer = cfg.tokenizer

### Setup data loaders

While significant sections remain the same between character-based and sub-word-based model configs - the data loaders are the main area where they diverge.

The sub-word encoding models do not require a "model.cfg.labels" section. In fact, their data loaders do not require `labels` at all! The labels are automatically extracted from the provided tokenizer, and the data loaders and updated implicitly.

In [None]:
# Setup train/val/test configs
print(OmegaConf.to_yaml(cfg.train_ds))

manifest_filepath: /disk1/NVIDIA/datasets/LibriSpeech_NeMo/librivox-train-all.json
sample_rate: 16000
batch_size: 16
shuffle: true
num_workers: 8
pin_memory: true
use_start_end_token: false
trim_silence: false
max_duration: 16.7
min_duration: 0.1
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
bucketing_strategy: fully_randomized
bucketing_batch_size: null



In [None]:
# Setup train, validation, test configs
with open_dict(cfg):
  # Train dataset
  cfg.train_ds.manifest_filepath = f"{train_manifest_cleaned}"
  cfg.train_ds.batch_size = 8
  cfg.train_ds.num_workers = 0
  cfg.train_ds.pin_memory = True
  cfg.train_ds.use_start_end_token = True
  cfg.train_ds.trim_silence = True

  # Validation dataset
  cfg.validation_ds.manifest_filepath = test_manifest_cleaned
  cfg.validation_ds.batch_size = 8
  cfg.validation_ds.num_workers = 0
  cfg.validation_ds.pin_memory = True
  cfg.validation_ds.use_start_end_token = True
  cfg.validation_ds.trim_silence = True

  # Test dataset
  cfg.test_ds.manifest_filepath = test_manifest_cleaned
  cfg.test_ds.batch_size = 8
  cfg.test_ds.num_workers = 0
  cfg.test_ds.pin_memory = True
  cfg.test_ds.use_start_end_token = True
  cfg.test_ds.trim_silence = True

In [None]:
# setup model with new configs
model.setup_training_data(cfg.train_ds)
model.setup_multiple_validation_data(cfg.validation_ds)
model.setup_multiple_test_data(cfg.test_ds)

[NeMo I 2024-06-19 08:53:32 collections:196] Dataset loaded with 686 files totalling 1.00 hours
[NeMo I 2024-06-19 08:53:32 collections:197] 0 files were filtered totalling 0.00 hours
[NeMo I 2024-06-19 08:53:32 collections:196] Dataset loaded with 674 files totalling 1.00 hours
[NeMo I 2024-06-19 08:53:32 collections:197] 0 files were filtered totalling 0.00 hours
[NeMo I 2024-06-19 08:53:32 collections:196] Dataset loaded with 674 files totalling 1.00 hours
[NeMo I 2024-06-19 08:53:32 collections:197] 0 files were filtered totalling 0.00 hours


### Examine dataset outliers

In general, there are minor differences between the Character encoding and Sub-word encoding models. Since sub-words can encode larger sequence of tokens into a single subword, they substantially reduce the target sequence length.

Citrinet takes advantage of this reduction by aggressively downsampling the input three times (a total of 8x downsampling). At this level of downsampling, it is possible to encounter a specific limitation of CTC loss.

-----

CTC loss works under the assumption that $T$ (the acoustic model's output sequence length) $> U$ (the target sequence length). If this criterion is violated, CTC loss is practically set to $\infty$ (which is then forced to $0$ by PyTorch's `zero_infinity` flag), and its gradient is set to 0.

Therefore it is essential to inspect the ratio of $\frac{T}{U}$ and ensure that it's reasonably close to 1 or higher.

In [None]:
def analyse_ctc_failures_in_model(model):
    count_ctc_failures = 0
    am_seq_lengths = []
    target_seq_lengths = []

    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model = model.to(device)
    mode = model.training

    train_dl = model.train_dataloader()

    with torch.no_grad():
      model = model.eval()
      for batch in tqdm(train_dl, desc='Checking for CTC failures'):
          x, x_len, y, y_len = batch
          x, x_len = x.to(device), x_len.to(device)
          x_logprobs, x_len, greedy_predictions = model(input_signal=x, input_signal_length=x_len)

          # Find how many CTC loss computation failures will occur
          for xl, yl in zip(x_len, y_len):
              if xl <= yl:
                  count_ctc_failures += 1

          # Record acoustic model lengths=
          am_seq_lengths.extend(x_len.to('cpu').numpy().tolist())

          # Record target sequence lengths
          target_seq_lengths.extend(y_len.to('cpu').numpy().tolist())

          del x, x_len, y, y_len, x_logprobs, greedy_predictions

    if mode:
      model = model.train()

    return count_ctc_failures, am_seq_lengths, target_seq_lengths

In [None]:
results = analyse_ctc_failures_in_model(model)

Checking for CTC failures:   0%|          | 0/86 [00:00<?, ?it/s]

In [None]:
num_ctc_failures, am_seq_lengths, target_seq_lengths = results

In [None]:
if num_ctc_failures > 0:
  logging.warning(f"\nCTC loss will fail for {num_ctc_failures} samples ({num_ctc_failures * 100./ float(len(am_seq_lengths))} % of samples)!\n"
                  f"Increase the vocabulary size of the tokenizer so that this number becomes close to zero !")
else:
  logging.info("No CTC failure cases !")

[NeMo W 2024-06-19 08:54:05 <ipython-input-61-101d07b4ff93>:2] 
    CTC loss will fail for 294 samples (42.857142857142854 % of samples)!
    Increase the vocabulary size of the tokenizer so that this number becomes close to zero !


In [None]:
# Compute average ratio of T / U
avg_T = sum(am_seq_lengths) / float(len(am_seq_lengths))
avg_U = sum(target_seq_lengths) / float(len(target_seq_lengths))

avg_length_ratio = 0
for am_len, tgt_len in zip(am_seq_lengths, target_seq_lengths):
  avg_length_ratio += (am_len / float(tgt_len))
avg_length_ratio = avg_length_ratio / len(am_seq_lengths)

print(f"Average Acoustic model sequence length = {avg_T}")
print(f"Average Target sequence length = {avg_U}")
print()
print(f"Ratio of Average AM sequence length to target sequence length = {avg_length_ratio}")

Average Acoustic model sequence length = 56.475218658892125
Average Target sequence length = 52.559766763848394

Ratio of Average AM sequence length to target sequence length = 1.1229405264655743


### Setup optimizer and scheduler

Similar to the character encoding model, we slightly reduce the learning rate when fine-tuning.

In [None]:
print(OmegaConf.to_yaml(cfg.optim))

name: adamw
lr: 0.001
betas:
- 0.9
- 0.98
weight_decay: 0.001
sched:
  name: CosineAnnealing
  warmup_steps: 15000
  warmup_ratio: null
  min_lr: 0.0001



Reduce learning rate and warmup if required

Optimizer and scheduler will be automatically instantiated from this config during training.

In [None]:
with open_dict(model.cfg.optim):
  model.cfg.optim.lr = 0.025
  model.cfg.optim.weight_decay = 0.001
  model.cfg.optim.sched.warmup_steps = None  # Remove default number of steps of warmup
  model.cfg.optim.sched.warmup_ratio = 0.10  # 10 % warmup
  model.cfg.optim.sched.min_lr = 1e-9

In [None]:
### Setup data augmentation

# We also increase the SpecAugment masks to prevent overfitting (since it is a larger model).

with open_dict(model.cfg.spec_augment):
  model.cfg.spec_augment.freq_masks = 2
  model.cfg.spec_augment.freq_width = 25
  model.cfg.spec_augment.time_masks = 10
  model.cfg.spec_augment.time_width = 0.05

model.spec_augmentation = model.from_config_dict(model.cfg.spec_augment)

In [None]:
#@title Metric
use_cer = False #@param ["False", "True"] {type:"raw"}
log_prediction = True #@param ["False", "True"] {type:"raw"}

In [None]:
model.wer.use_cer = use_cer
model.wer.log_prediction = log_prediction

## Setup Trainer and Experiment Manager

And that's it! Now we can train the model by simply using the Pytorch Lightning Trainer and NeMo Experiment Manager as always.

For demonstration purposes, the number of epochs can be reduced. Reasonable results can be obtained in around 100 epochs (approximately 25 minutes on Colab GPUs).

In [None]:
repo_name = "parakeet-0.6b-sw-1hr"
wandb.init(project="ASR Africa", name=repo_name)


from pytorch_lightning.loggers import WandbLogger
# Setup wandb logger
wandb_logger = WandbLogger(project="ASR Africa", name=repo_name, log_model=True)

[34m[1mwandb[0m: Currently logged in as: [33mkasulejohntrevor[0m ([33masr-africa-research-team[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
import torch
import pytorch_lightning as ptl

if torch.cuda.is_available():
  accelerator = 'gpu'
else:
  accelerator = 'cpu'

EPOCHS = 50  # 100 epochs would provide better results

trainer = ptl.Trainer(devices=1,
                      accelerator=accelerator,
                      max_epochs=EPOCHS,
                      accumulate_grad_batches=1,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=30,
                      check_val_every_n_epoch=10)

# Setup model with the trainer
model.set_trainer(trainer)

# finally, update the model's internal config
model.cfg = model._cfg

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [None]:
from nemo.utils import exp_manager
from nemo.utils.exp_manager import ExpManagerConfig, CallbackParams

# Setup NeMo's exp_manager configuration
config = ExpManagerConfig(
    exp_dir=f'experiments/lang-sw/',  # Adjust the path as needed
    name=repo_name,
    create_wandb_logger=True,
    wandb_logger_kwargs={
        "project": "ASR Africa",
        "name": repo_name
    },
    checkpoint_callback_params=CallbackParams(
        monitor="val_wer",
        mode="min",
        always_save_nemo=False,
        save_best_model=True,
    ),
)

config = OmegaConf.structured(config)

logdir = exp_manager.exp_manager(trainer, config)

[NeMo I 2024-06-19 08:54:12 exp_manager:396] Experiments will be logged at experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12
[NeMo I 2024-06-19 08:54:12 exp_manager:856] TensorboardLogger has been set up
[NeMo I 2024-06-19 08:54:12 exp_manager:871] WandBLogger has been set up


In [None]:
%%time
trainer.fit(model)

[NeMo W 2024-06-19 08:54:18 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loggers/wandb.py:396: There is a wandb run already in progress and newly created instances of `WandbLogger` will reuse this run. If this is not desired, call `wandb.finish()` before instantiating `WandbLogger`.
    
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-06-19 08:54:18 modelPT:770] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.025
        maximize: False
        weight_decay: 0.001
    )
[NeMo I 2024-06-19 08:54:18 lr_scheduler:923] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7d4f4c0d66e0>" 
    will be used during training (effective maximum steps = 4300) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: 0.1
    min_lr: 1.0e-09
    max_steps: 4300
    )


INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                              | Params | Mode 
--------------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0      | train
1 | encoder           | ConformerEncoder                  | 607 M  | train
2 | spec_augmentation | SpectrogramAugmentation           | 0      | train
3 | wer               | WER                               | 0      | train
4 | decoder           | ConvASRDecoder                    | 30.8 K | train
5 | loss              | CTCLoss                           | 0      | train
--------------------------------------------------------------------------------
79.9 K    Trainable params
607 M     Non-trainable params
607 M     Total params
2,431.119 Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

[NeMo W 2024-06-19 08:54:19 ctc_greedy_decoding:168] CTC decoding strategy 'greedy' is slower than 'greedy_batch', which implements the same exact interface. Consider changing your strategy to 'greedy_batch' for a free performance improvement.


[NeMo I 2024-06-19 08:54:19 wer:334] 
    
[NeMo I 2024-06-19 08:54:19 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 08:54:19 wer:336] predicted:o kfvnxnoioinqroyoixnrim ⁇ er ⁇ trtomozd kt ko kovxf ⁇ o k ⁇ or cmog
[NeMo I 2024-06-19 08:54:19 wer:334] 
    
[NeMo I 2024-06-19 08:54:19 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 08:54:19 wer:336] predicted:ov kji ⁇ eoj ⁇ df ⁇ jmznvzr kjfedjoji ⁇ eojoengvjiqeoto ⁇ cro


Training: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 08:54:19 preemption:56] Preemption requires torch distributed to be initialized, disabling preemption
[NeMo I 2024-06-19 08:54:37 wer:334] 
    
[NeMo I 2024-06-19 08:54:37 wer:335] reference:lugha rasmi na ya kawaida ni kifaransa
[NeMo I 2024-06-19 08:54:37 wer:336] predicted:aa a ai a a
[NeMo I 2024-06-19 08:54:53 wer:334] 
    
[NeMo I 2024-06-19 08:54:53 wer:335] reference:ni moja kati ya majimbo kumi na mawili ya uchaguzi katika kaunti ya kiambu
[NeMo I 2024-06-19 08:54:53 wer:336] predicted:na aiaum inamlai u ai a u 
[NeMo I 2024-06-19 08:55:27 wer:334] 
    
[NeMo I 2024-06-19 08:55:27 wer:335] reference:uwanja huu unapatikana huko cape town nchini afrika kusini
[NeMo I 2024-06-19 08:55:27 wer:336] predicted:a anau auo kat ni yia i
[NeMo I 2024-06-19 08:55:46 wer:334] 
    
[NeMo I 2024-06-19 08:55:46 wer:335] reference:mwaka huohuo halmashauri mpya ilitangaza katiba mpya na kuondoa urais wa maisha
[NeMo I 2024-06-19 08:55:46 wer:336] predicted:a hu ah ieltna 

Validation: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 09:03:10 wer:334] 
    
[NeMo I 2024-06-19 09:03:10 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 09:03:10 wer:336] predicted:wana tokea katika afrika kusini kwa saharatu mara ningi imani
[NeMo I 2024-06-19 09:03:10 wer:334] 
    
[NeMo I 2024-06-19 09:03:10 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 09:03:10 wer:336] predicted:havasi wakuba anin pand amagareibi niw apanaiwa awada
[NeMo I 2024-06-19 09:03:11 wer:334] 
    
[NeMo I 2024-06-19 09:03:11 wer:335] reference:mwenyewe aliuawa kwa tendo hilo
[NeMo I 2024-06-19 09:03:11 wer:336] predicted:maniewe ali wauwa katendo hilom
[NeMo I 2024-06-19 09:03:11 wer:334] 
    
[NeMo I 2024-06-19 09:03:11 wer:335] reference:idadi ya watu ni waisilamu lakini pia ina wakatoliki na wahuishaji
[NeMo I 2024-06-19 09:03:11 wer:336] predicted:i di ya atonima slamulaini pia ia atliki na wa hsajin
[NeMo I 2024

INFO:pytorch_lightning.utilities.rank_zero:Epoch 9, global step 860: 'val_wer' reached 0.95823 (best 0.95823), saving model to '/content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.9582-epoch=9.ckpt' as top 3


[NeMo I 2024-06-19 09:05:03 nemo_model_checkpoint:219] New best .nemo model saved to: /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo
[NeMo I 2024-06-19 09:06:04 wer:334] 
    
[NeMo I 2024-06-19 09:06:04 wer:335] reference:vyakula vilivyo na mafuta mengi ni kwa mfano
[NeMo I 2024-06-19 09:06:04 wer:336] predicted:zia kula nia nafata menini ueou ane
[NeMo I 2024-06-19 09:06:24 wer:334] 
    
[NeMo I 2024-06-19 09:06:24 wer:335] reference:anasema kuwa karibu kila mji na jiji lina angalau sehemu moja iliyoshangiliwa
[NeMo I 2024-06-19 09:06:24 wer:336] predicted:msmumaiumjinjiena imuj isangini
[NeMo I 2024-06-19 09:07:00 wer:334] 
    
[NeMo I 2024-06-19 09:07:00 wer:335] reference:alirudi algeria miezi michache kabla ya uvamizi wa kifaransa
[NeMo I 2024-06-19 09:07:00 wer:336] predicted:alivrui ajeraizimchche kabla wuwazwa kifasa
[NeMo I 2024-06-19 09:07:18 wer:334] 
    
[NeMo I 2024-06-19 09:07:18 wer:335] reference:hata hivy

Validation: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 09:14:44 wer:334] 
    
[NeMo I 2024-06-19 09:14:44 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 09:14:44 wer:336] predicted:wana toki a katika afrka kusini kwaaharatubara ningiwi mani
[NeMo I 2024-06-19 09:14:44 wer:334] 
    
[NeMo I 2024-06-19 09:14:44 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 09:14:44 wer:336] predicted:iai a kuamaigi apada a magribi niw apa naiwa arwadi
[NeMo I 2024-06-19 09:14:45 wer:334] 
    
[NeMo I 2024-06-19 09:14:45 wer:335] reference:mwenyewe aliuawa kwa tendo hilo
[NeMo I 2024-06-19 09:14:45 wer:336] predicted:manjya yali wawawatindohio
[NeMo I 2024-06-19 09:14:45 wer:334] 
    
[NeMo I 2024-06-19 09:14:45 wer:335] reference:idadi ya watu ni waisilamu lakini pia ina wakatoliki na wahuishaji
[NeMo I 2024-06-19 09:14:45 wer:336] predicted:idadi ya atinimaislang lakini pia ina watodiki na wahihadi
[NeMo I 2024-06-

INFO:pytorch_lightning.utilities.rank_zero:Epoch 19, global step 1720: 'val_wer' reached 0.95118 (best 0.95118), saving model to '/content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.9512-epoch=19.ckpt' as top 3


[NeMo I 2024-06-19 09:15:43 nemo_model_checkpoint:299] /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo already exists, moving existing checkpoint to /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:16:35 nemo_model_checkpoint:219] New best .nemo model saved to: /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo
[NeMo I 2024-06-19 09:16:35 nemo_model_checkpoint:228] Removing old .nemo backup /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:17:38 wer:334] 
    
[NeMo I 2024-06-19 09:17:38 wer:335] reference:makao makuu yalikuwa mjini nyeri
[NeMo I 2024-06-19 09:17:38 wer:336] predicted:na kpuomak kkuiyalikwambaenaori
[NeMo I 2024-06-19 09:17:59 wer:334] 
    
[NeMo I 2024-06-19 09:17:59 wer:335] reference:la

Validation: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 09:26:18 wer:334] 
    
[NeMo I 2024-06-19 09:26:18 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 09:26:18 wer:336] predicted:wana tokea katika afrika kusini kwa saharatumara ningiilimani
[NeMo I 2024-06-19 09:26:18 wer:334] 
    
[NeMo I 2024-06-19 09:26:18 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 09:26:18 wer:336] predicted:asi amubaingin upad wa magribi ne ziwapa naiwa a wata
[NeMo I 2024-06-19 09:26:18 wer:334] 
    
[NeMo I 2024-06-19 09:26:18 wer:335] reference:mwenyewe aliuawa kwa tendo hilo
[NeMo I 2024-06-19 09:26:18 wer:336] predicted:manya aliua wa kwatedo ilo
[NeMo I 2024-06-19 09:26:19 wer:334] 
    
[NeMo I 2024-06-19 09:26:19 wer:335] reference:idadi ya watu ni waisilamu lakini pia ina wakatoliki na wahuishaji
[NeMo I 2024-06-19 09:26:19 wer:336] predicted:di ya watunimwaislamu lakini ia ina watudliki na wahuichaji
[NeMo I 202

INFO:pytorch_lightning.utilities.rank_zero:Epoch 29, global step 2580: 'val_wer' reached 0.91817 (best 0.91817), saving model to '/content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.9182-epoch=29.ckpt' as top 3


[NeMo I 2024-06-19 09:28:49 nemo_model_checkpoint:299] /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo already exists, moving existing checkpoint to /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:30:17 nemo_model_checkpoint:219] New best .nemo model saved to: /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo
[NeMo I 2024-06-19 09:30:17 nemo_model_checkpoint:228] Removing old .nemo backup /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:32:11 wer:334] 
    
[NeMo I 2024-06-19 09:32:11 wer:335] reference:kuzingatia mtindo bora kunaweza kupunguza hatari kwa zaidi ya nusu
[NeMo I 2024-06-19 09:32:11 wer:336] predicted:kwzya tio urauiez ufumua achari kkazainana
[NeMo I 2024-06-19 09:32:31 wer:334] 
    
[NeM

Validation: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 09:40:53 wer:334] 
    
[NeMo I 2024-06-19 09:40:53 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 09:40:53 wer:336] predicted:wana tokea katika afrika kusini kwa sahara tuo mra nini milimani 
[NeMo I 2024-06-19 09:40:53 wer:334] 
    
[NeMo I 2024-06-19 09:40:53 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 09:40:53 wer:336] predicted:maiwam ku aengin pandewa magribi ne iw apa naziwa a wato
[NeMo I 2024-06-19 09:40:53 wer:334] 
    
[NeMo I 2024-06-19 09:40:53 wer:335] reference:mwenyewe aliuawa kwa tendo hilo
[NeMo I 2024-06-19 09:40:53 wer:336] predicted:mnyaewa aliwa wa kwa tendo hilo
[NeMo I 2024-06-19 09:40:54 wer:334] 
    
[NeMo I 2024-06-19 09:40:54 wer:335] reference:idadi ya watu ni waisilamu lakini pia ina wakatoliki na wahuishaji
[NeMo I 2024-06-19 09:40:54 wer:336] predicted:idi ya wato ni wa islamu lakinipia ina wa katoliki na wahui

INFO:pytorch_lightning.utilities.rank_zero:Epoch 39, global step 3440: 'val_wer' reached 0.86849 (best 0.86849), saving model to '/content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.8685-epoch=39.ckpt' as top 3


[NeMo I 2024-06-19 09:42:41 nemo_model_checkpoint:299] /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo already exists, moving existing checkpoint to /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:43:49 nemo_model_checkpoint:219] New best .nemo model saved to: /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo
[NeMo I 2024-06-19 09:43:49 nemo_model_checkpoint:228] Removing old .nemo backup /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:45:37 wer:334] 
    
[NeMo I 2024-06-19 09:45:37 wer:335] reference:yenyewe hayana rangi wala ladha au harufu
[NeMo I 2024-06-19 09:45:37 wer:336] predicted:ihya ana rangi wada a a ru
[NeMo I 2024-06-19 09:45:58 wer:334] 
    
[NeMo I 2024-06-19 09:45:58 wer:335] referenc

Validation: |          | 0/? [00:00<?, ?it/s]

[NeMo I 2024-06-19 09:54:17 wer:334] 
    
[NeMo I 2024-06-19 09:54:17 wer:335] reference:wanatokea katika afrika kusini kwa sahara tu mara nyingi milimani
[NeMo I 2024-06-19 09:54:17 wer:336] predicted:wana tokea katika afrika kusini kwa sahara tumra ninii milimani 
[NeMo I 2024-06-19 09:54:17 wer:334] 
    
[NeMo I 2024-06-19 09:54:17 wer:335] reference:maziwa makubwa mengine upande wa magharibi ni ziwa albert na ziwa edward
[NeMo I 2024-06-19 09:54:17 wer:336] predicted:maiwamkua mengin pande wamagarib ni ziwapa naziwa ai wado
[NeMo I 2024-06-19 09:54:18 wer:334] 
    
[NeMo I 2024-06-19 09:54:18 wer:335] reference:mwenyewe aliuawa kwa tendo hilo
[NeMo I 2024-06-19 09:54:18 wer:336] predicted:mnyaewa aliwa wa kwa tedohilo
[NeMo I 2024-06-19 09:54:18 wer:334] 
    
[NeMo I 2024-06-19 09:54:18 wer:335] reference:idadi ya watu ni waisilamu lakini pia ina wakatoliki na wahuishaji
[NeMo I 2024-06-19 09:54:18 wer:336] predicted:idi ya wato ni maislamulaki ia ina wa katoliki na wahishaji
[

INFO:pytorch_lightning.utilities.rank_zero:Epoch 49, global step 4300: 'val_wer' reached 0.86574 (best 0.86574), saving model to '/content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.8657-epoch=49.ckpt' as top 3


[NeMo I 2024-06-19 09:56:42 nemo_model_checkpoint:299] /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo already exists, moving existing checkpoint to /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 09:58:12 nemo_model_checkpoint:219] New best .nemo model saved to: /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo
[NeMo I 2024-06-19 09:58:12 nemo_model_checkpoint:228] Removing old .nemo backup /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo


INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=50` reached.
INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.8657-epoch=49.ckpt
INFO:pytorch_lightning.utilities.rank_zero:Restored all states from the checkpoint at /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr--val_wer=0.8657-epoch=49.ckpt


[NeMo I 2024-06-19 10:00:30 nemo_model_checkpoint:299] /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr.nemo already exists, moving existing checkpoint to /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
[NeMo I 2024-06-19 10:01:17 nemo_model_checkpoint:271] Removing old .nemo backup /content/experiments/lang-sw/parakeet-0.6b-sw-1hr/2024-06-19_08-54-12/checkpoints/parakeet-0.6b-sw-1hr-v1.nemo
CPU times: user 42min 55s, sys: 6min 32s, total: 49min 28s
Wall time: 1h 7min 5s


In [None]:
# Finish the run
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
learning_rate,▂▃▅▇███████▇▇▇▇▆▆▆▆▅▅▅▄▄▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train_backward_timing in s,▂▂▂▄▃▅▅▄▄▂▂▆▄▃▁▅▂▄█▃▂▅▆▄▄▅▂▂▄▅▆▃▆▄▇▄▆▂▃█
train_loss,▃█▄▄▃▅▆▃▂▄▅▄▃▃▅▄▃▃▃▂▃▄▆▂▃▃▃▂▃▂▄▃▄▃▃▃▁▂▁▅
train_step_timing in s,▂▂▃▄▂▄▅▄▅▂▂▆▅▃▁▄▂▄█▂▂▅▆▃▄▄▂▂▄▄▆▂▅▄█▄▆▂▃█
trainer/global_step,▁▂▁▁▁▁▁▁▃▄▁▁▁▁▁▁▄▅▁▁▁▁▁▁▆▆▁▁▁▂▂▂▇█▂▂▂▂▂▂
training_batch_wer,▅▄▄▄▄▃▄▄▄▄█▄▄▄▄▄▄▄▃▄▂▃▂▃▃▃▂▃▄▁▁▃▁▃▃▂▁▃▂▃
val_loss,██▃▂▁
val_wer,█▇▅▁▁

0,1
epoch,49.0
global_step,4300.0
learning_rate,0.0
train_backward_timing in s,0.10896
train_loss,1.35328
train_step_timing in s,0.47262
trainer/global_step,4299.0
training_batch_wer,0.96923
val_loss,1.15724
val_wer,0.86574


In [None]:
LANGUAGE = "sw"
save_path = f"Model-{LANGUAGE}-1hr.nemo"
model.save_to(f"{save_path}")
print(f"Model saved at path : {os.getcwd() + os.path.sep + save_path}")

Model saved at path : /content/Model-sw-1hr.nemo


In [None]:
from huggingface_hub import HfApi, login, Repository

In [None]:
LANGUAGE = "sw"
save_path = f"Model-{LANGUAGE}-1hr.nemo"
repo_name = repo_name
username = "KasuleTrevor"
repo_id = f"{username}/{repo_name}"
repo_dir = f"./{repo_name}"