## For CUDA OOM

In [1]:
import torch
torch.cuda.empty_cache()

# Transfer Learning

Transfer learning is an important machine learning technique that uses a modelâ€™s knowledge of one task to make it perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.

In [2]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection - this collections contains complete ASR models and
# building blocks (modules) for ASR
import nemo.collections.asr as nemo_asr

[NeMo W 2021-08-09 23:59:25 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-08-09 23:59:27 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.


#  Instantiate pre-trained NeMo model

from_pretrained(...) API downloads and initializes model directly from the cloud.

Alternatively, restore_from(...) allows loading a model from a disk.

To display available pre-trained models from the cloud, please use list_available_models() method.

In [3]:
nemo_asr.models.EncDecCTCModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pre

## Let's load a base English QuartzNet15x5 mode

In [4]:
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En')

[NeMo I 2021-08-09 11:56:53 cloud:56] Found existing object /home/hood/.cache/torch/NeMo/NeMo_1.2.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.
[NeMo I 2021-08-09 11:56:53 cloud:62] Re-using file from: /home/hood/.cache/torch/NeMo/NeMo_1.2.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2021-08-09 11:56:53 common:676] Instantiating model from pre-trained checkpoint
[NeMo I 2021-08-09 11:56:53 features:252] PADDING: 16
[NeMo I 2021-08-09 11:56:53 features:269] STFT using torch
[NeMo I 2021-08-09 11:56:56 modelPT:438] Model EncDecCTCModel was successfully restored from /home/hood/.cache/torch/NeMo/NeMo_1.2.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


## Restore the model

In [3]:
asr_model = nemo_asr.models.EncDecCTCModel.restore_from("./asr_quartznet_5epochs_finetune.nemo")

[NeMo W 2021-08-09 23:59:33 modelPT:138] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
    batch_size: 4
    trim_silence: true
    max_duration: 50.9
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    
[NeMo W 2021-08-09 23:59:33 modelPT:145] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASR

[NeMo I 2021-08-09 23:59:33 features:252] PADDING: 16
[NeMo I 2021-08-09 23:59:33 features:269] STFT using torch
[NeMo I 2021-08-09 23:59:35 modelPT:438] Model EncDecCTCModel was successfully restored from ./asr_quartznet_5epochs_finetune.nemo.


### Specifying Our Model with a YAML Config File

In [4]:
# --- Config Information ---#
try:
    from ruamel.yaml import YAML
except ModuleNotFoundError:
    from ruamel_yaml import YAML
config_path = '/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/examples/asr/conf/quartznet/quartznet_15x5.yaml'

yaml = YAML(typ='safe')
with open(config_path) as f:
    params = yaml.load(f)
print(params)

{'name': 'QuartzNet15x5', 'model': {'sample_rate': 16000, 'repeat': 5, 'dropout': 0.0, 'separable': True, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'train_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'batch_size': 32, 'trim_silence': True, 'max_duration': 16.7, 'shuffle': True, 'is_tarred': False, 'tarred_audio_filepaths': None, 'tarred_shard_strategy': 'scatter'}, 'validation_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'batch_size': 32, 'shuffle': False}, 'test_ds': {'manifest_filepath': None, 'sample_rate': 16000, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'batch_size':

### Training with PyTorch Lightning
NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.

However, NeMo's models are based on PytorchLightning's LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs

In [5]:
import pytorch_lightning as pl

# trainer = pl.Trainer(gpus=1, max_epochs=2)

# for fast training
trainer = pl.Trainer(gpus=1, max_epochs=20, amp_level='O1', precision=16)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.


Next, we instantiate and ASR model based on our config.yaml file. Note that this is a stage during which we also tell the model where our training and validation manifests are.

In [6]:
from omegaconf import DictConfig

In [7]:
# Update data input path
params['model']['train_ds']['manifest_filepath'] = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json"
params['model']['validation_ds']['manifest_filepath'] = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json"

# Update data batch_size
params['model']['train_ds']['batch_size'] = 4
params['model']['validation_ds']['batch_size'] = 4

# Update data sample_rate
# params['model']['train_ds']['sample_rate'] = 16000

# Update max duration
params['model']['train_ds']['max_duration'] = 50.9
params['model']['validation_ds']['max_duration'] = 50.9



Let's say we wanted to change the learning rate. To do so, we can create a new_opt dict and set our desired learning rate, then call <model>.setup_optimization() with the new optimization parameters.

In [8]:
import copy

new_opt = copy.deepcopy(params['model']['optim'])
new_opt['lr'] = 0.001

In [9]:
asr_model.setup_optimization(optim_config=DictConfig(new_opt))

[NeMo W 2021-08-09 23:59:55 modelPT:642] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2021-08-09 23:59:55 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.001
        weight_decay: 0.001
    )


[NeMo W 2021-08-09 23:59:55 lr_scheduler:604] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !


(Novograd (
 Parameter Group 0
     amsgrad: False
     betas: [0.8, 0.5]
     eps: 1e-08
     grad_averaging: False
     lr: 0.001
     weight_decay: 0.001
 ),
 None)

change model's vocabulary

In [10]:
# pertrained_model_vocab
print(asr_model.decoder.vocabulary)

/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt


In [16]:
print(params['model']['labels'])
asr_model.change_vocabulary(
    new_vocabulary="/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt"
)

[NeMo W 2021-08-09 19:35:55 ctc_models:302] Old /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt and new /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt match. Not changing anything.


/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt


In [17]:
# new_model_vocab
print(asr_model.decoder.vocabulary)

/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt


After this, our decoder has completely changed, but our encoder (which is where most of the weights are) remained intact. Let's fine tune-this model for 2 epochs on our urdu dataset. We will also use the smaller learning rate from ``new_opt(see the "After Training" section).

In [11]:
print(params['model']['train_ds'])

{'manifest_filepath': '/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json', 'sample_rate': 16000, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'batch_size': 4, 'trim_silence': True, 'max_duration': 50.9, 'shuffle': True, 'is_tarred': False, 'tarred_audio_filepaths': None, 'tarred_shard_strategy': 'scatter'}


In [12]:
# Point to the data we'll use for fine-tuning as the training set
asr_model.setup_training_data(train_data_config=params['model']['train_ds'])

# Point to the new validation data for fine-tuning
asr_model.setup_validation_data(val_data_config=params['model']['validation_ds'])

[NeMo I 2021-08-10 00:00:02 audio_to_text_dataset:36] Model level config does not container `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2021-08-10 00:00:02 audio_to_text_dataset:36] Model level config does not container `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2021-08-10 00:00:05 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-10 00:00:05 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-10 00:00:05 audio_to_text_dataset:36] Model level config does not container `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2021-08-10 00:00:05 audio_to_text_dataset:36] Model level config does not container `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2021-08-10 00:00:08 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-10 00:00:08 collections:174] 0 files were filter

In [13]:
print(params['model']['decoder'])
print(asr_model.decoder.vocabulary)
print(params['model'])

{'_target_': 'nemo.collections.asr.modules.ConvASRDecoder', 'feat_in': 1024, 'num_classes': 105, 'vocabulary': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt'}
/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
{'sample_rate': 16000, 'repeat': 5, 'dropout': 0.0, 'separable': True, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'train_ds': {'manifest_filepath': '/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json', 'sample_rate': 16000, 'labels': '/home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt', 'batch_size': 4, 'trim_silence': True, 'max_duration': 50.9, 'shuffle': True, 'is_tarred': False, 'tarred_audio_filepaths': None, 'tarred_shard_strategy': 'scatter'}, 'validation_ds': {'manifest_filepa

In [None]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/

In [None]:
# And now we can create a PyTorch Lightning trainer and call `fit` again.

trainer.fit(asr_model)
asr_model.save_to("asr_quartznet_20epochs_finetune.nemo")

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2021-08-10 00:00:11 modelPT:642] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2021-08-10 00:00:11 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.001
        weight_decay: 0.001
    )


[NeMo W 2021-08-10 00:00:11 lr_scheduler:604] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 18.9 M
2 | decoder           | ConvASRDecoder                    | 110 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
19.0 M    Trainable params
0         Non-trainable params
19.0 M    Total params
76.021    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

      rank_zero_warn(
    
[NeMo W 2021-08-10 00:00:11 patch_utils:49] torch.stft() signature has been updated for PyTorch 1.7+
    Please update PyTorch to remain compatible with later versions of NeMo.
    To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
      return torch.floor_divide(self, other)
    
      rank_zero_warn(
    


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

    


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Save the model easily along with the tokenizer using `save_to`. 

Later, we use `restore_from` to restore the model, it will also reinitialize the tokenizer !

In [18]:
asr_model.save_to("asr_quartznet_20epochs_finetune.nemo")

## Inference

Let's have a quick look at how one could run inference with NeMo's ASR model.

First, ``EncDecCTCModelBPE`` and its subclasses contain a handy ``transcribe`` method which can be used to simply obtain audio files' transcriptions. It also has batch_size argument to improve performance.

In [23]:
# Bigger batch-size = bigger throughput
params['model']['validation_ds']['batch_size'] = 16
params['model']['validation_ds']['manifest_filepath'] = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old/test_data/test_manifest.json"

# Setup the test data loader and make sure the model is on GPU
asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])
asr_model.cuda()
asr_model.eval()

# We remove some preprocessing artifacts which benefit training
asr_model.preprocessor.featurizer.pad_to = 0
asr_model.preprocessor.featurizer.dither = 0.0

# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.
# WER is computed as numerator/denominator.
# We'll gather all the test batches' numerators and denominators.
wer_nums = []
wer_denoms = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
for test_batch in asr_model.test_dataloader():
        test_batch = [x.cuda() for x in test_batch]
        targets = test_batch[2]
        targets_lengths = test_batch[3]        
        log_probs, encoded_len, greedy_predictions = asr_model(
            input_signal=test_batch[0], input_signal_length=test_batch[1]
        )
        # Notice the model has a helper object to compute WER
        asr_model._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_num, wer_denom = asr_model._wer.compute()
        wer_nums.append(wer_num.detach().cpu().numpy())
        wer_denoms.append(wer_denom.detach().cpu().numpy())

# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_nums)/sum(wer_denoms)}")


[NeMo I 2021-08-09 19:36:24 audio_to_text_dataset:36] Model level config does not container `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2021-08-09 19:36:24 audio_to_text_dataset:36] Model level config does not container `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2021-08-09 19:36:24 collections:173] Dataset loaded with 200 files totalling 0.64 hours
[NeMo I 2021-08-09 19:36:24 collections:174] 0 files were filtered totalling 0.00 hours


RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 9.17 GiB already allocated; 25.44 MiB free; 9.22 GiB reserved in total by PyTorch)

Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references.

This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number.

### Fast Training

Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.

You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:

```python
# Mixed precision:
trainer = pl.Trainer(amp_level='O1', precision=16)

# Trainer with a distributed backend:
trainer = pl.Trainer(gpus=2, num_nodes=2, accelerator='ddp')

# Of course, you can combine these flags as well.
```

Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_bpe.py) which can handle mixed precision and distributed training using command-line arguments.