## Training from scratch

To train from scratch, you need to prepare your training data in the right format and specify your models architecture.

# For CUDA OOM

In [12]:
import torch
torch.cuda.empty_cache()

In [2]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection - this collections contains complete ASR models and
# building blocks (modules) for ASR
import nemo.collections.asr as nemo_asr

[NeMo W 2021-08-05 22:24:02 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-08-05 22:24:04 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.


In [3]:

# Some utility imports
import os
from omegaconf import OmegaConf
# This line will print the entire config of the Jasper model
config_path = f"/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/examples/asr/conf/jasper/jasper_10x5dr.yaml"
config = OmegaConf.load(config_path)
config = OmegaConf.to_container(config, resolve=True)
config = OmegaConf.create(config)
print(OmegaConf.to_yaml(config))

name: Jasper10x5
model:
  sample_rate: 16000
  labels: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo/tokenizer_wpe_v1024/vocab.txt
  train_ds:
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo.json
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo/tokenizer_wpe_v1024/vocab.txt
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
  validation_ds:
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo.json
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo/tokenizer_wpe_v1024/vocab.txt
    batch_size: 32
    shuffle: false
  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: per_feature
    window_size: 0.02
    sample_rate: 16000
    window_stride: 0.01
    window:

In [4]:
print(OmegaConf.to_yaml(config.model.train_ds))

manifest_filepath: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo.json
sample_rate: 16000
labels: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo/tokenizer_wpe_v1024/vocab.txt
batch_size: 32
trim_silence: true
max_duration: 16.7
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter



In [5]:
train_dataset = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo.json"
val_dataset = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/nemo/nemo.json"


In [6]:
config.model.train_ds.manifest_filepath = train_dataset
config.model.validation_ds.manifest_filepath = val_dataset


In [7]:
import torch
import pytorch_lightning as pl

In [8]:
print("Trainer config - \n")
print(OmegaConf.to_yaml(config.trainer))

Trainer config - 

gpus: 0
max_epochs: 5
max_steps: null
num_nodes: 1
accelerator: ddp
accumulate_grad_batches: 1
checkpoint_callback: false
logger: false
log_every_n_steps: 1
val_check_interval: 1.0



In [9]:
# Lets modify some trainer configs for this demo
# Checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda

# Reduces maximum number of epochs to 5 for quick demonstration
config.trainer.max_epochs = 5

# Remove distributed training flags
config.trainer.accelerator = None

In [10]:
trainer = pl.Trainer(**config.trainer)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores


In [11]:
from nemo.utils.exp_manager import exp_manager

exp_dir = exp_manager(trainer, config.get("exp_manager", None))

[NeMo I 2021-08-04 18:59:48 exp_manager:219] Experiments will be logged at /home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48
[NeMo I 2021-08-04 18:59:48 exp_manager:568] TensorboardLogger has been set up


In [12]:
# The exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir


'/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48'

In [13]:
## Building the Jasper Model

asr_model = nemo_asr.models.EncDecCTCModel(cfg=config.model, trainer=trainer)

[NeMo I 2021-08-04 18:59:49 collections:173] Dataset loaded with 10 files totalling 0.02 hours
[NeMo I 2021-08-04 18:59:49 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-04 18:59:49 collections:173] Dataset loaded with 10 files totalling 0.02 hours
[NeMo I 2021-08-04 18:59:49 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-04 18:59:49 features:252] PADDING: 16
[NeMo I 2021-08-04 18:59:49 features:269] STFT using torch


In [14]:
trainer.fit(asr_model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2021-08-04 18:59:55 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.01
        weight_decay: 0.001
    )
[NeMo I 2021-08-04 18:59:55 lr_scheduler:621] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f4beb353a00>" 
    will be used during training (effective maximum steps = 5) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1
    max_steps: 5
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 332 M 
2 | decoder           | ConvASRDecoder                    | 90.2 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
332 M     Trainable params
0         Non-trainable params
332 M     Total params
1,330.771 Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

      rank_zero_warn(
    
[NeMo W 2021-08-04 18:59:55 patch_utils:49] torch.stft() signature has been updated for PyTorch 1.7+
    Please update PyTorch to remain compatible with later versions of NeMo.
    To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
      return torch.floor_divide(self, other)
    
      rank_zero_warn(
    


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch 0, global step 0: val_loss reached 1718.45935 (best 1718.45935), saving model to "/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48/checkpoints/Jasper10x5--val_loss=1718.46-epoch=0.ckpt" as top 3


Validating: 0it [00:00, ?it/s]

Epoch 1, global step 1: val_loss reached 1607.70251 (best 1607.70251), saving model to "/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48/checkpoints/Jasper10x5--val_loss=1607.70-epoch=1.ckpt" as top 3


Validating: 0it [00:00, ?it/s]

Epoch 2, global step 2: val_loss reached 1546.29919 (best 1546.29919), saving model to "/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48/checkpoints/Jasper10x5--val_loss=1546.30-epoch=2.ckpt" as top 3


Validating: 0it [00:00, ?it/s]

Epoch 3, global step 3: val_loss reached 1522.49414 (best 1522.49414), saving model to "/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48/checkpoints/Jasper10x5--val_loss=1522.49-epoch=3.ckpt" as top 3


Validating: 0it [00:00, ?it/s]

Epoch 4, global step 4: val_loss reached 1542.80298 (best 1522.49414), saving model to "/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/tutorials/asr/nemo_experiments/Jasper10x5/2021-08-04_18-59-48/checkpoints/Jasper10x5--val_loss=1542.80-epoch=4.ckpt" as top 3
Saving latest checkpoint...


# Pytorch Lightning

### Specifying Our Model with a YAML Config File

We'll build a *Citrinet* model for this tutorial and use *greedy CTC decoder*, using the configuration found in `./configs/citrinet_bpe.yaml`.

If we open up this config file, we find model section which describes architecture of our model. A model contains an entry labeled `encoder`, with a field called `jasper` that contains a list with multiple entries. Each of the members in this list specifies one block in our model, and looks something like this:
```
- filters: 192
  repeat: 5
  kernel: [11]
  stride: [1]
  dilation: [1]
  dropout: 0.0
  residual: false
  separable: true
  se: true
  se_context_size: -1
```
The first member of the list corresponds to the first block in the QuartzNet/Citrinet architecture diagram. 

Some entries at the top of the file specify how we will handle training (`train_ds`) and validation (`validation_ds`) data.

Using a YAML config such as this helps get a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code.

In [3]:
from omegaconf import OmegaConf, open_dict

In [4]:
params = OmegaConf.load("/home/hood/KK/MediaAnalysis/Code Repos/kashbah_ncai/NeMo/examples/asr/conf/jasper/jasper_10x5dr.yaml")

Let us make the network smaller since `AN4` is a particularly small dataset and does not need the capacity of the general config.

In [5]:
print(OmegaConf.to_yaml(params))

name: Jasper10x5
model:
  sample_rate: 16000
  labels: /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
  train_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
    batch_size: 4
    trim_silence: true
    max_duration: 50.9
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
  validation_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
    batch_size: 4
    shuffle: false
  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: per_feature
    window_size: 0.02
    sample_rate: 16000
    window_stride: 0.01
    window: hann
    features: 64
    n_fft: 512
    frame_splicing: 1


### Training with PyTorch Lightning

NeMo models and modules can be used in any PyTorch code where torch.nn.Module is expected.

However, NeMo's models are based on [PytorchLightning's](https://github.com/PyTorchLightning/pytorch-lightning) LightningModule and we recommend you use PytorchLightning for training and fine-tuning as it makes using mixed precision and distributed training very easy. So to start, let's create Trainer instance for training on GPU for 50 epochs

In [6]:
import pytorch_lightning as pl
trainer = pl.Trainer(gpus=1, max_epochs=10, amp_level='O1', precision=16)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores


Next, we instantiate and ASR model based on our ``citrinet_bpe.yaml`` file from the previous section.
Note that this is a stage during which we also tell the model where our training and validation manifests are.

In [7]:
# Update paths to dataset
params.model.train_ds.manifest_filepath = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json"
params.model.validation_ds.manifest_filepath = "/home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json"

# remove spec augment for this dataset
params.model.spec_augment.rect_masks = 0

params.model.train_ds.batch_size = 4
params.model.validation_ds.batch_size = 4

In [8]:
first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=params.model, trainer=trainer)

[NeMo I 2021-08-05 22:25:16 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-05 22:25:16 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-05 22:25:19 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-05 22:25:19 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-05 22:25:19 features:252] PADDING: 16
[NeMo I 2021-08-05 22:25:19 features:269] STFT using torch


In [None]:
%load_ext tensorboard
%tensorboard --logdir lightning_logs/

With that, we can start training with just one line!

In [9]:
torch.cuda.empty_cache()

In [10]:
# Start training!!!
trainer.fit(first_asr_model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2021-08-05 22:25:25 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.01
        weight_decay: 0.001
    )
[NeMo I 2021-08-05 22:25:25 lr_scheduler:621] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f615a9ef940>" 
    will be used during training (effective maximum steps = 195340) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1
    max_steps: 195340
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 332 M 
2 | decoder           | ConvASRDecoder                    | 110 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
332 M     Trainable params
0         Non-trainable params
332 M     Total params
1,330.853 Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

      rank_zero_warn(
    
[NeMo W 2021-08-05 22:25:25 patch_utils:49] torch.stft() signature has been updated for PyTorch 1.7+
    Please update PyTorch to remain compatible with later versions of NeMo.
    To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
      return torch.floor_divide(self, other)
    
      rank_zero_warn(
    


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

    


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

      rank_zero_warn('Detected KeyboardInterrupt, attempting graceful shutdown...')
    


In [11]:
torch.cuda.memory_summary()



Save the model easily along with the tokenizer using `save_to`. 

Later, we use `restore_from` to restore the model, it will also reinitialize the tokenizer !

In [11]:
first_asr_model.save_to("jasper_asr_03epoch_model.nemo")

There we go! We've put together a full training pipeline for the model and trained it for 50 epochs.

If you'd like to save this model checkpoint for loading later (e.g. for fine-tuning, or for continuing training), you can simply call `first_asr_model.save_to(<checkpoint_path>)`. Then, to restore your weights, you can rebuild the model using the config (let's say you call it `first_asr_model_continued` this time) and call `first_asr_model_continued.restore_from(<checkpoint_path>)`.

We could improve this model by playing with hyperparameters. We can look at the current hyperparameters with the following:

In [11]:
print(params.model.optim)

{'name': 'novograd', 'lr': 0.01, 'betas': [0.8, 0.5], 'weight_decay': 0.001, 'sched': {'name': 'CosineAnnealing', 'warmup_steps': None, 'warmup_ratio': None, 'min_lr': 0.0, 'last_epoch': -1}}


### After training and hyper parameter tuning

Let's say we wanted to change the learning rate. To do so, we can create a `new_opt` dict and set our desired learning rate, then call `<model>.setup_optimization()` with the new optimization parameters.

In [12]:
import copy
new_opt = copy.deepcopy(params.model.optim)
new_opt.lr = 0.1
first_asr_model.setup_optimization(optim_config=new_opt);
# And then you can invoke trainer.fit(first_asr_model)

[NeMo I 2021-08-04 19:17:35 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.1
        weight_decay: 0.001
    )
[NeMo I 2021-08-04 19:17:35 lr_scheduler:621] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f54b43fa430>" 
    will be used during training (effective maximum steps = 50) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1
    max_steps: 50
    )


## Inference

Let's have a quick look at how one could run inference with NeMo's ASR model.


Below is an example of a simple inference loop in pure PyTorch. It also shows how one can compute Word Error Rate (WER) metric between predictions and references.

In [13]:
# Bigger batch-size = bigger throughput
params['model']['validation_ds']['batch_size'] = 16

# Setup the test data loader and make sure the model is on GPU
first_asr_model.setup_test_data(test_data_config=params['model']['validation_ds'])
first_asr_model.cuda()
first_asr_model.eval()

# We remove some preprocessing artifacts which benefit training
first_asr_model.preprocessor.featurizer.pad_to = 0
first_asr_model.preprocessor.featurizer.dither = 0.0

# We will be computing Word Error Rate (WER) metric between our hypothesis and predictions.
# WER is computed as numerator/denominator.
# We'll gather all the test batches' numerators and denominators.
wer_nums = []
wer_denoms = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` will give us:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
for test_batch in first_asr_model.test_dataloader():
        test_batch = [x.cuda() for x in test_batch]
        targets = test_batch[2]
        targets_lengths = test_batch[3]        
        log_probs, encoded_len, greedy_predictions = first_asr_model(
            input_signal=test_batch[0], input_signal_length=test_batch[1]
        )
        # Notice the model has a helper object to compute WER
        first_asr_model._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_num, wer_denom = first_asr_model._wer.compute()
        wer_nums.append(wer_num.detach().cpu().numpy())
        wer_denoms.append(wer_denom.detach().cpu().numpy())

# We need to sum all numerators and denominators first. Then divide.
print(f"WER = {sum(wer_nums)/sum(wer_denoms)}")

[NeMo I 2021-08-04 19:17:55 collections:173] Dataset loaded with 10 files totalling 0.02 hours
[NeMo I 2021-08-04 19:17:55 collections:174] 0 files were filtered totalling 0.00 hours
WER = 4.5


This WER is not particularly impressive and could be significantly improved. You could train longer (try 100 epochs) to get a better number.

## Model Improvements

You already have all you need to create your own ASR model in NeMo, but there are a few more tricks that you can employ if you so desire. In this section, we'll briefly cover a few possibilities for improving an ASR model.

### Data Augmentation

There exist several ASR data augmentation methods that can increase the size of our training set.

For example, we can perform augmentation on the spectrograms by zeroing out specific frequency segments ("frequency masking") or time segments ("time masking") as described by [SpecAugment](https://arxiv.org/abs/1904.08779), or zero out rectangles on the spectrogram as in [Cutout](https://arxiv.org/pdf/1708.04552.pdf). In NeMo, we can do all three of these by simply adding a `SpectrogramAugmentation` neural module. (As of now, it does not perform the time warping from the SpecAugment paper.)

Our toy model disables spectrogram augmentation, because it is not significantly beneficial for the short demo.

In [None]:
print(OmegaConf.to_yaml(first_asr_model._cfg['spec_augment']))

If you want to enable SpecAugment in your model, make sure your .yaml config file contains 'model/spec_augment' section which looks like the one above.

### Transfer learning

Transfer learning is an important machine learning technique that uses a model’s knowledge of one task to perform better on another. Fine-tuning is one of the techniques to perform transfer learning. It is an essential part of the recipe for many state-of-the-art results where a base model is first pretrained on a task with abundant training data and then fine-tuned on different tasks of interest where the training data is less abundant or even scarce.

In ASR you might want to do fine-tuning in multiple scenarios, for example, when you want to improve your model's performance on a particular domain (medical, financial, etc.) or accented speech. You can even transfer learn from one language to another! Check out [this paper](https://arxiv.org/abs/2005.04290) for examples.

Transfer learning with NeMo is simple. Let's demonstrate how we could fine-tune the model we trained earlier on AN4 data. (NOTE: this is a toy example). And, while we are at it, we will change the model's vocabulary to demonstrate how it's done.

-----
First, let's create another tokenizer - perhaps using a larger vocabulary size than the small tokenizer we created earlier. Also we swap out `sentencepiece` for `BERT Word Piece` tokenizer.

In [None]:
!python ./scripts/process_asr_text_tokenizer.py \
  --manifest="{data_dir}/an4/train_manifest.json" \
  --data_root="{data_dir}/tokenizers/an4/" \
  --vocab_size=64 \
  --tokenizer="wpe" \
  --no_lower_case \
  --log

Now let's load the previously trained model so that we can fine tune it-

In [15]:
restored_model = nemo_asr.models.EncDecCTCModel.restore_from("./jasper_asr_03epoch_model.nemo")

[NeMo W 2021-08-06 19:35:11 modelPT:138] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASRdatasetDL/ASR_DL/dl_old_train_resample/dl_old_train_resample.json
    sample_rate: 16000
    labels: /home/hood/KK/MediaAnalysis/Code%20Repos/kashbah_ncai/NeMo/vocab/dl_old_train/tokenizer_wpe_v1024/vocab.txt
    batch_size: 4
    trim_silence: true
    max_duration: 50.9
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    
[NeMo W 2021-08-06 19:35:11 modelPT:145] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /home/hood/KK/MediaAnalysis/ASR

[NeMo I 2021-08-06 19:35:11 features:252] PADDING: 16
[NeMo I 2021-08-06 19:35:11 features:269] STFT using torch
[NeMo I 2021-08-06 19:35:15 modelPT:438] Model EncDecCTCModel was successfully restored from ./jasper_asr_03epoch_model.nemo.


In [17]:
import copy
new_opt = copy.deepcopy(params.model.optim)
new_opt.lr = 0.1

Now let's update the vocabulary in this model

After this, our decoder has completely changed, but our encoder (where most of the weights are) remained intact. Let's fine tune-this model for 20 epochs on AN4 dataset. We will also use the smaller learning rate from ``new_opt` (see the "After Training" section)`.

**Note**: For this demonstration, we will also freeze the encoder to speed up finetuning (since both tokenizers are built on the same train set), but in general it should not be done for proper training on a new language (or on a different corpus than the original train corpus).

In [None]:
# Check what kind of vocabulary/alphabet the model has right now
print(restored_model.decoder.vocabulary)

# Lets change the tokenizer vocabulary by passing the path to the new directory,
# and also change the type
restored_model.change_vocabulary(
    new_tokenizer_dir=data_dir + "/tokenizers/an4/tokenizer_wpe_v64/",
    new_tokenizer_type="wpe"
)

In [18]:
# Use the smaller learning rate we set before
restored_model.setup_optimization(optim_config=new_opt)

# Point to the data we'll use for fine-tuning as the training set
restored_model.setup_training_data(train_data_config=params['model']['train_ds'])

# Point to the new validation data for fine-tuning
restored_model.setup_validation_data(val_data_config=params['model']['validation_ds'])

# Freeze the encoder layers (should not be done for finetuning, only done for demo)
# restored_model.encoder.freeze()

[NeMo W 2021-08-06 19:36:00 modelPT:642] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2021-08-06 19:36:00 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.1
        weight_decay: 0.001
    )


[NeMo W 2021-08-06 19:36:00 lr_scheduler:604] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !


[NeMo I 2021-08-06 19:36:03 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-06 19:36:03 collections:174] 0 files were filtered totalling 0.00 hours
[NeMo I 2021-08-06 19:36:06 collections:173] Dataset loaded with 78133 files totalling 268.27 hours
[NeMo I 2021-08-06 19:36:06 collections:174] 0 files were filtered totalling 0.00 hours


In [None]:
# Load the TensorBoard notebook extension

%load_ext tensorboard
%tensorboard --logdir lightning_logs/


In [None]:
# And now we can create a PyTorch Lightning trainer and call `fit` again.
trainer = pl.Trainer(gpus=1, max_epochs=1, amp_level='O1', precision=16)
trainer.fit(restored_model)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[NeMo W 2021-08-06 19:43:53 modelPT:642] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2021-08-06 19:43:53 modelPT:750] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.1
        weight_decay: 0.001
    )


[NeMo W 2021-08-06 19:43:53 lr_scheduler:604] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !

  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 332 M 
2 | decoder           | ConvASRDecoder                    | 110 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
332 M     Trainable params
0         Non-trainable params
332 M     Total params
1,330.853 Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

So we get fast convergence even though the decoder vocabulary is double the size and we freeze the encoder.

### Fast Training

Last but not least, we could simply speed up training our model! If you have the resources, you can speed up training by splitting the workload across multiple GPUs. Otherwise (or in addition), there's always mixed precision training, which allows you to increase your batch size.

You can use [PyTorch Lightning's Trainer object](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html?highlight=Trainer) to handle mixed-precision and distributed training for you. Below are some examples of flags you would pass to the `Trainer` to use these features:

```python
# Mixed precision:
trainer = pl.Trainer(amp_level='O1', precision=16)

# Trainer with a distributed backend:
trainer = pl.Trainer(gpus=2, num_nodes=2, accelerator='ddp')

# Of course, you can combine these flags as well.
```

Finally, have a look at [example scripts in NeMo repository](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_bpe.py) which can handle mixed precision and distributed training using command-line arguments.

## Under the Hood

NeMo is open-source and we do all our model development in the open, so you can inspect our code if you wish.

In particular, ``nemo_asr.model.EncDecCTCModelBPE`` is an encoder-decoder model which is constructed using several ``Neural Modules`` taken from ``nemo_asr.modules.`` Here is what its forward pass looks like:
```python
def forward(self, input_signal, input_signal_length):
    processed_signal, processed_signal_len = self.preprocessor(
        input_signal=input_signal, length=input_signal_length,
    )
    # Spec augment is not applied during evaluation/testing
    if self.spec_augmentation is not None and self.training:
        processed_signal = self.spec_augmentation(input_spec=processed_signal)
    encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)
    log_probs = self.decoder(encoder_output=encoded)
    greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)
    return log_probs, encoded_len, greedy_predictions
```
Here:

* ``self.preprocessor`` is an instance of ``nemo_asr.modules.AudioToMelSpectrogramPreprocessor``, which is a neural module that takes audio signal and converts it into a Mel-Spectrogram
* ``self.spec_augmentation`` - is a neural module of type ```nemo_asr.modules.SpectrogramAugmentation``, which implements data augmentation. 
* ``self.encoder`` - is a convolutional Jasper, QuartzNet or Citrinet-like encoder of type ``nemo_asr.modules.ConvASREncoder``
* ``self.decoder`` - is a ``nemo_asr.modules.ConvASRDecoder`` which simply projects into the target alphabet (vocabulary).

Also, ``EncDecCTCModelBPE`` uses the audio dataset class ``nemo_asr.data.AudioToBPEDataset`` and CTC loss implemented in ``nemo_asr.losses.CTCLoss``.

You can use these and other neural modules (or create new ones yourself!) to construct new ASR models.