"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
!pip install wget
!pip install git+https://github.com/NVIDIA/apex.git
!pip install nemo-toolkit
!pip install nemo-asr
!pip install unidecode

!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v1.yaml
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v2.yaml

In [1]:
# Import some necessary libraries
import os
import argparse

import copy
import math
import os
import glob
from functools import partial
from datetime import datetime
from ruamel.yaml import YAML

# Introduction

This Speech Command recognition tutorial is based on the QuartzNet model from the paper "[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)" with a modified decoder head to suit classification tasks.

The notebook will follow the steps below:

 - Dataset preparation: Preparing Google Speech Commands dataset

 - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

 - Data augmentation using SpecAugment "[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" to increase number of data samples.
 
 - Develop a small Neural classification model which can be trained efficiently.
 
 - Model training on the Google Speech Commands dataset in NeMo.
 
 - Evaluation of error cases of the model by audibly hearing the samples

# This is where the Google Speech Commands directory will be placed.
# Change this if you don't want the data to be extracted in the current directory.
# Select the version of the dataset required as well (can be 1 or 2)
DATASET_VER = 2
data_dir = './google_dataset_v{0}/'.format(DATASET_VER)


In [2]:
data_dir = '/home/fjia/data/freesound_resampled'

# Data Preparation

We will be using the open source Google Speech Commands Dataset (we will use V1 of the dataset for the tutorial, but require very minor changes to support V2 dataset). These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr

## Download the dataset

The dataset must be prepared using the scripts provided under the `{NeMo root directory}/scripts` sub-directory. 

Run the following command below to download the training script and execute it.

**NOTE**: You should have at least 4GB of disk space available if you’ve used --data_version=1; and at least 6GB if you used --data_version=2. Also, it will take some time to download and process, so go grab a coffee.

**NOTE**: You may additionally pass a `--rebalance` flag at the end of the `process_speech_commands_data.py` script to rebalance the class samples in the manifest.

!mkdir {data_dir}
!python process_speech_commands_data.py --data_root={data_dir} --data_version={DATASET_VER}
print("Dataset ready !")

## Prepare the path to manifest files

dtaset_path = 'google_speech_recognition_v{0}'.format(DATASET_VER)
dataset_basedir = os.path.join(data_dir, dtaset_path)

train_dataset = os.path.join(dataset_basedir, 'train_manifest.json')
val_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')
test_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')

# COMBO 2 
## Background + Speech Command 57k, 7k, 7k

In [3]:

dataset_basedir = data_dir


train_dataset = './old_manifest/background_training_manifest.json,./old_manifest/2balanced_sc_train_manifest.json'
# test_dataset  = './manifest/background_testing_manifest.json,./manifest/2balanced_sc_test_manifest.json'
test_dataset  = './old_manifest/all_test.json'

## Read a few rows of the manifest file 

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `command`: The class label (or speech command) of this sample <br>
3) `duration`: The length of the audio file, in seconds.

In [4]:
!tail -n 10 {test_dataset}

{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/bfd26d6b_nohash_2.wav", "duration": 1.0, "label": "commands", "text": "_", "offset": 0.0}
{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/b83c1acf_nohash_3.wav", "duration": 1.0, "label": "commands", "text": "_", "offset": 0.0}
{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/2fa39636_nohash_2.wav", "duration": 0.8359375, "label": "commands", "text": "_", "offset": 0.0}
{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/b83c1acf_nohash_2.wav", "duration": 1.0, "label": "commands", "text": "_", "offset": 0.0}
{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/bfd26d6b_nohash_3.wav", "duration": 1.0, "label": "commands", "text": "_", "offset": 0.0}
{"audio_filepath": "/home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/e0c782d5_nohash_1.wav",

# Training - Preparation

We will be training a QuartzNet model from the paper "[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)". The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.

QuartzNet models generally follow the model definition pattern QuartzNet-[BxR], where B is the number of blocks and R is the number of convolutional sub-blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout:


In [5]:
FEAT_VERSION ='v1'
COMBO_VERSION = 'combo_balanced_all_Mfcc_prob0.8'

In [6]:
# Lets load the config file for the QuartzNet 3x1 model
# Here we will be using separable convolutions
# with 3 blocks (k=3 repeated once r=1 from the picture above)
yaml = YAML(typ="safe")
with open("configs/quartznet_vad_3x1_{0}.yaml".format(FEAT_VERSION)) as f:
    jasper_params = yaml.load(f)

# Pre-define a set of labels that this model must learn to predict
labels = jasper_params['labels']

# Get the sampling rate of the data
sample_rate = jasper_params['sample_rate']

In [7]:
# Import NeMo core functionality
# NeMo's "core" package
import nemo
# NeMo's ASR collection
import nemo.collections.asr as nemo_asr
# NeMo's learning rate policy
from nemo.utils.lr_policies import CosineAnnealing
from nemo.collections.asr.helpers import (
    monitor_classification_training_progress,
    process_classification_evaluation_batch,
    process_classification_evaluation_epoch,
)
from nemo.collections.asr.metrics import classification_accuracy

logging = nemo.logging

################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

Import requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit
Import of 'jit' requested from: 'numba.decorators', please update to use 'numba.core.decorators' or pin to Numba version 0.48.0. This alias will not be present in Numba version 0.50.0.
  from numba.decorators import jit as optional_jit


## Define some model hyper parameters

In [8]:
# Lets define some hyper parameters
lr = 0.05
num_epochs = 5 #5
batch_size = 128
weight_decay = 0.001

## Define the NeMo components

In [9]:
# Create a Neural Factory
# It creates log files and tensorboard writers for us among other functions
neural_factory = nemo.core.NeuralModuleFactory(
    log_dir='./{0}/quartznet-3x1-{1}{2}'.format(dataset_basedir, FEAT_VERSION, COMBO_VERSION),
    create_tb_writer=True)
tb_writer = neural_factory.tb_writer

In [10]:
# Check if data augmentation such as white noise and time shift augmentation should be used
audio_augmentor = jasper_params.get('AudioAugmentor', None)

# Build the input data layer and the preprocessing layers for the train set
train_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=train_dataset,
    labels=labels,
    sample_rate=sample_rate,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    augmentor=audio_augmentor,
    shuffle=True
)

 # Build the input data layer and the preprocessing layers for the test set
eval_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=test_dataset,
    sample_rate=sample_rate,
    labels=labels,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    shuffle=False,
)

# # We will convert the raw audio data into MelSpectrogram Features to feed as input to our model
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
    sample_rate=sample_rate, **jasper_params["AudioToMelSpectrogramPreprocessor"],
)

# data_preprocessor = nemo_asr.AudioToMFCCPreprocessor(
#     sample_rate=sample_rate, **jasper_params["AudioToMFCCPreprocessor"],
# )



# Compute the total number of samples and the number of training steps per epoch
N = len(train_data_layer)
steps_per_epoch = math.ceil(N / float(batch_size) + 1)

logging.info("Steps per epoch : {0}".format(steps_per_epoch))
logging.info('Have {0} examples to train on.'.format(N))

# Here we begin defining all of the augmentations we want
# We will pad the preprocessed spectrogram image to have a certain number of timesteps
# This centers the generated spectrogram and adds black boundaries to either side
# of the padded image.
crop_pad_augmentation = nemo_asr.CropOrPadSpectrogramAugmentation(audio_length=128)

# We also optionally add `SpecAugment` augmentations based on the config file
# SpecAugment has various possible augmentations to the generated spectrogram
# 1) Frequency band masking
# 2) Time band masking
# 3) Rectangular cutout
spectr_augment_config = jasper_params.get('SpectrogramAugmentation', None)

if spectr_augment_config:
    data_spectr_augmentation = nemo_asr.SpectrogramAugmentation(**spectr_augment_config)

# Build the QuartzNet Encoder model
# The config defines the layers as a list of dictionaries
# The first and last two blocks are not considered when we say QuartzNet-[BxR]
# B is counted as the number of blocks after the first layer and before the penultimate layer.
# R is defined as the number of repetitions of each block in B.
# Note: We can scale the convolution kernels size by the float parameter `kernel_size_factor`
jasper_encoder = nemo_asr.JasperEncoder(**jasper_params["JasperEncoder"])

# We then define the QuartzNet decoder.
# This decoder head is specialized for the task for classification, such that it
# accepts a set of `N-feat` per timestep of the model, and averages these features
# over all the timesteps, before passing a Linear classification layer on those features.
jasper_decoder = nemo_asr.JasperDecoderForClassification(
    feat_in=jasper_params["JasperEncoder"]["jasper"][-1]["filters"],
    num_classes=len(labels),
    **jasper_params['JasperDecoderForClassification'],
)

# We can easily apply cross entropy loss to train this model
ce_loss = nemo_asr.CrossEntropyLossNM()

[NeMo I 2020-05-08 17:41:26 collections:222] Filtered duration for loading collection is 7.351812.
[NeMo I 2020-05-08 17:41:27 collections:222] Filtered duration for loading collection is 1.173313.
[NeMo I 2020-05-08 17:41:27 features:144] PADDING: 16
[NeMo I 2020-05-08 17:41:27 features:152] STFT using conv
[NeMo I 2020-05-08 17:41:29 <ipython-input-10-09d345f4e405>:40] Steps per epoch : 894
[NeMo I 2020-05-08 17:41:29 <ipython-input-10-09d345f4e405>:41] Have 114200 examples to train on.


In [11]:
# Lets print out the number of parameters of this model
logging.info('================================')
logging.info(f"Number of parameters in encoder: {jasper_encoder.num_weights}")
logging.info(f"Number of parameters in decoder: {jasper_decoder.num_weights}")
logging.info(
    f"Total number of parameters in model: " f"{jasper_decoder.num_weights + jasper_encoder.num_weights}"
)
logging.info('================================')

[NeMo I 2020-05-08 17:41:33 <ipython-input-11-6805b5462cf6>:3] Number of parameters in encoder: 73344
[NeMo I 2020-05-08 17:41:33 <ipython-input-11-6805b5462cf6>:4] Number of parameters in decoder: 258
[NeMo I 2020-05-08 17:41:33 <ipython-input-11-6805b5462cf6>:6] Total number of parameters in model: 73602


## Compile the Training Graph for NeMo

In [12]:
# Now we have all of the components that are required to build the NeMo execution graph!
## Build the training data loaders and preprocessors first
audio_signal, audio_signal_len, labels, label_len = train_data_layer()
processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)
processed_signal, processed_signal_len = crop_pad_augmentation(
    input_signal=processed_signal,
    length=audio_signal_len
)

## Augment the dataset for training
if spectr_augment_config:
    processed_signal = data_spectr_augmentation(input_spec=processed_signal)

## Define the model
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len)
decoded = jasper_decoder(encoder_output=encoded)

## Obtain the train loss
train_loss = ce_loss(logits=decoded, labels=labels)


## Compile the Test Graph for NeMo

In [13]:
# Now we build the test graph in a similar way, reusing the above components
## Build the test data loader and preprocess same way as train graph
## But note, we do not add the spectrogram augmentation to the test graph !
test_audio_signal, test_audio_signal_len, test_labels, test_label_len = eval_data_layer()
test_processed_signal, test_processed_signal_len = data_preprocessor(
    input_signal=test_audio_signal, length=test_audio_signal_len
)
test_processed_signal, test_processed_signal_len = crop_pad_augmentation(
    input_signal=test_processed_signal, length=test_processed_signal_len
)

# Pass the test data through the model encoder and decoder
test_encoded, test_encoded_len = jasper_encoder(
    audio_signal=test_processed_signal, length=test_processed_signal_len
)
test_decoded = jasper_decoder(encoder_output=test_encoded)

# Compute test loss for visualization
test_loss = ce_loss(logits=test_decoded, labels=test_labels)

## Setting up callbacks for training and test set evaluation, and checkpoint saving

In [14]:
# Now that we have our training and evaluation graphs built,
# we can focus on a few callbacks to help us save the model checkpoints
# during training, as well as display train and test metrics

# Callbacks needed to print train info to console and Tensorboard
train_callback = nemo.core.SimpleLossLoggerCallback(
    # Notice that we pass in loss, predictions, and the labels.
    # Of course we would like to see our training loss, but we need the
    # other arguments to calculate the accuracy.
    tensors=[train_loss, decoded, labels],
    # The print_func defines what gets printed.
    print_func=partial(monitor_classification_training_progress, eval_metric=None),
    get_tb_values=lambda x: [("loss", x[0])],
    tb_writer=neural_factory.tb_writer,
)

# Callbacks needed to print test info to console and Tensorboard
tagname = 'TestSet'
eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[test_loss, test_decoded, test_labels],
    user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),
    user_epochs_done_callback=partial(process_classification_evaluation_epoch, eval_metric=1, tag=tagname),
    eval_step=200,  # How often we evaluate the model on the test set #200
    tb_writer=neural_factory.tb_writer,
)

# Callback to save model checkpoints
chpt_callback = nemo.core.CheckpointCallback(
    folder=neural_factory.checkpoint_dir,
    step_freq=1000,
)

# Prepare a list of checkpoints to pass to the engine
callbacks = [train_callback, eval_callback, chpt_callback]

# Training the model

Even with such a small model (77k parameters), and just 5 epochs (should take just a few minutes to train), you should be able to get a test set accuracy score in the range 85 - 90%. Not bad for a 30 (v1) or 35 (v2) way classification problem !

Experiment with increasing the number of epochs or with batch size to see how much you can improve the score!

In [21]:
import time

In [22]:
start = time.time()

# Now we have all the components required to train the model
# Lets define a learning rate schedule

# Define a learning rate schedule
lr_policy = CosineAnnealing(
    total_steps=num_epochs * steps_per_epoch,
    warmup_ratio=0.05,
    min_lr=0.001,
)

logging.info(f"Using `{lr_policy}` Learning Rate Scheduler")

# Finally, lets train this model !
neural_factory.train(
    tensors_to_optimize=[train_loss],
    callbacks=callbacks,
    lr_policy=lr_policy,
    optimizer="novograd",
    optimization_params={
        "num_epochs": num_epochs,
        "max_steps": None,
        "lr": lr,
        "momentum": 0.95,
        "betas": (0.98, 0.5),
        "weight_decay": weight_decay,
        "grad_norm_clip": None,
    },
    batches_per_step=1,
)

end = time.time()

[NeMo I 2020-05-07 15:13:54 <ipython-input-22-689768946717>:13] Using `<nemo.utils.lr_policies.CosineAnnealing object at 0x7f29045a0b50>` Learning Rate Scheduler
[NeMo I 2020-05-07 15:13:54 callbacks:187] Starting .....
[NeMo I 2020-05-07 15:13:54 callbacks:359] Found 2 modules with weights:
[NeMo I 2020-05-07 15:13:54 callbacks:361] JasperEncoder
[NeMo I 2020-05-07 15:13:54 callbacks:361] JasperDecoderForClassification
[NeMo I 2020-05-07 15:13:54 callbacks:362] Total model parameters: 73602
[NeMo I 2020-05-07 15:13:54 callbacks:311] Found checkpoint folder .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints. Will attempt to restore checkpoints from it.


[NeMo W 2020-05-07 15:13:54 callbacks:328] For module JasperDecoderForClassification, no file matches  in .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints
[NeMo W 2020-05-07 15:13:54 callbacks:330] Checkpoint folder .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints was present but nothing was restored. Continuing training from random initialization.


[NeMo I 2020-05-07 15:13:54 callbacks:199] Starting epoch 0
[NeMo I 2020-05-07 15:13:55 callbacks:224] Step: 0
[NeMo I 2020-05-07 15:13:55 helpers:104] Loss: 0.7503116726875305
[NeMo I 2020-05-07 15:13:55 helpers:110] training_batch_top@1:  46.0938
[NeMo I 2020-05-07 15:13:55 callbacks:239] Step time: 0.20498371124267578 seconds
[NeMo I 2020-05-07 15:13:55 callbacks:445] Doing Evaluation ..............................


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)


[NeMo I 2020-05-07 15:13:57 callbacks:450] Evaluation time: 1.4857428073883057 seconds
[NeMo I 2020-05-07 15:13:58 callbacks:224] Step: 25
[NeMo I 2020-05-07 15:13:58 helpers:104] Loss: 0.3551883101463318
[NeMo I 2020-05-07 15:13:58 helpers:110] training_batch_top@1:  86.7188
[NeMo I 2020-05-07 15:13:58 callbacks:239] Step time: 0.04430961608886719 seconds
[NeMo I 2020-05-07 15:13:59 callbacks:224] Step: 50
[NeMo I 2020-05-07 15:13:59 helpers:104] Loss: 0.2272559553384781
[NeMo I 2020-05-07 15:13:59 helpers:110] training_batch_top@1:  90.6250
[NeMo I 2020-05-07 15:13:59 callbacks:239] Step time: 0.04366898536682129 seconds
[NeMo I 2020-05-07 15:14:00 callbacks:224] Step: 75
[NeMo I 2020-05-07 15:14:00 helpers:104] Loss: 0.25523388385772705
[NeMo I 2020-05-07 15:14:00 helpers:110] training_batch_top@1:  90.6250
[NeMo I 2020-05-07 15:14:00 callbacks:239] Step time: 0.04039311408996582 seconds
[NeMo I 2020-05-07 15:14:01 callbacks:224] Step: 100
[NeMo I 2020-05-07 15:14:01 helpers:104] Lo

[NeMo I 2020-05-07 15:14:29 callbacks:224] Step: 650
[NeMo I 2020-05-07 15:14:29 helpers:104] Loss: 0.1436813473701477
[NeMo I 2020-05-07 15:14:29 helpers:110] training_batch_top@1:  96.0938
[NeMo I 2020-05-07 15:14:29 callbacks:239] Step time: 0.04389381408691406 seconds
[NeMo I 2020-05-07 15:14:30 callbacks:224] Step: 675
[NeMo I 2020-05-07 15:14:30 helpers:104] Loss: 0.2161560356616974
[NeMo I 2020-05-07 15:14:30 helpers:110] training_batch_top@1:  94.5312
[NeMo I 2020-05-07 15:14:30 callbacks:239] Step time: 0.04391002655029297 seconds
[NeMo I 2020-05-07 15:14:31 callbacks:224] Step: 700
[NeMo I 2020-05-07 15:14:31 helpers:104] Loss: 0.22580504417419434
[NeMo I 2020-05-07 15:14:31 helpers:110] training_batch_top@1:  92.1875
[NeMo I 2020-05-07 15:14:31 callbacks:239] Step time: 0.040617942810058594 seconds
[NeMo I 2020-05-07 15:14:32 callbacks:224] Step: 725
[NeMo I 2020-05-07 15:14:32 helpers:104] Loss: 0.14303067326545715
[NeMo I 2020-05-07 15:14:32 helpers:110] training_batch_top

[NeMo I 2020-05-07 15:15:00 callbacks:239] Step time: 0.044036149978637695 seconds
[NeMo I 2020-05-07 15:15:01 callbacks:224] Step: 1275
[NeMo I 2020-05-07 15:15:01 helpers:104] Loss: 0.061141736805438995
[NeMo I 2020-05-07 15:15:01 helpers:110] training_batch_top@1:  99.2188
[NeMo I 2020-05-07 15:15:01 callbacks:239] Step time: 0.043936967849731445 seconds
[NeMo I 2020-05-07 15:15:02 callbacks:224] Step: 1300
[NeMo I 2020-05-07 15:15:02 helpers:104] Loss: 0.11661899089813232
[NeMo I 2020-05-07 15:15:02 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-07 15:15:02 callbacks:239] Step time: 0.04396557807922363 seconds
[NeMo I 2020-05-07 15:15:04 callbacks:224] Step: 1325
[NeMo I 2020-05-07 15:15:04 helpers:104] Loss: 0.09770214557647705
[NeMo I 2020-05-07 15:15:04 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-07 15:15:04 callbacks:239] Step time: 0.04267072677612305 seconds
[NeMo I 2020-05-07 15:15:05 callbacks:224] Step: 1350
[NeMo I 2020-05-07 15:15:05 helpers:

[NeMo I 2020-05-07 15:15:34 callbacks:224] Step: 1900
[NeMo I 2020-05-07 15:15:34 helpers:104] Loss: 0.2002420425415039
[NeMo I 2020-05-07 15:15:34 helpers:110] training_batch_top@1:  92.9688
[NeMo I 2020-05-07 15:15:34 callbacks:239] Step time: 0.04331159591674805 seconds
[NeMo I 2020-05-07 15:15:35 callbacks:224] Step: 1925
[NeMo I 2020-05-07 15:15:35 helpers:104] Loss: 0.2031973898410797
[NeMo I 2020-05-07 15:15:35 helpers:110] training_batch_top@1:  93.7500
[NeMo I 2020-05-07 15:15:35 callbacks:239] Step time: 0.043143272399902344 seconds
[NeMo I 2020-05-07 15:15:36 callbacks:224] Step: 1950
[NeMo I 2020-05-07 15:15:36 helpers:104] Loss: 0.1090618371963501
[NeMo I 2020-05-07 15:15:36 helpers:110] training_batch_top@1:  96.0938
[NeMo I 2020-05-07 15:15:36 callbacks:239] Step time: 0.040833234786987305 seconds
[NeMo I 2020-05-07 15:15:37 callbacks:224] Step: 1975
[NeMo I 2020-05-07 15:15:37 helpers:104] Loss: 0.08569393306970596
[NeMo I 2020-05-07 15:15:37 helpers:110] training_batch

[NeMo I 2020-05-07 15:16:05 callbacks:224] Step: 2525
[NeMo I 2020-05-07 15:16:05 helpers:104] Loss: 0.10689946264028549
[NeMo I 2020-05-07 15:16:05 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-07 15:16:05 callbacks:239] Step time: 0.040480613708496094 seconds
[NeMo I 2020-05-07 15:16:06 callbacks:224] Step: 2550
[NeMo I 2020-05-07 15:16:06 helpers:104] Loss: 0.08988360315561295
[NeMo I 2020-05-07 15:16:06 helpers:110] training_batch_top@1:  96.0938
[NeMo I 2020-05-07 15:16:06 callbacks:239] Step time: 0.04315996170043945 seconds
[NeMo I 2020-05-07 15:16:07 callbacks:224] Step: 2575
[NeMo I 2020-05-07 15:16:07 helpers:104] Loss: 0.05460293963551521
[NeMo I 2020-05-07 15:16:07 helpers:110] training_batch_top@1:  99.2188
[NeMo I 2020-05-07 15:16:07 callbacks:239] Step time: 0.04502248764038086 seconds
[NeMo I 2020-05-07 15:16:08 callbacks:224] Step: 2600
[NeMo I 2020-05-07 15:16:08 helpers:104] Loss: 0.08232153952121735
[NeMo I 2020-05-07 15:16:08 helpers:110] training_bat

[NeMo I 2020-05-07 15:16:36 callbacks:239] Step time: 0.046907663345336914 seconds
[NeMo I 2020-05-07 15:16:37 callbacks:224] Step: 3150
[NeMo I 2020-05-07 15:16:37 helpers:104] Loss: 0.13977688550949097
[NeMo I 2020-05-07 15:16:37 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-07 15:16:37 callbacks:239] Step time: 0.04143500328063965 seconds
[NeMo I 2020-05-07 15:16:38 callbacks:224] Step: 3175
[NeMo I 2020-05-07 15:16:38 helpers:104] Loss: 0.14391139149665833
[NeMo I 2020-05-07 15:16:38 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-07 15:16:38 callbacks:239] Step time: 0.04433631896972656 seconds
[NeMo I 2020-05-07 15:16:39 callbacks:224] Step: 3200
[NeMo I 2020-05-07 15:16:39 helpers:104] Loss: 0.0497245192527771
[NeMo I 2020-05-07 15:16:39 helpers:110] training_batch_top@1:  99.2188
[NeMo I 2020-05-07 15:16:39 callbacks:239] Step time: 0.049695491790771484 seconds
[NeMo I 2020-05-07 15:16:39 callbacks:445] Doing Evaluation ..............................
[

[NeMo I 2020-05-07 15:17:09 callbacks:224] Step: 3775
[NeMo I 2020-05-07 15:17:09 helpers:104] Loss: 0.09113907068967819
[NeMo I 2020-05-07 15:17:09 helpers:110] training_batch_top@1:  96.0938
[NeMo I 2020-05-07 15:17:09 callbacks:239] Step time: 0.04052901268005371 seconds
[NeMo I 2020-05-07 15:17:10 callbacks:224] Step: 3800
[NeMo I 2020-05-07 15:17:10 helpers:104] Loss: 0.10832063108682632
[NeMo I 2020-05-07 15:17:10 helpers:110] training_batch_top@1:  96.8750
[NeMo I 2020-05-07 15:17:10 callbacks:239] Step time: 0.04409003257751465 seconds
[NeMo I 2020-05-07 15:17:10 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-07 15:17:12 callbacks:450] Evaluation time: 1.4619972705841064 seconds
[NeMo I 2020-05-07 15:17:13 callbacks:224] Step: 3825
[NeMo I 2020-05-07 15:17:13 helpers:104] Loss: 0.07323834300041199
[NeMo I 2020-05-07 15:17:13 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-07 15:17:13 callbacks:239] Step time: 0.04062795639038086 secon

[NeMo I 2020-05-07 15:17:41 callbacks:224] Step: 4400
[NeMo I 2020-05-07 15:17:41 helpers:104] Loss: 0.06995581835508347
[NeMo I 2020-05-07 15:17:41 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-07 15:17:41 callbacks:239] Step time: 0.040346622467041016 seconds
[NeMo I 2020-05-07 15:17:41 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-07 15:17:42 callbacks:450] Evaluation time: 1.4363057613372803 seconds
[NeMo I 2020-05-07 15:17:43 callbacks:224] Step: 4425
[NeMo I 2020-05-07 15:17:43 helpers:104] Loss: 0.09613458812236786
[NeMo I 2020-05-07 15:17:43 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-07 15:17:43 callbacks:239] Step time: 0.04565882682800293 seconds
[NeMo I 2020-05-07 15:17:44 callbacks:224] Step: 4450
[NeMo I 2020-05-07 15:17:44 helpers:104] Loss: 0.10379315912723541
[NeMo I 2020-05-07 15:17:44 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-07 15:17:44 callbacks:239] Step time: 0.04911184310913086 seco

In [23]:
dur = end - start

In [24]:
dur

232.0180184841156

## inference

In [15]:
# Download the checkpoint files

base_checkpoint_path = '/home/fjia/code/NeMo-fei/examples/asr/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/results/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/1'
# base_checkpoint_path = './quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/results/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/1'
CHECKPOINT_ENCODER = os.path.join(base_checkpoint_path, 'JasperEncoder-STEP-87000.pt')
CHECKPOINT_DECODER = os.path.join(base_checkpoint_path, 'JasperDecoderForClassification-STEP-87000.pt')

# if not os.path.exists(base_checkpoint_path):
#     os.makedirs(base_checkpoint_path)
    
# if not os.path.exists(CHECKPOINT_ENCODER):
#     !wget https://api.ngc.nvidia.com/v2/models/nvidia/google_speech_commands_v2___matchboxnet_3x1x1/versions/1/files/JasperEncoder-STEP-89000.pt -P {base_checkpoint_path};

# if not os.path.exists(CHECKPOINT_DECODER):
#     !wget https://api.ngc.nvidia.com/v2/models/nvidia/google_speech_commands_v2___matchboxnet_3x1x1/versions/1/files/JasperDecoderForClassification-STEP-89000.pt -P {base_checkpoint_path};

In [16]:
# Download the checkpoint files

base_checkpoint_path = '/home/fjia/code/NeMo-fei/examples/asr/quartznet_VAD_2balanced_o0_200ep_mels_prob88/results/quartznet_VAD_2balanced_o0_200ep_mels_prob88/1'


In [17]:
model_path  = base_checkpoint_path

# Evaluation of incorrectly predicted samples

Given that we have a trained model, which performs reasonably well, lets try to listen to the samples where the model is least confident in its predictions.

For this, we need support of the librosa library.

**NOTE**: The following code depends on librosa. To install it, run the following code block first

!pip install librosa

In [25]:
# lets add a path to the checkpoint dir
model_path = neural_factory.checkpoint_dir

In [18]:
model_path

'/home/fjia/code/NeMo-fei/examples/asr/quartznet_VAD_2balanced_o0_200ep_mels_prob88/results/quartznet_VAD_2balanced_o0_200ep_mels_prob88/1'

## Extract the predictions from the model

We want to possess the actual logits of the model instead of just the final evaluation score, so we use `NeuralFactory.infer(...)` to extract the logits per batch of samples provided.

In [19]:
# --- Inference Only --- #
# We've already built the inference DAG above, so all we need is to call infer().
evaluated_tensors = neural_factory.infer(
    # These are the tensors we want to get from the model.
    tensors=[test_loss, test_decoded, test_labels],
    # checkpoint_dir specifies where the model params are loaded from.
    checkpoint_dir=model_path
    )

[NeMo I 2020-05-08 17:42:02 actions:1493] Restoring JasperEncoder from /home/fjia/code/NeMo-fei/examples/asr/quartznet_VAD_2balanced_o0_200ep_mels_prob88/results/quartznet_VAD_2balanced_o0_200ep_mels_prob88/1/JasperEncoder-STEP-89400.pt
[NeMo I 2020-05-08 17:42:02 actions:1493] Restoring JasperDecoderForClassification from /home/fjia/code/NeMo-fei/examples/asr/quartznet_VAD_2balanced_o0_200ep_mels_prob88/results/quartznet_VAD_2balanced_o0_200ep_mels_prob88/1/JasperDecoderForClassification-STEP-89400.pt
[NeMo I 2020-05-08 17:42:02 actions:734] Evaluating batch 0 out of 109




[NeMo I 2020-05-08 17:42:03 actions:734] Evaluating batch 10 out of 109
[NeMo I 2020-05-08 17:42:03 actions:734] Evaluating batch 20 out of 109
[NeMo I 2020-05-08 17:42:03 actions:734] Evaluating batch 30 out of 109
[NeMo I 2020-05-08 17:42:03 actions:734] Evaluating batch 40 out of 109
[NeMo I 2020-05-08 17:42:03 actions:734] Evaluating batch 50 out of 109
[NeMo I 2020-05-08 17:42:04 actions:734] Evaluating batch 60 out of 109
[NeMo I 2020-05-08 17:42:04 actions:734] Evaluating batch 70 out of 109
[NeMo I 2020-05-08 17:42:04 actions:734] Evaluating batch 80 out of 109
[NeMo I 2020-05-08 17:42:04 actions:734] Evaluating batch 90 out of 109
[NeMo I 2020-05-08 17:42:04 actions:734] Evaluating batch 100 out of 109


## Accuracy calculation

In [20]:
correct_count = 0
total_count = 0

for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    acc = classification_accuracy(
        logits=logits,
        targets=labels,
        top_k=[1]
    )

    # Select top 1 accuracy only
    acc = acc[0]

    # Since accuracy here is "per batch", we simply denormalize it by multiplying
    # by batch size to recover the count of correct samples.
    correct_count += int(acc * logits.size(0))
    total_count += logits.size(0)

logging.info(f"Total correct / Total count : {correct_count} / {total_count}")
logging.info(f"Final accuracy : {correct_count / float(total_count)}")

[NeMo I 2020-05-08 17:42:05 <ipython-input-20-674fb7de9132>:19] Total correct / Total count : 13740 / 13861
[NeMo I 2020-05-08 17:42:05 <ipython-input-20-674fb7de9132>:20] Final accuracy : 0.9912704711059808


## Precision Recall F1

In [21]:
import torch

In [22]:
# todo test
from typing import List, Optional
def binary_classification_confusion_matrix(logits: torch.Tensor, targets: torch.Tensor, top_k: Optional[List[int]] = None) -> List[float]:
    """
    ]
    [TODO]
    """
    if top_k is None:
        top_k = [1]
    max_k = max(top_k)

    with torch.no_grad():
        true_positive = 0
        false_positive = 0
        _, predictions = logits.topk(max_k, dim=1, largest=True, sorted=True)
        predictions = predictions.t().squeeze()

        # speech(command) positive | background negative
        
        true_negative = 0
        false_negative = 0
        false_positive = 0
        true_positive = 0
        
        for i in range(predictions.size(-1)):
            pred = predictions[i]
            targ = targets[i]
#             print(pred, targ)
            if pred == 0 and targ == 0:
                true_negative += 1
            elif pred == 0 and targ == 1:
                false_negative += 1
            elif pred == 1 and targ == 0:
                false_positive += 1
            elif pred == 1 and targ == 1:
                true_positive += 1
            else:
                raise ValueError('Predictions or targets not in 0/1')
               
                
#         correct = predictions.eq(targets.view(1, -1)).expand_as(predictions)
#         print(correct)

#         results = []
#         for k in top_k:
#             correct_k = correct[:k].view(-1).float().mean().to('cpu').numpy()
#             results.append(correct_k)

#     return results
    return true_negative, false_negative , false_positive, true_positive


In [23]:
correct_count = 0
# total_count = 0

total_true_negative, total_false_negative , total_false_positive, total_true_positive = 0, 0, 0, 0
for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    true_negative, false_negative , false_positive, true_positive = binary_classification_confusion_matrix(
        logits=logits,
        targets=labels,
        top_k=[1]
    )

    total_true_negative += true_negative
    total_false_negative += false_negative
    total_false_positive += false_positive
    total_true_positive  += true_positive

logging.info(f" TN : {total_true_negative}")
logging.info(f" FN : {total_false_negative}")
logging.info(f" FP : {total_false_positive}")
logging.info(f" TP : {total_true_positive}")
precision = total_true_positive / (total_true_positive + total_false_positive)
recall = total_true_positive / (total_true_positive + total_false_negative)
f1_score =  2 * precision * recall / (precision + recall)
logging.info(f"Final Precision: {precision}")
logging.info(f"Final Recall : {recall}")
logging.info(f"Final F1 score : {f1_score}")

[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:17]  TN : 6814
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:18]  FN : 75
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:19]  FP : 46
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:20]  TP : 6926
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:24] Final Precision: 0.9934021801491681
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:25] Final Recall : 0.9892872446793315
[NeMo I 2020-05-08 17:42:48 <ipython-input-23-3fdd580e5c72>:26] Final F1 score : 0.991340442281543


## Filtering out incorrect samples
Let us now filter out the incorrectly labeled samples from the total set of samples in the test set

In [28]:
import librosa
import json
import IPython.display as ipd

In [29]:
# First lets create a utility class to remap the integer class labels to actual string label
class ReverseMapLabel:
    def __init__(self, data_layer: nemo_asr.AudioToSpeechLabelDataLayer):
        self.label2id = dict(data_layer._dataset.label2id)
        self.id2label = dict(data_layer._dataset.id2label)

    def __call__(self, pred_idx, label_idx):
        return self.id2label[pred_idx], self.id2label[label_idx]

In [37]:
# Next, lets get the indices of all the incorrectly labeled samples
sample_idx = 0
incorrect_preds = []
rev_map = ReverseMapLabel(eval_data_layer)

# Remember, evaluated_tensor = (loss, logits, labels)
for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    probs = torch.softmax(logits, dim=-1)
    probas, preds = torch.max(probs, dim=-1)

    incorrect_ids = (preds != labels).nonzero()
    for idx in incorrect_ids:
        proba = float(probas[idx][0])
        pred = int(preds[idx][0])
        label = int(labels[idx][0])
        idx = int(idx[0]) + sample_idx

        incorrect_preds.append((idx, *rev_map(pred, label), proba))

    sample_idx += labels.size(0)

logging.info(f"Num test samples : {total_count}")
logging.info(f"Num errors : {len(incorrect_preds)}")

# First lets sort by confidence of prediction
incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=True) #False

[NeMo I 2020-05-04 19:20:18 <ipython-input-37-6b4c4e5eff4b>:22] Num test samples : 13861
[NeMo I 2020-05-04 19:20:18 <ipython-input-37-6b4c4e5eff4b>:23] Num errors : 208


In [38]:
# Next, lets get the indices of all the incorrectly labeled samples
sample_idx = 0
correct_preds = []
rev_map = ReverseMapLabel(eval_data_layer)

# Remember, evaluated_tensor = (loss, logits, labels)
for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    probs = torch.softmax(logits, dim=-1)
    probas, preds = torch.max(probs, dim=-1)

    correct_ids = (preds == labels).nonzero()
    for idx in correct_ids:
        proba = float(probas[idx][0])
        pred = int(preds[idx][0])
        label = int(labels[idx][0])
        idx = int(idx[0]) + sample_idx

        correct_preds.append((idx, *rev_map(pred, label), proba))

    sample_idx += labels.size(0)

logging.info(f"Num test samples : {total_count}")
logging.info(f"Num correct : {len(correct_preds)}")

# First lets sort by confidence of prediction
correct_preds = sorted(correct_preds, key=lambda x: x[-1], reverse=True) #False

[NeMo I 2020-05-04 19:20:20 <ipython-input-38-aa5f2c5e1dea>:22] Num test samples : 13861
[NeMo I 2020-05-04 19:20:20 <ipython-input-38-aa5f2c5e1dea>:23] Num correct : 13653


## Examine a subset of incorrect samples
Lets print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples

In [39]:
for incorrect_sample in incorrect_preds[:50]:
    
#     if incorrect_sample[2] == 'background':
#         print(incorrect_sample)
    logging.info(str(incorrect_sample))

[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (5887, 'commands', 'background', 0.999923825263977)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (1716, 'commands', 'background', 0.9997192025184631)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (2720, 'commands', 'background', 0.9995511174201965)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (7094, 'background', 'commands', 0.9978386759757996)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (8307, 'background', 'commands', 0.9977680444717407)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (10848, 'background', 'commands', 0.9977318644523621)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (9008, 'background', 'commands', 0.9976598024368286)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (12281, 'background', 'commands', 0.9975294470787048)
[NeMo I 2020-05-04 19:20:26 <ipython-input-39-e74830bb82ce>:5] (13719, 

##  Define a threshold below which we designate a model's prediction as "low confidence"

In [None]:
# Filter out how many such samples exist
low_confidence_threshold = 0.55
count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))
logging.info(f"Number of low confidence predictions : {count_low_confidence}")

In [None]:
# Filter out how many such samples exist
high_confidence_threshold = 0.99
count_high_confidence = len(list(filter(lambda x: x[-1] >= high_confidence_threshold, incorrect_preds)))
logging.info(f"Number of high confidence predictions : {count_high_confidence}")

# Lets hear the samples which the model has least confidence in !

In [None]:
# First lets create a helper function to parse the manifest files
def parse_manifest(manifest):
    data = []
    for line in manifest:
        line = json.loads(line)
        data.append(line)

    return data

In [None]:
# Next, lets create a helper function to actually listen to certain samples
def listen_to_file(sample_id, pred=None, label=None, proba=None):
    # Load the audio waveform using librosa
    filepath = test_samples[sample_id]['audio_filepath']
    if 'offset' in test_samples[sample_id]:
        audio, sample_rate = librosa.load(filepath,
                                          offset = test_samples[sample_id]['offset'],
                                          duration = test_samples[sample_id]['duration'])
    else:
         audio, sample_rate = librosa.load(filepath)

    if pred is not None and label is not None and proba is not None:
        logging.info(f"filepath: {filepath}, Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}")
    else:
        
        logging.info(f"Sample : {sample_id}")

    return ipd.Audio(audio, rate=sample_rate)


In [None]:
# Now lets load the test manifest into memory

all_test_samples = []
for _ in test_dataset.split(','):
    with open(_, 'r') as test_f:
        test_samples = test_f.readlines()
        print(_, len(test_samples))
       
        all_test_samples.extend(test_samples)
        
test_samples = parse_manifest(all_test_samples)
print(len(test_samples))

In [None]:
incorrect_preds

In [None]:
# Finally, lets listen to all the audio samples where the model made a mistake
# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`

for sample_id, pred, label, proba in incorrect_preds[:200]:
    filepath = test_samples[sample_id]['audio_filepath']
    
    print(test_samples[sample_id])
#     if filepath not in exist:
    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))
    exist.add(filepath)

# inference

In [41]:
data_json = test_dataset
data = []
for line in open(data_json, 'r'):
    data.append(json.loads(line))    


In [47]:
import scipy.io.wavfile as wave
sample_rate, signal = wave.read(data[0]['audio_filepath'])

# make sure that sample rate is the same as expected by Jasper
# assert sample_rate == model_definition['sample_rate']

In [52]:
eval_data_layer2 = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=test_dataset,
    sample_rate=sample_rate,
    labels=labels,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    shuffle=False,
)

[NeMo I 2020-05-05 15:38:20 collections:222] Filtered duration for loading collection is 1.173313.


In [53]:
# Now we build the test graph in a similar way, reusing the above components
## Build the test data loader and preprocess same way as train graph
## But note, we do not add the spectrogram augmentation to the test graph !
test_audio_signal, test_audio_signal_len, test_labels, test_label_len = eval_data_layer2()
test_processed_signal, test_processed_signal_len = data_preprocessor(
    input_signal=test_audio_signal, length=test_audio_signal_len
)
test_processed_signal, test_processed_signal_len = crop_pad_augmentation(
    input_signal=test_processed_signal, length=test_processed_signal_len
)

# Pass the test data through the model encoder and decoder
test_encoded, test_encoded_len = jasper_encoder(
    audio_signal=test_processed_signal, length=test_processed_signal_len
)
test_decoded = jasper_decoder(encoder_output=test_encoded)

# Compute test loss for visualization
# test_loss = ce_loss(logits=test_decoded, labels=test_labels)

In [54]:
evaluated_tensors = neural_factory.infer(
    # These are the tensors we want to get from the model.
    tensors=[ test_decoded, test_labels],
    # checkpoint_dir specifies where the model params are loaded from.
    checkpoint_dir=model_path
    )

[NeMo I 2020-05-05 15:38:32 actions:1493] Restoring JasperEncoder from .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc/checkpoints/JasperEncoder-STEP-4465.pt
[NeMo I 2020-05-05 15:38:32 actions:1493] Restoring JasperDecoderForClassification from .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc/checkpoints/JasperDecoderForClassification-STEP-4465.pt


KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/fjia/anaconda3/envs/vad/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/fjia/anaconda3/envs/vad/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/fjia/anaconda3/envs/vad/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/fjia/anaconda3/envs/vad/lib/python3.7/site-packages/nemo/collections/asr/parts/dataset.py", line 402, in __getitem__
    t = self.label2id[sample.label]
KeyError: 'background'
