In [19]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install unidecode

# ## Install NeMo
!python -m pip install --upgrade git+https://github.com/NVIDIA/NeMo.git@candidate#egg=nemo_toolkit[asr]

## Install TorchAudio
!pip install torchaudio>=0.6.0 -f https://download.pytorch.org/whl/torch_stable.html

## Grab the config we'll use in this example
!mkdir configs

E: Could not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?
Collecting nemo_toolkit[asr]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision candidate) to /tmp/pip-install-67y1lgyr/nemo-toolkit
  Running command git clone -q https://github.com/NVIDIA/NeMo.git /tmp/pip-install-67y1lgyr/nemo-toolkit
  Running command git checkout -b candidate --track origin/candidate
  Switched to a new branch 'candidate'
  Branch 'candidate' set up to track remote branch 'candidate' from 'origin'.


Building wheels for collected packages: nemo-toolkit
  Building wheel for nemo-toolkit (setup.py) ... [?25ldone
[?25h  Created wheel for nemo-toolkit: filename=nemo_toolkit-0.88.1b0-py3-none-any.whl size=358253 sha256=d4dec7c364d16d5333568649b7b5ac0fd6089a43d859b75d3f2399dde403dac9
  Stored in directory: /tmp/pip-ephem-wheel-cache-_inurkyq/wheels/de/df/80/16e33a10e05d7769182e3cab4910a5fb304706a3133df2716b
Successfully built nemo-toolkit
Installing collected packages: nemo-toolkit
  Attempting uninstall: nemo-toolkit
    Found existing installation: nemo-toolkit 0.88.1b0
    Uninstalling nemo-toolkit-0.88.1b0:
      Successfully uninstalled nemo-toolkit-0.88.1b0
Successfully installed nemo-toolkit-0.88.1b0
mkdir: cannot create directory ‘configs’: File exists


# Introduction [TODO] add more

This VAD tutorial is based on the MatchboxNet model from the paper "[MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition](https://arxiv.org/abs/2004.08531)" with a modified decoder head to suit classification tasks.

The notebook will follow the steps below:

 - Dataset preparation: Instruction of downloading datasets. And how to convert it to a format suitable for use with nemo_asr
 - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

 - Data augmentation using SpecAugment "[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" to increase number of data samples.
 
 - Develop a small Neural classification model which can be trained efficiently.
 
 - Model training on the Google Speech Commands dataset and Freesound dataset in NeMo.
 
 - Evaluation of error cases of the model by audibly hearing the samples

In [5]:
# Some utility imports
import os
from omegaconf import OmegaConf

# Data Preparation

## Download the background data
We suggest to use the background categories of [freesound](https://freesound.org/) dataset  as our non-speech/background data. 
We provide scripts for downloading and resampling it.  Please have a look at [NeMo docs VAD Data Preparation]( https://docs.nvidia.com/deeplearning/nemo/developer_guide/en/v0.11.0/voice_activity_detection/tutorial.html#data-preparation). Note that downloading this dataset may takes hours. 

**NOTE:** Here, this tutorial serves as a demonstration on how to train and evaluate models for vad using NeMo. We avoid using freesound dataset, and use `_background_noise_` category in Google Speech Commands Dataset as non-speech/background data.

## Download the speech data
   
We will use the open source Google Speech Commands Dataset (we will use V2 of the dataset for the tutorial, but require very minor changes to support V1 dataset) as our speech data. Google Speech Commands Dataset V2 will take roughly 6GB disk space. These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr.


**NOTE**: You may additionally pass `--test_size` or `--val_size` flag for spliting train val and test data.

**NOTE**: You may additionally pass a `--rebalance_method='fixed|over|under'` at the end of the script to rebalance the class samples in the manifest. 
* 'fixed': Fixed number of sample for each class. Train 5000, val 1000, and test 1000. (Change number in script if you want)
* 'over': Oversampling rebalance method
* 'under': Undersampling rebalance method

**NOTE**: The `_background_noise_` category only has 6 audio files. So we would like to generate more based on the audio files to enlarge our background training data. If you want to use your own background noise data, just change the `background_data_root` and delete `--generate`


In [6]:
tmp = 'src'
data_folder = 'data'
if not os.path.exists(tmp):
    os.makedirs(tmp)
if not os.path.exists(data_folder):
    os.makedirs(data_folder)

In [7]:
script = os.path.join(tmp, 'process_vad_data.py')
if not os.path.exists(script):
    !wget -P $tmp https://raw.githubusercontent.com/NVIDIA/NeMo/candidate/scripts/process_vad_data.py

--2020-08-26 14:08:53--  https://raw.githubusercontent.com/NVIDIA/NeMo/candidate/scripts/process_vad_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 199.232.36.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|199.232.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18334 (18K) [text/plain]
Saving to: ‘src/process_vad_data.py’


2020-08-26 14:08:53 (282 KB/s) - ‘src/process_vad_data.py’ saved [18334/18334]



In [8]:
speech_data_root = os.path.join(data_folder, 'google_dataset_v2')
background_data_root = os.path.join(data_folder, 'google_dataset_v2/google_speech_recognition_v2/_background_noise_')# your <resampled freesound data directory>
out_dir = os.path.join(data_folder, 'manifest')
if not os.path.exists(speech_data_root):
    os.mkdir(speech_data_root)

In [11]:
!python $script \
    --out_dir={out_dir} \
    --speech_data_root={speech_data_root} \
    --background_data_root={background_data_root}\
    --log \
    --generate \
    --rebalance_method='fixed' 

INFO:root:Working on: google_speech_recognition_v2
INFO:root:Split speech data!
INFO:root:Overall: 105829, Train: 84663, Validatoin: 10583, Test: 10583
INFO:root:Finish spliting train, val and test for speech. Write to files!
INFO:root:Split background data!
INFO:root:Overall: 6, Train: 4, Validatoin: 1, Test: 1
INFO:root:Finish spliting train, val and test for background. Write to files!
INFO:root:=== Write speech data to manifest!
INFO:root:Val: Skip 140 samples. Get 10009 segments! => data/manifest/speech_validation_manifest.json 
INFO:root:Test: Skip 94 samples. Get 167331 segments! => data/manifest/speech_testing_manifest.json
INFO:root:Train: Skip 887 samples. Get 80497 segments!=> data/manifest/speech_training_manifest.json
INFO:root:Start generating more background noise data
INFO:root:Generate more background for data/google_dataset_v2/google_speech_recognition_v2/_background_noise_/background_validation_list.txt. => data/google_dataset_v2/google_speech_recognition_v2/_backgro

## Preparing the manifest file

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `label`: The class label (speech or background) of this sample <br>
3) `duration`: The length of the audio file, in seconds.<br>
4) `offset`: The start of the segment, in seconds.

In [12]:
# change below if you don't have or don't want to use rebalanced data
train_dataset = 'data/manifest/balanced_background_training_manifest.json,data/manifest/balanced_speech_training_manifest.json' 
val_dataset = 'data/manifest/background_validation_manifest.json,data/manifest/speech_validation_manifest.json' 
test_dataset = 'data/manifest/balanced_background_testing_manifest.json,data/manifest/balanced_speech_testing_manifest.json' 

## Read a few rows of the manifest file 

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `command`: The class label (or speech command) of this sample <br>
3) `duration`: The length of the audio file, in seconds.

In [13]:
sample_test_dataset =  test_dataset.split(',')[0]

In [14]:
!head -n 5 {sample_test_dataset}

{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/white_noise.wav_90000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.19000000000000003}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/white_noise.wav_933000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.09}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/white_noise.wav_174000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.09}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/white_noise.wav_920000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.20000000000000004}
{"audio_filepath": "data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/white_noise.wav_209000.wav", "duration": 0.63, "label": "background", "text": "_", "offset": 0.3

# Training - Preparation

We will be training a MatchboxNet model from paper "[MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition](https://arxiv.org/abs/2004.08531)" evolved from [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) model. The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.

MatchboxNet models generally follow the model definition pattern QuartzNet-[BxRXC], where B is the number of blocks, R is the number of convolutional sub-blocks, and C is the number of channels in these blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout.


In [20]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection - this collections contains complete ASR models and
# building blocks (modules) for ASR
import nemo.collections.asr as nemo_asr

ImportError: cannot import name '_run_hydra' from 'hydra._internal.utils' (/home/fjia/anaconda3/envs/vad/lib/python3.7/site-packages/hydra/_internal/utils.py)

## Model Configuration
The MatchboxNet Model is defined in a config file which declares multiple important sections.

They are:

1) `model`: All arguments that will relate to the Model - preprocessors, encoder, decoder, optimizer and schedulers, datasets and any other related information

2) `trainer`: Any argument to be passed to PyTorch Lightning

In [12]:
MODEL_CONFIG = "matchboxnet_3x1x64_vad.yaml"

if not os.path.exists(f"configs/{MODEL_CONFIG}"):
  !wget -P configs/ "https://raw.githubusercontent.com/NVIDIA/NeMo/candidate/examples/asr/conf/{MODEL_CONFIG}"

In [13]:
# This line will print the entire config of the MatchboxNet model
config_path = f"configs/{MODEL_CONFIG}"
config = OmegaConf.load(config_path)
print(config.pretty())

                pretty() is deprecated and will be removed in a future version.
                Use OmegaConf.to_yaml. Please note that the default value for
                resolve has changed to True.
                
    


name: MatchboxNet-3x1x64-vad
model:
  sample_rate: 16000
  timesteps: 64
  repeat: 1
  dropout: 0.0
  kernel_size_factor: 1.0
  labels:
  - background
  - speech
  train_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: true
    augmentor:
      shift:
        prob: 1.0
        min_shift_ms: -5.0
        max_shift_ms: 5.0
      white_noise:
        prob: 1.0
        min_level: -90
        max_level: -46
  validation_ds:
    manifest_filepath: ???
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    val_loss_idx: 0
  test_ds:
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0
  preprocessor:
    cls: nemo.collections.asr.modules.AudioToMFCCPreprocessor
    params:
      window_size: 0.025
      window_stride: 0.01
      window: hann
      n_mels: 64
     

In [14]:
# Preserve some useful parameters
labels = config.model.labels
sample_rate = config.sample_rate

### Setting up the datasets within the config

If you'll notice, there are a few config dictionaries called `train_ds`, `validation_ds` and `test_ds`. These are configurations used to setup the Dataset and DataLoaders of the corresponding config.



In [15]:
print(config.model.train_ds.pretty())

manifest_filepath: ???
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: true
augmentor:
  shift:
    prob: 1.0
    min_shift_ms: -5.0
    max_shift_ms: 5.0
  white_noise:
    prob: 1.0
    min_level: -90
    max_level: -46



### `???` inside configs

You will often notice that some configs have `???` in place of paths. This is used as a placeholder so that the user can change the value at a later time.

Let's add the paths to the manifests to the config above.

In [16]:
config.model.train_ds.manifest_filepath = train_dataset
config.model.validation_ds.manifest_filepath = val_dataset
config.model.test_ds.manifest_filepath = test_dataset

## Building the PyTorch Lightning Trainer

NeMo models are primarily PyTorch Lightning modules - and therefore are entirely compatible with the PyTorch Lightning ecosystem!

Lets first instantiate a Trainer object!

In [17]:
import torch
import pytorch_lightning as pl

In [18]:
print("Trainer config - \n")
print(config.trainer.pretty())

Trainer config - 

gpus: 0
max_epochs: 200
max_steps: null
num_nodes: 1
distributed_backend: ddp
accumulate_grad_batches: 1
checkpoint_callback: false
logger: false
row_log_interval: 1
val_check_interval: 1.0



In [19]:
# Lets modify some trainer configs for this demo
# Checks if we have GPU available and uses it
cuda = 1 if torch.cuda.is_available() else 0
config.trainer.gpus = cuda

# Reduces maximum number of epochs to 5 for quick demonstration
config.trainer.max_epochs = 2

# Remove distributed training flags
config.trainer.distributed_backend = None

In [20]:
trainer = pl.Trainer(**config.trainer)

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]


## Setting up a NeMo Experiment

NeMo has an experiment manager that handles logging and checkpointing for us, so let's use it ! 

In [21]:
from nemo.utils.exp_manager import exp_manager

In [22]:
exp_dir = exp_manager(trainer, config.get("exp_manager", None))

[NeMo I 2020-08-25 23:00:55 exp_manager:170] Experiments will be logged at /home/fjia/code/NeMo-fei/tutorials/asr/nemo_experiments/MatchboxNet-3x1x64-vad/2020-08-25_23-00-55
[NeMo I 2020-08-25 23:00:55 exp_manager:504] TensorboardLogger has been set up


[NeMo W 2020-08-25 23:00:55 exp_manager:538] trainer had a weights_save_path of cwd(). This was ignored.


In [23]:
# The exp_dir provides a path to the current experiment for easy access
exp_dir = str(exp_dir)
exp_dir

'/home/fjia/code/NeMo-fei/tutorials/asr/nemo_experiments/MatchboxNet-3x1x64-vad/2020-08-25_23-00-55'

## Building the MatchboxNet Model

MatchboxNet is an ASR model with a classification task - it generates one label for the entire provided audio stream. Therefore we encapsulate it inside the `EncDecClassificationModel` as follows.

In [24]:
asr_model = nemo_asr.models.EncDecClassificationModel(cfg=config.model, trainer=trainer)

[NeMo I 2020-08-25 23:00:55 collections:253] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-08-25 23:00:55 collections:256] # 20000 files loaded accounting to # 2 labels
[NeMo I 2020-08-25 23:00:55 collections:253] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-08-25 23:00:55 collections:256] # 15205 files loaded accounting to # 2 labels
[NeMo I 2020-08-25 23:00:55 collections:253] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-08-25 23:00:55 collections:256] # 4000 files loaded accounting to # 2 labels


    Config key 'cls' is deprecated since Hydra 1.0 and will be removed in Hydra 1.1.
    Use '_target_' instead of 'cls'.
    See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/object_instantiation_changes
    
    Field 'params' is deprecated since Hydra 1.0 and will be removed in Hydra 1.1.
    Inline the content of params directly at the containing node.
    See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/object_instantiation_changes
    


# Training a MatchboxNet Model

As MatchboxNet is inherently a PyTorch Lightning Model, it can easily be trained in a single line - `trainer.fit(model)` !

### Monitoring training progress

Before we begin training, lets first create a Tensorboard visualization to monitor progress


In [25]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

In [26]:
%tensorboard --logdir {exp_dir}

### Training for 5 epochs
We see below that the model begins to get modest scores on the validation set after just 5 epochs of training

In [27]:
trainer.fit(asr_model)

    GeForce GT 710 with CUDA capability sm_35 is not compatible with the current PyTorch installation.
    The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.
    If you want to use the GeForce GT 710 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
    
    


[NeMo I 2020-08-25 23:00:58 modelPT:465] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.95, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.05
        weight_decay: 0.001
    )
[NeMo I 2020-08-25 23:00:58 lr_scheduler:545] Scheduler "<nemo.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing object at 0x7f5c69d034d0>" will be used during training (effective maximum steps = 312) - Parameters : ({'power': 2.0, 'warmup_ratio': 0.05, 'hold_ratio': 0.45, 'min_lr': 0.001, 'last_epoch': -1, 'max_steps': 312})



  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | preprocessor      | AudioToMFCCPreprocessor      | 0     
1 | encoder           | ConvASREncoder               | 73 K  
2 | decoder           | ConvASRDecoderClassification | 258   
3 | loss              | CrossEntropyLoss             | 0     
4 | spec_augmentation | SpectrogramAugmentation      | 0     
5 | _accuracy         | TopKClassificationAccuracy   | 0     
    


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validation sanity check', layout=Layout…

    


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Training', layout=Layout(flex='2'), max…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Validating', layout=Layout(flex='2'), m…




1

### Evaluation on the Test set

Lets compute the final score on the test set via `trainer.test(model)`

In [28]:
trainer.test(asr_model, ckpt_path=None)

[NeMo I 2020-08-25 23:01:56 modelPT:465] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.95, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.05
        weight_decay: 0.001
    )
[NeMo I 2020-08-25 23:01:56 lr_scheduler:545] Scheduler "<nemo.core.optim.lr_scheduler.PolynomialHoldDecayAnnealing object at 0x7f5c691cbd10>" will be used during training (effective maximum steps = 312) - Parameters : ({'power': 2.0, 'warmup_ratio': 0.05, 'hold_ratio': 0.45, 'min_lr': 0.001, 'last_epoch': -1, 'max_steps': 312})


    


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Testing', layout=Layout(flex='2'), max=…

--------------------------------------------------------------------------------
TEST RESULTS
{'test_epoch_top@1': tensor(0.9902),
 'test_loss': tensor(0.0510, device='cuda:0')}
--------------------------------------------------------------------------------



{'test_loss': 0.050965823233127594, 'test_epoch_top@1': 0.9902499914169312}

# Fast Training

We can dramatically improve the time taken to train this model by using Multi GPU training along with Mixed Precision.

For multi-GPU training, take a look at [the PyTorch Lightning Multi-GPU training section](https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html)

For mixed-precision training, take a look at [the PyTorch Lightning Mixed-Precision training section](https://pytorch-lightning.readthedocs.io/en/latest/apex.html)

```python
# Mixed precision:
trainer = Trainer(amp_level='O1', precision=16)

# Trainer with a distributed backend:
trainer = Trainer(gpus=2, num_nodes=2, distributed_backend='ddp')

# Of course, you can combine these flags as well.
```

# Evaluation of incorrectly predicted samples

Given that we have a trained model, which performs reasonably well, let's try to listen to the samples where the model is least confident in its predictions.

## Extract the predictions from the model

We want to possess the actual logits of the model instead of just the final evaluation score, so we can define a function to perform the forward step for us without computing the final loss. Instead, we extract the logits per batch of samples provided.

## Accessing the data loaders

We can utilize the `setup_test_data` method in order to instantiate a data loader for the dataset we want to analyze.

For convenience, we can access these instantiated data loaders using the following accessors - `asr_model._train_dl`, `asr_model._validation_dl` and `asr_model._test_dl`.

In [29]:
asr_model.setup_test_data(config.model.test_ds)
test_dl = asr_model._test_dl

[NeMo I 2020-08-25 23:01:58 collections:253] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-08-25 23:01:58 collections:256] # 4000 files loaded accounting to # 2 labels


## Partial Test Step

Below we define a utility function to perform most of the test step. For reference, the test step is defined as follows:

```python
    def test_step(self, batch, batch_idx, dataloader_idx=0):
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = self.forward(input_signal=audio_signal, input_signal_length=audio_signal_len)
        loss_value = self.loss(logits=logits, labels=labels)
        correct_counts, total_counts = self._accuracy(logits=logits, labels=labels)
        return {'test_loss': loss_value, 'test_correct_counts': correct_counts, 'test_total_counts': total_counts}
```

In [73]:
@torch.no_grad()
def extract_logits(model, dataloader):
    logits_buffer = []
    label_buffer = []

    # Follow the above definition of the test_step
    for batch in dataloader:
        audio_signal, audio_signal_len, labels, labels_len = batch
        logits = model(input_signal=audio_signal, input_signal_length=audio_signal_len)

        logits_buffer.append(logits)
        label_buffer.append(labels)
        print(".", end='')
    print()

    print("Finished extracting logits !")
    logits = torch.cat(logits_buffer, 0)
    labels = torch.cat(label_buffer, 0)
    return logits, labels


In [74]:
cpu_model = asr_model.cpu()
cpu_model.eval()
logits, labels = extract_logits(cpu_model, test_dl)
print("Logits:", logits.shape, "Labels :", labels.shape)

................................
Finished extracting logits !
Logits: torch.Size([4000, 2]) Labels : torch.Size([4000])


In [32]:
# Compute accuracy - `_accuracy` is a PyTorch Lightning Metric !
correct_count, total_count = cpu_model._accuracy(logits=logits, labels=labels)
print("Accuracy : ", float(correct_count * 100.) / float(total_count))

Accuracy :  99.025


# Add evaluation metrics

Here is an example of how to use more metrics (e.g. from pytorch_lightning) to evaluate your result.

**Note:** If you would like to add metrics for training and testing, have a look at 
```python
NeMo/nemo/collections/common/metrics
```


In [2]:
from pytorch_lightning.metrics.functional import confusion_matrix

In [3]:
_, predictions = logits.topk(max_k, dim=1, largest=True, sorted=True)

NameError: name 'logits' is not defined

In [None]:
pred = logits.topk()

In [83]:

confusion_matrix(pred, target)

NameError: name 'target' is not defined

In [89]:
logits.shape

torch.Size([4000, 2])

In [90]:
labels.shape

torch.Size([4000])

In [87]:
pred = 
confusion_matrix(pred=logits, target=labels)

RuntimeError: The size of tensor a (4000) must match the size of tensor b (8000) at non-singleton dimension 0

## Filtering out incorrect samples
Let us now filter out the incorrectly labeled samples from the total set of samples in the test set

In [33]:
import librosa
import json
import IPython.display as ipd

In [34]:
# First lets create a utility class to remap the integer class labels to actual string label
class ReverseMapLabel:
    def __init__(self, data_loader):
        self.label2id = dict(data_loader.dataset.label2id)
        self.id2label = dict(data_loader.dataset.id2label)

    def __call__(self, pred_idx, label_idx):
        return self.id2label[pred_idx], self.id2label[label_idx]

In [35]:
# Next, lets get the indices of all the incorrectly labeled samples
sample_idx = 0
incorrect_preds = []
rev_map = ReverseMapLabel(test_dl)

# Remember, evaluated_tensor = (loss, logits, labels)
probs = torch.softmax(logits, dim=-1)
probas, preds = torch.max(probs, dim=-1)

incorrect_ids = (preds != labels).nonzero()
for idx in incorrect_ids:
    proba = float(probas[idx][0])
    pred = int(preds[idx][0])
    label = int(labels[idx][0])
    idx = int(idx[0]) + sample_idx

    incorrect_preds.append((idx, *rev_map(pred, label), proba))

print(f"Num test samples : {total_count.item()}")
print(f"Num errors : {len(incorrect_preds)}")

# First lets sort by confidence of prediction
incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=False)

    	nonzero()
    Consider using one of the following signatures instead:
    	nonzero(*, bool as_tuple) (Triggered internally at  /opt/conda/conda-bld/pytorch_1595629427478/work/torch/csrc/utils/python_arg_parser.cpp:766.)
      # Remove the CWD from sys.path while we load stuff.
    


Num test samples : 4000.0
Num errors : 39


## Examine a subset of incorrect samples
Let's print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples

In [43]:
for incorrect_sample in incorrect_preds[:20]:
    print(str(incorrect_sample))

(3156, 'background', 'speech', 0.7270654439926147)
(3557, 'background', 'speech', 0.7274996638298035)
(3979, 'background', 'speech', 0.7600748538970947)
(3599, 'background', 'speech', 0.9448409676551819)
(3370, 'background', 'speech', 0.9485548138618469)
(3027, 'background', 'speech', 0.9552394151687622)
(3999, 'background', 'speech', 0.9569714665412903)
(3456, 'background', 'speech', 0.9575883746147156)
(3054, 'background', 'speech', 0.9596822261810303)
(3614, 'background', 'speech', 0.9645829200744629)
(3265, 'background', 'speech', 0.9671289920806885)
(3330, 'background', 'speech', 0.9673179388046265)
(3267, 'background', 'speech', 0.9676496386528015)
(3925, 'background', 'speech', 0.9678117632865906)
(3608, 'background', 'speech', 0.9701297283172607)
(3118, 'background', 'speech', 0.972405195236206)
(3989, 'background', 'speech', 0.9731476902961731)
(3793, 'background', 'speech', 0.9735272526741028)
(3272, 'background', 'speech', 0.974219560623169)
(3702, 'background', 'speech', 0.

##  Define a threshold below which we designate a model's prediction as "low confidence"

In [64]:
# Filter out how many such samples exist
low_confidence_threshold = 0.8
count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))
print(f"Number of low confidence predictions : {count_low_confidence}")

Number of low confidence predictions : 3


## Lets hear the samples which the model has least confidence in !

In [65]:
# First lets create a helper function to parse the manifest files
def parse_manifest(manifest):
    data = []
    for line in manifest:
        line = json.loads(line)
        data.append(line)

    return data

In [66]:
# Next, lets create a helper function to actually listen to certain samples
def listen_to_file(sample_id, pred=None, label=None, proba=None):
    # Load the audio waveform using librosa
    filepath = test_samples[sample_id]['audio_filepath']
    if 'offset' in test_samples[sample_id]:
        audio, sample_rate = librosa.load(filepath,
                                          offset = test_samples[sample_id]['offset'],
                                          duration = test_samples[sample_id]['duration'])
    else:
         audio, sample_rate = librosa.load(filepath)

    if pred is not None and label is not None and proba is not None:
        print(f"filepath: {filepath}, Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}")
    else:
        
        print(f"Sample : {sample_id}")

    return ipd.Audio(audio, rate=sample_rate)


In [67]:
import json
# Now lets load the test manifest into memory
all_test_samples = []
for _ in test_dataset.split(','):
    print(_)
    with open(_, 'r') as test_f:
        test_samples = test_f.readlines()
        
        all_test_samples.extend(test_samples)
print(len(all_test_samples))
test_samples = parse_manifest(all_test_samples)

data/manifest/balanced_background_testing_manifest.json
data/manifest/balanced_speech_testing_manifest.json
4000


In [68]:
# Finally, lets listen to all the audio samples where the model made a mistake
# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`
count = min(count_low_confidence, 20)  # replace this line with just `count_low_confidence` to listen to all samples with low confidence

for sample_id, pred, label, proba in incorrect_preds[:count]:
    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))

filepath: data/google_dataset_v2/google_speech_recognition_v2/_background_noise_more/running_tap.wav_713000.wav, Sample : 3156 Prediction : background Label : speech Confidence =  0.7271


filepath: data/google_dataset_v2/google_speech_recognition_v2/marvin/e1936ce8_nohash_0.wav, Sample : 3557 Prediction : background Label : speech Confidence =  0.7275


filepath: data/google_dataset_v2/google_speech_recognition_v2/marvin/e1936ce8_nohash_0.wav, Sample : 3979 Prediction : background Label : speech Confidence =  0.7601


# Transfer Leaning & Fine-tuning on a new dataset
For transfer learning please refer to [**Transfer learning** part of ASR tutorial](https://github.com/NVIDIA/NeMo/blob/candidate/tutorials/asr/01_ASR_with_NeMo.ipynb)

More details on how to saving and restoring checkpoint, and exporting a model in its entirety, please refer to [**Fine-tuning on a new dataset** and **Advanced Usage parts** of Speech Command tutorial](https://github.com/NVIDIA/NeMo/blob/candidate/tutorials/asr/02_Speech_Commands.ipynb)





# Inference and more
If you are interested in **pretrained** model and **streaming inference**, please have a look at [Offline_and_Online_VAD_Demo](https://github.com/NVIDIA/NeMo/blob/candidate/tutorials/asr/07_Online_Voice_Activity_Detection_Demo.ipynb)

