In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
!pip install wget
!pip install git+https://github.com/NVIDIA/apex.git
!pip install nemo-toolkit
!pip install nemo-asr
!pip install unidecode
[TODO]
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v1.yaml
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v2.yaml

In [1]:
# Import some necessary libraries
import os
import argparse

import copy
import math
import os
import glob
from functools import partial
from datetime import datetime
from ruamel.yaml import YAML

# Introduction

This VAD tutorial is based on the MatchboxNet model from the paper "[MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition](https://arxiv.org/abs/2004.08531)" with a modified decoder head to suit classification tasks.

The notebook will follow the steps below:

 - Dataset preparation: Preparing Google Speech Commands dataset

 - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

 - Data augmentation using SpecAugment "[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" to increase number of data samples.
 
 - Develop a small Neural classification model which can be trained efficiently.
 
 - Model training on the Google Speech Commands dataset and Freesound dataset in NeMo.
 
 - Evaluation of error cases of the model by audibly hearing the samples

In [3]:
speech_data_root = '/home/fjia/data/google_dataset_v2'
background_data_root = '/home/fjia/data/freesound_resampled_background'

# Data Preparation

## Download the Freesound dataset
Note that downloading this dataset may takes hours. We provide scripts and you can customize a lot by using it.

1. Download provided downloading and resampling scripts

In [3]:
# [TODO]
# !wget https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/proceta.py

# download_resample_freesound.sh
# freesound_download.py
# freesound_private_apikey.py
# freesound_resample.py
freesound_requirements.txt

2. We will need some requirements including freesound, requests, requests_oauthlib, joblib, librosa and sox. If they are not installed, please run 

In [None]:
!pip install -r freesound_requirements.txt

3. Create an API key for freesound.org at https://freesound.org/help/developers/ and paste the cliend_id and api_key to freesound_private_apikey
4. Authorize by run python freesound_download.py --authorize and visit website and paste response code

In [5]:
!python freesound_download.py --authorize

python: can't open file 'freesound_download.py': [Errno 2] No such file or directory


5. Feel free to change any arguments for freesound_download.py in **download_resample_freesound.sh** such as max_samples and max_filesize
6. Run `bash download_resample_freesound.sh <max number of samples you want> <download data directory> <resampled data directory> `

In [None]:
!bash download_resample_freesound.sh 4000 ./freesound  ./freesound_resampled_background

In [None]:
    
We will be using the open source Google Speech Commands Dataset (we will use V1 of the dataset for the tutorial, but require very minor changes to support V2 dataset). These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr

## Download the dataset

The dataset must be prepared using the scripts provided under the `{NeMo root directory}/scripts` sub-directory. 

Run the following command below to download the training script and execute it.

**NOTE**: You should have at least 4GB of disk space available if you’ve used --data_version=1; and at least 6GB if you used --data_version=2. Also, it will take some time to download and process, so go grab a coffee.

**NOTE**: You may additionally pass a `--rebalance` flag at the end of the `process_speech_commands_data.py` script to rebalance the class samples in the manifest.

!mkdir {data_dir}
!python process_speech_commands_data.py --data_root={data_dir} --data_version={DATASET_VER}
print("Dataset ready !")

## Prepare the path to manifest files

dtaset_path = 'google_speech_recognition_v{0}'.format(DATASET_VER)
dataset_basedir = os.path.join(data_dir, dtaset_path)

train_dataset = os.path.join(dataset_basedir, 'train_manifest.json')
val_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')
test_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')

In [35]:
!python process_vad_data.py --speech_data_root={speech_data_root} --background_data_root={background_data_root} --log=False

INFO:root:Working on: google_speech_recognition_v2
./frame_manifest/speech_testing_manifest.json
./frame_manifest/background_testing_manifest.json


# COMBO 2 
## Background + Speech Command 57k, 7k, 7k

In [6]:
train_dataset='./manifest/background_training_manifest.json,./manifest/speech_training_manifest.json' 
test_dataset='./manifest/all_test.json' 


###

train_dataset='./manifest/background_training_manifest.json,./manifest/speech_training_manifest.json' 
test_dataset='./manifest/all_test.json' 


## Read a few rows of the manifest file 

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `label`: The class label (speech or background) of this sample <br>
3) `duration`: The length of the audio file, in seconds.<br>
4) `offset`: The start of the segment, in seconds.

In [7]:
!tail -n 10 {test_dataset}

{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 20.0}
{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 21.0}
{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 22.0}
{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 23.0}
{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 24.0}
{"audio_filepath": "/home/fjia/data/freesound_resampled_background/Bus/id_353178 buspass.wav", "duration": 1.0, "label": "background", "text": "_", "offset": 25.0}
{"audio_fi

# Training - Preparation

We will be training a QuartzNet model from the paper "[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)". The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.

QuartzNet models generally follow the model definition pattern QuartzNet-[BxR], where B is the number of blocks and R is the number of convolutional sub-blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout:


In [8]:
FEAT_VERSION ='v1'
COMBO_VERSION = 'combo_balanced_all_Mfcc_prob0.8'

In [9]:
# Lets load the config file for the QuartzNet 3x1 model
# Here we will be using separable convolutions
# with 3 blocks (k=3 repeated once r=1 from the picture above)
yaml = YAML(typ="safe")
with open("configs/quartznet_vad_3x1_{0}.yaml".format(FEAT_VERSION)) as f:
    jasper_params = yaml.load(f)

# Pre-define a set of labels that this model must learn to predict
labels = jasper_params['labels']

# Get the sampling rate of the data
sample_rate = jasper_params['sample_rate']

In [10]:
# Import NeMo core functionality
# NeMo's "core" package
import nemo
# NeMo's ASR collection
import nemo.collections.asr as nemo_asr
# NeMo's learning rate policy
from nemo.utils.lr_policies import CosineAnnealing
from nemo.collections.asr.helpers import (
    monitor_classification_training_progress,
    process_classification_evaluation_batch,
    process_classification_evaluation_epoch,
)
from nemo.collections.asr.metrics import classification_accuracy, classification_confusion_matrix

logging = nemo.logging

################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################



## Define some model hyper parameters

In [11]:
# Lets define some hyper parameters
lr = 0.05
num_epochs = 5 #5
batch_size = 128
weight_decay = 0.001

## Define the NeMo components

In [12]:
result_dir = '.'

In [13]:
# Create a Neural Factory
# It creates log files and tensorboard writers for us among other functions
neural_factory = nemo.core.NeuralModuleFactory(
    log_dir='./{0}/quartznet-3x1-{1}{2}'.format(result_dir, FEAT_VERSION, COMBO_VERSION),
    create_tb_writer=True)
tb_writer = neural_factory.tb_writer

In [14]:
# Check if data augmentation such as white noise and time shift augmentation should be used
audio_augmentor = jasper_params.get('AudioAugmentor', None)

# Build the input data layer and the preprocessing layers for the train set
train_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=train_dataset,
    labels=labels,
    sample_rate=sample_rate,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    augmentor=audio_augmentor,
    shuffle=True
)

 # Build the input data layer and the preprocessing layers for the test set
eval_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=test_dataset,
    sample_rate=sample_rate,
    labels=labels,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    shuffle=False,
)

# # We will convert the raw audio data into MelSpectrogram Features to feed as input to our model
# data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
# #     sample_rate=sample_rate, **jasper_params["AudioToMelSpectrogramPreprocessor"],
# # )

data_preprocessor = nemo_asr.AudioToMFCCPreprocessor(
    sample_rate=sample_rate, **jasper_params["AudioToMFCCPreprocessor"],
)



# Compute the total number of samples and the number of training steps per epoch
N = len(train_data_layer)
steps_per_epoch = math.ceil(N / float(batch_size) + 1)

logging.info("Steps per epoch : {0}".format(steps_per_epoch))
logging.info('Have {0} examples to train on.'.format(N))

# Here we begin defining all of the augmentations we want
# We will pad the preprocessed spectrogram image to have a certain number of timesteps
# This centers the generated spectrogram and adds black boundaries to either side
# of the padded image.
crop_pad_augmentation = nemo_asr.CropOrPadSpectrogramAugmentation(audio_length=128)

# We also optionally add `SpecAugment` augmentations based on the config file
# SpecAugment has various possible augmentations to the generated spectrogram
# 1) Frequency band masking
# 2) Time band masking
# 3) Rectangular cutout
spectr_augment_config = jasper_params.get('SpectrogramAugmentation', None)

if spectr_augment_config:
    data_spectr_augmentation = nemo_asr.SpectrogramAugmentation(**spectr_augment_config)

# Build the QuartzNet Encoder model
# The config defines the layers as a list of dictionaries
# The first and last two blocks are not considered when we say QuartzNet-[BxR]
# B is counted as the number of blocks after the first layer and before the penultimate layer.
# R is defined as the number of repetitions of each block in B.
# Note: We can scale the convolution kernels size by the float parameter `kernel_size_factor`
jasper_encoder = nemo_asr.JasperEncoder(**jasper_params["JasperEncoder"])

# We then define the QuartzNet decoder.
# This decoder head is specialized for the task for classification, such that it
# accepts a set of `N-feat` per timestep of the model, and averages these features
# over all the timesteps, before passing a Linear classification layer on those features.
jasper_decoder = nemo_asr.JasperDecoderForClassification(
    feat_in=jasper_params["JasperEncoder"]["jasper"][-1]["filters"],
    num_classes=len(labels),
    **jasper_params['JasperDecoderForClassification'],
)

# We can easily apply cross entropy loss to train this model
ce_loss = nemo_asr.CrossEntropyLossNM()

[NeMo I 2020-05-26 15:45:20 collections:232] Filtered duration for loading collection is 8.211812.
[NeMo I 2020-05-26 15:45:20 collections:235] # 142681 files loaded accounting to # 2 labels
[NeMo I 2020-05-26 15:45:20 data_layer:960] # of classes :2
[NeMo I 2020-05-26 15:45:20 collections:232] Filtered duration for loading collection is 1.507312.
[NeMo I 2020-05-26 15:45:20 collections:235] # 17999 files loaded accounting to # 2 labels
[NeMo I 2020-05-26 15:45:20 data_layer:960] # of classes :2
[NeMo I 2020-05-26 15:45:22 <ipython-input-14-af3cadd591b3>:40] Steps per epoch : 1116
[NeMo I 2020-05-26 15:45:22 <ipython-input-14-af3cadd591b3>:41] Have 142681 examples to train on.


In [15]:
# Lets print out the number of parameters of this model
logging.info('================================')
logging.info(f"Number of parameters in encoder: {jasper_encoder.num_weights}")
logging.info(f"Number of parameters in decoder: {jasper_decoder.num_weights}")
logging.info(
    f"Total number of parameters in model: " f"{jasper_decoder.num_weights + jasper_encoder.num_weights}"
)
logging.info('================================')

[NeMo I 2020-05-26 15:45:22 <ipython-input-15-6805b5462cf6>:3] Number of parameters in encoder: 73344
[NeMo I 2020-05-26 15:45:22 <ipython-input-15-6805b5462cf6>:4] Number of parameters in decoder: 258
[NeMo I 2020-05-26 15:45:22 <ipython-input-15-6805b5462cf6>:6] Total number of parameters in model: 73602


## Compile the Training Graph for NeMo

In [16]:
# Now we have all of the components that are required to build the NeMo execution graph!
## Build the training data loaders and preprocessors first
audio_signal, audio_signal_len, labels, label_len = train_data_layer()
processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)
processed_signal, processed_signal_len = crop_pad_augmentation(
    input_signal=processed_signal,
    length=audio_signal_len
)

## Augment the dataset for training
if spectr_augment_config:
    processed_signal = data_spectr_augmentation(input_spec=processed_signal)

## Define the model
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len)
decoded = jasper_decoder(encoder_output=encoded)

## Obtain the train loss
train_loss = ce_loss(logits=decoded, labels=labels)


## Compile the Test Graph for NeMo

In [18]:
# Now we build the test graph in a similar way, reusing the above components
## Build the test data loader and preprocess same way as train graph
## But note, we do not add the spectrogram augmentation to the test graph !
test_audio_signal, test_audio_signal_len, test_labels, test_label_len = eval_data_layer()
test_processed_signal, test_processed_signal_len = data_preprocessor(
    input_signal=test_audio_signal, length=test_audio_signal_len
)
test_processed_signal, test_processed_signal_len = crop_pad_augmentation(
    input_signal=test_processed_signal, length=test_processed_signal_len
)

# Pass the test data through the model encoder and decoder
test_encoded, test_encoded_len = jasper_encoder(
    audio_signal=test_processed_signal, length=test_processed_signal_len
)
test_decoded = jasper_decoder(encoder_output=test_encoded)

# Compute test loss for visualization
test_loss = ce_loss(logits=test_decoded, labels=test_labels)

## Setting up callbacks for training and test set evaluation, and checkpoint saving

In [19]:
# Now that we have our training and evaluation graphs built,
# we can focus on a few callbacks to help us save the model checkpoints
# during training, as well as display train and test metrics

# Callbacks needed to print train info to console and Tensorboard
train_callback = nemo.core.SimpleLossLoggerCallback(
    # Notice that we pass in loss, predictions, and the labels.
    # Of course we would like to see our training loss, but we need the
    # other arguments to calculate the accuracy.
    tensors=[train_loss, decoded, labels],
    # The print_func defines what gets printed.
    print_func=partial(monitor_classification_training_progress, eval_metric=None),
    get_tb_values=lambda x: [("loss", x[0])],
    tb_writer=neural_factory.tb_writer,
)

# Callbacks needed to print test info to console and Tensorboard
tagname = 'TestSet'
eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[test_loss, test_decoded, test_labels],
    user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),
    user_epochs_done_callback=partial(process_classification_evaluation_epoch, eval_metric=1, tag=tagname),
    eval_step=200,  # How often we evaluate the model on the test set #200
    tb_writer=neural_factory.tb_writer,
)

# Callback to save model checkpoints
chpt_callback = nemo.core.CheckpointCallback(
    folder=neural_factory.checkpoint_dir,
    step_freq=1000,
)

# Prepare a list of checkpoints to pass to the engine
callbacks = [train_callback, eval_callback, chpt_callback]

# Training the model

Even with such a small model (77k parameters), and just 5 epochs (should take just a few minutes to train), you should be able to get a test set accuracy score in the range 85 - 90%. Not bad for a 30 (v1) or 35 (v2) way classification problem !

Experiment with increasing the number of epochs or with batch size to see how much you can improve the score!

If you are interested in  **pretrained** model, please have a look at [7_VAD_Offline_Online_Microphone_Demo.ipynb](todo)

In [23]:
# Now we have all the components required to train the model
# Lets define a learning rate schedule

# Define a learning rate schedule
lr_policy = CosineAnnealing(
    total_steps=num_epochs * steps_per_epoch,
    warmup_ratio=0.05,
    min_lr=0.001,
)

logging.info(f"Using `{lr_policy}` Learning Rate Scheduler")

# Finally, lets train this model !
neural_factory.train(
    tensors_to_optimize=[train_loss],
    callbacks=callbacks,
    lr_policy=lr_policy,
    optimizer="novograd",
    optimization_params={
        "num_epochs": num_epochs,
        "max_steps": None,
        "lr": lr,
        "momentum": 0.95,
        "betas": (0.98, 0.5),
        "weight_decay": weight_decay,
        "grad_norm_clip": None,
    },
    batches_per_step=1,
)



[NeMo I 2020-05-12 14:49:22 <ipython-input-23-689768946717>:13] Using `<nemo.utils.lr_policies.CosineAnnealing object at 0x7f7190d79910>` Learning Rate Scheduler
[NeMo I 2020-05-12 14:49:22 callbacks:187] Starting .....
[NeMo I 2020-05-12 14:49:22 callbacks:359] Found 2 modules with weights:
[NeMo I 2020-05-12 14:49:22 callbacks:361] JasperEncoder
[NeMo I 2020-05-12 14:49:22 callbacks:361] JasperDecoderForClassification
[NeMo I 2020-05-12 14:49:22 callbacks:362] Total model parameters: 73602
[NeMo I 2020-05-12 14:49:22 callbacks:311] Found checkpoint folder .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints. Will attempt to restore checkpoints from it.


[NeMo W 2020-05-12 14:49:22 callbacks:328] For module JasperDecoderForClassification, no file matches  in .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints
[NeMo W 2020-05-12 14:49:22 callbacks:330] Checkpoint folder .//home/fjia/data/freesound_resampled/quartznet-3x1-v1combo_balanced_all_Mfcc_prob0.8/checkpoints was present but nothing was restored. Continuing training from random initialization.


[NeMo I 2020-05-12 14:49:22 callbacks:199] Starting epoch 0
[NeMo I 2020-05-12 14:49:25 callbacks:224] Step: 0
[NeMo I 2020-05-12 14:49:25 helpers:104] Loss: 0.9745737314224243
[NeMo I 2020-05-12 14:49:25 helpers:110] training_batch_top@1:  34.3750
[NeMo I 2020-05-12 14:49:25 callbacks:239] Step time: 0.1588578224182129 seconds
[NeMo I 2020-05-12 14:49:25 callbacks:445] Doing Evaluation ..............................


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)


[NeMo I 2020-05-12 14:49:27 callbacks:450] Evaluation time: 1.740051031112671 seconds
[NeMo I 2020-05-12 14:49:28 callbacks:224] Step: 25
[NeMo I 2020-05-12 14:49:28 helpers:104] Loss: 0.5065323710441589
[NeMo I 2020-05-12 14:49:28 helpers:110] training_batch_top@1:  75.0000
[NeMo I 2020-05-12 14:49:28 callbacks:239] Step time: 0.05025506019592285 seconds
[NeMo I 2020-05-12 14:49:30 callbacks:224] Step: 50
[NeMo I 2020-05-12 14:49:30 helpers:104] Loss: 0.31578725576400757
[NeMo I 2020-05-12 14:49:30 helpers:110] training_batch_top@1:  86.7188
[NeMo I 2020-05-12 14:49:30 callbacks:239] Step time: 0.1203145980834961 seconds
[NeMo I 2020-05-12 14:49:32 callbacks:224] Step: 75
[NeMo I 2020-05-12 14:49:32 helpers:104] Loss: 0.31505733728408813
[NeMo I 2020-05-12 14:49:32 helpers:110] training_batch_top@1:  85.9375
[NeMo I 2020-05-12 14:49:32 callbacks:239] Step time: 0.04457831382751465 seconds
[NeMo I 2020-05-12 14:49:33 callbacks:224] Step: 100
[NeMo I 2020-05-12 14:49:33 helpers:104] Los

[NeMo I 2020-05-12 14:50:14 callbacks:224] Step: 650
[NeMo I 2020-05-12 14:50:14 helpers:104] Loss: 0.23157186806201935
[NeMo I 2020-05-12 14:50:14 helpers:110] training_batch_top@1:  90.6250
[NeMo I 2020-05-12 14:50:14 callbacks:239] Step time: 0.14571404457092285 seconds
[NeMo I 2020-05-12 14:50:16 callbacks:224] Step: 675
[NeMo I 2020-05-12 14:50:16 helpers:104] Loss: 0.35731738805770874
[NeMo I 2020-05-12 14:50:16 helpers:110] training_batch_top@1:  89.0625
[NeMo I 2020-05-12 14:50:16 callbacks:239] Step time: 0.04394674301147461 seconds
[NeMo I 2020-05-12 14:50:17 callbacks:224] Step: 700
[NeMo I 2020-05-12 14:50:17 helpers:104] Loss: 0.3463391959667206
[NeMo I 2020-05-12 14:50:17 helpers:110] training_batch_top@1:  84.3750
[NeMo I 2020-05-12 14:50:17 callbacks:239] Step time: 0.04420042037963867 seconds
[NeMo I 2020-05-12 14:50:19 callbacks:224] Step: 725
[NeMo I 2020-05-12 14:50:19 helpers:104] Loss: 0.33758291602134705
[NeMo I 2020-05-12 14:50:19 helpers:110] training_batch_top

[NeMo I 2020-05-12 14:51:00 callbacks:239] Step time: 0.08862161636352539 seconds
[NeMo I 2020-05-12 14:51:02 callbacks:224] Step: 1275
[NeMo I 2020-05-12 14:51:02 helpers:104] Loss: 0.2592397630214691
[NeMo I 2020-05-12 14:51:02 helpers:110] training_batch_top@1:  88.2812
[NeMo I 2020-05-12 14:51:02 callbacks:239] Step time: 0.044200897216796875 seconds
[NeMo I 2020-05-12 14:51:04 callbacks:224] Step: 1300
[NeMo I 2020-05-12 14:51:04 helpers:104] Loss: 0.17464350163936615
[NeMo I 2020-05-12 14:51:04 helpers:110] training_batch_top@1:  90.6250
[NeMo I 2020-05-12 14:51:04 callbacks:239] Step time: 0.045468807220458984 seconds
[NeMo I 2020-05-12 14:51:05 callbacks:224] Step: 1325
[NeMo I 2020-05-12 14:51:05 helpers:104] Loss: 0.18225739896297455
[NeMo I 2020-05-12 14:51:05 helpers:110] training_batch_top@1:  92.1875
[NeMo I 2020-05-12 14:51:05 callbacks:239] Step time: 0.07424306869506836 seconds
[NeMo I 2020-05-12 14:51:07 callbacks:224] Step: 1350
[NeMo I 2020-05-12 14:51:07 helpers:10

[NeMo I 2020-05-12 14:51:48 helpers:110] training_batch_top@1:  85.1562
[NeMo I 2020-05-12 14:51:48 callbacks:239] Step time: 0.04893946647644043 seconds
[NeMo I 2020-05-12 14:51:50 callbacks:224] Step: 1925
[NeMo I 2020-05-12 14:51:50 helpers:104] Loss: 0.3184409737586975
[NeMo I 2020-05-12 14:51:50 helpers:110] training_batch_top@1:  87.5000
[NeMo I 2020-05-12 14:51:50 callbacks:239] Step time: 0.07735824584960938 seconds
[NeMo I 2020-05-12 14:51:52 callbacks:224] Step: 1950
[NeMo I 2020-05-12 14:51:52 helpers:104] Loss: 0.2662186324596405
[NeMo I 2020-05-12 14:51:52 helpers:110] training_batch_top@1:  89.8438
[NeMo I 2020-05-12 14:51:52 callbacks:239] Step time: 0.044324636459350586 seconds
[NeMo I 2020-05-12 14:51:53 callbacks:224] Step: 1975
[NeMo I 2020-05-12 14:51:53 helpers:104] Loss: 0.16672399640083313
[NeMo I 2020-05-12 14:51:53 helpers:110] training_batch_top@1:  93.7500
[NeMo I 2020-05-12 14:51:53 callbacks:239] Step time: 0.049155235290527344 seconds
[NeMo I 2020-05-12 14

[NeMo I 2020-05-12 14:52:36 callbacks:224] Step: 2525
[NeMo I 2020-05-12 14:52:36 helpers:104] Loss: 0.19708772003650665
[NeMo I 2020-05-12 14:52:36 helpers:110] training_batch_top@1:  93.7500
[NeMo I 2020-05-12 14:52:36 callbacks:239] Step time: 0.05688118934631348 seconds
[NeMo I 2020-05-12 14:52:38 callbacks:224] Step: 2550
[NeMo I 2020-05-12 14:52:38 helpers:104] Loss: 0.3506413698196411
[NeMo I 2020-05-12 14:52:38 helpers:110] training_batch_top@1:  84.3750
[NeMo I 2020-05-12 14:52:38 callbacks:239] Step time: 0.06837344169616699 seconds
[NeMo I 2020-05-12 14:52:39 callbacks:224] Step: 2575
[NeMo I 2020-05-12 14:52:39 helpers:104] Loss: 0.22137701511383057
[NeMo I 2020-05-12 14:52:39 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-12 14:52:39 callbacks:239] Step time: 0.044213294982910156 seconds
[NeMo I 2020-05-12 14:52:41 callbacks:224] Step: 2600
[NeMo I 2020-05-12 14:52:41 helpers:104] Loss: 0.1954980045557022
[NeMo I 2020-05-12 14:52:41 helpers:110] training_batch

[NeMo I 2020-05-12 14:53:22 helpers:104] Loss: 0.21518684923648834
[NeMo I 2020-05-12 14:53:22 helpers:110] training_batch_top@1:  90.6250
[NeMo I 2020-05-12 14:53:22 callbacks:239] Step time: 0.11887121200561523 seconds
[NeMo I 2020-05-12 14:53:24 callbacks:224] Step: 3175
[NeMo I 2020-05-12 14:53:24 helpers:104] Loss: 0.10242561995983124
[NeMo I 2020-05-12 14:53:24 helpers:110] training_batch_top@1:  97.6562
[NeMo I 2020-05-12 14:53:24 callbacks:239] Step time: 0.058302879333496094 seconds
[NeMo I 2020-05-12 14:53:25 callbacks:224] Step: 3200
[NeMo I 2020-05-12 14:53:25 helpers:104] Loss: 0.2946646809577942
[NeMo I 2020-05-12 14:53:25 helpers:110] training_batch_top@1:  91.4062
[NeMo I 2020-05-12 14:53:25 callbacks:239] Step time: 0.12248516082763672 seconds
[NeMo I 2020-05-12 14:53:25 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-12 14:53:27 callbacks:450] Evaluation time: 1.4922308921813965 seconds
[NeMo I 2020-05-12 14:53:28 callbacks:224] Step: 32

[NeMo I 2020-05-12 14:54:10 helpers:110] training_batch_top@1:  89.0625
[NeMo I 2020-05-12 14:54:10 callbacks:239] Step time: 0.05255293846130371 seconds
[NeMo I 2020-05-12 14:54:12 callbacks:224] Step: 3800
[NeMo I 2020-05-12 14:54:12 helpers:104] Loss: 0.12022773176431656
[NeMo I 2020-05-12 14:54:12 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-12 14:54:12 callbacks:239] Step time: 0.07635164260864258 seconds
[NeMo I 2020-05-12 14:54:12 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-12 14:54:13 callbacks:450] Evaluation time: 1.5617625713348389 seconds
[NeMo I 2020-05-12 14:54:15 callbacks:224] Step: 3825
[NeMo I 2020-05-12 14:54:15 helpers:104] Loss: 0.17067106068134308
[NeMo I 2020-05-12 14:54:15 helpers:110] training_batch_top@1:  93.7500
[NeMo I 2020-05-12 14:54:15 callbacks:239] Step time: 0.046558380126953125 seconds
[NeMo I 2020-05-12 14:54:16 callbacks:224] Step: 3850
[NeMo I 2020-05-12 14:54:16 helpers:104] Loss: 0.22782959043979

[NeMo I 2020-05-12 14:54:58 callbacks:224] Step: 4400
[NeMo I 2020-05-12 14:54:58 helpers:104] Loss: 0.14522118866443634
[NeMo I 2020-05-12 14:54:58 helpers:110] training_batch_top@1:  92.1875
[NeMo I 2020-05-12 14:54:58 callbacks:239] Step time: 0.05823016166687012 seconds
[NeMo I 2020-05-12 14:54:58 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-12 14:54:59 callbacks:450] Evaluation time: 1.553720474243164 seconds
[NeMo I 2020-05-12 14:55:01 callbacks:224] Step: 4425
[NeMo I 2020-05-12 14:55:01 helpers:104] Loss: 0.14171554148197174
[NeMo I 2020-05-12 14:55:01 helpers:110] training_batch_top@1:  95.3125
[NeMo I 2020-05-12 14:55:01 callbacks:239] Step time: 0.056873321533203125 seconds
[NeMo I 2020-05-12 14:55:02 callbacks:224] Step: 4450
[NeMo I 2020-05-12 14:55:02 helpers:104] Loss: 0.2346191257238388
[NeMo I 2020-05-12 14:55:02 helpers:110] training_batch_top@1:  89.8438
[NeMo I 2020-05-12 14:55:02 callbacks:239] Step time: 0.05523681640625 seconds
[

KeyboardInterrupt: 

# Evaluation the model


In [27]:
# Download the checkpoint files

# base_checkpoint_path = '/home/fjia/data/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/results/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/1'
# base_checkpoint_path = '/home/fjia/NeMo-fei/examples/asr/fp16_o1_try/05-13-2020 -- 15-21-28/'
# base_checkpoint_path = '/home/fjia/code/NeMo-fei/examples/asr/fp16_o1_try/05-13-2020 -- 15-21-28'

# base_checkpoint_path = './quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/results/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/1'
# CHECKPOINT_ENCODER = os.path.join(base_checkpoint_path, 'JasperEncoder-STEP-87000.pt')
# CHECKPOINT_DECODER = os.path.join(base_checkpoint_path, 'JasperDecoderForClassification-STEP-87000.pt')


base_checkpoint_path = '/home/fjia/code/something/night_unba/05-18-2020 -- 01-10-37'
CHECKPOINT_ENCODER = os.path.join(base_checkpoint_path, 'JasperEncoder-STEP-111600.pt')
CHECKPOINT_DECODER = os.path.join(base_checkpoint_path, 'JasperDecoderForClassification-STEP-111600.pt')

# if not os.path.exists(base_checkpoint_path):
#     os.makedirs(base_checkpoint_path)
    
# if not os.path.exists(CHECKPOINT_ENCODER):
#     !wget https://api.ngc.nvidia.com/v2/models/nvidia/google_speech_commands_v2___matchboxnet_3x1x1/versions/1/files/JasperEncoder-STEP-89000.pt -P {base_checkpoint_path};

# if not os.path.exists(CHECKPOINT_DECODER):
#     !wget https://api.ngc.nvidia.com/v2/models/nvidia/google_speech_commands_v2___matchboxnet_3x1x1/versions/1/files/JasperDecoderForClassification-STEP-89000.pt -P {base_checkpoint_path};

In [31]:
model_path  = base_checkpoint_path

# Evaluation of incorrectly predicted samples

Given that we have a trained model, which performs reasonably well, lets try to listen to the samples where the model is least confident in its predictions.

For this, we need support of the librosa library.

**NOTE**: The following code depends on librosa. To install it, run the following code block first

!pip install librosa

In [26]:
# lets add a path to the checkpoint dir
model_path = neural_factory.checkpoint_dir

In [23]:
model_path

'/home/fjia/data/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/results/quartznet_VAD_2balanced_o0_200ep_mfcc_prob88/1'

## Extract the predictions from the model

We want to possess the actual logits of the model instead of just the final evaluation score, so we use `NeuralFactory.infer(...)` to extract the logits per batch of samples provided.

In [32]:
# --- Inference Only --- #
# We've already built the inference DAG above, so all we need is to call infer().
evaluated_tensors = neural_factory.infer(
    # These are the tensors we want to get from the model.
    tensors=[test_loss, test_decoded, test_labels],
    # checkpoint_dir specifies where the model params are loaded from.
    checkpoint_dir=model_path
    )

[NeMo I 2020-05-26 15:49:12 actions:1510] Restoring JasperEncoder from /home/fjia/code/something/night_unba/05-18-2020 -- 01-10-37/JasperEncoder-STEP-111600.pt
[NeMo I 2020-05-26 15:49:12 actions:1510] Restoring JasperDecoderForClassification from /home/fjia/code/something/night_unba/05-18-2020 -- 01-10-37/JasperDecoderForClassification-STEP-111600.pt
[NeMo I 2020-05-26 15:49:12 actions:763] Evaluating batch 0 out of 141
[NeMo I 2020-05-26 15:49:12 actions:763] Evaluating batch 14 out of 141
[NeMo I 2020-05-26 15:49:12 actions:763] Evaluating batch 28 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 42 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 56 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 70 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 84 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 98 out of 141
[NeMo I 2020-05-26 15:49:13 actions:763] Evaluating batch 112 out of 14

## Accuracy calculation

In [33]:
correct_count = 0
total_count = 0

for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    acc = classification_accuracy(
        logits=logits,
        targets=labels,
        top_k=[1]
    )

    # Select top 1 accuracy only
    acc = acc[0]

    # Since accuracy here is "per batch", we simply denormalize it by multiplying
    # by batch size to recover the count of correct samples.
    correct_count += int(acc * logits.size(0))
    total_count += logits.size(0)

logging.info(f"Total correct / Total count : {correct_count} / {total_count}")
logging.info(f"Final accuracy : {correct_count / float(total_count)}")

[NeMo I 2020-05-26 15:49:16 <ipython-input-33-674fb7de9132>:19] Total correct / Total count : 17952 / 17999
[NeMo I 2020-05-26 15:49:16 <ipython-input-33-674fb7de9132>:20] Final accuracy : 0.997388743819101


## Precision Recall F1 score calculation

In [34]:
total_true_negative, total_false_negative , total_false_positive, total_true_positive = 0, 0, 0, 0

for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):

    tn, fp, fn, tp = classification_confusion_matrix(
        logits=logits,
        targets=labels).ravel()
    
    total_true_negative += tn
    total_false_negative += fn
    total_false_positive += fp
    total_true_positive += tp



logging.info(f" True Positive: {total_true_positive}")
logging.info(f" False Postive : {total_false_positive}")
logging.info(f" False Negative : {total_false_negative}")
logging.info(f" True Negative : {total_true_negative}")
accuracy = (total_true_positive + total_true_negative) \
                / (total_true_positive + total_true_negative + total_false_negative + total_false_positive)
precision = total_true_positive / (total_true_positive + total_false_positive)
recall = total_true_positive / (total_true_positive + total_false_negative)
f1_score =  2 * precision * recall / (precision + recall)

logging.info(f"Final Accuracy: {accuracy}")
logging.info(f"Final Precision: {precision}")
logging.info(f"Final Recall : {recall}")
logging.info(f"Final F1 score : {f1_score}")

[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:16]  True Positive: 10561
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:17]  False Postive : 25
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:18]  False Negative : 22
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:19]  True Negative : 7391
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:26] Final Accuracy: 0.997388743819101
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:27] Final Precision: 0.9976383903268468
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:28] Final Recall : 0.9979211943683266
[NeMo I 2020-05-26 15:49:18 <ipython-input-34-f3320ed7ac71>:29] Final F1 score : 0.9977797723085644


## Filtering out incorrect samples
Let us now filter out the incorrectly labeled samples from the total set of samples in the test set

In [47]:
import librosa
import json
import IPython.display as ipd
import torch

In [38]:
# First lets create a utility class to remap the integer class labels to actual string label
class ReverseMapLabel:
    def __init__(self, data_layer: nemo_asr.AudioToSpeechLabelDataLayer):
        self.label2id = dict(data_layer._dataset.label2id)
        self.id2label = dict(data_layer._dataset.id2label)

    def __call__(self, pred_idx, label_idx):
        return self.id2label[pred_idx], self.id2label[label_idx]

In [40]:
# Next, lets get the indices of all the incorrectly labeled samples
sample_idx = 0
incorrect_preds = []
rev_map = ReverseMapLabel(eval_data_layer)

# Remember, evaluated_tensor = (loss, logits, labels)
for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    probs = torch.softmax(logits, dim=-1)
    probas, preds = torch.max(probs, dim=-1)

    incorrect_ids = (preds != labels).nonzero()
    for idx in incorrect_ids:
        proba = float(probas[idx][0])
        pred = int(preds[idx][0])
        label = int(labels[idx][0])
        idx = int(idx[0]) + sample_idx

        incorrect_preds.append((idx, *rev_map(pred, label), proba))

    sample_idx += labels.size(0)

logging.info(f"Num test samples : {total_count}")
logging.info(f"Num errors : {len(incorrect_preds)}")

# First lets sort by confidence of prediction
incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=False) 

tensor([[2.4871e-11, 1.0000e+00],
        [5.2966e-07, 1.0000e+00],
        [3.6805e-09, 1.0000e+00],
        [1.0800e-10, 1.0000e+00],
        [1.1925e-11, 1.0000e+00],
        [5.1641e-04, 9.9948e-01],
        [6.9062e-10, 1.0000e+00],
        [5.3576e-14, 1.0000e+00],
        [1.7338e-10, 1.0000e+00],
        [3.0762e-06, 1.0000e+00],
        [3.6372e-05, 9.9996e-01],
        [5.5025e-09, 1.0000e+00],
        [4.3310e-12, 1.0000e+00],
        [3.6823e-08, 1.0000e+00],
        [3.5454e-07, 1.0000e+00],
        [5.8067e-12, 1.0000e+00],
        [8.6577e-09, 1.0000e+00],
        [8.7654e-05, 9.9991e-01],
        [6.3781e-11, 1.0000e+00],
        [2.6535e-09, 1.0000e+00],
        [2.0816e-13, 1.0000e+00],
        [1.1909e-10, 1.0000e+00],
        [4.0928e-10, 1.0000e+00],
        [3.0498e-12, 1.0000e+00],
        [3.4841e-08, 1.0000e+00],
        [1.0599e-11, 1.0000e+00],
        [2.2118e-11, 1.0000e+00],
        [5.5120e-11, 1.0000e+00],
        [1.2099e-04, 9.9988e-01],
        [5.618

tensor([[9.9996e-01, 4.0002e-05],
        [9.9992e-01, 7.8719e-05],
        [9.9996e-01, 3.9634e-05],
        [9.9997e-01, 3.1974e-05],
        [9.9996e-01, 3.5634e-05],
        [9.9994e-01, 6.3976e-05],
        [9.9981e-01, 1.9115e-04],
        [9.9971e-01, 2.9073e-04],
        [9.9993e-01, 7.1114e-05],
        [9.9997e-01, 2.7012e-05],
        [9.9997e-01, 2.6480e-05],
        [9.9997e-01, 2.6486e-05],
        [9.9997e-01, 2.7718e-05],
        [9.9997e-01, 2.9553e-05],
        [9.9985e-01, 1.5093e-04],
        [9.9987e-01, 1.3456e-04],
        [9.9997e-01, 3.1033e-05],
        [9.9993e-01, 7.4888e-05],
        [9.9996e-01, 4.3845e-05],
        [9.9996e-01, 4.1571e-05],
        [9.9996e-01, 3.5173e-05],
        [9.9995e-01, 5.4449e-05],
        [9.9993e-01, 6.5439e-05],
        [9.9957e-01, 4.2706e-04],
        [9.9983e-01, 1.6681e-04],
        [9.9995e-01, 4.8045e-05],
        [9.9996e-01, 3.5378e-05],
        [9.9995e-01, 4.7215e-05],
        [9.9995e-01, 5.1421e-05],
        [9.999

[NeMo I 2020-05-18 14:05:04 <ipython-input-40-432d15bd17f2>:24] Num errors : 47


## Examine a subset of incorrect samples
Lets print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples

In [50]:
for incorrect_sample in incorrect_preds[:20]:
    logging.info(str(incorrect_sample))

[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (13772, 'speech', 'background', 0.5008878707885742)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (13173, 'speech', 'background', 0.5154367089271545)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (8506, 'background', 'speech', 0.5175673365592957)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (5901, 'background', 'speech', 0.5276477932929993)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (16000, 'speech', 'background', 0.5597944259643555)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (15355, 'speech', 'background', 0.5611239075660706)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (10704, 'speech', 'background', 0.5677288174629211)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (10705, 'speech', 'background', 0.6267029047012329)
[NeMo I 2020-05-18 02:54:24 <ipython-input-50-631305d430a9>:2] (14504, 'speech', '

##  Define a threshold below which we designate a model's prediction as "low confidence"

In [61]:
# Filter out how many such samples exist
low_confidence_threshold = 1.0
count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))
logging.info(f"Number of low confidence predictions : {count_low_confidence}")

[NeMo I 2020-05-18 02:56:22 <ipython-input-61-d1dec2ec09f9>:4] Number of low confidence predictions : 47


# Lets hear the samples which the model has least confidence in !

In [62]:
# First lets create a helper function to parse the manifest files
def parse_manifest(manifest):
    data = []
    for line in manifest:
        line = json.loads(line)
        data.append(line)

    return data

In [63]:
# Next, lets create a helper function to actually listen to certain samples
def listen_to_file(sample_id, pred=None, label=None, proba=None):
    # Load the audio waveform using librosa
    filepath = test_samples[sample_id]['audio_filepath']
    if 'offset' in test_samples[sample_id]:
        audio, sample_rate = librosa.load(filepath,
                                          offset = test_samples[sample_id]['offset'],
                                          duration = test_samples[sample_id]['duration'])
    else:
         audio, sample_rate = librosa.load(filepath)

    if pred is not None and label is not None and proba is not None:
        logging.info(f"filepath: {filepath}, Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}")
    else:
        
        logging.info(f"Sample : {sample_id}")

    return ipd.Audio(audio, rate=sample_rate)


In [59]:
import json
# Now lets load the test manifest into memory
all_test_samples = []
for _ in test_dataset.split(','):
    with open(_, 'r') as test_f:
        test_samples = test_f.readlines()
        print(_, len(test_samples))
        all_test_samples.extend(test_samples)
        
test_samples = parse_manifest(all_test_samples)

./manifest/all_test.json 18028


In [64]:
# Finally, lets listen to all the audio samples where the model made a mistake
# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`
for sample_id, pred, label, proba in incorrect_preds[:count_low_confidence]:
    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))

[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Ship/id_420716 shiphornveryc.wav, Sample : 13772 Prediction : speech Label : background Confidence =  0.5009


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Bus/id_392257 Bus(Ambiente01).wav, Sample : 13173 Prediction : speech Label : background Confidence =  0.5154


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/follow/5f1253e9_nohash_0.wav, Sample : 8506 Prediction : background Label : speech Confidence =  0.5176


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/happy/3659fc1c_nohash_1.wav, Sample : 5901 Prediction : background Label : speech Confidence =  0.5276


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Vibration/id_501900 Vibrations_01_12_Sec.wav, Sample : 16000 Prediction : speech Label : background Confidence =  0.5598


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Bus/id_451699 schoolbustruckintidlefrontfootareaperspective-.wav, Sample : 15355 Prediction : speech Label : background Confidence =  0.5611


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Train/id_157873 Trainuponus.wav, Sample : 10704 Prediction : speech Label : background Confidence =  0.5677


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Train/id_157873 Trainuponus.wav, Sample : 10705 Prediction : speech Label : background Confidence =  0.6267


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Skateboard/id_262082 SkateboardRollStereo.wav, Sample : 14504 Prediction : speech Label : background Confidence =  0.6790


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/down/8ca3b1db_nohash_0.wav, Sample : 759 Prediction : background Label : speech Confidence =  0.6870


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/down/0165e0e8_nohash_0.wav, Sample : 7355 Prediction : background Label : speech Confidence =  0.6972


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Motorcycle/id_450016 motorcycledirtbikemotocrossidlestationaryrevsvariousonstandniceleftnearbyrightclose.wav, Sample : 17778 Prediction : speech Label : background Confidence =  0.7359


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/house/ad526ada_nohash_0.wav, Sample : 6844 Prediction : background Label : speech Confidence =  0.7491


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Truck/id_62619 truck.wav, Sample : 12237 Prediction : speech Label : background Confidence =  0.7581


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/six/99b05bcf_nohash_0.wav, Sample : 534 Prediction : background Label : speech Confidence =  0.7723


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/dog/6347b393_nohash_0.wav, Sample : 5402 Prediction : background Label : speech Confidence =  0.7848


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/six/475b61f1_nohash_0.wav, Sample : 5603 Prediction : background Label : speech Confidence =  0.8060


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/happy/5f1253e9_nohash_0.wav, Sample : 8844 Prediction : background Label : speech Confidence =  0.8440


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/down/196e84b7_nohash_0.wav, Sample : 8648 Prediction : background Label : speech Confidence =  0.8810


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/down/cd1baff6_nohash_0.wav, Sample : 10447 Prediction : background Label : speech Confidence =  0.8999


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/three/e638109b_nohash_0.wav, Sample : 6307 Prediction : background Label : speech Confidence =  0.9175


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/zero/d1bf406b_nohash_1.wav, Sample : 7942 Prediction : background Label : speech Confidence =  0.9276


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/up/9205fb3c_nohash_1.wav, Sample : 8246 Prediction : background Label : speech Confidence =  0.9291


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Bus/id_392257 Bus(Ambiente01).wav, Sample : 13170 Prediction : speech Label : background Confidence =  0.9486


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Car/id_174833 MITSUBISHIIMIEVelectriccarChargeplugcover.wav, Sample : 13176 Prediction : speech Label : background Confidence =  0.9505


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/two/a9f54d8d_nohash_0.wav, Sample : 2886 Prediction : background Label : speech Confidence =  0.9598


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/five/990ebd1f_nohash_0.wav, Sample : 7372 Prediction : background Label : speech Confidence =  0.9606


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Car/id_174833 MITSUBISHIIMIEVelectriccarChargeplugcover.wav, Sample : 13177 Prediction : speech Label : background Confidence =  0.9611


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/tree/446a3161_nohash_0.wav, Sample : 1334 Prediction : background Label : speech Confidence =  0.9652


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Traffic_noise/id_63748 traffic.wav, Sample : 10890 Prediction : speech Label : background Confidence =  0.9755


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Truck/id_62619 truck.wav, Sample : 12277 Prediction : speech Label : background Confidence =  0.9756


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/tree/475b61f1_nohash_0.wav, Sample : 5772 Prediction : background Label : speech Confidence =  0.9784


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/stop/7622d95b_nohash_0.wav, Sample : 4266 Prediction : background Label : speech Confidence =  0.9841


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Helicopter/id_72377 LAYERS003-HelicopterView-C#4-.wav, Sample : 13856 Prediction : speech Label : background Confidence =  0.9853


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Car/id_237381 CarWipersInte.wav, Sample : 15993 Prediction : speech Label : background Confidence =  0.9889


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Helicopter/id_72377 LAYERS003-HelicopterView-C#4-.wav, Sample : 13855 Prediction : speech Label : background Confidence =  0.9890


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Car/id_174833 MITSUBISHIIMIEVelectriccarChargeplugcover.wav, Sample : 13175 Prediction : speech Label : background Confidence =  0.9901


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Ship/id_420716 shiphornveryc.wav, Sample : 13771 Prediction : speech Label : background Confidence =  0.9922


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Helicopter/id_72377 LAYERS003-HelicopterView-C#4-.wav, Sample : 13854 Prediction : speech Label : background Confidence =  0.9965


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Ship/id_222594 shipshape.wav, Sample : 11745 Prediction : speech Label : background Confidence =  0.9979


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/wow/446a3161_nohash_0.wav, Sample : 3604 Prediction : background Label : speech Confidence =  0.9987


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/stop/7464b22e_nohash_0.wav, Sample : 6517 Prediction : background Label : speech Confidence =  0.9993


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/google_dataset_v2/google_speech_recognition_v2/zero/30f6e665_nohash_0.wav, Sample : 2056 Prediction : background Label : speech Confidence =  0.9995


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Car/id_237381 CarWipersInte.wav, Sample : 15994 Prediction : speech Label : background Confidence =  0.9995


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Bus/id_392257 Bus(Ambiente01).wav, Sample : 13174 Prediction : speech Label : background Confidence =  1.0000


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Aircraft/id_437793 G04-26-WWIIFighterStarts-Taxis-Stops.wav, Sample : 11742 Prediction : speech Label : background Confidence =  1.0000


[NeMo I 2020-05-18 02:56:26 <ipython-input-63-3789ce96c2a9>:13] filepath: /home/fjia/data/freesound_resampled_background/Vibration/id_501900 Vibrations_01_12_Sec.wav, Sample : 15996 Prediction : speech Label : background Confidence =  1.0000


# inference and more
If you are interested in **pretrained** model and **streaming inference**, please have a look at [7_VAD_Offline_Online_Microphone_Demo.ipynb](todo)

