In [1]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
!pip install wget
!apt-get install sox
!pip install nemo_toolkit[asr]==0.10.0b10
!pip install unidecode

!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/speaker_recognition/configs/quartznet_spkr_3x2x512_xvector.yaml
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/scp_to_manifest.py

Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=d69c9698696a555da36c83dbfbb4a9892524957bf414d8ffddf644b8df672b17
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0 libsox-fmt-alsa
  libsox-fmt-base libsox3
Suggested packages:
  file libsox-fmt-all
The following NEW packages will be installed:
  libmagic-mgc libmagic1 libopencore-amrnb0 libopencore-amrwb0

# **SPEAKER RECOGNITION** 

Speaker Recognition (SR) is an broad research area which solves two major tasks: speaker identification (who is speaking?) and speaker verification (is the speaker who she claims to be?). In this work, we focmus on the far-field, text-independent speaker recognition when the identity of the speaker is based on how speech is spoken, not necessarily in what is being said. Typically such SR systems operate on unconstrained speech utterances, which are converted into vector of fixed length, called speaker embedding. Speaker embedding is also used in automatic speech recognition (ASR) and speech synthesis.

As goal of most speaker related systems is to get good speaker level embeddings that could help distinguish from other speakers, we shall first train these embeddings in end-to-end manner optimizing the [QuatzNet](https://arxiv.org/abs/1910.10261) based encoder model on cross-entropy loss. We modify the original quartznet based decoder to get these fixed size embeddings irrespective of length of input audio. We employ mean and variance based statistics pooling method to grab these embeddings.

In this tutorial we shall first train these embeddings on speaker related datasets and then get speaker embeddings from a pretrained network for a new dataset. Since Google Colab has very slow read-write speeds I'll be demonstarting this tutorial using [an4](http://www.speech.cs.cmu.edu/databases/an4/). 

Instead if you'd like to try on a bigger dataset like [hi-mia](https://arxiv.org/abs/1912.01231) use the [get_hi-mia-data.py](https://github.com/NVIDIA/NeMo/blob/master/scripts/get_hi-mia_data.py) script to download the necessary files, extract them, also re-sample to 16Khz if any of the sample is not at 16Khz. This will take a while so grap a large coffee. We do also provide scripts to score these embeddings for a speaker-verification task for hi-mia dataset. To do that follow this detailed [tutorial](https://nvidia.github.io/NeMo/). 

In [2]:
import os
print(os.getcwd())
data_dir = 'data'
!mkdir $data_dir
import glob
import subprocess
import tarfile
import wget

# Download the dataset. This will take a few moments...
print("******")
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url, data_dir)
    print(f"Dataset downloaded at: {an4_path}")
else:
    print("Tarfile already exists.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

# Untar and convert .sph to .wav (using sox)
tar = tarfile.open(an4_path)
tar.extractall(path=data_dir)

print("Converting .sph to .wav...")
sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
for sph_path in sph_list:
    wav_path = sph_path[:-4] + '.wav'
    cmd = ["sox", sph_path, wav_path]
    subprocess.run(cmd)
print("Finished conversion.\n******")

/content
******
Dataset downloaded at: data/an4_sphere.tar.gz
Converting .sph to .wav...
Finished conversion.
******


Since an4 is not designed for speaker recognition, this facilitates the oppurtunity to demostrate how you can generate manifest files that are necessary for training. These methods can be applied to any dataset to get similar training manifest files. 

First get a scp file(s) which has all the wav files with absolute path for each of train,dev and test set. This can be easily done by `find` bash command

In [0]:
!find $PWD/data/an4/wav/an4_clstk  -iname "*.wav" > data/an4/wav/an4_clstk/train_all.scp

Let's look at first 3 lines of scp file for train. 

In [4]:
!head -n 3 data/an4/wav/an4_clstk/train_all.scp

/content/data/an4/wav/an4_clstk/fsrb/an169-fsrb-b.wav
/content/data/an4/wav/an4_clstk/fsrb/an166-fsrb-b.wav
/content/data/an4/wav/an4_clstk/fsrb/an170-fsrb-b.wav


Since we created scp file for train, we use `scp_to_manifest.py` to convert this scp file to manifest and then optionally split the files to train \& dev for evaluating the models while training by using `--split` flag. So as you guessed we wouldn't be needing `--split` option for test folder. 

In [5]:
!python scp_to_manifest.py --scp data/an4/wav/an4_clstk/train_all.scp --id -2 --out data/an4/wav/an4_clstk/all_manifest.json --split

100% 948/948 [01:16<00:00, 12.39it/s]
853
wrote data/an4/wav/an4_clstk/train.json
wrote data/an4/wav/an4_clstk/dev.json


Generating scp for test folder and then converting to manifest type. 

In [6]:
!find $PWD/data/an4/wav/an4test_clstk  -iname "*.wav" > data/an4/wav/an4test_clstk/test_all.scp
!python scp_to_manifest.py --scp data/an4/wav/an4test_clstk/test_all.scp --id -2 --out data/an4/wav/an4test_clstk/test.json

100% 130/130 [00:10<00:00, 12.38it/s]


In [7]:
!git clone https://github.com/NVIDIA/NeMo.git
os.chdir('NeMo')
!bash reinstall.sh

Cloning into 'NeMo'...
remote: Enumerating objects: 212, done.[K
remote: Counting objects: 100% (212/212), done.[K
remote: Compressing objects: 100% (147/147), done.[K
remote: Total 27862 (delta 114), reused 126 (delta 65), pack-reused 27650[K
Receiving objects: 100% (27862/27862), 109.50 MiB | 32.59 MiB/s, done.
Resolving deltas: 100% (19493/19493), done.
Uninstalling stuff
Uninstalling nemo-toolkit-0.10.0b10:
  Successfully uninstalled nemo-toolkit-0.10.0b10
Installing stuff
Obtaining file:///content/NeMo
Collecting html2text
  Downloading https://files.pythonhosted.org/packages/ae/88/14655f727f66b3e3199f4467bafcc88283e6c31b562686bf606264e09181/html2text-2020.1.16-py3-none-any.whl
Collecting ipdb
  Downloading https://files.pythonhosted.org/packages/2c/bb/a3e1a441719ebd75c6dac8170d3ddba884b7ee8a5c0f9aefa7297386627a/ipdb-0.13.2.tar.gz
Collecting jupyterlab
[?25l  Downloading https://files.pythonhosted.org/packages/ec/30/03638fbb348e55af6375916962ddbfca786bd31cff9899b86162e2fc0cda

Import necessary packages

In [8]:
from ruamel.yaml import YAML

import nemo
import nemo.collections.asr as nemo_asr
import copy
from functools import partial

################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################



# Building Training and Evaluation DAGs with NeMo
Building a model using NeMo consists of 

1.  Instantiating the neural modules we need
2.  specifying the DAG by linking them together.

In NeMo, the training and inference pipelines are managed by a NeuralModuleFactory, which takes care of checkpointing, callbacks, and logs, along with other details in training and inference. We set its log_dir argument to specify where our model logs and outputs will be written, and can set other training and inference settings in its constructor. For instance, if we were resuming training from a checkpoint, we would set the argument checkpoint_dir=`<path_to_checkpoint>`.

Along with logs in NeMo, you can optionally view the tensorboard logs with the create_tb_writer=True argument to the NeuralModuleFactory. By default all the tensorboard log files will be stored in {log_dir}/tensorboard, but you can change this with the tensorboard_dir argument. One can load tensorboard logs through tensorboard by running tensorboard --logdir=`<path_to_tensorboard dir>` in the terminal.

In [0]:
exp_name = 'quartznet3x1_an4'
work_dir = './myExps/'
neural_factory = nemo.core.NeuralModuleFactory(
    log_dir=work_dir+"/as4_logdir/",
    checkpoint_dir="./myExps/checkpoints/" + exp_name,
    create_tb_writer=True,
    random_seed=42,
    tensorboard_dir=work_dir+'/tensorboard/',
)

Now that we have our neural module factory, we can specify our **neural modules and instantiate them**. Here, we load the parameters for each module from the configuration file. 

In [20]:
logging = nemo.logging
yaml = YAML(typ="safe")
with open('../configs/quartznet_spkr_3x1x512_xvector.yaml') as f:
    spkr_params = yaml.load(f)

sample_rate = spkr_params["sample_rate"]
time_length = spkr_params.get("time_length", 8)
logging.info("max time length considered for each file is {} sec".format(time_length))

[NeMo I 2020-05-11 02:21:15 <ipython-input-20-4537deb08e36>:8] max time length considered for each file is 8 sec


Instantiating train data_layer using config arguments. `labels = None` automatically creates output labels from manifest files, if you would like to pass those speaker names you can use the labels option. So while instatilatin eval data_layer we can use data_layer labels as it should both match to same speaker output labels. This comes handy while training on multiple datasets with more than one manifest file. 

In [21]:
train_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
train_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["train"])
del train_dl_params["train"]
del train_dl_params["eval"]

batch_size=64
data_layer_train = nemo_asr.AudioToSpeechLabelDataLayer(
        manifest_filepath='../data/an4/wav/an4_clstk/train.json',
        labels=None,
        batch_size=batch_size,
        time_length=time_length,
        **train_dl_params,
    )

eval_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
eval_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["eval"])
del eval_dl_params["train"]
del eval_dl_params["eval"]

data_layer_eval = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath="../data/an4/wav/an4_clstk/dev.json",
    labels=data_layer_train.labels,
    batch_size=batch_size,
    time_length=time_length,
    **eval_dl_params,
)

data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
        sample_rate=sample_rate, **spkr_params["AudioToMelSpectrogramPreprocessor"],
    )
encoder = nemo_asr.JasperEncoder(**spkr_params["JasperEncoder"],)

decoder = nemo_asr.JasperDecoderForSpkrClass(
        feat_in=spkr_params["JasperEncoder"]["jasper"][-1]["filters"],
        num_classes=data_layer_train.num_classes,
        pool_mode=spkr_params["JasperDecoderForSpkrClass"]['pool_mode'],
        emb_sizes=spkr_params["JasperDecoderForSpkrClass"]["emb_sizes"].split(","),
    )

xent_loss = nemo_asr.CrossEntropyLossNM(weight=None)

[NeMo I 2020-05-11 02:21:16 collections:234] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-05-11 02:21:16 collections:237] # 853 files loaded accounting to # 74 labels
[NeMo I 2020-05-11 02:21:16 data_layer:962] # of classes :74
[NeMo I 2020-05-11 02:21:17 collections:234] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-05-11 02:21:17 collections:237] # 95 files loaded accounting to # 74 labels
[NeMo I 2020-05-11 02:21:17 data_layer:962] # of classes :74
[NeMo I 2020-05-11 02:21:17 features:144] PADDING: 16
[NeMo I 2020-05-11 02:21:17 features:165] STFT using torch


The next step is to assemble our training DAG by specifying the inputs to each neural module.

In [22]:
audio_signal, audio_signal_len, label, label_len = data_layer_train()
processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)
encoded, encoded_len = encoder(audio_signal=processed_signal, length=processed_signal_len)
logits, _ = decoder(encoder_output=encoded)
loss = xent_loss(logits=logits, labels=label)

[NeMo W 2020-05-11 02:21:18 graph_outputs:167] Setting unigue name of the default output port `audio_signal` produced in step 10 by `audiotospeechlabeldatalayer2` to `10_audiotospeechlabeldatalayer2_audio_signal`
[NeMo W 2020-05-11 02:21:18 graph_outputs:167] Setting unigue name of the default output port `a_sig_length` produced in step 10 by `audiotospeechlabeldatalayer2` to `10_audiotospeechlabeldatalayer2_a_sig_length`
[NeMo W 2020-05-11 02:21:18 graph_outputs:167] Setting unigue name of the default output port `label` produced in step 10 by `audiotospeechlabeldatalayer2` to `10_audiotospeechlabeldatalayer2_label`
[NeMo W 2020-05-11 02:21:18 graph_outputs:167] Setting unigue name of the default output port `label_length` produced in step 10 by `audiotospeechlabeldatalayer2` to `10_audiotospeechlabeldatalayer2_label_length`
[NeMo W 2020-05-11 02:21:18 graph_outputs:167] Setting unigue name of the default output port `processed_signal` produced in step 11 by `audiotomelspectrogramprep

We would like to be able to evaluate our model on the dev set, as well, so let's set up the evaluation DAG.

Our evaluation DAG will reuse most of the parts of the training DAG with the exception of the data layer, since we are loading the evaluation data from a different file but evaluating on the same model. Note that if we were using data augmentation in training, we would also leave that out in the evaluation DAG.

In [23]:
audio_signal_test, audio_len_test, label_test, _ = data_layer_eval()
processed_signal_test, processed_len_test = data_preprocessor(
            input_signal=audio_signal_test, length=audio_len_test
        )
encoded_test, encoded_len_test = encoder(audio_signal=processed_signal_test, length=processed_len_test)
logits_test, _ = decoder(encoder_output=encoded_test)
loss_test = xent_loss(logits=logits_test, labels=label_test)

[NeMo W 2020-05-11 02:21:19 graph_outputs:167] Setting unigue name of the default output port `audio_signal` produced in step 15 by `audiotospeechlabeldatalayer3` to `15_audiotospeechlabeldatalayer3_audio_signal`
[NeMo W 2020-05-11 02:21:19 graph_outputs:167] Setting unigue name of the default output port `a_sig_length` produced in step 15 by `audiotospeechlabeldatalayer3` to `15_audiotospeechlabeldatalayer3_a_sig_length`
[NeMo W 2020-05-11 02:21:19 graph_outputs:167] Setting unigue name of the default output port `label` produced in step 15 by `audiotospeechlabeldatalayer3` to `15_audiotospeechlabeldatalayer3_label`
[NeMo W 2020-05-11 02:21:19 graph_outputs:167] Setting unigue name of the default output port `label_length` produced in step 15 by `audiotospeechlabeldatalayer3` to `15_audiotospeechlabeldatalayer3_label_length`
[NeMo W 2020-05-11 02:21:19 graph_outputs:167] Setting unigue name of the default output port `processed_signal` produced in step 16 by `audiotomelspectrogramprep

# Creating CallBacks

We would like to be able to monitor our model while it's training, so we use callbacks. In general, callbacks are functions that are called at specific intervals over the course of training or inference, such as at the start or end of every n iterations, epochs, etc. The callbacks we'll be using for this are the SimpleLossLoggerCallback, which reports the training loss (or another metric of your choosing, such as \% accuracy for speaker recognition tasks), and the EvaluatorCallback, which regularly evaluates the model on the dev set. Both of these callbacks require you to pass in the tensors to be evaluated--these would be the final outputs of the training and eval DAGs above.

Another useful callback is the CheckpointCallback, for saving checkpoints at set intervals. We create one here just to demonstrate how it works.

In [0]:
from nemo.collections.asr.helpers import (
    monitor_classification_training_progress,
    process_classification_evaluation_batch,
    process_classification_evaluation_epoch,
)
from nemo.utils.lr_policies import CosineAnnealing

train_callback = nemo.core.SimpleLossLoggerCallback(
        tensors=[loss, logits, label],
        print_func=partial(monitor_classification_training_progress, eval_metric=[1]),
        step_freq=40,
        get_tb_values=lambda x: [("train_loss", x[0])],
        tb_writer=neural_factory.tb_writer,
    )

callbacks = [train_callback]

chpt_callback = nemo.core.CheckpointCallback(
            folder="./myExps/checkpoints/" + exp_name,
            load_from_folder="./myExps/checkpoints/" + exp_name,
            step_freq=100,
        )
callbacks.append(chpt_callback)

tagname = "an4_dev"
eval_callback = nemo.core.EvaluatorCallback(
            eval_tensors=[loss_test, logits_test, label_test],
            user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),
            user_epochs_done_callback=partial(process_classification_evaluation_epoch, tag=tagname),
            eval_step=100,  # How often we evaluate the model on the test set
            tb_writer=neural_factory.tb_writer,
        )

callbacks.append(eval_callback)

Now that we have our model and callbacks set up, how do we run it?

Once we create our neural factory and the callbacks for the information that we want to see, we can start training by simply calling the train function on the tensors we want to optimize and our callbacks! Since this notebook is for you to get started, by an4 as dataset is small it would quickly get higher accuracies. For better models use bigger datasets

In [25]:
# train model
num_epochs=100
N = len(data_layer_train)
steps_per_epoch = N // batch_size

logging.info("Number of steps per epoch {}".format(steps_per_epoch))

neural_factory.train(
        tensors_to_optimize=[loss],
        callbacks=callbacks,
        lr_policy=CosineAnnealing(
            num_epochs * steps_per_epoch, warmup_steps=0.1 * num_epochs * steps_per_epoch,
        ),
        optimizer="novograd",
        optimization_params={
            "num_epochs": num_epochs,
            "lr": 0.01,
            "betas": (0.95, 0.5),
            "weight_decay": 0.001,
            "grad_norm_clip": None,
        }
    )

[NeMo I 2020-05-11 02:21:22 <ipython-input-25-e501be8813ec>:5] Number of steps per epoch 26
[NeMo I 2020-05-11 02:21:22 callbacks:187] Starting .....
[NeMo I 2020-05-11 02:21:22 callbacks:359] Found 2 modules with weights:
[NeMo I 2020-05-11 02:21:22 callbacks:361] JasperEncoder
[NeMo I 2020-05-11 02:21:22 callbacks:361] JasperDecoderForSpkrClass
[NeMo I 2020-05-11 02:21:22 callbacks:362] Total model parameters: 11400194
[NeMo I 2020-05-11 02:21:22 callbacks:311] Found checkpoint folder ./myExps/checkpoints/quartznet3x1_an4. Will attempt to restore checkpoints from it.


[NeMo W 2020-05-11 02:21:22 callbacks:328] Error(s) in loading state_dict for JasperEncoder:
    	Unexpected key(s) in state_dict: "encoder.1.mconv.4.conv.weight", "encoder.1.mconv.5.weight", "encoder.1.mconv.5.bias", "encoder.1.mconv.5.running_mean", "encoder.1.mconv.5.running_var", "encoder.1.mconv.5.num_batches_tracked", "encoder.2.mconv.4.conv.weight", "encoder.2.mconv.5.weight", "encoder.2.mconv.5.bias", "encoder.2.mconv.5.running_mean", "encoder.2.mconv.5.running_var", "encoder.2.mconv.5.num_batches_tracked", "encoder.3.mconv.4.conv.weight", "encoder.3.mconv.5.weight", "encoder.3.mconv.5.bias", "encoder.3.mconv.5.running_mean", "encoder.3.mconv.5.running_var", "encoder.3.mconv.5.num_batches_tracked". 
[NeMo W 2020-05-11 02:21:22 callbacks:330] Checkpoint folder ./myExps/checkpoints/quartznet3x1_an4 was present but nothing was restored. Continuing training from random initialization.


[NeMo I 2020-05-11 02:21:22 callbacks:199] Starting epoch 0
[NeMo I 2020-05-11 02:21:22 callbacks:224] Step: 0
[NeMo I 2020-05-11 02:21:22 helpers:104] Loss: 5.108186721801758
[NeMo I 2020-05-11 02:21:22 helpers:110] training_batch_top@1:  3.1250
[NeMo I 2020-05-11 02:21:22 callbacks:239] Step time: 0.32698655128479004 seconds
[NeMo I 2020-05-11 02:21:22 callbacks:445] Doing Evaluation ..............................
[NeMo I 2020-05-11 02:21:22 callbacks:450] Evaluation time: 0.36254286766052246 seconds
[NeMo I 2020-05-11 02:21:30 callbacks:207] Finished epoch 0 in 0:00:08.636425
[NeMo I 2020-05-11 02:21:30 callbacks:199] Starting epoch 1
[NeMo I 2020-05-11 02:21:35 callbacks:224] Step: 40
[NeMo I 2020-05-11 02:21:35 helpers:104] Loss: 4.767226696014404
[NeMo I 2020-05-11 02:21:35 helpers:110] training_batch_top@1:  0.0000
[NeMo I 2020-05-11 02:21:35 callbacks:239] Step time: 0.2993435859680176 seconds
[NeMo I 2020-05-11 02:21:39 callbacks:207] Finished epoch 1 in 0:00:08.375817
[NeMo I

Now that we trained our embeddings, we shall extract these embeddings from using pretrained checkpoint present at `checkpoint_dir`. As we can see from the neural architecture, we extract the embeddings after `emb1` layer. 
![Speaker Recognition Layers](./speaker_reco.png)

Now use test manifest to get the embeddings. As we saw before let's create a new data\_layer for test. And use previously instiated models and attach the DAGs

In [26]:
eval_dl_params = copy.deepcopy(spkr_params["AudioToSpeechLabelDataLayer"])
eval_dl_params.update(spkr_params["AudioToSpeechLabelDataLayer"]["eval"])
del eval_dl_params["train"]
del eval_dl_params["eval"]
eval_dl_params['shuffle'] = False  # To grab  the file names without changing data_layer

test_dataset = '../data/an4/wav/an4test_clstk/test.json'
data_layer_test = nemo_asr.AudioToSpeechLabelDataLayer(
        manifest_filepath=test_dataset,
        labels=None,
        batch_size=batch_size,
        **eval_dl_params,
    )

audio_signal_test, audio_len_test, label_test, _ = data_layer_test()
processed_signal_test, processed_len_test = data_preprocessor(
    input_signal=audio_signal_test, length=audio_len_test)
encoded_test, _ = encoder(audio_signal=processed_signal_test, length=processed_len_test)
_, embeddings = decoder(encoder_output=encoded_test)

[NeMo I 2020-05-11 02:36:00 collections:234] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-05-11 02:36:00 collections:237] # 130 files loaded accounting to # 10 labels
[NeMo I 2020-05-11 02:36:00 data_layer:962] # of classes :10


[NeMo W 2020-05-11 02:36:00 graph_outputs:167] Setting unigue name of the default output port `audio_signal` produced in step 20 by `audiotospeechlabeldatalayer4` to `20_audiotospeechlabeldatalayer4_audio_signal`
[NeMo W 2020-05-11 02:36:00 graph_outputs:167] Setting unigue name of the default output port `a_sig_length` produced in step 20 by `audiotospeechlabeldatalayer4` to `20_audiotospeechlabeldatalayer4_a_sig_length`
[NeMo W 2020-05-11 02:36:00 graph_outputs:167] Setting unigue name of the default output port `label` produced in step 20 by `audiotospeechlabeldatalayer4` to `20_audiotospeechlabeldatalayer4_label`
[NeMo W 2020-05-11 02:36:00 graph_outputs:167] Setting unigue name of the default output port `label_length` produced in step 20 by `audiotospeechlabeldatalayer4` to `20_audiotospeechlabeldatalayer4_label_length`
[NeMo W 2020-05-11 02:36:00 graph_outputs:167] Setting unigue name of the default output port `processed_signal` produced in step 21 by `audiotomelspectrogramprep

Now get the embeddings using neural_factor infer command, that just does forward pass of all our modules. And save our embeddings in `<work_dir>/embeddings`

In [27]:
import numpy as np
import json
eval_tensors = neural_factory.infer(tensors=[embeddings, label_test], checkpoint_dir="./myExps/checkpoints/" + exp_name)
    # inf_loss , inf_emb, inf_logits, inf_label = eval_tensors
inf_emb, inf_label = eval_tensors
whole_embs = []
whole_labels = []
manifest = open(test_dataset, 'r').readlines()

for line in manifest:
    line = line.strip()
    dic = json.loads(line)
    filename = dic['audio_filepath'].split('/')[-1]
    whole_labels.append(filename)

for idx in range(len(inf_label)):
    whole_embs.extend(inf_emb[idx].numpy())

embedding_dir = './myExps/embeddings/'
if not os.path.exists(embedding_dir):
    os.mkdir(embedding_dir)

filename = os.path.basename(test_dataset).split('.')[0]
name = embedding_dir + filename

np.save(name + '.npy', np.asarray(whole_embs))
np.save(name + '_labels.npy', np.asarray(whole_labels))
logging.info("Saved embedding files to {}".format(embedding_dir))


[NeMo I 2020-05-11 02:36:07 actions:1533] Restoring JasperEncoder from ./myExps/checkpoints/quartznet3x1_an4/JasperEncoder-STEP-2700.pt
[NeMo I 2020-05-11 02:36:07 actions:1533] Restoring JasperDecoderForSpkrClass from ./myExps/checkpoints/quartznet3x1_an4/JasperDecoderForSpkrClass-STEP-2700.pt
[NeMo I 2020-05-11 02:36:07 actions:759] Evaluating batch 0 out of 5
[NeMo I 2020-05-11 02:36:07 actions:759] Evaluating batch 1 out of 5
[NeMo I 2020-05-11 02:36:07 actions:759] Evaluating batch 2 out of 5
[NeMo I 2020-05-11 02:36:08 actions:759] Evaluating batch 3 out of 5
[NeMo I 2020-05-11 02:36:08 actions:759] Evaluating batch 4 out of 5
[NeMo I 2020-05-11 02:36:08 <ipython-input-27-e3f164d28ce7>:28] Saved embedding files to ./myExps/embeddings/


In [28]:
ls myExps/embeddings/

test_labels.npy  test.npy
