# NeMo offline ASR

This notebook demonstrates how to  

* transcribe an audio file (offline ASR) with greedy decoder
* extract timestamps information from the model to split audio into separate words
* use beam search decoder with N-gram language model re-scoring

You may find more info on how to train and use language models for ASR models here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html


## Installation
NeMo can be installed via simple pip command. 

Optional CTC beam search decoder might require restart of Colab runtime after installation.

In [2]:
BRANCH = 'r1.6.1'
try:
    # Import NeMo Speech Recognition collection
    import nemo.collections.asr as nemo_asr
except ModuleNotFoundError:
    !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

# check if we have optional Plotly for visualization
try:
    from plotly import graph_objects as go
except ModuleNotFoundError:
    !pip install plotly
    from plotly import graph_objects as go

# check if we have optional ipywidgets for tqdm/notebook
try:
    import ipywidgets
except ModuleNotFoundError:
    !pip install ipywidgets

# check if CTC beam decoders are installed
try:
    import ctc_decoders
except ModuleNotFoundError:
    # install beam search decoder
    !apt-get install -y swig
    !git clone https://github.com/NVIDIA/NeMo -b "$BRANCH"
    !cd NeMo && bash scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh
    print('Restarting Colab runtime to successfully import built module.')
    print('Please re-run the notebook.')
    import os
    os.kill(os.getpid(), 9)

[NeMo W 2022-02-02 18:38:29 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
################################################################################
###          (please add 'export KALDI_ROOT=<your_path>' in your $HOME/.profile)
###          (or run as: KALDI_ROOT=<your_path> python <your_script>.py)
################################################################################

[NeMo W 2022-02-02 18:38:29 experimental:28] Module <function get_argmin_mat at 0x7f0527a71e60> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-02-02 18:38:29 experimental:28] Module <function getMultiScaleCosAffinityMatrix at 0x7f0527a71ef0> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-02-02 18:38:29 experimental:28] Module <function parse_scale_configs at 0x7f0527a71b00> is experimental, not ready for production and is not fully supported. Us

In [1]:
import numpy as np
# Import audio processing library
import librosa
# We'll use this to listen to audio
from IPython.display import Audio, display

## Instantiate pre-trained NeMo model
``from_pretrained(...)`` API downloads and initializes model directly from the cloud. 

Alternatively, ``restore_from(...)`` allows loading a model from a disk.

To display available pre-trained models from the cloud, please use ``list_available_models()`` method.

In [3]:
nemo_asr.models.EncDecCTCModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ), PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ), PretrainedModelInfo(
 	pretr

Let's load a base English QuartzNet15x5 model.

In [4]:
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='QuartzNet15x5Base-En', strict=False)

[NeMo I 2022-02-02 18:38:38 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo to /root/.cache/torch/NeMo/NeMo_1.6.1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2022-02-02 18:38:41 common:728] Instantiating model from pre-trained checkpoint
[NeMo I 2022-02-02 18:38:42 features:264] PADDING: 16
[NeMo I 2022-02-02 18:38:42 features:281] STFT using torch
[NeMo I 2022-02-02 18:38:53 save_restore_connector:154] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.6.1/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


## Get test audio clip

Let's download and analyze a test audio signal.

In [5]:
# Download audio sample which we'll try
# This is a sample from LibriSpeech dev clean subset - the model hasn't seen it before
AUDIO_FILENAME = '1919-142785-0028.wav'
!wget https://dldata-public.s3.us-east-2.amazonaws.com/1919-142785-0028.wav

# load audio signal with librosa
signal, sample_rate = librosa.load(AUDIO_FILENAME, sr=None)

# display audio player for the signal
display(Audio(data=signal, rate=sample_rate))

# plot the signal in time domain
fig_signal = go.Figure(
    go.Scatter(x=np.arange(signal.shape[0])/sample_rate,
               y=signal, line={'color': 'green'},
               name='Waveform',
               hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),
    layout={
        'height': 300,
        'xaxis': {'title': 'Time, s'},
        'yaxis': {'title': 'Amplitude'},
        'title': 'Audio Signal',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
)
fig_signal.show()

# calculate amplitude spectrum
time_stride=0.01
hop_length = int(sample_rate*time_stride)
n_fft = 512
# linear scale spectrogram
s = librosa.stft(y=signal,
                 n_fft=n_fft,
                 hop_length=hop_length)
s_db = librosa.power_to_db(np.abs(s)**2, ref=np.max, top_db=100)

# plot the signal in frequency domain
fig_spectrum = go.Figure(
    go.Heatmap(z=s_db,
               colorscale=[
                   [0, 'rgb(30,62,62)'],
                   [0.5, 'rgb(30,128,128)'],
                   [1, 'rgb(30,255,30)'],
               ],
               colorbar=dict(
                   ticksuffix=' dB'
               ),
               dx=time_stride, dy=sample_rate/n_fft/1000,
               name='Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),
    layout={
        'height': 300,
        'xaxis': {'title': 'Time, s'},
        'yaxis': {'title': 'Frequency, kHz'},
        'title': 'Spectrogram',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
)
fig_spectrum.show()

--2022-02-02 18:39:04--  https://dldata-public.s3.us-east-2.amazonaws.com/1919-142785-0028.wav
Resolving dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)... 52.219.98.234
Connecting to dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)|52.219.98.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 165164 (161K) [audio/wav]
Saving to: ‘1919-142785-0028.wav’


2022-02-02 18:39:05 (837 KB/s) - ‘1919-142785-0028.wav’ saved [165164/165164]



## Offline inference
If we have an entire audio clip available, then we can do offline inference with a pre-trained model to transcribe it.

The easiest way to do it is to call ASR model's ``transcribe(...)`` method  that allows transcribing multiple files in a batch.

In [6]:
# Convert our audio sample to text
files = [AUDIO_FILENAME]
transcript = asr_model.transcribe(paths2audio_files=files)[0]
print(f'Transcript: "{transcript}"')

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Transcript: "boil them before they are put into the soup or other dish they may be intended for"


## Extract timestamps and split words
``transcribe()`` generates a text applying a CTC greedy decoder to raw probabilities distribution over alphabet's characters from ASR model. We can get those raw probabilities with ``logprobs=True`` argument.

In [7]:
# softmax implementation in NumPy
def softmax(logits):
    e = np.exp(logits - np.max(logits))
    return e / e.sum(axis=-1).reshape([logits.shape[0], 1])

# let's do inference once again but without decoder
logits = asr_model.transcribe(files, logprobs=True)[0]
probs = softmax(logits)
print(probs)

# 20ms is duration of a timestep at output of the model
time_stride = 0.02

# get model's alphabet
labels = list(asr_model.decoder.vocabulary) + ['blank']
labels[0] = 'space'

# plot probability distribution over characters for each timestep
fig_probs = go.Figure(
    go.Heatmap(z=probs.transpose(),
               colorscale=[
                   [0, 'rgb(30,62,62)'],
                   [1, 'rgb(30,255,30)'],
               ],
               y=labels,
               dx=time_stride,
               name='Probs',
               hovertemplate='Time: %{x:.2f} s<br>Character: %{y}<br>Probability: %{z:.2f}<extra></extra>'),
    layout={
        'height': 300,
        'xaxis': {'title': 'Time, s'},
        'yaxis': {'title': 'Characters'},
        'title': 'Character Probabilities',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
)
fig_probs.show()

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

[[1.12356240e-08 6.21769871e-08 7.64101884e-08 ... 1.46942580e-09
  4.43366694e-08 9.99999166e-01]
 [2.82938597e-08 1.12658874e-07 1.37950153e-07 ... 3.10133075e-09
  9.42624325e-08 9.99998331e-01]
 [3.02687830e-09 1.15469119e-08 4.49922908e-08 ... 6.33829544e-10
  1.30119187e-08 9.99999702e-01]
 ...
 [3.78954160e-11 6.40935785e-12 5.61221002e-16 ... 3.62100190e-15
  1.58321099e-12 1.00000000e+00]
 [2.22706187e-09 1.72049198e-11 1.10573053e-14 ... 1.42608093e-13
  5.05930575e-11 1.00000000e+00]
 [1.34128464e-09 9.75943562e-11 2.50892819e-14 ... 9.27765352e-13
  5.40039506e-11 1.00000000e+00]]


It is easy to identify timesteps for space character.

In [9]:
# get timestamps for space symbols
spaces = []

state = ''
idx_state = 0

if np.argmax(probs[0]) == 0:
    state = 'space'

for idx in range(1, probs.shape[0]):
    current_char_idx = np.argmax(probs[idx])
    if state == 'space' and current_char_idx != 0 and current_char_idx != 28:
        spaces.append([idx_state, idx-1])
        state = ''
    if state == '':
        if current_char_idx == 0:
            state = 'space'
            idx_state = idx

if state == 'space':
    spaces.append([idx_state, len(pred)-1])

Then we can split original audio signal into separate words. It is worth to mention that all timestamps have a delay (or an offset) depending on the model. We need to take it into account for alignment.

In [10]:
# calibration offset for timestamps: 180 ms
offset = -0.18

# split the transcript into words
words = transcript.split()

# cut words
pos_prev = 0
for j, spot in enumerate(spaces):
    display(words[j])
    pos_end = offset + (spot[0]+spot[1])/2*time_stride
    display(Audio(signal[int(pos_prev*sample_rate):int(pos_end*sample_rate)],
                 rate=sample_rate))
    pos_prev = pos_end

display(words[j+1])
display(Audio(signal[int(pos_prev*sample_rate):],
        rate=sample_rate))

'boil'

'them'

'before'

'they'

'are'

'put'

'into'

'the'

'soup'

'or'

'other'

'dish'

'they'

'may'

'be'

'intended'

'for'

## Offline inference with beam search decoder and N-gram language model re-scoring

It is possible to use an external [KenLM](https://kheafield.com/code/kenlm/)-based N-gram language model to rescore multiple transcription candidates. 

Let's download and preprocess LibriSpeech 3-gram language model.

In [1]:
import gzip
import os, shutil, wget

lm_gzip_path = '3-gram.pruned.1e-7.arpa.gz'
if not os.path.exists(lm_gzip_path):
    print('Downloading pruned 3-gram model.')
    lm_url = 'http://www.openslr.org/resources/11/3-gram.pruned.1e-7.arpa.gz'
    lm_gzip_path = wget.download(lm_url)
    print('Downloaded the 3-gram language model.')
else:
    print('Pruned .arpa.gz already exists.')

uppercase_lm_path = '3-gram.pruned.1e-7.arpa'
if not os.path.exists(uppercase_lm_path):
    with gzip.open(lm_gzip_path, 'rb') as f_zipped:
        with open(uppercase_lm_path, 'wb') as f_unzipped:
            shutil.copyfileobj(f_zipped, f_unzipped)
    print('Unzipped the 3-gram language model.')
else:
    print('Unzipped .arpa already exists.')

lm_path = 'lowercase_3-gram.pruned.1e-7.arpa'
if not os.path.exists(lm_path):
    with open(uppercase_lm_path, 'r') as f_upper:
        with open(lm_path, 'w') as f_lower:
            for line in f_upper:
                print(line.lower())
                f_lower.write(line.lower())
print('Converted language model file to lowercase.')

ModuleNotFoundError: ignored

Let's instantiate ``BeamSearchDecoderWithLM`` module.

In [12]:
beam_search_lm = nemo_asr.modules.BeamSearchDecoderWithLM(
    vocab=list(asr_model.decoder.vocabulary),
    beam_width=16,
    alpha=2, beta=1.5,
    lm_path=lm_path,
    num_cpus=max(os.cpu_count(), 1),
    input_tensor=False)

Now we can check all transcription candidates along with their scores.

In [13]:
beam_search_lm.forward(log_probs = np.expand_dims(probs, axis=0), log_probs_length=None)

[[(-53.303104400634766,
   'boil them before they are put into the soup or other dish they may be intended for'),
  (-62.76603317260742,
   'boil them before they are put into the soup or other dish they may be intended for '),
  (-65.28871154785156,
   'boil them before they are put into the soup or other dish they may be intended or'),
  (-65.70912170410156,
   'boil them before they are put into the soup or other dish they may be intend for'),
  (-68.9410171508789,
   'boil them before they are put into the soup or other dish they may be intended far'),
  (-70.1639175415039,
   'boil them before they are put into the soup or other dish they may be intended fer'),
  (-70.34444427490234,
   'boil them before they are put into the soup or other dish they may be intended fort'),
  (-70.37554168701172,
   'boil them before they are put into the soup or other dish they may be intended form'),
  (-70.41539764404297,
   'boil them before they are put into the soup or other dish they may be 