# NeMo offline ASR

This notebook demonstrates how to

* transcribe an audio file (offline ASR) with greedy decoder

You may find more info on how to train and use language models for ASR models here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html



In [None]:
BRANCH = 'main'
try:
    # Import NeMo Speech Recognition collection
    import nemo.collections.asr as nemo_asr
except ModuleNotFoundError:
    !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

# check if we have optional Plotly for visualization
try:
    from plotly import graph_objects as go
except ModuleNotFoundError:
    !pip install plotly
    from plotly import graph_objects as go

# check if we have optional ipywidgets for tqdm/notebook
try:
    import ipywidgets
except ModuleNotFoundError:
    !pip install ipywidgets

# check if CTC beam decoders are installed
try:
    import ctc_decoders
except ModuleNotFoundError:
    # install beam search decoder
    !apt-get update && apt-get install -y swig
    !git clone https://github.com/NVIDIA/NeMo -b "$BRANCH"
    !cd NeMo && bash scripts/asr_language_modeling/ngram_lm/install_beamsearch_decoders.sh
    # import os
    # os.kill(os.getpid(), 9)

In [2]:
!pip install gradio

In [6]:
import numpy as np
# Import audio processing library
import librosa
# We'll use this to listen to audio
from IPython.display import Audio, display

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instantiate fine-tuned NeMo model
``from_pretrained(...)`` API downloads and initializes model directly from the cloud.

Alternatively, ``restore_from(...)`` allows loading a model from a disk.

To display available pre-trained models from the cloud, please use ``list_available_models()`` method.

In [None]:
nemo_asr.models.EncDecCTCModel.list_available_models()

In [None]:
asr_model = nemo_asr.models.EncDecCTCModel.restore_from('/content/drive/MyDrive/Model-te.nemo', strict=False)

## Get test audio clip

Let's download and analyze a test audio signal.

In [9]:
# Download audio sample which we'll try
# This is a sample from LibriSpeech dev clean subset - the model hasn't seen it before
AUDIO_FILENAME = '/content/tel_0001.wav'
# !wget https://dldata-public.s3.us-east-2.amazonaws.com/1919-142785-0028.wav

# load audio signal with librosa
signal, sample_rate = librosa.load(AUDIO_FILENAME, sr=None)

# display audio player for the signal
display(Audio(data=signal, rate=sample_rate))

# plot the signal in time domain
fig_signal = go.Figure(
    go.Scatter(x=np.arange(signal.shape[0])/sample_rate,
               y=signal, line={'color': 'green'},
               name='Waveform',
               hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),
    layout={
        'height': 300,
        'xaxis': {'title': 'Time, s'},
        'yaxis': {'title': 'Amplitude'},
        'title': 'Audio Signal',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
)
fig_signal.show()

# calculate amplitude spectrum
time_stride=0.01
hop_length = int(sample_rate*time_stride)
n_fft = 512
# linear scale spectrogram
s = librosa.stft(y=signal,
                 n_fft=n_fft,
                 hop_length=hop_length)
s_db = librosa.power_to_db(np.abs(s)**2, ref=np.max, top_db=100)

# plot the signal in frequency domain
fig_spectrum = go.Figure(
    go.Heatmap(z=s_db,
               colorscale=[
                   [0, 'rgb(30,62,62)'],
                   [0.5, 'rgb(30,128,128)'],
                   [1, 'rgb(30,255,30)'],
               ],
               colorbar=dict(
                   ticksuffix=' dB'
               ),
               dx=time_stride, dy=sample_rate/n_fft/1000,
               name='Spectrogram',
               hovertemplate='Time: %{x:.2f} s<br>Frequency: %{y:.2f} kHz<br>Magnitude: %{z:.2f} dB<extra></extra>'),
    layout={
        'height': 300,
        'xaxis': {'title': 'Time, s'},
        'yaxis': {'title': 'Frequency, kHz'},
        'title': 'Spectrogram',
        'margin': dict(l=0, r=0, t=40, b=0, pad=0),
    }
)
fig_spectrum.show()

## Offline inference
If we have an entire audio clip available, then we can do offline inference with a pre-trained model to transcribe it.

The easiest way to do it is to call ASR model's ``transcribe(...)`` method  that allows transcribing multiple files in a batch.

In [None]:
# Convert our audio sample to text
files = [AUDIO_FILENAME]
transcript = asr_model.transcribe(paths2audio_files=files)[0]
print(f'Transcript: "{transcript}"')

# a gradio interface for demonstration purposes

In [None]:
import gradio as gr

# Load the DeepSpeech model
asr_model = nemo_asr.models.EncDecCTCModel.restore_from('/content/drive/MyDrive/Model-te.nemo', strict=False)

def transcribe_audio(file_path):
    transcript = asr_model.transcribe(paths2audio_files=[file_path])[0]
    print(transcript)
    return transcript

file_input = gr.inputs.Textbox(label="Enter File Path")
text_output = gr.outputs.Textbox(label="Transcript")

# Create a Gradio interface
gr.Interface(fn=transcribe_audio, inputs=file_input, outputs=text_output, title="Audio to Text Converter").launch()
