## Introduction

This notebook allows you to do real-time ("streaming") speech recognition using audio recorded from your microphone. This notebook shows how to use a NeMo chunk-aware FastConformer model with caching enabled.

## Installation

The notebook requires PyAudio library, which is used to capture an audio stream from your machine. This means that you need to run this notebook locally. This notebook will not be able to record your audio if you run it in Google Colab or in a Docker container.

For Ubuntu, please run the following commands to install it:

```
sudo apt install python3-pyaudio
pip install pyaudio
```


In [None]:
## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg portaudio19-dev
!pip install text-unidecode
!pip install pyaudio

In [None]:
# ## Uncomment this cell to install NeMo if it has not been installed
# BRANCH = 'main'
# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[asr]

In [None]:
# import dependencies
import copy
import time
import pyaudio as pa
import numpy as np
import torch

from omegaconf import OmegaConf, open_dict

import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models.ctc_bpe_models import EncDecCTCModelBPE
from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
from nemo.collections.asr.parts.utils.rnnt_utils import Hypothesis

# specify sample rate we will use for recording audio
SAMPLE_RATE = 16000 # Hz

## Cache-aware streaming Fastconformer
In this tutorial, we will do streaming transcription using NeMo models that were specially trained for use in streaming applications. These models are described in the paper released by the NeMo team: [*Noroozi et al.* "Stateful FastConformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279)" (accepted to ICASSP 2024).

These models have the following features:
* They were trained such that at each timestep, the decoder (either RNNT or CTC) would receive a limited amount of context on the left and (most importantly) the right side. Keeping the right side context small means that in a real time streaming scenario, we do not need to keep recording for very long before we are able to compute the output token at that timestep - thus we are able to get transcriptions with a low latency.
* The model implementation has **caching** enabled, meaning we do not need to recalculate activations that were obtained in previous timesteps, thus reducing latency further.


## Model checkpoints
The following checkpoints of these models are currently available, and are compatible with this notebook. The meaning of "lookahead" and "chunk size" is described in the following section.

1) [`stt_en_fastconformer_hybrid_large_streaming_80ms`](https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_fastconformer_hybrid_large_streaming_80ms) - 80ms lookahead / 160ms chunk size

2) [`stt_en_fastconformer_hybrid_large_streaming_480ms`](https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_fastconformer_hybrid_large_streaming_480ms) - 480ms lookahead / 540ms chunk size

3) [`stt_en_fastconformer_hybrid_large_streaming_1040ms`](https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_fastconformer_hybrid_large_streaming_1040ms) - 1040ms lookahead / 1120ms chunk size

4) [`stt_en_fastconformer_hybrid_large_streaming_multi`](https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_fastconformer_hybrid_large_streaming_multi) - 0ms, 80ms, 480ms, 1040ms lookahead / 80ms, 160ms, 540ms, 1120ms chunk size

## Model inference explanation
We run inference by continuously recording our audio in chunks, and feeding the chunks into the chosen ASR model. In this notebook we use `pyaudio` to open an audio input stream, and pass the audio to a `stream_callback` function every "chunk-sized" number of seconds. In the `stream_callback` function, we pass the audio signal to a `transcribe` function (which we will specify in this notebook), and print the resulting transcription.

As mentioned, the "chunk size" is the duration of audio that we feed into the ASR model at a time (and we keep doing this continuously, to allow for real-time, streaming speech recognition).

"Lookahead" size is the "chunk size" minus the duration of a single output timestep from the decoder. For FastConformer models, the duration of an output timestep is always 80ms, hence in this notebook always `lookahead size = chunk size - 80 ms`.

## Model selection
In the next cell, you can select which pretrained `model_name` and `lookahead_size` you would like to try.

Additionally, note that all of the available models are [Hybrid RNNT-CTC models](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#hybrid-transducer-ctc). Inference is by default done using the RNNT decoder (which tends to produce a higher transcription accuracy), but you may choose to use the CTC decoder instead. For this, we also provide a `decoder_type` variable in the cell below.

In [None]:
# You may wish to try different values of model_name and lookahead_size

# Choose a the name of a model to use.
# Currently available options:
# 1) "stt_en_fastconformer_hybrid_large_streaming_multi"
# 2) "stt_en_fastconformer_hybrid_large_streaming_80ms"
# 3) "stt_en_fastconformer_hybrid_large_streaming_480ms"
# 4) "stt_en_fastconformer_hybrid_large_streaming_1040ms"

model_name = "stt_en_fastconformer_hybrid_large_streaming_multi"

# Specify the lookahead_size.
# If model_name == "stt_en_fastconformer_hybrid_large_streaming_multi" then
# lookahead_size can be 0, 80, 480 or 1040 (ms)
# Else, lookahead_size should be whatever is written in the model_name:
# "stt_en_fastconformer_hybrid_large_streaming_<lookahead_size>ms"

lookahead_size = 80 # in milliseconds

# Specify the decoder to use.
# Can be "rnnt" or "ctc"
decoder_type = "rnnt"

## Model set-up
Next we:
* set up the `asr_model` according to the chosen `model_name` and `lookahead_size`
* make sure we use the specified `decoder_type`
* make sure the model's decoding strategy has suitable parameters
* instantiate a `CacheAwareStreamingAudioBuffer`
* get some parameters to use as the initial cache state

In [None]:
# setting up model and validating the choice of model_name and lookahead size
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=model_name)


# specify ENCODER_STEP_LENGTH (which is 80 ms for FastConformer models)
ENCODER_STEP_LENGTH = 80 # ms

# update att_context_size if using multi-lookahead model
# (for single-lookahead models, the default context size will be used and the
# `lookahead_size` variable will be ignored)
if model_name == "stt_en_fastconformer_hybrid_large_streaming_multi":
    # check that lookahead_size is one of the valid ones
    if lookahead_size not in [0, 80, 480, 1040]:
        raise ValueError(
            f"specified lookahead_size {lookahead_size} is not one of the "
            "allowed lookaheads (can select 0, 80, 480 or 1040 ms)"
        )

    # update att_context_size
    left_context_size = asr_model.encoder.att_context_size[0]
    asr_model.encoder.set_default_att_context_size([left_context_size, int(lookahead_size / ENCODER_STEP_LENGTH)])


# make sure we use the specified decoder_type
asr_model.change_decoding_strategy(decoder_type=decoder_type)

# make sure the model's decoding strategy is optimal
decoding_cfg = asr_model.cfg.decoding
with open_dict(decoding_cfg):
    # save time by doing greedy decoding and not trying to record the alignments
    decoding_cfg.strategy = "greedy"
    decoding_cfg.preserve_alignments = False
    if hasattr(asr_model, 'joint'):  # if an RNNT model
        # restrict max_symbols to make sure not stuck in infinite loop
        decoding_cfg.greedy.max_symbols = 10
        # sensible default parameter, but not necessary since batch size is 1
        decoding_cfg.fused_batch_size = -1
    asr_model.change_decoding_strategy(decoding_cfg)


# set model to eval mode
asr_model.eval()


# get parameters to use as the initial cache state
cache_last_channel, cache_last_time, cache_last_channel_len = asr_model.encoder.get_initial_cache_state(
    batch_size=1
)

## Transcribing a single chunk
In the following code block we specify the `transcribe_chunk` function that transcribes a single chunk.

In [None]:
# init params we will use for streaming
previous_hypotheses = None
pred_out_stream = None
step_num = 0
pre_encode_cache_size = asr_model.encoder.streaming_cfg.pre_encode_cache_size[1]
# cache-aware models require some small section of the previous processed_signal to
# be fed in at each timestep - we initialize this to a tensor filled with zeros
# so that we will do zero-padding for the very first chunk(s)
num_channels = asr_model.cfg.preprocessor.features
cache_pre_encode = torch.zeros((1, num_channels, pre_encode_cache_size), device=asr_model.device)


# helper function for extracting transcriptions
def extract_transcriptions(hyps):
    """
        The transcribed_texts returned by CTC and RNNT models are different.
        This method would extract and return the text section of the hypothesis.
    """
    if isinstance(hyps[0], Hypothesis):
        transcriptions = []
        for hyp in hyps:
            transcriptions.append(hyp.text)
    else:
        transcriptions = hyps
    return transcriptions

# define functions to init audio preprocessor and to
# preprocess the audio (ie obtain the mel-spectrogram)
def init_preprocessor(asr_model):
    cfg = copy.deepcopy(asr_model._cfg)
    OmegaConf.set_struct(cfg.preprocessor, False)

    # some changes for streaming scenario
    cfg.preprocessor.dither = 0.0
    cfg.preprocessor.pad_to = 0
    cfg.preprocessor.normalize = "None"
    
    preprocessor = EncDecCTCModelBPE.from_config_dict(cfg.preprocessor)
    preprocessor.to(asr_model.device)
    
    return preprocessor

preprocessor = init_preprocessor(asr_model)

def preprocess_audio(audio, asr_model):
    device = asr_model.device

    # doing audio preprocessing
    audio_signal = torch.from_numpy(audio).unsqueeze_(0).to(device)
    audio_signal_len = torch.Tensor([audio.shape[0]]).to(device)
    processed_signal, processed_signal_length = preprocessor(
        input_signal=audio_signal, length=audio_signal_len
    )
    return processed_signal, processed_signal_length


def transcribe_chunk(new_chunk):
    
    global cache_last_channel, cache_last_time, cache_last_channel_len
    global previous_hypotheses, pred_out_stream, step_num
    global cache_pre_encode
    
    # new_chunk is provided as np.int16, so we convert it to np.float32
    # as that is what our ASR models expect
    audio_data = new_chunk.astype(np.float32)
    audio_data = audio_data / 32768.0

    # get mel-spectrogram signal & length
    processed_signal, processed_signal_length = preprocess_audio(audio_data, asr_model)
     
    # prepend with cache_pre_encode
    processed_signal = torch.cat([cache_pre_encode, processed_signal], dim=-1)
    processed_signal_length += cache_pre_encode.shape[1]
    
    # save cache for next time
    cache_pre_encode = processed_signal[:, :, -pre_encode_cache_size:]
    
    with torch.no_grad():
        (
            pred_out_stream,
            transcribed_texts,
            cache_last_channel,
            cache_last_time,
            cache_last_channel_len,
            previous_hypotheses,
        ) = asr_model.conformer_stream_step(
            processed_signal=processed_signal,
            processed_signal_length=processed_signal_length,
            cache_last_channel=cache_last_channel,
            cache_last_time=cache_last_time,
            cache_last_channel_len=cache_last_channel_len,
            keep_all_outputs=False,
            previous_hypotheses=previous_hypotheses,
            previous_pred_out=pred_out_stream,
            drop_extra_pre_encoded=None,
            return_transcription=True,
        )
    
    final_streaming_tran = extract_transcriptions(transcribed_texts)
    step_num += 1
    
    return final_streaming_tran[0]

## Simple streaming with microphone
We use `pyaudio` to record audio from an input audio device on your local machine. We use a `stream_callback` which will be called every `frames_per_buffer` number of frames, and conduct the transcription, which will be printed in the output of the cell below.

In [None]:
# calculate chunk_size in milliseconds
chunk_size = lookahead_size + ENCODER_STEP_LENGTH

p = pa.PyAudio()
print('Available audio input devices:')
input_devices = []
for i in range(p.get_device_count()):
    dev = p.get_device_info_by_index(i)
    if dev.get('maxInputChannels'):
        input_devices.append(i)
        print(i, dev.get('name'))

if len(input_devices):
    dev_idx = -2
    while dev_idx not in input_devices:
        print('Please type input device ID:')
        dev_idx = int(input())

    def callback(in_data, frame_count, time_info, status):
        signal = np.frombuffer(in_data, dtype=np.int16)
        text = transcribe_chunk(signal)
        print(text, end='\r')
        return (in_data, pa.paContinue)

    stream = p.open(format=pa.paInt16,
                    channels=1,
                    rate=SAMPLE_RATE,
                    input=True,
                    input_device_index=dev_idx,
                    stream_callback=callback,
                    frames_per_buffer=int(SAMPLE_RATE * chunk_size / 1000) - 1
                   )

    print('Listening...')

    stream.start_stream()
    
    # Interrupt kernel and then speak for a few more words to exit the pyaudio loop !
    try:
        while stream.is_active():
            time.sleep(0.1)
    finally:        
        stream.stop_stream()
        stream.close()
        p.terminate()

        print()
        print("PyAudio stopped")
    
else:
    print('ERROR: No audio input device found.')