NVIDIA · VahidooX · Aug 3, 2022 · Nov 16, 2021 · Nov 16, 2021 · Nov 16, 2021
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -1504,7 +1504,7 @@ pipeline {
             model.data_dir=/home/TestData/nlp/new_multiatis \
             model.validation_ds.prefix=dev \
             model.test_ds.prefix=dev \
-            trainer.gpus=[0] \
+            trainer.devices=[0] \
             +trainer.fast_dev_run=true \
             exp_manager.exp_dir=checkpoints2'
             sh 'rm -rf checkpoints2'

diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst
@@ -127,6 +127,62 @@ You may find the example config files of Conformer-Transducer model with charact
 ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and
 with sub-word encoding at ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``.
 
+Streaming Conformer
+-------------------
+
+Streaming Conformer models are variants of Conformer which are trained with limited right context. It enables the model to be used very efficiently for frame-wise streaming.
+Three categories of layers in Conformer have access to right tokens: 1-depthwise convolutions 2-self-attention, and 3-convolutions in downsampling layers.
+Streaming Conformer models uses causal convolutions or convolutions with lower right context and also self-attention with limited right context to limit the effective right context for the input.
+The model trained with such limitations can be used in streaming mode and give the exact same output and accuracy as when the whole audio is given to the model in offline mode.
+These model can use caching mechanism to store and reuse the activations during streaming inference to avoid any duplications in the computations as much as possible.
+
+We support the following three right context modeling:
+*  fully causal model with zero look-ahead: tokens would not see any future tokens. convolution layers are all causal and right tokens are masked for self-attention.
+It gives zero latency but with limited accuracy.
+To train such a model, you need to set `encoder.att_context_size=[left_context, 0]` and `encoder.conv_context_size=causal` in the config.
+
+*  regular look-ahead: convolutions would be able to see few future frames, and self-attention would also see the same number of future tokens.
+In this approach the activations for the look-ahead part is not cached and recalculated in the next chunks. The right context in each layer should be a small number as multiple layers would increase the effective context size and then increase the look-ahead size and latency.
+For example for a model of 17 layers with 4x downsampling and 10ms window shift, then even 2 right context in each layer means 17*2*10*4=1360ms look-ahead. Each step after the downsampling corresponds to 4*10=40ms.
+
+*  chunk-aware look-ahead: input is split into equal chunks. Convolutions are fully causal while self-attention layers would be able to see all the tokens in their corresponding chunk.
+For example, in a model which chunk size of 20 tokens, tokens at the first position of each chunk would see all the next 19 tokens while the last token would see zero future tokens.
+This approach is more efficient than regular look-ahead in terms of computations as the activations for most of the look-ahead part would be cached and there is close to zero duplications in the calculations.
+In terms of accuracy, this approach gives similar or even better results in term of accuracy than regular look-ahead as each token in each layer have access to more tokens on average. That is why we recommend to use this approach for streaming.
+
+
+** Note: Latencies are based on the assumption that the forward time of the network is zero.
+
+Approaches with non-zero look-ahead can give significantly better accuracy by sacrificing latency. The latency can get controlled by the left context size.
+
+
+In all modes, left context can be controlled by the number of tokens to be visible in the self-attention and the kernel size of the convolutions.
+For example, if left context of self-attention in each layer is set to 20 tokens and there are 10 layers of Conformer, then effective left context is 20*10=200 tokens.
+Left context of self-attention for regular look-ahead can be set as any number while it should be set as a multiplication of the right context in chunk-aware look-ahead.
+For convolutions, if we use a left context of 30 in such model, then there would be 30*10=300 effective left context.
+Left context of convolutions is dependent to the their kernel size while it can be any number for self-attention layers. Higher left context for self-attention means larger cache and more computations for the self-attention.
+Self-attention left context of around 6 secs would give close result to have unlimited left context. For a model with 4x downsampling and shift window of 10ms in the preprocessor, each token corresponds to 4*10=40ms.
+
+If striding approach is used for downsampling, all the convolutions in downsampling would be fully causal and don't see future tokens.
+It is recommended to use stacking for streaming model which is significantly faster and uses less memory.
+
+Conformer-Transducer is the Conformer model introduced in :cite:`asr-models-gulati2020conformer` and uses RNNT/Transducer loss/decoder.
+It has the same encoder as Conformer-CTC but utilizes RNNT/Transducer loss/decoder which makes it an autoregressive model.
+
+Most of the config file for Conformer-Transducer models are similar to Conformer-CTC except the sections related to the decoder and loss: decoder, loss, joint, decoding.
+You may take a look at our `tutorials page <../starthere/tutorials.html>` on Transducer models to become familiar with their configs:
+`Introduction to Transducers <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/Intro_to_Transducers.ipynb>` and `ASR with Transducers <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb>`
+You can find more details on the config files for the Conformer-Transducer models at `Conformer-CTC <./configs.html#conformer-ctc>`.
+
+This model supports both the sub-word level and character level encodings. The variant with sub-word encoding is a BPE-based model
+which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` class, while the
+character-based variant is based on :class:`~nemo.collections.asr.models.EncDecRNNTModel`.
+
+You may find the example config files of Conformer-Transducer model with character-based encoding at
+``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and
+with sub-word encoding at ``<NeMo_git_root>/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``.
+
+
 .. _LSTM-Transducer_model:
 
 LSTM-Transducer

diff --git a/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py b/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py
@@ -98,7 +98,7 @@
     help="Model downsampling factor, 8 for Citrinet models and 4 for Conformer models",
 )
 parser.add_argument(
-    '--max_steps_per_timestep', type=int, default=5, help='Maximum number of tokens decoded per acoustic timestepB'
+    '--max_steps_per_timestep', type=int, default=5, help='Maximum number of tokens decoded per acoustic timestep'
 )
 parser.add_argument('--stateful_decoding', action='store_true', help='Whether to perform stateful decoding')
 parser.add_argument('--device', default=None, type=str, required=False)
@@ -175,10 +175,10 @@ def main(args):
     torch.set_grad_enabled(False)
     if args.asr_model.endswith('.nemo'):
         logging.info(f"Using local ASR model from {args.asr_model}")
-        asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path=args.asr_model)
+        asr_model = nemo_asr.models.ASRModel.restore_from(restore_path=args.asr_model)
     else:
         logging.info(f"Using NGC cloud ASR model {args.asr_model}")
-        asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name=args.asr_model)
+        asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name=args.asr_model)
 
     cfg = copy.deepcopy(asr_model._cfg)
     OmegaConf.set_struct(cfg.preprocessor, False)