Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exportable mel spectrogram preprocessor #5508

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
9ccbf56
initial commit
1-800-BAD-CODE Nov 25, 2022
d2bd936
add featurizer property
1-800-BAD-CODE Nov 25, 2022
225931b
use torchaudio argument instead of separate top-level NeuralModule
1-800-BAD-CODE Nov 26, 2022
052e4ee
Refactor/unify ASR offline and buffered inference (#5440)
fayejf Nov 19, 2022
6831170
Update docs with Comparison tool info (#5182)
Jorjeous Nov 21, 2022
ac58218
Standalone diarization+ASR evaluation script (#5439)
tango4j Nov 21, 2022
92306b5
Fix for prompt table restore error (#5393) (#5408)
github-actions[bot] Nov 21, 2022
203ab44
Radtts 1.13 plus (#5457) (#5471)
github-actions[bot] Nov 21, 2022
51384bf
[TN] raise NotImplementedError for unsupported languages and other mi…
XuesongYang Nov 21, 2022
f6a7bff
Add num layers check (#5470) (#5473)
github-actions[bot] Nov 22, 2022
669a0ea
Add float32 type casting for get_samples function (#5399)
tango4j Nov 22, 2022
38a9349
Change to kwargs (#5475) (#5477)
github-actions[bot] Nov 22, 2022
8fa52b1
Transcribe for multi-channel signals (#5479)
anteju Nov 22, 2022
072e10b
Megatron Export Update (#5343) (#5423)
github-actions[bot] Nov 23, 2022
d397b9a
export_utils bugfix (#5482)
github-actions[bot] Nov 23, 2022
04662c7
Add missing import (#5487)
jonghwanhyeon Nov 23, 2022
7e0507c
Add Silence Augmentation (#5476)
fayejf Nov 23, 2022
76efef6
Bug Fix for bert to run on entire validation dataset (#5493)
shanmugamr1992 Nov 23, 2022
5552d57
Export fixes for Riva (#5496) (#5497)
github-actions[bot] Nov 24, 2022
b5f2cb0
Add auto-labeler (#5498)
SeanNaren Nov 25, 2022
a00508f
Make arguments match
1-800-BAD-CODE Nov 26, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/labeler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ASR:
- nemo/collections/asr/**/*

NLP:
- nemo/collections/nlp/**/*
14 changes: 14 additions & 0 deletions .github/workflows/labeler.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
name: "Pull Request Labeler"
on:
- pull_request_target

jobs:
triage:
permissions:
contents: read
pull-requests: write
runs-on: ubuntu-latest
steps:
- uses: actions/labeler@v4
with:
repo-token: "${{ secrets.GITHUB_TOKEN }}"
4 changes: 2 additions & 2 deletions docs/source/asr/examples/kinyarwanda_asr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ The figure below shows the training dynamics when we train Kinyarwanda models **
.. image:: ../images/kinyarwanda_from_scratch.png
:align: center
:alt: Training dynamics of Kinyarwanda models trained from scratch
:scale: 50%
:width: 800px

Finetuning from another model
#############################
Expand Down Expand Up @@ -530,7 +530,7 @@ The figure below compares the training dynamics for three Conformer-Transducer m
.. image:: ../images/kinyarwanda_finetuning.png
:align: center
:alt: Training dynamics of Kinyarwanda models trained from scratch and finetuned from different pretrained checkpoints
:scale: 50%
:width: 800px

************************
Inference and evaluation
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ NVIDIA NeMo User Guide


.. toctree::
:maxdepth: 2
:maxdepth: 3
:caption: Tools
:name: Tools

Expand Down
1 change: 1 addition & 0 deletions docs/source/nlp/dialogue.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ In particular, we wanted to decouple the task-dependent, model-independent compo

.. image:: dialogue_UML.png
:alt: Dialogue-UML
:width: 800px

**Supported Tasks**

Expand Down
1 change: 1 addition & 0 deletions docs/source/nlp/entity_linking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ be used to build a knowledge base embedding index.

.. image:: https://github.com/NVIDIA/NeMo/blob/entity-linking-documentation/docs/source/nlp/entity_linking_overview.jpg
:alt: Entity-Linking-Overview
:width: 800px

Our BERT-base + Self Alignment Pretraining implementation allows you to train an entity linking encoder. We also provide example code
on building an index with `Medical UMLS <https://www.nlm.nih.gov/research/umls/index.html>`_ concepts `NeMo/examples/nlp/entity_linking/build_index.py <https://github.com/NVIDIA/NeMo/tree/stable/examples/nlp/entity_linking/build_index.py>`__.
Expand Down
5 changes: 5 additions & 0 deletions docs/source/nlp/nemo_megatron/parallelisms.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Distributed Data parallelism

.. image:: images/ddp.gif
:align: center
:width: 800px
:alt: Distributed Data Parallel


Expand All @@ -18,20 +19,23 @@ Tensor Parallelism

.. image:: images/tp.gif
:align: center
:width: 800px
:alt: Tensor Parallel

Pipeline Parallelism
^^^^^^^^^^^^^^^^^^^^

.. image:: images/pp.gif
:align: center
:width: 800px
:alt: Pipeline Parallel

Sequence Parallelism
^^^^^^^^^^^^^^^^^^^^

.. image:: images/sp.gif
:align: center
:width: 800px
:alt: Sqeuence Parallel

Parallelism nomenclature
Expand All @@ -41,4 +45,5 @@ When reading and modifying NeMo Megatron code you will encounter the following t

.. image:: images/pnom.gif
:align: center
:width: 800px
:alt: Parallelism nomenclature
1 change: 1 addition & 0 deletions docs/source/nlp/question_answering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ Similarly, the BaseQAModel module handles common model tasks like creating datal

.. image:: question_answering_arch.png
:alt: Question-Answerin-Architecture
:width: 800px

Configuration
=============
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,4 +182,4 @@ References
.. bibliography:: ../tn_itn_all.bib
:style: plain
:labelprefix: TEXTPROCESSING-NORM
:keyprefix: textprocessing-norm-
:keyprefix: textprocessing-norm-
Original file line number Diff line number Diff line change
Expand Up @@ -96,4 +96,4 @@ References
.. bibliography:: ../tn_itn_all.bib
:style: plain
:labelprefix: TEXTPROCESSING-DEPLOYMENT
:keyprefix: textprocessing-deployment-
:keyprefix: textprocessing-deployment-
153 changes: 153 additions & 0 deletions docs/source/tools/comparison_tool.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
Comparison tool for ASR Models
==============================

The Comparison Tool (CT) allows to compare predictions of different ASR models at word accuracy level.

+--------------------------------------------------------------------------------------------------------------------------+
| **Comparison tool features:** |
+--------------------------------------------------------------------------------------------------------------------------+
| navigation across dataset's vocabulary using an interactive datatable that supports sorting and filtering |
+--------------------------------------------------------------------------------------------------------------------------+
| interactive visualization of model's accuracy |
+--------------------------------------------------------------------------------------------------------------------------+
| visual comparison of predictions of different models |
+--------------------------------------------------------------------------------------------------------------------------+

Getting Started
---------------
The Comparison Tool is integrated in NeMo Speech Data Explorer (SDE) that could be found at `NeMo/tools/speech_data_explorer <https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer>`__.

Please install the SDE requirements:

.. code-block:: bash

pip install -r tools/speech_data_explorer/requirements.txt

Then run:

.. code-block:: bash

python tools/speech_data_explorer/data_explorer.py -h

usage: data_explorer.py [-h] [--vocab VOCAB] [--port PORT] [--disable-caching-metrics] [--estimate-audio-metrics] [--debug] manifest

Speech Data Explorer

positional arguments:
manifest path to JSON manifest file

optional arguments:
-h, --help show this help message and exit
--vocab VOCAB optional vocabulary to highlight OOV words
--port PORT serving port for establishing connection
--disable-caching-metrics
disable caching metrics for errors analysis
--estimate-audio-metrics, -a
estimate frequency bandwidth and signal level of audio recordings
--debug, -d enable debug mode
--audio-base-path A base path for the relative paths in manifest. It defaults to manifest path.
--names_compared, -nc names of the two fields that will be compared, example: pred_text_contextnet pred_text_conformer.
--show_statistics, -shst field name for which you want to see statistics (optional). Example: pred_text_contextnet.

CT takes a JSON manifest file (that describes speech datasets in NeMo) as an input. It should contain the following fields:

* `audio_filepath` (path to audio file)
* `duration` (duration of the audio file in seconds)
* `text` (reference transcript)
* `pred_text_<model_1_name>`
* `pred_text_<model_2_name>`

SDE supports any extra custom fields in the JSON manifest. If the field is numeric, then SDE can visualize its distribution across utterances.

JSON manifest has attribute `pred_text`, SDE interprets it as a predicted ASR transcript and computes error analysis metrics.
If you want SDE to analyse another prediction field, then please use `--show_statistics` argument.

User Interface
--------------

SDE has three pages if `--names_compared` argument is not empty:

* `Statistics` (to display global statistics and aggregated error metrics)

.. image:: images/sde_base_stats.png
:align: center
:width: 800px
:alt: SDE Statistics


* `Samples` (to allow navigation across the entire dataset and exploration of individual utterances)

.. image:: images/sde_player.png
:align: center
:width: 800px
:alt: SDE Statistics

* `Comparison tool` (to explore predictions at word level)

.. image:: images/scrsh_2.png
:align: center
:width: 800px
:alt: Comparison tool


CT has an interactive datatable for dataset's vocabulary (that supports navigation, filtering, and sorting):


* Data (that visualizes all dataset's words and adds each one's accuracy)

.. image:: images/scrsh_3.png
:align: center
:width: 800px
:alt: Data

CT supports all operations, that present in SDE, and allows combination of filtering expressions with "or" and "and" operations

* filtering (by entering a filtering expression in a cell below the header's cell)

.. image:: images/scrsh_4.png
:align: center
:width: 800px
:alt: Filtering


Analysis of Speech Datasets
---------------------------

If there is a pre-trained ASR model, then the JSON manifest file can be extended with ASR predicted transcripts:

.. code-block:: bash

python examples/asr/transcribe_speech.py pretrained_name=<ASR_MODEL_NAME> dataset_manifest=<JSON_FILENAME> append_pred=False pred_name_postfix=<model_name_1>


More information about transcribe_speech parameters is available in the code: `NeMo/examples/asr/transcribe_speech.py <https://github.com/NVIDIA/NeMo/blob/main/examples/asr/transcribe_speech.py>`__.
.

.. image:: images/scrsh_2.png
:align: center
:width: 800px
:alt: fields

Fields 1 and 2 are responsible for what will be displayed on the horizontal and vertical axes.

Fields 3 and 4 allow you to convert any available numeric parameter into color and size, respectively.

Fields 5 and 6 are responsible for point spacing. Some data points might have the same coordinates on both axes, in which case there will be an overlap, and in order to be able to explore each point, the option for their spreading was added.

.. image:: images/scrsh_5.png
:align: center
:width: 800px
:alt: dot spacing

Point spacing works as follows: a small random value is added to all point coordinates, the value of which is limited by the "radius" parameter, which can be set manually.

.. image:: images/scrsh_9.png
:align: center
:width: 800px
:alt: Example

In this case, all points lying above the diagonal have higher accuracy with the model displayed on the vertical axis, and all points below the diagonal were recognized better with the model displayed on the horizontal axis.

Points marked with circles should be explored first.

Words in the first quarter were well recognized by both models, and conversely, words in the third quarter were poorly recognized by both models.
Binary file added docs/source/tools/images/scrsh_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/tools/images/scrsh_3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/tools/images/scrsh_4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/tools/images/scrsh_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/tools/images/scrsh_9.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/tools/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,5 +9,6 @@ NeMo provides a set of tools useful for developing Automatic Speech Recognitions

ctc_segmentation
speech_data_explorer
comparison_tool


30 changes: 20 additions & 10 deletions docs/source/tools/speech_data_explorer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,15 +83,17 @@ SDE application has two pages:

.. image:: images/sde_base_stats.png
:align: center
:width: 800px
:alt: SDE Statistics
:scale: 50%


* `Samples` (to allow navigation across the entire dataset and exploration of individual utterances)

.. image:: images/sde_player.png
:align: center
:width: 800px
:alt: SDE Statistics
:scale: 50%


Plotly Dash Datatable provides core SDE's interactive features (navigation, filtering, and sorting).
SDE has two datatables:
Expand All @@ -100,38 +102,43 @@ SDE has two datatables:

.. image:: images/sde_words.png
:align: center
:width: 800px
:alt: Vocabulary
:scale: 50%


* Data (that visualizes all dataset's utterances on `Samples` page)

.. image:: images/sde_utterances.png
:align: center
:width: 800px
:alt: Data
:scale: 50%


Every column of the DataTable has the following interactive features:

* toggling off (by clicking on the `eye` icon in the column's header cell) or on (by clicking on the `Toggle Columns` button below the table)

.. image:: images/datatable_toggle.png
:align: center
:width: 800px
:alt: Toggling
:scale: 80%


* sorting (by clicking on small triangle icons in the column's header cell): unordered (two triangles point up and down), ascending (a triangle points up), descending (a triangle points down)

.. image:: images/datatable_sort.png
:align: center
:width: 800px
:alt: Sorting
:scale: 80%


* filtering (by entering a filtering expression in a cell below the header's cell): SDE supports ``<``, ``>``, ``<=``, ``>=``, ``=``, ``!=``, and ``contains`` operators; to match a specific substring, the quoted substring can be used as a filtering expression

.. image:: images/datatable_filter.png
:align: center
:width: 800px
:alt: Filtering
:scale: 80%



Analysis of Speech Datasets
Expand All @@ -154,22 +161,25 @@ After that it is worth to check words with zero accuracy.

.. image:: images/sde_mls_words.png
:align: center
:width: 800px
:alt: MLS Words
:scale: 50%


And then look at high CER utterances.

.. image:: images/sde_mls_cer.png
:align: center
:width: 800px
:alt: MLS CER
:scale: 50%


Listening to the audio recording helps to validate the corresponding reference transcript.

.. image:: images/sde_mls_player.png
:align: center
:width: 800px
:alt: MLS Player
:scale: 50%




2 changes: 2 additions & 0 deletions docs/update_docs_docker.sh
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
cd ../
docker run --rm -v $PWD:/workspace python:3.8 /bin/bash -c "cd /workspace && \
pip install -r requirements/requirements_docs.txt && cd docs/ && rm -rf build && make clean && make html && make html"
echo "To start web server just run in docs directory:"
echo "python3 -m http.server 8000 --directory ./build/html/"
Loading