Skip to content

Commit

Permalink
Add SDP documentation (NVIDIA#5274) (NVIDIA#5376)
Browse files Browse the repository at this point in the history
* Add details to SDP README.md

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add docstring to WriteManifest processor

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add docstring to CreateInitialManifestMLS

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add ModifyManifestTextProcessor docstring

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add ASRInference docstring

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add base_processor docstrings

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Add minimal SDP docs page

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Update tools/speech_dataset_processor/README.md

Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>

* Write simple README for SDP and move complex explanations to docs

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Remove incorrect type hints

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Make config example less confusing

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Fix typo

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Clarify that YAML file is config file in README

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Remove unused imports

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Remove SDP docs for now

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Remove links to docs in SDP README

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: 1-800-bad-code <shane.carroll@utsa.edu>
  • Loading branch information
3 people authored and 1-800-BAD-CODE committed Nov 13, 2022
1 parent b86f4f8 commit e9afd2f
Show file tree
Hide file tree
Showing 6 changed files with 136 additions and 14 deletions.
68 changes: 65 additions & 3 deletions tools/speech_dataset_processor/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,69 @@
# Speech Dataset Processor

Toolkit to make it easy to write and share the steps for processing a speech dataset.
Speech Dataset Processor (SDP) is a toolkit to make it easy to:
1. write code to process a new dataset, minimizing the amount of boilerplate code required.
2. share the steps for processing a speech dataset. Sharing processing steps can be as easy as sharing a YAML file.

This toolkit contains many of the most common speech dataset processing operations. To process a new dataset, you simply need to write a YAML file containing the parameters needed for dataset processing. It is also easy to add your own code for various speech dataset processing steps if needed.
SDP's philosophy is to represent processing operations as 'processor' classes. Many common processing operations are provided, and it is easy to add your own. In some cases, all you will need to do to process a new dataset is simply to write a YAML file containing the parameters needed to process your dataset.

TBD
SDP is specifically intended for the use case when you have an existing dataset with the audio & text pairs already specified in some form, and you wish to create a JSON manifest suitable for use with NeMo. SDP allows for intermediate cleaning and filtering steps which involve amending the 'ground truth' `"text"` or dropping utterances which are deemed to be too inaccurate for training on.

## Quick intro to Speech Dataset Processor

* The steps to process a dataset are specified by a YAML config file.
* The YAML config file contains a list of processor classes & the args to pass into the constructor.
* Each processor class inputs an existing manifest (except for classes which create an 'initial' manifest from some external transcript file) & outputs a modified version of the manifest. It may change other files in the process, e.g. resample audio.
* To process a manifest, you need to list the chain of processors you wish to use.
* If a processor is not included, you can make your own.

## YAML config file layout
A simplified version of an SDP file can be:

```yaml
processors:

# use existing classes for popular datasets or make your own class
- _target_: sdp.processors.CreateInitialManifestMLS
output_manifest_file: ...
download_dir: ...
...

# use existing classes for common operations or write your own
- _target_: sdp.processors.SubSubstringToSubstring

substring_pairs: {
# specify the parameters needed for your usecase
" mr ": " mister ",
" misteak ": " mistake ",
...
}

- _target_: sdp.processors.DropNonAlphabet
alphabet: " abcdefghijklmnopqrstuvwxyz"
output_manifest_file: ...
...
```
## Existing processor classes
In addition to those mentioned in the example config file, many more classes are already included in Speech Dataset Processor, for example:
* `sdp.processors.ASRInference` will run inference on the manifest using a specified `pretrained_model`.
* `sdp.processors.DropHighWER` will compute WER between `text` and `pred_text` of each utterance and remove the utterance if WER is greater than the specified `wer_threshold`.
* `sdp.processors.DropHighLowCharrate` will compute the character rate in the utterance using `text` and `duration`, and drop the utterance if it is outside the bounds of the specified `high_charrate_threshold` and `low_charrate_threshold`. Carefully chosen thresholds will allow us to drop utterances with incorrect ground truth `text`.

## Processor test cases
You can add test cases to verify you have specified your desired changes correctly and to help document why your are making these changes.

For example:
```yaml
processors:
...
- _target_: sdp.processors.DropIfRegexInAttribute
attribute_to_regex:
"text" : ["(\\D ){5,20}"] # looks for between 4 and 19 characters surrounded by spaces

test_cases:
- {input: {text: "some s p a c e d out letters"}, output: null}
- {input: {text: "normal words only"}, output: {text: "normal words only"}}
- {input: {text: "three a b c spaced out letters"}, output: {text: "three a b c spaced out letters"}}
- {input: {text: "four a b c d spaced out letters"}, output: null}
...
```
13 changes: 11 additions & 2 deletions tools/speech_dataset_processor/sdp/processors/asr_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,14 @@


class ASRInference(BaseProcessor):
"""This processor perforce ASR inference.
"""This processor performs ASR inference on the input manifest.
Args:
output_manifest: the path to the output manifest. It will be the same as the input manifest, but will
also have "pred_true" entries for every utterance.
input_manifest_file: the path to the input manifest which will be transcribed.
pretrained_model: the name of the pretrained NeMo ASR model which will be used to do inference.
batch_size: the batch size to use for ASR inference.
Note that it does not re-use base parallel implementation, since the ASR
inference is already run in batches.
Expand All @@ -29,7 +36,9 @@ class ASRInference(BaseProcessor):
parallelization, but that needs to be tested.
"""

def __init__(self, output_manifest_file, input_manifest_file, pretrained_model, batch_size=32):
def __init__(
self, output_manifest_file: str, input_manifest_file: str, pretrained_model: str, batch_size: int = 32
):
self.output_manifest_file = output_manifest_file
self.input_manifest_file = input_manifest_file
self.script_path = Path(__file__).parents[4] / "examples" / "asr" / "transcribe_speech.py"
Expand Down
21 changes: 17 additions & 4 deletions tools/speech_dataset_processor/sdp/processors/base_processor.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,17 @@ class DataEntry:


class BaseProcessor(ABC):
"""
Abstract class for SDP processors.
Args
output_manifest_file: path of where the output manifest file will be located.
input_manifest_file: path of where the input manifest file is located. This arg
is optional - some processors may not take in an input manifest because they
need to create an initial manifest from scratch (ie from some transcript file
that is in a format different to the NeMo manifest format).
"""

def __init__(self, output_manifest_file, input_manifest_file=None):
self.output_manifest_file = output_manifest_file
self.input_manifest_file = input_manifest_file
Expand All @@ -55,13 +66,15 @@ def test(self):

class BaseParallelProcessor(BaseProcessor):
"""
TBD
Processor class which allows operations on each utterance to be parallelized. Parallelization
is done using tqdm.contrib.concurrent.process_map.
input_manifest_file should always be specified unless it's the first
processor that reads from original dataset representation.
Args:
max_workers: maximum number of workers that will be spawned during parallel processing.
chunksize: the size of the chunks that will be sent to worker processes.
"""

def __init__(self, max_workers=-1, chunksize=100, **kwargs):
def __init__(self, max_workers: int = -1, chunksize: int = 100, **kwargs):
super().__init__(**kwargs)
if max_workers == -1:
max_workers = multiprocessing.cpu_count()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,27 @@


class CreateInitialManifestMLS(BaseParallelProcessor):
"""
Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using
the transcripts provided in the raw data.
Args:
language: the language of the data you wish to be downloaded. This will be used to format the
URL from which we attempt to download the data.
download_dir: the directory where the downloaded data will be saved.
data_split: the data split for which the initial manifest will be created.
resampled_audio_dir: the directory where the resampled (16kHz) wav files will be stored.
use_test_data: if `True`, will use the test data manifest located at `TEST_DATA_PATH` to carry out tests.
"""

def __init__(
self, language, download_dir, resampled_audio_dir, data_split, use_test_data=False, **kwargs,
self,
language: str,
download_dir: str,
resampled_audio_dir: str,
data_split: str,
use_test_data: bool = False,
**kwargs,
):
super().__init__(**kwargs)
self.language = language
Expand Down Expand Up @@ -65,7 +84,7 @@ def read_manifest(self):

return dataset_entries

def process_dataset_entry(self, data_entry):
def process_dataset_entry(self, data_entry: str):
if len(data_entry.split("\t")) != 2:
raise RuntimeError(f"have more than one tab in line {data_entry}")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,20 @@
class ModifyManifestTextProcessor(BaseParallelProcessor):
"""Base class useful for most "text-only" modifications of the manifest.
Will add the following functionality:
- Add space in the beginning and end of sentence for easier regex-based
This adds the following functionality on top of BaseParallelProcessor
- Adds space in the beginning and end of sentence for easier regex-based
processing.
- Automatically handles common test cases by comparing input to output
values.
Args:
test_cases: an optional list of dicts containing test cases for checking
that the processor makes the changes that we are expecting.
The dicts must have a key 'input', the value of which is a dictionary
containing data which is our test input manifest line, and a key
'output', the value of which is a dictionary containing data which is
the expected output manifest line.
.. note::
This class only supports one-to-one or one-to-none mappings.
"""
Expand Down
13 changes: 12 additions & 1 deletion tools/speech_dataset_processor/sdp/processors/write_manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,24 @@
# limitations under the License.

import json
from typing import List

from sdp.processors.base_processor import BaseProcessor
from tqdm import tqdm


class WriteManifest(BaseProcessor):
def __init__(self, output_manifest_file, input_manifest_file, fields_to_save):
"""
Saves a copy of a manifest but only with the fields specified in fields_to_save.
Args:
output_manifest_file: path of where the output file will be saved.
input_manifest_file: path of where the input file that we will be copying is saved.
fields_to_save: list of the fields in the input manifest that we want to copy over.
The output file will only contain these fields.
"""

def __init__(self, output_manifest_file: str, input_manifest_file: str, fields_to_save: List[str]):
self.output_manifest_file = output_manifest_file
self.input_manifest_file = input_manifest_file
self.fields_to_save = fields_to_save
Expand Down

0 comments on commit e9afd2f

Please sign in to comment.