Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NemoAsrReader handling of UTF8 text #2358

Merged
merged 4 commits into from
Oct 16, 2020

Conversation

jantonguirao
Copy link
Contributor

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

  • It adds new feature needed to integrate with NeMo ASR training on non-english datasets.

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    NeMo manifests are encoded in utf8. Made sure that the output of the reader is the expected utf8 encoded string
    Tests added
    Removed trailing 0 in text outputs
    Removed text normalization, which will be left to the python code
  • Affected modules and functionalities:
    NemoAsrReader
  • Key points relevant for the review:
    N/A
  • Validation and testing:
    New tests added
  • Documentation (including examples):
    Docstr updated

JIRA TASK: [DALI-1635]

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1700932]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1700932]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1703805]: BUILD STARTED

Signed-off-by: Joaquin Anton <janton@nvidia.com>
Signed-off-by: Joaquin Anton <janton@nvidia.com>

{
std::stringstream ss;
ss << R"code({"audio_filepath": "path/to/audio1.wav", "duration": 1.45, "text": "这是一个测试"})code" << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should get the utf-8 bytes and store it as a char array (providing the characters' numerical values) - otherwise it's prone to breaking when the file is transcoded.

@@ -85,14 +85,6 @@ in seconds, of the audio samples.

Samples with a duration longer than this value will be ignored.)code",
0.0f)
.AddOptionalArg("normalize_text",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecate this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

nemo_asr_manifest = os.path.join(tmp_dir.name, "nemo_asr_manifest.json")
create_manifest_file(nemo_asr_manifest, names, lengths, rates, ref_text_literal)
ref_text = [np.frombuffer(bytes(s, "utf8"), dtype=np.uint8).reshape(-1) for s in ref_text_literal]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is .reshape(-1) for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flattening

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked, not needed

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1707805]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1707805]: BUILD PASSED

@jantonguirao jantonguirao merged commit 14ed736 into NVIDIA:master Oct 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants