Skip to content

Update text processing in Magpietts eval#15608

Merged
rlangman merged 8 commits into
mainfrom
magpie_eval
Apr 29, 2026
Merged

Update text processing in Magpietts eval#15608
rlangman merged 8 commits into
mainfrom
magpie_eval

Conversation

@rlangman
Copy link
Copy Markdown
Collaborator

@rlangman rlangman commented Apr 15, 2026

What does this PR do ?

Adds an interface for doing language-specific text processing for TTS.

Collection: [TTS]

Changelog

  • Adds a interface TextProcessor which can be used fornormalization of training/eval data, and for processing text for WER calculation.
  • Adds English implementation of TextProcessor to Magpie evaluation code. This reduces false positives in TTS WER caused by ASR models producing text in written (instead of spoken) form.
  • Updated punctuation removal to use isalnum(). Previous logic using string.punctuation only removes a limited set of English punctuation.
  • Added datasets_base_path to allow doing inference and evaluation using a config file with relative paths.
  • Modified Longform CI to use a manifest with a "normalized_text" field. Otherwise, tests had higher WER because it was feeding Magpie raw text with several digits in its input.

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

@rlangman rlangman self-assigned this Apr 15, 2026
@rlangman rlangman added the TTS label Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Left some minor comments.

Comment thread nemo/collections/tts/parts/utils/tts_dataset_utils.py Outdated
Comment thread nemo/collections/tts/parts/utils/tts_dataset_utils.py
Comment thread nemo/collections/tts/parts/utils/tts_dataset_utils.py
Comment thread nemo/collections/tts/parts/utils/tts_dataset_utils.py
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good. One question: have we tested the updated code at all on non-English? At least to the extent that we'd expect WERs to not be higher than before the change, or even to be lower. One possible way to test that while removing variability due to the TTS model would be to run evaluation on a pre-populated directory of Magpie outputs (if the code allows doing that easily), once before the changes and once after.

blisc
blisc previously requested changes Apr 22, 2026
return text


def get_text_processor(language: str) -> TextProcessor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function and https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/parts/utils/helpers.py#L810 needs to be de-duplicated before we merge. The functionalities of process_text_for_cer() should be merged in and GRPO should use this text processor.

Copy link
Copy Markdown
Collaborator Author

@rlangman rlangman Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That method is used by 3 different recipes: Magpie GPRO, Easy Magpie, and Magpie GRPO, and the refactoring needed for each of those is complicated both to implement and test. It is too large and dangerous of a change to couple with this PR, which is relatively small and required for us to proceed with updating our evaluation datasets.

I could implement something and rely on CI tests to validate each recipe still runs, but that is not safe because it is changing the GRPO criteria itself.

@blisc blisc requested review from blisc and shehzeen April 22, 2026 14:32
@rlangman
Copy link
Copy Markdown
Collaborator Author

All looks good. One question: have we tested the updated code at all on non-English? At least to the extent that we'd expect WERs to not be higher than before the change, or even to be lower. One possible way to test that while removing variability due to the TTS model would be to run evaluation on a pre-populated directory of Magpie outputs (if the code allows doing that easily), once before the changes and once after.

Using the same Magpie outputs I ran evaluation on CML test (french, german, italian, spanish) and all metrics were identical, except CER on french dropped from 0.022 to 0.019 because the updated logic with isalnum() removes french quotation marks before CER calculation.

@github-actions github-actions Bot removed the Run CICD label Apr 23, 2026
@blisc blisc dismissed their stale review April 23, 2026 17:25

Unblock

blisc
blisc previously approved these changes Apr 23, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rfejgin
rfejgin previously approved these changes Apr 23, 2026
@github-actions github-actions Bot removed the Run CICD label Apr 24, 2026
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan <rlangman@nvidia.com>
Signed-off-by: Ryan Langman <rlangman@nvidia.com>
@rlangman
Copy link
Copy Markdown
Collaborator Author

/ok to test 418f0d3

blisc
blisc previously approved these changes Apr 28, 2026
Comment thread .github/workflows/cicd-main-speech.yml
Signed-off-by: Ryan Langman <rlangman@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @rlangman 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@rlangman
Copy link
Copy Markdown
Collaborator Author

/ok to test d85344d

@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @rlangman 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@rlangman rlangman merged commit 056d937 into main Apr 29, 2026
136 checks passed
@rlangman rlangman deleted the magpie_eval branch April 29, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants