Skip to content

[TTS] MagpieTTS Inference Refactoring#15178

Merged
blisc merged 54 commits intomainfrom
magpietts_opensource
Dec 13, 2025
Merged

[TTS] MagpieTTS Inference Refactoring#15178
blisc merged 54 commits intomainfrom
magpietts_opensource

Conversation

@subhankar-ghosh
Copy link
Copy Markdown
Collaborator

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Copy of [TTS] MagpieTTS Inference Changes #15158
Because after rebasing with main git showed the PR had 231 files changed, which was wrong. So I created a fresh PR. This shows the correct number of files changed.

Collection: [Note which collection this PR will affect]

Changelog

  1. Refactored Inference script
    a) Modularized code
    b) Separated inference and evaluation logic
    c) Separated different configs
    d) Lays the groundwork for longform inference

  2. infer_batch returns dataclass object now

  3. Moved inference code from scripts to examples

  4. Added fixed context embeddings instead of getting it through context encoder, in case checkpoint does not have context encoder

  5. Removed evalset_config.py and replaced with evalset.json

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

subhankar-ghosh and others added 30 commits December 7, 2025 01:19
… example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Removed multiple long manifest configurations from evalset_config.py.

Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
… example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
@@ -16,24 +16,36 @@
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do a find and replace inside of this file of all print statements and switch them to logging?

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Comment on lines +101 to +102
Returns:
Dictionary mapping metric names to formatted "mean ± CI" strings.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should now say the map contains tuples

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

self.model_type == 'decoder_ce'
and hasattr(self, 'baked_context_embedding')
and self.baked_context_embedding is not None
and self.baked_context_embedding.numel() > 0
Copy link
Copy Markdown
Collaborator

@rlangman rlangman Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last condition should probably throw an error, as it implies someone tried to load an embedding but did it incorrectly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_baked_context_embedding property is used as a check in many places. In cases where the condition should absolutely be true it does throw an error. Hence I think this method just returns if baked_context_embedding is present and properly loaded.

True if baked_context_embedding buffer is set, not None, and has elements.
"""
return (
self.model_type == 'decoder_ce'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to enforce that baked embedding can only be used with decoder_ce? Its not a big deal since I imagine we will be removing most of the other model_type code paths later. Just as first glance, it looks like the implementation would be easier/simpler if it was not gated by model type.

Comment on lines +547 to +550
# Index into embeddings: (N, T, D) -> (B, T, D)
embeddings = self.baked_context_embedding[indices] # (B, T, D)
lengths = self.baked_context_embedding_len[indices] # (B,)
return embeddings, lengths
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A benefit of using an embedding layer is you could validate the shape of the embedding during model initialization. The value of N, num_baked_speakers, would need to be provided as a buffer from the checkpoint, replacing the current definition of it as an @property

context_input_embedded, context_mask, cond=None, cond_mask=None
)['output']
# Check for baked context embedding first
if self.has_baked_context_embedding:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have this branch decided by whether speaker_indices is None or not. If speaker indices is provided, then use this path, else use the context encoder. An error should be thrown if someone provides neither or both context_audio and speaker_indices.

Copy link
Copy Markdown
Collaborator Author

@subhankar-ghosh subhankar-ghosh Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of N, num_baked_speakers, would need to be provided as a buffer from the checkpoint

Why do we need to have number of baked speakers saved as a buffer? We need to save T and D sure, but I am not sure why we need to save N as a buffer. We can get the num embeddings by embedding_layer.weight.shape[0]
Nonetheless, we are changing the implementation to torch.nn.embedding

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have this branch decided by whether speaker_indices is None or not. If speaker indices is provided, then use this path, else use the context encoder. An error should be thrown if someone provides neither or both context_audio and speaker_indices.

The idea is if baked_embeddings are there the user should only use that and not context encoder. If speaker_indices are not provided and baked_embeddings are present, the default behavior is to use the first speaker embedding. I thought this is what we had decided - baked_embeddings so we do not let users use any other voice.

Copy link
Copy Markdown
Collaborator

@rlangman rlangman Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of N, num_baked_speakers, would need to be provided as a buffer from the checkpoint

Why do we need to have number of baked speakers saved as a buffer? We need to save T and D sure, but I am not sure why we need to save N as a buffer. We can get the num embeddings by embedding_layer.weight.shape[0]

The request from Jason was to make the parameters a nn.Embedding layer. Creating this layer requires knowing the value of N ahead of time. Like:

num_baked_embeddings = self.register_buffer('num_baked_embeddings ', 0)
self.embedding_layer = torch.nn.Embedding(num_embeddings=num_baked_embeddings, embedding_dim=max_context_len * context_dim, _freeze=True)

This is a clean approach that guarantees the embedding is the correct dimensions when the model is first loaded. Otherwise if the embedding_layer happens to be wrong in the model checkpoint, then you would only know by hitting errors at inference time. Granted this is a pretty minor concern.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have this branch decided by whether speaker_indices is None or not. If speaker indices is provided, then use this path, else use the context encoder. An error should be thrown if someone provides neither or both context_audio and speaker_indices.

The idea is if baked_embeddings are there the user should only use that and not context encoder. If speaker_indices are not provided and baked_embeddings are present, the default behavior is to use the first speaker embedding. I thought this is what we had decided - baked_embeddings so we do not let users use any other voice.

I think we assume, and we should add this as a check to ensure that our assumption is correct, that a model can either have backed context embeddings OR a context encoder. Not both. So the path where speaker_indices is None is ok with me

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this release, there should be some inference path that allows a user to pre-compute and reuse context embeddings. So that when hosting the model, it is not necessary to recompute the same context for every request. It would be nice if that use case and the baked context embeddings could be consolidated, but we can revisit that in the later PR which defines the actual inference functions/APIs.

… print->logging

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
@github-actions github-actions bot removed the Run CICD label Dec 13, 2025
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @subhankar-ghosh 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

@blisc blisc merged commit 8df83ac into main Dec 13, 2025
135 checks passed
@blisc blisc deleted the magpietts_opensource branch December 13, 2025 17:39
AkCodes23 pushed a commit to AkCodes23/NeMo that referenced this pull request Jan 28, 2026
* Modularize magpie inference code, move inference code from scripts to example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Modify magpie CI with inference changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Renaming magpietts inference scripts from magpie to magpietts

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* infer_batch returns dataclass object

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Fixed context embedding without context encoder

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unnecessary configurations

Removed multiple long manifest configurations from evalset_config.py.

Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>

* Removing unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Move inference helper modules from examples to tts collection

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Review changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Changes suggested in compute_mean_with_confidence_interval

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Linting issue

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* support multiple voices - baked context embeddings

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* move evalset_config to json

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Modularize magpie inference code, move inference code from scripts to example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Modify magpie CI with inference changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Renaming magpietts inference scripts from magpie to magpietts

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* infer_batch returns dataclass object

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Fixed context embedding without context encoder

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Removing unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unnecessary configurations

Removed multiple long manifest configurations from evalset_config.py.

Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Move inference helper modules from examples to tts collection

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Review changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Changes suggested in compute_mean_with_confidence_interval

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Linting issue

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* support multiple voices - baked context embeddings

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* move evalset_config to json

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* simplifying get_baked_context_embeddings_batch

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Fix logging

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* logging changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Changed baked_context_embeddings from tensor to flattened embeddings, print->logging

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Removed comments and print statements

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

---------

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>
Co-authored-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: Akhil Varanasi <akhilvaranasi23@gmail.com>
nune-tadevosyan pushed a commit to nune-tadevosyan/NeMo that referenced this pull request Mar 13, 2026
* Modularize magpie inference code, move inference code from scripts to example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Modify magpie CI with inference changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Renaming magpietts inference scripts from magpie to magpietts

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* infer_batch returns dataclass object

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Fixed context embedding without context encoder

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unnecessary configurations

Removed multiple long manifest configurations from evalset_config.py.

Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>

* Removing unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Move inference helper modules from examples to tts collection

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Review changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Changes suggested in compute_mean_with_confidence_interval

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Linting issue

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* support multiple voices - baked context embeddings

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* move evalset_config to json

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Modularize magpie inference code, move inference code from scripts to example

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Modify magpie CI with inference changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Renaming magpietts inference scripts from magpie to magpietts

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* infer_batch returns dataclass object

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Fixed context embedding without context encoder

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Removing unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unnecessary configurations

Removed multiple long manifest configurations from evalset_config.py.

Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Copilot suggested changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Move inference helper modules from examples to tts collection

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Review changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Changes suggested in compute_mean_with_confidence_interval

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Linting issue

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* support multiple voices - baked context embeddings

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* move evalset_config to json

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* simplifying get_baked_context_embeddings_batch

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Fix logging

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Apply isort and black reformatting

Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>

* Remove unused imports

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* logging changes

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Changed baked_context_embeddings from tensor to flattened embeddings, print->logging

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

* Removed comments and print statements

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>

---------

Signed-off-by: subhankar-ghosh <subhankar2321@gmail.com>
Signed-off-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Signed-off-by: Subhankar Ghosh <subhankarg@nvidia.com>
Co-authored-by: subhankar-ghosh <subhankar-ghosh@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants