MagpieTTS refactor by paarthneekhara · Pull Request #15504 · NVIDIA-NeMo/NeMo

paarthneekhara · 2026-03-16T21:12:54Z

This change is mainly done because EasyMagpie will be reusing some shared functionalities with Magpie. So to avoid code duplication, we are moving common things together.

After this, i will raise a separate PR for EasyMagpie changes.

rfejgin

Overall looks good! Glad for the code reuse across architectures.

Left some comments.

An overall question - did you test that existing checkpoints are able to be loaded with the updated code? I guess it's not a strict requirement, but if not, we'd need to adapt existing checkpoints e.g. in CI. (But it does look like you considered this, e.g. in the design of the LT helper.)

rfejgin · 2026-03-17T23:02:31Z

+    return transcripts
+
+
+def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device):


In nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py we have a function extract_embedding() - consider combining

Hmm. Keeping it separate for now - this one is batched, and has resampling going on. The other one is supporting both WavLM and titanet.

They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.

rfejgin

Comments have been addressed, see remaining optional comment (on speaker embedding function).

rfejgin · 2026-03-17T23:54:07Z

+    return transcripts
+
+
+def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device):


They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.

rlangman · 2026-03-19T18:30:49Z

+    This is a plain Python class (not ``nn.Module``) that holds *references*
+    to nn.Module sub-modules owned by the parent model.  Keeping it non-Module
+    preserves checkpoint key compatibility.
+


I am not opposed to trying this approach of deduping code since it is easily reversible, but its not too clear what the benefit is.

Deduping is mainly helpful if you want modifications to this function to automatically back-propagate to the current Magpie model class. If our intention is to put the previous iteration of Magpie into maintenance mode, then it could be better for the new version to use a different implementation of LT in order to make backwards compatibility easier. This is irrelevant if we have no intention of ever cleaning up this logic.

Usually if we refactor something to be reusable, it should be a generic implementation that can be reused in any system. For local transformer, this would mean creating something like a nn.Module with a generic interface that does not rely on knowing implementation details that only make sense for this specific model implementation (e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer). So if we wanted to reuse local transformer in something else like in our codec or S2S models, this would not help us.

For local transformer, this would mean creating something like a nn.Module

It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR

(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).

On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.

rlangman · 2026-03-19T18:35:23Z

+    text = text.replace("h t t p", "http")
+    text = text.replace("w w w", "www")


I was considering doing this for common normalization problems like "mr" and "mrs" (without periods) in English, but I thought it would make more sense to require these text fixes outside of the CER function. Would we want to add that here?

If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.

blisc

Approved from my end

blisc · 2026-03-19T20:10:57Z

+    text = text.replace("h t t p", "http")
+    text = text.replace("w w w", "www")


If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.

blisc · 2026-03-19T20:14:02Z

+    This is a plain Python class (not ``nn.Module``) that holds *references*
+    to nn.Module sub-modules owned by the parent model.  Keeping it non-Module
+    preserves checkpoint key compatibility.
+


For local transformer, this would mean creating something like a nn.Module

It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR

(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).

On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

XuesongYang · 2026-03-25T04:39:44Z

-                if (
-                    self.training
-                    and batch_size > 1
-                    and self.train_shuffle_context_embedding_prob > 0
-                    and random.random() < self.train_shuffle_context_embedding_prob
-                ):
-                    shift = random.randint(1, batch_size - 1)
-                    context_embeddings = context_input_embedded.roll(shift, dims=0)
-                    context_mask = context_mask.roll(shift, dims=0)


@paarthneekhara not sure if this is removed on purpose. Did you try alternative solution to bypass context encoder instead?

github-actions Bot added the TTS label Mar 16, 2026

blisc reviewed Mar 16, 2026

View reviewed changes

Comment thread nemo/collections/tts/models/magpietts.py

Comment thread nemo/collections/tts/models/magpietts.py

Comment thread nemo/collections/tts/modules/magpietts_modules.py

Comment thread nemo/collections/tts/modules/magpietts_modules.py

paarthneekhara requested review from rfejgin and rlangman March 17, 2026 17:44

blisc added the Run CICD label Mar 17, 2026

blisc had a problem deploying to test March 17, 2026 20:07 — with GitHub Actions Error

chtruong814 added Run CICD and removed Run CICD labels Mar 17, 2026

github-advanced-security AI found potential problems Mar 17, 2026

View reviewed changes

Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed

Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed

Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed

chtruong814 added Run CICD and removed Run CICD labels Mar 17, 2026

chtruong814 had a problem deploying to test March 17, 2026 21:36 — with GitHub Actions Error

rfejgin requested changes Mar 17, 2026

View reviewed changes

chtruong814 added Run CICD and removed Run CICD labels Mar 17, 2026

chtruong814 temporarily deployed to test March 17, 2026 23:24 — with GitHub Actions Inactive

rfejgin previously approved these changes Mar 18, 2026

View reviewed changes

paarthneekhara dismissed rfejgin’s stale review via 48d9a4c March 18, 2026 18:18

chtruong814 added Run CICD and removed Run CICD labels Mar 18, 2026

chtruong814 had a problem deploying to test March 18, 2026 18:20 — with GitHub Actions Error

chtruong814 added Run CICD and removed Run CICD labels Mar 18, 2026

chtruong814 temporarily deployed to test March 18, 2026 18:38 — with GitHub Actions Inactive

chtruong814 added Run CICD and removed Run CICD labels Mar 19, 2026

chtruong814 temporarily deployed to test March 19, 2026 09:17 — with GitHub Actions Inactive

paarthneekhara force-pushed the magpietts_refactor_pr branch from 2c10060 to 399453d Compare March 19, 2026 16:01

github-actions Bot added core Changes to NeMo Core CI labels Mar 19, 2026

chtruong814 added Run CICD and removed Run CICD labels Mar 19, 2026

paarthneekhara force-pushed the magpietts_refactor_pr branch from 7390bcb to 1353aa3 Compare March 19, 2026 16:27

github-actions Bot removed core Changes to NeMo Core CI labels Mar 19, 2026

chtruong814 added Run CICD and removed Run CICD labels Mar 19, 2026

chtruong814 temporarily deployed to test March 19, 2026 16:30 — with GitHub Actions Inactive

rlangman reviewed Mar 19, 2026

View reviewed changes

blisc approved these changes Mar 19, 2026

View reviewed changes

rlangman approved these changes Mar 20, 2026

View reviewed changes

chtruong814 added Run CICD and removed Run CICD labels Mar 20, 2026

chtruong814 had a problem deploying to test March 20, 2026 18:14 — with GitHub Actions Error

paarthneekhara and others added 2 commits March 20, 2026 14:41

clean changes

d88f34b

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

Apply isort and black reformatting

c83c82f

Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>

paarthneekhara force-pushed the magpietts_refactor_pr branch from 33ef845 to c83c82f Compare March 20, 2026 18:42

chtruong814 added Run CICD and removed Run CICD labels Mar 20, 2026

chtruong814 temporarily deployed to test March 20, 2026 18:44 — with GitHub Actions Inactive

blisc enabled auto-merge (squash) March 20, 2026 20:47

blisc merged commit b3e8d2d into NVIDIA-NeMo:main Mar 21, 2026
131 checks passed

XuesongYang reviewed Mar 25, 2026

View reviewed changes

		return transcripts


		def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device):

		text = text.replace("h t t p", "http")
		text = text.replace("w w w", "www")

Conversation

paarthneekhara commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rfejgin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rfejgin Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rfejgin left a comment

Choose a reason for hiding this comment

Uh oh!

rfejgin Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blisc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

rfejgin left a comment •

edited

Loading

rfejgin Mar 17, 2026 •

edited

Loading

rfejgin Mar 17, 2026 •

edited

Loading