Skip to content

MagpieTTS refactor #15504

Merged
blisc merged 2 commits into
NVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_refactor_pr
Mar 21, 2026
Merged

MagpieTTS refactor #15504
blisc merged 2 commits into
NVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_refactor_pr

Conversation

@paarthneekhara
Copy link
Copy Markdown
Collaborator

This change is mainly done because EasyMagpie will be reusing some shared functionalities with Magpie. So to avoid code duplication, we are moving common things together.

After this, i will raise a separate PR for EasyMagpie changes.

@github-actions github-actions Bot added the TTS label Mar 16, 2026
Comment thread nemo/collections/tts/models/magpietts.py
Comment thread nemo/collections/tts/models/magpietts.py
Comment thread nemo/collections/tts/modules/magpietts_modules.py
Comment thread nemo/collections/tts/modules/magpietts_modules.py
Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed
Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed
Comment thread nemo/collections/tts/parts/utils/helpers.py Fixed
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good! Glad for the code reuse across architectures.

Left some comments.

An overall question - did you test that existing checkpoints are able to be loaded with the updated code? I guess it's not a strict requirement, but if not, we'd need to adapt existing checkpoints e.g. in CI. (But it does look like you considered this, e.g. in the design of the LT helper.)

Comment thread nemo/collections/tts/parts/utils/helpers.py Outdated
Comment thread nemo/collections/tts/parts/utils/helpers.py Outdated
Comment thread nemo/collections/tts/models/magpietts.py
Comment thread nemo/collections/tts/parts/utils/helpers.py
return transcripts


def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py we have a function extract_embedding() - consider combining

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Keeping it separate for now - this one is batched, and has resampling going on. The other one is supporting both WavLM and titanet.

Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.

Comment thread nemo/collections/tts/modules/magpietts_modules.py
rfejgin
rfejgin previously approved these changes Mar 18, 2026
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments have been addressed, see remaining optional comment (on speaker embedding function).

return transcripts


def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device):
Copy link
Copy Markdown
Collaborator

@rfejgin rfejgin Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.

Comment thread nemo/collections/tts/modules/magpietts_modules.py
Comment thread nemo/collections/tts/parts/utils/helpers.py Outdated
Comment thread nemo/collections/tts/parts/utils/helpers.py Outdated
@github-actions github-actions Bot added core Changes to NeMo Core CI labels Mar 19, 2026
@paarthneekhara paarthneekhara force-pushed the magpietts_refactor_pr branch from 7390bcb to 1353aa3 Compare March 19, 2026 16:27
@github-actions github-actions Bot removed core Changes to NeMo Core CI labels Mar 19, 2026
This is a plain Python class (not ``nn.Module``) that holds *references*
to nn.Module sub-modules owned by the parent model. Keeping it non-Module
preserves checkpoint key compatibility.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to trying this approach of deduping code since it is easily reversible, but its not too clear what the benefit is.

Deduping is mainly helpful if you want modifications to this function to automatically back-propagate to the current Magpie model class. If our intention is to put the previous iteration of Magpie into maintenance mode, then it could be better for the new version to use a different implementation of LT in order to make backwards compatibility easier. This is irrelevant if we have no intention of ever cleaning up this logic.

Usually if we refactor something to be reusable, it should be a generic implementation that can be reused in any system. For local transformer, this would mean creating something like a nn.Module with a generic interface that does not rely on knowing implementation details that only make sense for this specific model implementation (e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer). So if we wanted to reuse local transformer in something else like in our codec or S2S models, this would not help us.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local transformer, this would mean creating something like a nn.Module

It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR

(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).

On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.

Comment on lines +820 to +821
text = text.replace("h t t p", "http")
text = text.replace("w w w", "www")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering doing this for common normalization problems like "mr" and "mrs" (without periods) in English, but I thought it would make more sense to require these text fixes outside of the CER function. Would we want to add that here?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.

Copy link
Copy Markdown
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from my end

Comment on lines +820 to +821
text = text.replace("h t t p", "http")
text = text.replace("w w w", "www")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.

This is a plain Python class (not ``nn.Module``) that holds *references*
to nn.Module sub-modules owned by the parent model. Keeping it non-Module
preserves checkpoint key compatibility.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local transformer, this would mean creating something like a nn.Module

It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR

(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).

On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.

paarthneekhara and others added 2 commits March 20, 2026 14:41
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
@paarthneekhara paarthneekhara force-pushed the magpietts_refactor_pr branch from 33ef845 to c83c82f Compare March 20, 2026 18:42
@blisc blisc enabled auto-merge (squash) March 20, 2026 20:47
@blisc blisc merged commit b3e8d2d into NVIDIA-NeMo:main Mar 21, 2026
131 checks passed
Comment on lines -2311 to -2319
if (
self.training
and batch_size > 1
and self.train_shuffle_context_embedding_prob > 0
and random.random() < self.train_shuffle_context_embedding_prob
):
shift = random.randint(1, batch_size - 1)
context_embeddings = context_input_embedded.roll(shift, dims=0)
context_mask = context_mask.roll(shift, dims=0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paarthneekhara not sure if this is removed on purpose. Did you try alternative solution to bypass context encoder instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants