Skip to content

MagpieTTS decoder model on top of NeMo main branch#15277

Merged
paarthneekhara merged 30 commits intoNVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_decoderonly_2601
Apr 1, 2026
Merged

MagpieTTS decoder model on top of NeMo main branch#15277
paarthneekhara merged 30 commits intoNVIDIA-NeMo:mainfrom
paarthneekhara:magpietts_decoderonly_2601

Conversation

@paarthneekhara
Copy link
Copy Markdown
Collaborator

No description provided.

Comment on lines +42 to +50
from nemo.collections.tts.modules.nemotron_h_decoder import (
HybridMambaAttentionDynamicCache,
NemotronHConfig,
NemotronHForCausalLM,
NemotronHMLP,
NemotronHModel,
NemotronHMOE,
NemotronHTopkRouter,
)

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'NemotronHMLP' is not used.
Copy link
Copy Markdown
Collaborator

@blisc blisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments from WIP review

@shehzeen shehzeen force-pushed the magpietts_decoderonly_2601 branch from 54d6283 to 06c516f Compare February 12, 2026 00:12
@github-actions github-actions bot added the core Changes to NeMo Core label Feb 17, 2026
@paarthneekhara paarthneekhara force-pushed the magpietts_decoderonly_2601 branch from f684fc3 to eeac2ce Compare March 9, 2026 16:32
@shehzeen shehzeen force-pushed the magpietts_decoderonly_2601 branch from 81af95a to c8ad57a Compare March 10, 2026 23:04
@github-actions github-actions bot removed the core Changes to NeMo Core label Mar 11, 2026
@paarthneekhara paarthneekhara marked this pull request as ready for review March 14, 2026 19:21

return loss

def validation_step(self, batch, batch_idx):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XuesongYang Can you take a look at the dataset setup and validation logging especially since you recently made changes to these inside of the magpie class

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the model class and the Lhotse YAML config, and I can confirm that this model doesn’t currently support multiple validation dataloaders.

Since this PR is already quite large and aggregates a significant number of changes, let’s avoid adding multiple dataloader support here. Instead, we can plan to implement that in a separate PR (similar to #15348).

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
@paarthneekhara paarthneekhara force-pushed the magpietts_decoderonly_2601 branch from 9353e11 to c538c7f Compare March 22, 2026 23:36
self.use_multilingual_asr = cfg.get('use_multilingual_asr', False)
if self.run_val_inference:
logging.info("Loading eval models for validation inference (ASR and speaker verification)...")
if self.use_multilingual_asr:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have these imports in the constructor? These imports might cause issues during model loading while doing inference. IMO we should not include evaluation metrics in the class. Keep it separate like we do in MagpieTTS.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved them above.

'_utmos_calculator',
]

def compute_loss(self, logits, audio_codes, audio_codes_lens):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reuse the compute_loss method from MagpieTTS? I think this is calculating the same loss if we pass frame_stacking_factor=1.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not wanting to reuse it because it uses several self variables. @rlangman suggested to dedupe code only if the function is reusable across other models as well. I dont think it is very reusable and it's just additive cross-entropy loss. Also the function is fairly small (~20 lines) so I am not too inclined towards a shared implementation.

total_phoneme_loss = total_phoneme_loss / self.phoneme_stacking_factor
return total_phoneme_loss, loss_mask

def log_val_audio_example(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change can be part of a later PR. Can we move log_audio_* and log_plot_* to helper files and reuse them across models? I think we can reuse a lot of these plot and log methods from MagpieTTS.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I'll make another PR once this is merged to unify the audio/image logging.

if predicted_audio_paths and context_audio_paths:
with torch.no_grad():
# ASR transcription for CER/WER
if self.use_multilingual_asr:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need cer/wer and ssim during validation step? It might make sense for experimentation but for open sourcing it does not make sense especially after we have a working recipe. Makes val step heavier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it quite helpful to track CER/SSIM during training to know if the model is working and also comparing different experiments. Val loss comparison often doesnt translate to which experiment is doing well. The overhead of this step is not much since it is spread across 4 nodes, so less than 10 iterations of the infer_batch cover the entire val set.

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
selected_training_mode: Optional[str]


class EasyMagpieTTSModel(EasyMagpieTTSInferenceModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason why subclassing inference class for pretraining? aren't we expecting inference subclassing pretraining class?

Copy link
Copy Markdown
Collaborator

@XuesongYang XuesongYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally work through the general design, the new class hierarchy (ModelPT -> EasyMagpieTTSInferenceModel -> EasyMagpieTTSModel  -> EasyMagpieTTSModelOnlinePO) operates independently from the existing MagpieTTS class.

Because it doesn’t subclass MagpieTTS, any features added to MagpieTTS won’t be inherited without careful cherry-picking, and I already noticed a few manually copy-pasted code blocks. If this implementation is meant to be a backbone for future models rather than a one-off recipe, it is worth rethinking the architecture now to ensure better long-term maintainability.

batch_duration: ??? # recommend to use smaller batch_duration for validation dataset than training dataset.
quadratic_duration: ${quadratic_duration}
use_bucketing: false
force_finite: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paarthneekhara Let's make these changes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing these. I have made these changes.


return loss

def validation_step(self, batch, batch_idx):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the model class and the Lhotse YAML config, and I can confirm that this model doesn’t currently support multiple validation dataloaders.

Since this PR is already quite large and aggregates a significant number of changes, let’s avoid adding multiple dataloader support here. Instead, we can plan to implement that in a separate PR (similar to #15348).

Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants