MagpieTTS refactor #15504
Conversation
There was a problem hiding this comment.
Overall looks good! Glad for the code reuse across architectures.
Left some comments.
An overall question - did you test that existing checkpoints are able to be loaded with the updated code? I guess it's not a strict requirement, but if not, we'd need to adapt existing checkpoints e.g. in CI. (But it does look like you considered this, e.g. in the design of the LT helper.)
| return transcripts | ||
|
|
||
|
|
||
| def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device): |
There was a problem hiding this comment.
In nemo/collections/tts/modules/magpietts_inference/evaluate_generated_audio.py we have a function extract_embedding() - consider combining
There was a problem hiding this comment.
Hmm. Keeping it separate for now - this one is batched, and has resampling going on. The other one is supporting both WavLM and titanet.
There was a problem hiding this comment.
They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.
rfejgin
left a comment
There was a problem hiding this comment.
Comments have been addressed, see remaining optional comment (on speaker embedding function).
| return transcripts | ||
|
|
||
|
|
||
| def get_speaker_embeddings_from_filepaths(filepaths, speaker_verification_model, device): |
There was a problem hiding this comment.
They both do resampling I think (in the other function it's done through the target SR passed to librosa.load()). There's also the padding short signals to avoid crashing the speaker embedding model which we handle in extract_embedding which might be useful here too. But then the interaction with the embedding model itself looks different, this function uses the sv model's forward() whereas extract_audio uses get_embedding (not sure what the differences are). So the common part is the load-resample-pad. Up to you if you want to combine parts or not.
2c10060 to
399453d
Compare
7390bcb to
1353aa3
Compare
| This is a plain Python class (not ``nn.Module``) that holds *references* | ||
| to nn.Module sub-modules owned by the parent model. Keeping it non-Module | ||
| preserves checkpoint key compatibility. | ||
|
|
There was a problem hiding this comment.
I am not opposed to trying this approach of deduping code since it is easily reversible, but its not too clear what the benefit is.
Deduping is mainly helpful if you want modifications to this function to automatically back-propagate to the current Magpie model class. If our intention is to put the previous iteration of Magpie into maintenance mode, then it could be better for the new version to use a different implementation of LT in order to make backwards compatibility easier. This is irrelevant if we have no intention of ever cleaning up this logic.
Usually if we refactor something to be reusable, it should be a generic implementation that can be reused in any system. For local transformer, this would mean creating something like a nn.Module with a generic interface that does not rely on knowing implementation details that only make sense for this specific model implementation (e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer). So if we wanted to reuse local transformer in something else like in our codec or S2S models, this would not help us.
There was a problem hiding this comment.
For local transformer, this would mean creating something like a nn.Module
It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR
(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).
On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.
| text = text.replace("h t t p", "http") | ||
| text = text.replace("w w w", "www") |
There was a problem hiding this comment.
I was considering doing this for common normalization problems like "mr" and "mrs" (without periods) in English, but I thought it would make more sense to require these text fixes outside of the CER function. Would we want to add that here?
There was a problem hiding this comment.
If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.
| text = text.replace("h t t p", "http") | ||
| text = text.replace("w w w", "www") |
There was a problem hiding this comment.
If we want to correct for ASR transcription errors, it would be useful to make this helper function. Not all scripts should use this helper function.
| This is a plain Python class (not ``nn.Module``) that holds *references* | ||
| to nn.Module sub-modules owned by the parent model. Keeping it non-Module | ||
| preserves checkpoint key compatibility. | ||
|
|
There was a problem hiding this comment.
For local transformer, this would mean creating something like a nn.Module
It is a fair point that we should consider making this a nn.Module. Let's make a point of trying to update this in another PR
(e.g. the concept of audio_embeddings table, frame stacking, special EOS/BOS ID tokens, have nothing to do with local transformer).
On this point, I would disagree. The main point of this module is to transform continuous embedding spaces into a stack of audio tokens. I'm not sure if it's worth trying to make it more generic than that. At that point, you might as well just instantiate a transformer module.
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
Signed-off-by: paarthneekhara <paarthneekhara@users.noreply.github.com>
33ef845 to
c83c82f
Compare
| if ( | ||
| self.training | ||
| and batch_size > 1 | ||
| and self.train_shuffle_context_embedding_prob > 0 | ||
| and random.random() < self.train_shuffle_context_embedding_prob | ||
| ): | ||
| shift = random.randint(1, batch_size - 1) | ||
| context_embeddings = context_input_embedded.roll(shift, dims=0) | ||
| context_mask = context_mask.roll(shift, dims=0) |
There was a problem hiding this comment.
@paarthneekhara not sure if this is removed on purpose. Did you try alternative solution to bypass context encoder instead?
This change is mainly done because EasyMagpie will be reusing some shared functionalities with Magpie. So to avoid code duplication, we are moving common things together.
After this, i will raise a separate PR for EasyMagpie changes.