[FastPitch] Why do you hierarchically predict the variance features (pitch and energy)?

Thank you always for sharing your thoughtful code.

As we can see in FastPitch code, you added the pitch embedding to encoder output before passing the energy predictor.

https://github.com/NVIDIA/DeepLearningExamples/blob/da7e1a701bd44885c5537afa7974be391f82401e/PyTorch/SpeechSynthesis/FastPitch/fastpitch/model.py#L300

Why did you chose the hierarchical variance feature prediction instead of parallel prediction like the FastSpeech2(paper version)?
Are there any performance advantages?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FastPitch] Why do you hierarchically predict the variance features (pitch and energy)? #1357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FastPitch] Why do you hierarchically predict the variance features (pitch and energy)? #1357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions