Thank you always for sharing your thoughtful code.
As we can see in FastPitch code, you added the pitch embedding to encoder output before passing the energy predictor.
|
enc_out = enc_out + pitch_emb.transpose(1, 2) |
Why did you chose the hierarchical variance feature prediction instead of parallel prediction like the FastSpeech2(paper version)?
Are there any performance advantages?