Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot change speaker for interpolation #35

Open
DamienToomey opened this issue Jun 18, 2020 · 5 comments
Open

Cannot change speaker for interpolation #35

DamienToomey opened this issue Jun 18, 2020 · 5 comments

Comments

@DamienToomey
Copy link

Hello,

I am trying to interpolate between two speakers. I am using the model pretrained on LibriTTS.

I have read the issue "How is interpolation between speakers performed?" #33 but I still cannot manage to make it work.

Here are the steps I have followed:

  • gate_threshold = 1 (as mentioned in How is interpolation between speakers performed? #33)
  • set 'dummy_speaker_embedding = True` in config.json as in the paper is written "For the experiment without speaker embeddings we interpolate between Sally and Helen using the phrase “We are testing this model.”."
  • I have removed seeds torch.manual_seed(seed) and torch.cuda.manual_seed(seed) from inference.py
  • z_1 ∼ N(0, 0.5) (as in paper)
  • z_2 ∼ N(0, 0.5) (as in paper)
  • interpolation
  • reset gate_threshold = 0.5
  • model.infer
  • waveglow.infer

But when sampling z_1 and z_2, even multiple times, after generating the spectrogram with the pretrained Flowtron and generating the audio with the pretrained WaveGlow, the speaker sounds the same, only the audio quality seems to vary. (z_1 and z_2 have different values)

  • Could you tell me which of the above steps I have done wrong or if I have forgotten any steps?
  • Once I have found z_1 and z_2 that I want to interpolate, do I have to reset gate_threshold = 0.5 before interpolation?
  • Why did we have to set gate_threshold = 1 in the first place when looking for z_1 and z_2?

Thanks

@rafaelvalle
Copy link
Contributor

You need to make sure z_1 and z_2 produce samples from different speakers.
Sample z_1 once, perform inference and memorize the speaker's voice.
Keep sampling z_2, performing inference and listening to the samples produced with z_2 until the speaker you hear is different from the speaker produced with z_1.
You can interpolate once you have z_1 and z_2 values associated with different speakers.
It is safer to let gate_threshold = 1 and prune the audio later.

@DamienToomey
Copy link
Author

I have also model_config['dummy_speaker_embedding'] = True

I keep sampling z_2, performing inference and listening to the samples produced with z_2 but the speaker's voice sounds the same as the voice produced with z_1. By the way, it is always a female voice. Do you have any idea why this might be happening ?

@rafaelvalle
Copy link
Contributor

Are you using the LibriTTS model?

@DamienToomey
Copy link
Author

Yes I am using the LibriTTS model

@rafaelvalle
Copy link
Contributor

Hey Damien, the pre-trained LibriTTS model available in our repo has speaker embeddings.

You need to train a model without speaker embeddings, i.e. model_config['dummy_speaker_embedding'] = True, to be able to interpolate between speakers through interpolation in the latent space.

You can warm-start from the pre-trained LibriTTS model with speaker embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants