Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

384 repeatable voice cloning #432

Merged
17 commits merged into from Jul 22, 2020
Merged

384 repeatable voice cloning #432

17 commits merged into from Jul 22, 2020

Conversation

ghost
Copy link

@ghost ghost commented Jul 19, 2020

See #384. This PR adds a "--seed" option to make the output of the toolbox repeatable. It also implements a workaround for #53 by adding an option to trim silences in the vocoder output (caused by gaps in the spectrograms created during synthesizing).

demo_toolbox.py Outdated Show resolved Hide resolved
@ghost ghost requested a review from CorentinJ July 19, 2020 18:06
@ghost ghost marked this pull request as draft July 19, 2020 23:30
@ghost
Copy link
Author

ghost commented Jul 19, 2020

Making some more improvements, will mark as ready when complete

@CorentinJ
Copy link
Owner

CorentinJ commented Jul 20, 2020

Ensuring that the dropout in the prenet is set to inference mode at inference time would work too, it's the only source of randomness in tacotron

@ghost
Copy link
Author

ghost commented Jul 20, 2020

Ensuring that the dropout in the prenet is set to inference mode at inference time would work too, it's the only source of randomness in tacotron

Thanks for the info, I'll implement it and test it out. Much better than the brute force approach.

Edit: Will this suggestion make the synthesizer output unaffected by the state of the random number generator?

Because this tacotron is liable to produce gaps in the output (#53), I think it is preferable to keep the randomness, and allow for controlling it by setting the seed. When using the toolbox I repeatedly click "synthesize only" until a spectrogram with no large gaps appears. Vocoding is reserved for good spectrograms, especially since it is very slow with CPU inference.

@ghost
Copy link
Author

ghost commented Jul 20, 2020

I have performed additional experimentation to identify the minimum change needed for repeatability. Ready for review.

Some thoughts:

  1. Tacotron's randomness is a feature that is useful for fixing the large gaps that it sometimes creates. I find it useful to control the synthesizer output by adjusting the seed.
  2. It would be nice to get repeatable output without reloading the synthesizer and vocoder models on every use, but it works.

@ghost ghost marked this pull request as ready for review July 20, 2020 10:07
@ghost
Copy link
Author

ghost commented Jul 20, 2020

This PR resolves #384, and introduces a workaround for the problem identified in #53.
User interface after the proposed changes. "Random seed" and "Enhance vocoder output" are new.

screenshot

@mbdash
Copy link
Collaborator

mbdash commented Jul 20, 2020

+1 for repeatability feature. Proposed changes looks neat.

@blue-fish I just want to say thank you for your work, you are adding features and fixing bugs that are really helpful and appreciated.
thank you to @CorentinJ for also allowing blue to make / integrate all the updates.
(also thanks to the original author of the feature, if i am not mistaken I saw someone else made the initial code change suggestion and blue is pimping it out)

@ghost
Copy link
Author

ghost commented Jul 20, 2020

Thank you for the kind words @mbdash , it is nice knowing that others also find these improvements worthwhile. Feel free to provide feedback to help guide development, though as usual we find ourselves long on ideas and short on developers.

@ghost
Copy link
Author

ghost commented Jul 20, 2020

Just pushed a fix for a small bug found during testing. Tacotron was incorrectly retaining the seed after the "random seed" checkbox transitioned from a checked to unchecked state. No further changes are expected.

@mbdash
Copy link
Collaborator

mbdash commented Jul 20, 2020

I would have not dared asking for anything, but since you mentioned it...

If I may ask for your opinion on 2 questions I have been thinking about:
(and I hope these are not stupid questions)

Q1

Do you see a way in the future to reduce / tweak the minimum output audio length below the minimum 5 sec?

For example,
Something that would allow input text lengths as low as single words such as:

  • Hi
  • Hi your-name-here
  • How are you
  • I'm fine thank you
  • yes
  • no
  • thank you

My understanding is that the minimum audio output length is around 5 sec.
I have experimented with 90, 70, 60, 50 and 40 characters of input text.
The minimum workable input seem to be 60-70 chars to fill that 5 sec of audio,
below that, the audio output is just weird / creepy.
The sweet spot seems to be a minimum of 80-90 characters to fill nicely the minimum 5 sec audio output.

Q2
this one is a weird one and might go against the design itself...

Would using a dataset purely generated by a single actor, result in a better audio output when reproducing solely that actor's voice?

and if so,

Do you have any guess of how big of a dataset would be required to reproduce the voice of a single voice actor?
1 to 1.
Essentially removing the capacity to reproduce any other voices properly when using that specific model,
for the purpose of achieving better cloning accuracy for a single voice.

ie:
a single voice actor reads 12h of transcript (or more)
then we can generate higher quality TTS for that single actor.

thank you for any feedback.

@ghost
Copy link
Author

ghost commented Jul 20, 2020

@mbdash Opened #433 to discuss your questions. Let's continue the conversation there.

demo_cli.py Outdated
@@ -32,12 +32,13 @@
"overhead but allows to save some GPU memory for lower-end GPUs.")
parser.add_argument("--no_sound", action="store_true", help=\
"If True, audio won't be played.")
parser.add_argument("--seed", type=int, default=None, help=\
"Optional random number seed value for repeatable output.")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "repeatable" to "deterministic" everywhere, that's more precise

@ghost ghost merged commit eaf5ec4 into CorentinJ:master Jul 22, 2020
@ghost ghost deleted the 384_repeatable_voice_cloning branch July 22, 2020 10:58
@ghost ghost mentioned this pull request Aug 17, 2020
12 tasks
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants