-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
384 repeatable voice cloning #432
Conversation
…and tf.layers.dense()
Making some more improvements, will mark as ready when complete |
Start implementing toolbox UI updates. Add some code for #53 trim silences (still in work)
Ensuring that the dropout in the prenet is set to inference mode at inference time would work too, it's the only source of randomness in tacotron |
Thanks for the info, I'll implement it and test it out. Much better than the brute force approach. Edit: Will this suggestion make the synthesizer output unaffected by the state of the random number generator? Because this tacotron is liable to produce gaps in the output (#53), I think it is preferable to keep the randomness, and allow for controlling it by setting the seed. When using the toolbox I repeatedly click "synthesize only" until a spectrogram with no large gaps appears. Vocoding is reserved for good spectrograms, especially since it is very slow with CPU inference. |
I have performed additional experimentation to identify the minimum change needed for repeatability. Ready for review. Some thoughts:
|
+1 for repeatability feature. Proposed changes looks neat. @blue-fish I just want to say thank you for your work, you are adding features and fixing bugs that are really helpful and appreciated. |
Thank you for the kind words @mbdash , it is nice knowing that others also find these improvements worthwhile. Feel free to provide feedback to help guide development, though as usual we find ourselves long on ideas and short on developers. |
Just pushed a fix for a small bug found during testing. Tacotron was incorrectly retaining the seed after the "random seed" checkbox transitioned from a checked to unchecked state. No further changes are expected. |
I would have not dared asking for anything, but since you mentioned it... If I may ask for your opinion on 2 questions I have been thinking about: Q1 Do you see a way in the future to reduce / tweak the minimum output audio length below the minimum 5 sec? For example,
My understanding is that the minimum audio output length is around 5 sec. Q2 Would using a dataset purely generated by a single actor, result in a better audio output when reproducing solely that actor's voice? and if so, Do you have any guess of how big of a dataset would be required to reproduce the voice of a single voice actor? ie: thank you for any feedback. |
demo_cli.py
Outdated
@@ -32,12 +32,13 @@ | |||
"overhead but allows to save some GPU memory for lower-end GPUs.") | |||
parser.add_argument("--no_sound", action="store_true", help=\ | |||
"If True, audio won't be played.") | |||
parser.add_argument("--seed", type=int, default=None, help=\ | |||
"Optional random number seed value for repeatable output.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "repeatable" to "deterministic" everywhere, that's more precise
See #384. This PR adds a "--seed" option to make the output of the toolbox repeatable. It also implements a workaround for #53 by adding an option to trim silences in the vocoder output (caused by gaps in the spectrograms created during synthesizing).