blog/tortoise-fine-tuning/ #4
Replies: 8 comments 22 replies
-
Your assumption is correct, it's possible to train Tortoise by only training a few of the models, namely, the VQVAE and the transformer model, you'd also need to update the CLVP vocab. |
Beta Was this translation helpful? Give feedback.
-
Awesome, I wouldn't put it past the author to omit or alter some details
since he was trying to discourage training the whole time.
But this checks out regarding the lexman post, this is obviously what he
meant when he said that the part was removed from the public release but he
did train with it
…On Wed, Feb 15, 2023, 4:23 p.m. 152334H ***@***.***> wrote:
found the problem, i was using a dataset smaller than a single batch size
lol
I did a lot of things to get it to load; I'll push my fork once I verify
that a very basic autoregressive fine-tune step works
—
Reply to this email directly, view it on GitHub
<#4 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUYC4IHXQWD42LWXIL7KOLDWXVXO7ANCNFSM6AAAAAAUY7XTO4>
.
You are receiving this because you commented.Message ID: <152334H/152334H.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Awesome, how is the VRAM usage?
…On Wed, Feb 15, 2023, 4:46 p.m. 152334H ***@***.***> wrote:
23-02-16 08:25:48.710 - INFO: Start training from epoch: 0, iter: -1
0%| | 0/34 [00:00<?, ?it/s]
23-02-16 08:26:00.964 - INFO: [epoch: 0, iter: 0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 1.2800e+02 megasamples: 1.2800e-04 iteration_rate: 6.6430e-02 loss_text_ce: 3.5616e+00 loss_mel_ce: 3.1480e+00 loss_gpt_total: 3.1837e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2800e+02 percent_skipped_samples: 1.5385e-02 percent_conditioning_is_self: 9.8462e-01 gpt_conditioning_encoder: 6.2724e+00 gpt_gpt: 3.1667e+00 gpt_heads: 3.4476e-01
< ... omitted ... >
23-02-16 08:42:06.027 - INFO: [epoch: 2, iter: 100, lr:(1.000e-05,1.000e-05,)] step: 1.0000e+02 samples: 1.2928e+04 megasamples: 1.2928e-02 iteration_rate: 7.8301e-02 loss_text_ce: 2.7187e+00 loss_mel_ce: 2.2782e+00 loss_gpt_total: 2.3053e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2928e+04 percent_skipped_samples: 9.7281e-03 percent_conditioning_is_self: 9.9027e-01 gpt_conditioning_encoder: 4.2794e-02 gpt_gpt: 2.0582e-01 gpt_heads: 1.2382e-01
Made it to 100 steps on bs=128. Loss is *allegedly* going down, although
I have no idea whether this will cook well or not (I set save_checkpoint
far too high)
Will fork with instructions soon
—
Reply to this email directly, view it on GitHub
<#4 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUYC4IC4YFGOIPWHRYUTBNTWXV2G7ANCNFSM6AAAAAAUY7XTO4>
.
You are receiving this because you commented.Message ID: <152334H/152334H.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Interesting, you're using his trainer? I wonder if we can get better
performance using stuff like accelerate or fp16 etc, will be interesting
regardless.
Do you think we need to train any other then the AR to get custom voices
…On Wed, Feb 15, 2023, 5:08 p.m. 152334H ***@***.***> wrote:
oh, and these statements only apply for gpt training. I haven't done any
experiments with training, e.g. the diffusion model or CLVP yet.
—
Reply to this email directly, view it on GitHub
<#4 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUYC4IFZ7OGQLU63DBACG43WXV4ZHANCNFSM6AAAAAAUY7XTO4>
.
You are receiving this because you commented.Message ID: <152334H/152334H.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Super awesome!
Gonna take it for a test later tonight :)
A couple of questions tho, any tips on how you organized your dataset and
how much was it? (you said 4k wav files, but what duration etc)
Also seems your instructions are missing the requirements.laxed.txt file
using the non laxed one gives errors (on windows and wsl at least) with
pyworld not being installed etc
…On Wed, Feb 15, 2023 at 5:30 PM 152334H ***@***.***> wrote:
I wonder if we can get better performance using stuff like accelerate or
fp16 etc,
DLAS can do a surprising amount of tricks. fp16, grad accum, any kind of
optimizer, ZeRo, etc. I doubt porting to a different trainer will be
necessary, but who knows?
Do you think we need to train any other then the AR to get custom voices
It'll be for the best. The diffusion model reads the conditioning latents
as well...
In fact, if you had the resources you might even want to fine-tune Unvinet
as well. All steps would improve voice cloning fidelity.
But that is all in the future. For now, here's my fork:
https://github.com/152334H/DL-Art-School
—
Reply to this email directly, view it on GitHub
<#4 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUYC4IAAW2NWPKX5LBQQK4DWXV7KDANCNFSM6AAAAAAUY7XTO4>
.
You are receiving this because you commented.Message ID: <152334H/152334H.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Awesome thanks!
How did you construct your dataset? I assume you didn't manually
transcripted and cut 4k files
…On Wed, Feb 15, 2023, 6:44 p.m. 152334H ***@***.***> wrote:
missing requirements file has been added, apologies
Re: dataset, the "LJSpeech format" refers to a directory of,
dataset/
├── val.txt
├── train.txt
├── wavs/
where text files contain lines that look like:
wavs/A.wav|this is a spoken line
wavs/B.wav|this is another spoken line
...
and the wav files contain wavs. I'm pretty sure they'll automatically
resample to the right sampling rate, but if they don't then use 22.05kHz.
Duration wise: short clips of 3-20s will probably do well. But
experimentation would be great; I've only done 1 training run on a bunch of
configs I picked at random; people should try whatever they can to figure
out what works.
—
Reply to this email directly, view it on GitHub
<#4 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUYC4IC4FQ7MPKY7VJDPIHTWXWH77ANCNFSM6AAAAAAUY7XTO4>
.
You are receiving this because you commented.Message ID: <152334H/152334H.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Someone linked me here - I have to say this is really well put together! I was honestly really hoping someone would make the reverse engineering approach you describe work. I'm pretty sure it will without too much effort. The quality would not likely be as good as if you had the real VQ but it'd likely be damned close. I somewhat suspect the folks who have built things on top of Tortoise (and who have figured out how to do fine-tunes before the VQ was recently found in a old repo) did this. None of them ever confessed though. :) |
Beta Was this translation helpful? Give feedback.
-
I think you could bootstrap the VQVAE training using the method you're suggesting. I believe part of the VQVAE training is learning the codebook and that's what would fundamentally make it incompatible with the trained GPT model. But you know the codes (from the GPT output). So you could pretrain to get a model that uses the right codes and then set the codes to not be updated and continue training on real data. Right? |
Beta Was this translation helpful? Give feedback.
-
blog/tortoise-fine-tuning/
TorToiSe 🐢 is an open-source Text-To-Speech (TTS) neural network that creates fairly authentic & realistic voices. Checkpoints for local inference have been available since April last year, but its users are seemingly unable to fine-tune the model with additional voice data.
Why is this the case, and how could it be fixed?
https://152334h.github.io/blog/tortoise-fine-tuning/
Beta Was this translation helpful? Give feedback.
All reactions