blog/tortoise-fine-tuning/ #4

2023-02-12T00:54:08Z

giscus[bot]
bot Feb 12, 2023

blog/tortoise-fine-tuning/

TorToiSe 🐢 is an open-source Text-To-Speech (TTS) neural network that creates fairly authentic & realistic voices. Checkpoints for local inference have been available since April last year, but its users are seemingly unable to fine-tune the model with additional voice data.
Why is this the case, and how could it be fixed?

https://152334h.github.io/blog/tortoise-fine-tuning/

devilismyfriend · 2023-02-13T18:56:29Z

devilismyfriend
Feb 13, 2023 — with giscus

Your assumption is correct, it's possible to train Tortoise by only training a few of the models, namely, the VQVAE and the transformer model, you'd also need to update the CLVP vocab.

12 replies

152334H Feb 15, 2023
Maintainer

Got the config loaded with a dataset and everything, but for some reason I'm getting repeated 0it/s.

devilismyfriend Feb 15, 2023 — with giscus

amm weird, where does it get stuck?

devilismyfriend Feb 15, 2023 — with giscus

how did you get it to load btw?

152334H Feb 16, 2023
Maintainer

found the problem, i was using a dataset smaller than a single batch size lol

I did a lot of things to get it to load; I'll push my fork once I verify that a very basic autoregressive fine-tune step works

There's a chance that all of this blows over, though. The dvae checkpoint has cookbook_dim=512, rather than 256 as was reported in both the TorToiSe design blog & doc (as well as everywhere in the code). It's possible (maybe not likely, but) the dvae was merely an experimental upload... I'll figure out whether this is the case or not if I see ridiculously high loss values on training.

152334H Feb 16, 2023
Maintainer

23-02-16 08:25:48.710 - INFO: Start training from epoch: 0, iter: -1
  0%|                                                                                                                                                                                                                 | 0/34 [00:00<?, ?it/s]
23-02-16 08:26:00.964 - INFO: [epoch:  0, iter:       0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 1.2800e+02 megasamples: 1.2800e-04 iteration_rate: 6.6430e-02 loss_text_ce: 3.5616e+00 loss_mel_ce: 3.1480e+00 loss_gpt_total: 3.1837e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2800e+02 percent_skipped_samples: 1.5385e-02 percent_conditioning_is_self: 9.8462e-01 gpt_conditioning_encoder: 6.2724e+00 gpt_gpt: 3.1667e+00 gpt_heads: 3.4476e-01 
< ... omitted ... >
23-02-16 08:42:06.027 - INFO: [epoch:  2, iter:     100, lr:(1.000e-05,1.000e-05,)] step: 1.0000e+02 samples: 1.2928e+04 megasamples: 1.2928e-02 iteration_rate: 7.8301e-02 loss_text_ce: 2.7187e+00 loss_mel_ce: 2.2782e+00 loss_gpt_total: 2.3053e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2928e+04 percent_skipped_samples: 9.7281e-03 percent_conditioning_is_self: 9.9027e-01 gpt_conditioning_encoder: 4.2794e-02 gpt_gpt: 2.0582e-01 gpt_heads: 1.2382e-01

Made it to 100 steps on bs=128. Loss is allegedly going down, although I have no idea whether this will cook well or not (I set save_checkpoint far too high)

Will fork with instructions soon

devilismyfriend · 2023-02-16T00:32:15Z

devilismyfriend
Feb 16, 2023

Awesome, I wouldn't put it past the author to omit or alter some details since he was trying to discourage training the whole time. But this checks out regarding the lexman post, this is obviously what he meant when he said that the part was removed from the public release but he did train with it

…

On Wed, Feb 15, 2023, 4:23 p.m. 152334H ***@***.***> wrote: found the problem, i was using a dataset smaller than a single batch size lol I did a lot of things to get it to load; I'll push my fork once I verify that a very basic autoregressive fine-tune step works — Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYC4IHXQWD42LWXIL7KOLDWXVXO7ANCNFSM6AAAAAAUY7XTO4> . You are receiving this because you commented.Message ID: <152334H/152334H. ***@***.***>

0 replies

devilismyfriend · 2023-02-16T00:48:16Z

devilismyfriend
Feb 16, 2023

Awesome, how is the VRAM usage?

…

On Wed, Feb 15, 2023, 4:46 p.m. 152334H ***@***.***> wrote: 23-02-16 08:25:48.710 - INFO: Start training from epoch: 0, iter: -1 0%| | 0/34 [00:00<?, ?it/s] 23-02-16 08:26:00.964 - INFO: [epoch: 0, iter: 0, lr:(1.000e-05,1.000e-05,)] step: 0.0000e+00 samples: 1.2800e+02 megasamples: 1.2800e-04 iteration_rate: 6.6430e-02 loss_text_ce: 3.5616e+00 loss_mel_ce: 3.1480e+00 loss_gpt_total: 3.1837e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2800e+02 percent_skipped_samples: 1.5385e-02 percent_conditioning_is_self: 9.8462e-01 gpt_conditioning_encoder: 6.2724e+00 gpt_gpt: 3.1667e+00 gpt_heads: 3.4476e-01 < ... omitted ... > 23-02-16 08:42:06.027 - INFO: [epoch: 2, iter: 100, lr:(1.000e-05,1.000e-05,)] step: 1.0000e+02 samples: 1.2928e+04 megasamples: 1.2928e-02 iteration_rate: 7.8301e-02 loss_text_ce: 2.7187e+00 loss_mel_ce: 2.2782e+00 loss_gpt_total: 2.3053e+00 grad_scaler_scale: 1.0000e+00 learning_rate_gpt_0: 1.0000e-05 learning_rate_gpt_1: 1.0000e-05 total_samples_loaded: 1.2928e+04 percent_skipped_samples: 9.7281e-03 percent_conditioning_is_self: 9.9027e-01 gpt_conditioning_encoder: 4.2794e-02 gpt_gpt: 2.0582e-01 gpt_heads: 1.2382e-01 Made it to 100 steps on bs=128. Loss is *allegedly* going down, although I have no idea whether this will cook well or not (I set save_checkpoint far too high) Will fork with instructions soon — Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYC4IC4YFGOIPWHRYUTBNTWXV2G7ANCNFSM6AAAAAAUY7XTO4> . You are receiving this because you commented.Message ID: <152334H/152334H. ***@***.***>

2 replies

152334H Feb 16, 2023
Maintainer

Partially a function of your batch size. I'm getting 16GB+ usage with bs=128, so you could probably train on Colab with a slightly lower bs if you wanted to.

Not that I recommend it, because it's slow as all hell. I'm getting 10s/it right now on a 3090. I'm sure some people will be willing to sit around for a full day clicking captchas on colab, though

152334H Feb 16, 2023
Maintainer

oh, and these statements only apply for gpt training. I haven't done any experiments with training, e.g. the diffusion model or CLVP yet.

devilismyfriend · 2023-02-16T01:13:24Z

devilismyfriend
Feb 16, 2023

Interesting, you're using his trainer? I wonder if we can get better performance using stuff like accelerate or fp16 etc, will be interesting regardless. Do you think we need to train any other then the AR to get custom voices

…

On Wed, Feb 15, 2023, 5:08 p.m. 152334H ***@***.***> wrote: oh, and these statements only apply for gpt training. I haven't done any experiments with training, e.g. the diffusion model or CLVP yet. — Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYC4IFZ7OGQLU63DBACG43WXV4ZHANCNFSM6AAAAAAUY7XTO4> . You are receiving this because you commented.Message ID: <152334H/152334H. ***@***.***>

1 reply

152334H Feb 16, 2023
Maintainer

I wonder if we can get better performance using stuff like accelerate or fp16 etc,

DLAS can do a surprising amount of tricks. fp16, grad accum, any kind of optimizer, ZeRo, etc. I doubt porting to a different trainer will be necessary, but who knows?

Do you think we need to train any other then the AR to get custom voices

It'll be for the best. The diffusion model reads the conditioning latents as well...

In fact, if you had the resources you might even want to fine-tune Unvinet as well. All steps would improve voice cloning fidelity.

But that is all in the future. For now, here's my fork: https://github.com/152334H/DL-Art-School

devilismyfriend · 2023-02-16T02:33:53Z

devilismyfriend
Feb 16, 2023

Super awesome! Gonna take it for a test later tonight :) A couple of questions tho, any tips on how you organized your dataset and how much was it? (you said 4k wav files, but what duration etc) Also seems your instructions are missing the requirements.laxed.txt file using the non laxed one gives errors (on windows and wsl at least) with pyworld not being installed etc

…

On Wed, Feb 15, 2023 at 5:30 PM 152334H ***@***.***> wrote: I wonder if we can get better performance using stuff like accelerate or fp16 etc, DLAS can do a surprising amount of tricks. fp16, grad accum, any kind of optimizer, ZeRo, etc. I doubt porting to a different trainer will be necessary, but who knows? Do you think we need to train any other then the AR to get custom voices It'll be for the best. The diffusion model reads the conditioning latents as well... In fact, if you had the resources you might even want to fine-tune Unvinet as well. All steps would improve voice cloning fidelity. But that is all in the future. For now, here's my fork: https://github.com/152334H/DL-Art-School — Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYC4IAAW2NWPKX5LBQQK4DWXV7KDANCNFSM6AAAAAAUY7XTO4> . You are receiving this because you commented.Message ID: <152334H/152334H. ***@***.***>

1 reply

152334H Feb 16, 2023
Maintainer

missing requirements file has been added, apologies

Re: dataset, the "LJSpeech format" refers to a directory of,

dataset/
├── val.txt
├── train.txt
├── wavs/

where text files contain lines that look like:

wavs/A.wav|this is a spoken line
wavs/B.wav|this is another spoken line
...

and the wav files contain wavs. I'm pretty sure they'll automatically resample to the right sampling rate, but if they don't then use 22.05kHz.

Duration wise: short clips of 3-20s will probably do well. But experimentation would be great; I've only done 1 training run on a bunch of configs I picked at random; people should try whatever they can to figure out what works.

devilismyfriend · 2023-02-16T03:18:56Z

devilismyfriend
Feb 16, 2023

Awesome thanks! How did you construct your dataset? I assume you didn't manually transcripted and cut 4k files

…

On Wed, Feb 15, 2023, 6:44 p.m. 152334H ***@***.***> wrote: missing requirements file has been added, apologies Re: dataset, the "LJSpeech format" refers to a directory of, dataset/ ├── val.txt ├── train.txt ├── wavs/ where text files contain lines that look like: wavs/A.wav|this is a spoken line wavs/B.wav|this is another spoken line ... and the wav files contain wavs. I'm pretty sure they'll automatically resample to the right sampling rate, but if they don't then use 22.05kHz. Duration wise: short clips of 3-20s will probably do well. But experimentation would be great; I've only done 1 training run on a bunch of configs I picked at random; people should try whatever they can to figure out what works. — Reply to this email directly, view it on GitHub <#4 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUYC4IC4FQ7MPKY7VJDPIHTWXWH77ANCNFSM6AAAAAAUY7XTO4> . You are receiving this because you commented.Message ID: <152334H/152334H. ***@***.***>

2 replies

152334H Feb 16, 2023
Maintainer

It's an old dataset I extracted from Disco Elysium's unity asset files about half a year ago. I just had them prepared.

The data extraction / parsing stuff is always very specific to the voice you're trying to get, you need to code this stuff yourself unfortunately

You can actually see most of my code for it in my DiscoAudioSources repo, but it won't be relevant to anyone who isn't training Disco Elysium voices

devilismyfriend Feb 16, 2023

Nice, also a great character pick to test it haha :)

Gonna make my own pipeline to generate data using something like whisper and some automatic audio cutting, will share at the repo when I have something useable

neonbjb · 2023-02-22T18:38:47Z

neonbjb
Feb 22, 2023 — with giscus

Someone linked me here - I have to say this is really well put together!

I was honestly really hoping someone would make the reverse engineering approach you describe work. I'm pretty sure it will without too much effort. The quality would not likely be as good as if you had the real VQ but it'd likely be damned close.

I somewhat suspect the folks who have built things on top of Tortoise (and who have figured out how to do fine-tunes before the VQ was recently found in a old repo) did this. None of them ever confessed though. :)

2 replies

devilismyfriend Feb 22, 2023

For now we're focusing on cracking how to fine tune the current models and making it more accessible :)

152334H Feb 23, 2023
Maintainer

I still think it'd be a really fun and educational project! But well, necessity and time eats at what's fun 🙃

Permafacture · 2023-03-06T04:37:12Z

Permafacture
Mar 6, 2023 — with giscus

I think you could bootstrap the VQVAE training using the method you're suggesting. I believe part of the VQVAE training is learning the codebook and that's what would fundamentally make it incompatible with the trained GPT model. But you know the codes (from the GPT output). So you could pretrain to get a model that uses the right codes and then set the codes to not be updated and continue training on real data. Right?

2 replies

152334H Mar 6, 2023
Maintainer

yep, that was the idea

I think the codes wouldn't be able to reach a 1-to-1 equivalence, because the random tensors assigned in the codebook would probably not be structured the same and would be a little bit incomparable, but all of this is just fun speculation -- not necessary with the VQVAE leak.

Permafacture Mar 7, 2023 — with giscus

Oh! I'm behind the times! Maybe reinforcement learning is around the corner then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog/tortoise-fine-tuning/ #4

{{title}}

Replies: 8 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

blog/tortoise-fine-tuning/ #4

giscus[bot] bot Feb 12, 2023

blog/tortoise-fine-tuning/

Replies: 8 comments · 22 replies

devilismyfriend Feb 13, 2023 — with giscus

152334H Feb 15, 2023 Maintainer

devilismyfriend Feb 15, 2023 — with giscus

devilismyfriend Feb 15, 2023 — with giscus

152334H Feb 16, 2023 Maintainer

152334H Feb 16, 2023 Maintainer

devilismyfriend Feb 16, 2023

devilismyfriend Feb 16, 2023

152334H Feb 16, 2023 Maintainer

152334H Feb 16, 2023 Maintainer

devilismyfriend Feb 16, 2023

152334H Feb 16, 2023 Maintainer

devilismyfriend Feb 16, 2023

152334H Feb 16, 2023 Maintainer

devilismyfriend Feb 16, 2023

152334H Feb 16, 2023 Maintainer

devilismyfriend Feb 16, 2023

neonbjb Feb 22, 2023 — with giscus

devilismyfriend Feb 22, 2023

152334H Feb 23, 2023 Maintainer

Permafacture Mar 6, 2023 — with giscus

152334H Mar 6, 2023 Maintainer

Permafacture Mar 7, 2023 — with giscus

giscus[bot]
bot Feb 12, 2023

Replies: 8 comments 22 replies

devilismyfriend
Feb 13, 2023 — with giscus

152334H Feb 15, 2023
Maintainer

152334H Feb 16, 2023
Maintainer

152334H Feb 16, 2023
Maintainer

devilismyfriend
Feb 16, 2023

devilismyfriend
Feb 16, 2023

152334H Feb 16, 2023
Maintainer

152334H Feb 16, 2023
Maintainer

devilismyfriend
Feb 16, 2023

152334H Feb 16, 2023
Maintainer

devilismyfriend
Feb 16, 2023

152334H Feb 16, 2023
Maintainer

devilismyfriend
Feb 16, 2023

152334H Feb 16, 2023
Maintainer

neonbjb
Feb 22, 2023 — with giscus

152334H Feb 23, 2023
Maintainer

Permafacture
Mar 6, 2023 — with giscus

152334H Mar 6, 2023
Maintainer