Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train other models in the pipeline #3

Open
152334H opened this issue Feb 16, 2023 · 6 comments
Open

Train other models in the pipeline #3

152334H opened this issue Feb 16, 2023 · 6 comments

Comments

@152334H
Copy link
Owner

152334H commented Feb 16, 2023

Apart from the GPT model (which has been implemented), there are 4 other models in TorToiSe that could be fine-tuned:

  • the VQVAE, which learns how to encode the training data,
  • CLVP, which determines how closely a spoken line matches speech tokens,
  • the diffuser, which learns how to decompress speech latents into spectrograms,
  • Unvinet, the vocoder, which convers spectrograms to sound.

IMO, the diffusion model + vocoder are obvious targets. Vocoders are often fine-tuned in other tts pipelines, and the diffusion model serves roughly the same purpose...

...but, the diffusion model is the only other model that takes the conditioning latents into account. I suspect that fine-tuning both the autoregressive & diffuser models on a single speaker would lead to a kind of 'mode collapse' (bear with this inaccurate phrasing), where the conditioning latents fail to affect the output speech substantially. Ideally, some form of mixed speaker training would account for this, but I'm not sure how to accomplish that yet.

Training the VQVAE could be good for datasets that are emotional, and substantially different from the normal LJSpeech+libretts+commonvoice+voxpopuli+... pile of monotonic speech. But I think it would necessitate a parallel training of the GPT model + the CLVP model as well, to account for the change in tokens outputted.

I also think that keeping the CLVP model untrained could be a good idea to retain the power of conditioning latents. Fine-tuning it on a single voice would adjust it to see that specific speaker as more likely than other speakers.

@Ryu1845
Copy link

Ryu1845 commented Feb 19, 2023

Might be relevant https://github.com/yuan1615/AdaVocoder

@152334H
Copy link
Owner Author

152334H commented Feb 20, 2023

DIFFUSION TRAINING PROGRESS

  • figure out how to switch it on
  • add integration with tortoise-tts-fast
  • add basic training config to repo
  • ❌ check that the results are better at all
  • figure out why the results are so bad (figure out why even the 0.pth checkpoint is completely wrong) -- this happened because the defined config file was wrong.
  • test on a non-messed-up gpt model
  • find the best parameters for training -- especially, do the losses really require warmup periods?
  • add integration with colab notebook
  • contact @devilismyfriend w.r.t. local training UI
  • fix this issue with unifiedvoice
  • fix the Eval loss_upper_quantile_mse_loss: 0.0 Eval loss_mid_upper_quantile_mse_loss: 0.0 thing
  • figure out how to reduce vram -- I had to substantially reduce batch size (w.r.t. the GPT model) && increase gradient checkpointing to get it to run at sub 16GB. I suspect it will crash on colab due to random spikes on the current default configs anyway. I'm 90% certain you could preprocess the dataset to substantially bring down vram, but that will take some time to cobble together.

@devilismyfriend
Copy link
Collaborator

Did the new configs and changes improve the diffusion model training?

@152334H
Copy link
Owner Author

152334H commented Feb 25, 2023

What I did was to try to train the diffusion model on top of a fairly broken gpt fine-tune... which was evidently a bad idea; I couldn't tell whether it was significantly better or not. I vaguely think "it works" but honestly I should figure out how to enable the FID eval metrics first.

@caffeinetoomuch
Copy link

Hi, is this still ongoing? I was trying to train the diffusion model from the template yaml(../experiments/FIXED_diff.yml), but it was throwing unexpected keys in loading gpt_latent model. I gave the path of autoregressive model for produce_latents section. Should I be passing the different model?

@152334H
Copy link
Owner Author

152334H commented Sep 19, 2023

nope this entire repo + project is dead (i got poached)

xtts seems at least marginally better, i'd just ask around coqui how to train stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants