Fine tuning with custom (multilingual) data #82

ukemamaster · 2023-10-19T08:58:07Z

Hi @OlaWod, i appreciate your work.

I am trying to fine tune the FreeVC model with my custom multilingual data (using an already trained speaker encoder model), and without SR augmentation. After some 300k steps (with batch size 32) it gives fair conversion outputs. However, i have some questions:

It seems that unseen-to-seen, and unseen-to-unseen conversions have poor quality. Will adding more and more data to the training set improve these cases?
Is it necessary to train the WavLM and HiFiGAN models with the custom dataset or the pre-trained models are OK to use for custom dataset?
Is it possible to train the FreeVC model using mel-spectrograms directly fed to the Bottleneck Extractor instead of SSL features, (i.e., skipping HifiGAN and WavLM models) ? Have you tried it? Is it worth giving a try?
Does the 24khz training recipe has better performance than the 16khz one?
Does the SR augmentation has a big effect on performance?
Can the conversion process be in real-time? I mean can we convert a source audio frame-by-frame, and not as a whole?

Any other tips that can improve the conversion quality, are appreciated.

Thanks

Xmiler · 2023-10-21T09:09:29Z

Hi @ukemamaster,

I am new here and will follow this topic with interest. But could you please share the audio samples your model generates to see the quality you have achieved.

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine tuning with custom (multilingual) data #82

Fine tuning with custom (multilingual) data #82

ukemamaster commented Oct 19, 2023

Xmiler commented Oct 21, 2023

Fine tuning with custom (multilingual) data #82

Fine tuning with custom (multilingual) data #82

Comments

ukemamaster commented Oct 19, 2023

Xmiler commented Oct 21, 2023