Single speaker fine-tuning process and results #437

ghost · 2020-07-22T10:08:16Z

Summary

A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.

Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.

Procedure

Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0

First, create a dataset of a single speaker from LibriSpeech. All embeddings are updated to reference the same file. (I'm not sure if this helps or not, but the idea is to get it to converge faster.)
- It doesn't have to be LibriSpeech. This demonstrates the concept with minimal changes to existing files.
- Total of 13.28 minutes (train-clean-100/211/122425/*)
Next, continue training of the pretrained synthesizer model using the restricted dataset. Running overnight on a CPU, loss decreased from 0.70 to 0.50 over 2,600 steps. I plan to go further in subsequent tests.
Generate new training data for the vocoder using the updated synthesizer model.
Continue training of the pretrained vocoder. I only added 1,000 steps for now because I was eager to see if it worked, but the difference is noticeable even with a little fine-tuning.

Results

Download audio samples: samples.zip

These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.

Discussion

This work helps to demonstrate the following points:

Deficiencies with the synthesizer and its pretrained model can be compensated to some extent, by fine-tuning to a single speaker. This is much easier than implementing a new synthesizer and requires far less training.
A small dataset of 0.2 hours is sufficient for fine-tuning the synthesizer.
Better single-speaker performance can be obtained with just a few thousand steps of additional synthesizer training.

The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.

Acknowledgements

@CorentinJ (for the toolbox and original models)
@matheusfillipe (for the Export and replay generated wav #402 features which make the toolbox much more usable for these experiments)
@mbdash (for asking questions in Questions about the toolbox from @mbdash #433 that inspired me to try this)
@plummet555 (for support on Improving repeatability of voice cloning #384 to make the toolbox deterministic, helps a lot with benchmarking)
@pusalieth (for Working CPU model and few other fixes #331 to make toolbox work on CPU)

ghost · 2020-07-23T14:48:29Z

Pretrained synthesizer + 200 steps of training on VCTK p240 samples (0.34 hours of speech). Still using the original vocoder model. This is just a few minutes of CPU time for fine-tuning. It is remarkable that the synthesizer is already imparting the accent on the result. This is good news for anyone who is fine-tuning an accent: it should not take too long, even for multispeaker.

I did notice a lot more gaps and sound artifacts than usual with the finetuned model (this result is cherry-picked). Is it because I did not hardcode all the samples to a single utterance embedding?

samples_vctkp240_200steps.zip

ghost · 2020-07-24T01:25:13Z

Single-speaker finetuning using VCTK dataset: samples_vctkp240.zip

Here are some samples from the latest experiment. VCTK p240 is used to add 4.4k steps to the synthesizer, and 1.0k to the vocoder. Synthesized audios have filename speaker_utterance_SYN_VOC.wav and use all combinations of pretrained ("pre") and finetuned ("fin") models for the synthesizer and vocoder, respectively.

Synthesized utterances using speaker p240's hardcoded embedding (derived from p240_001_mic1.flac) show the success in finetuning to match the voice, including the accent. Samples made from speaker p260's embedding demonstrate how much quality is lost when finetuning a single-speaker model.

In these samples, the synthesizer has far more impact on quality, though this result could be due to insufficient finetuning of the vocoder. Though the finetuned vocoder has only a slight advantage over the original for p240, it severely degrades voice cloning quality for p260.

Also compare to the samples for p240 and p260 in the Google SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/

Replicating this experiment

Here is a preprocessed p240 dataset if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file p240_001.flac to generate embeddings for inference. The audios are not included to keep the file size down, so if you care to do vocoder training you will need to get and preprocess VCTK.

Directions:

Copy the folder synthesizer/saved_models/logs-pretrained to logs-vctkp240 in the same location. This will make a copy of your pretrained model to be finetuned.
Unzip the dataset files to datasets_p240 in your Real-Time-Voice-Cloning folder (or somewhere else if you desire)
Train the model: python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
Let it run for 200 to 400 iterations, then stop the program.
- This should complete in a reasonable amount of time even on CPU.
- You can safely stop and resume training at any time though you will lose all progress since the last checkpoint
Test the finetuned model in the toolbox using dataset_p240/p240_001.flac to generate the embedding

mbdash · 2020-07-24T01:52:32Z

Wow that is amazing... I only asked your opinion and you actually did it!

The difference is incredible.

Now I just need to dumb down all you wrote to be able to reproduce it.

Also try your_input_text.replace('hi', 'eye') it is a little cheat that I find gives better results currently.
At least in the multi speaker model.

ghost · 2020-07-24T01:56:52Z

Now I just need to dumb down all you wrote to be able to reproduce it.

@mbdash In the first post I included a dropbox link that has fairly detailed instructions for the single-speaker LibriSpeech example. You can try that and ask if you have any trouble reproducing the results. If you want VCTKp240 I can make a zip file for you tomorrow.

This was much easier and faster than expected. I am sharing the results to generate interest, so we can collaborate on how much training is needed, best values of hparams, etc.

mbdash · 2020-07-24T02:10:14Z

Thank you,
I will look at it tomorrow morning I am only staying up for a few more minutes, I am a bit too tired to think straight right now..

Tonight I am trying to keep it simple and see if I can Jam a regular "hand modeled" 3D head mesh into VOCA (Voice Operated Character Animation) (another GitHub project)

Update: nope it exploded.

ghost · 2020-07-26T07:30:03Z

Some general observations to share:

Finetuning improves both quality and similarity with the target voice, and transfers accent.
Decent single-speaker models require as little as 5 min of audio and 400 steps of synthesizer training.
Finetuning the vocoder is not as impactful as finetuning the synthesizer. In fact given the quality limitations of the underlying models (see poor performance in compare to the main paper? #411) I would not bother with additional vocoder training.

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

P.S. @mbdash I updated the VCTKp240 post with a single-speaker dataset if you would like to try that out. #437 (comment)

ghost · 2020-07-28T00:22:02Z

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

Changing my mind on training from scratch, I think we just need to add an extra input parameter to the synthesizer which indicates the accent or more accurately the dataset that it is trained on. A simple implementation might be a single bit representing LibriSpeech or VCTK. Next, finetune the existing models on VCTK with the added parameter. Then for inference specify the dataset that you want the result to sound like. I'm at a loss how to implement this with the current set of models, but I think this repo will have clues: https://github.com/Tomiinek/Multilingual_Text_to_Speech

I'm all done with accent experiments for now but I hope this is helpful to anyone who wants to continue this work.

Adam-Mortimer · 2020-07-29T21:37:47Z

"I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository."

Thank you for all your hard work on this repo - even as an almost complete newcomer to deep learning, I've been able to decipher some things, but I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

Ori-Pixel · 2020-07-30T04:15:38Z

@blue-fish any reason why im getting the following error: "synthesizer_train.py: error: the following arguments are required: synthesizer_root"? I'm trying to run:

synthesizer_train.py H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240\SV2TTS\synthesizer --checkpoint_interval 100

the second argument is the folder that contains embeds, mels, and train.txt

nevermind I fixed it while writing this. The argument isn't --synthesizer_root as all of the other arguments, but actually just synthesizer_root. Also, the above testing instructions are thus wrong (or at least not working for me). The command should be:

python synthesizer_train.py synthesizer_root dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

(it at least bumped me to a dll error - still working through that one)

ghost · 2020-07-30T04:34:35Z

I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

Hi @Adam-Mortimer. The custom dataset tool is still planned, but currently on hold as I've just started working #447 (switching out the synthesizer for fatchord's tacotron). #447 will be bigger than all of my existing pull requests combined if it ever gets finished. In other words, it's going to take quite some time.

I started writing the custom dataset tool for a voice cloning experiment. I didn't get very far with the tool before I added LibriTTS support in #441 which made it much easier to create a dataset by putting your data in this kind of directory structure:

datasets_root
    * LibriTTS
        * train-clean-100
            * speaker-001
                * book-001
                    * utterance-001.wav
                    * utterance-001.txt
                    * utterance-002.wav
                    * utterance-002.txt
                    * utterance-003.wav
                    * utterance-003.txt

Where each utterance-###.wav is a short utterance (2-10 sec) and the utterance-###.txt contains the corresponding transcript. Then you can process this dataset using:

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

When this completes, your dataset is in the SV2TTS format and subsequent preprocessing commands (synthesizer_preprocess_embeds.py, vocoder_preprocess.py) will work as described on the training wiki page.

I would still like to write the custom dataset tool but I think #447 is a more pressing matter since the toolbox is incompatible with Python 3.8 due to our reliance on Tensorflow 1.x.

ghost · 2020-07-30T04:34:44Z

@Ori-Pixel There was a problem with my command and I fixed it. If you are following everything to the letter it should be:

python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

Where the first arg vctkp240 describes the path to the model you are training (in this case, it tells python to look for the model in synthesizer/saved_models/logs-vctkp240), and the second arg is the path to the location containing train.txt, and the mels and embeds folders. Please share your results and feel free to ask for help if you get stuck.

Ori-Pixel · 2020-07-30T05:03:43Z

@blue-fish thanks. yeah, I can see that it's saving to a new directory, I'll run it again with the correct params and post results.

Also, thanks for the preprocessing tips you gave to @Adam-Mortimer . I was not looking forward to custom labeling, but it doesn't seem that bad if I only have ~200 lines/~34 minutes. I'm trying to make a fake (semi-Gaelic) accent video game character say some lines, so I'll probably scrape the audio files from the wiki site, then slap them into a folder structure like above with a simple script and then run this single speaker fine tuning again. And for the accent, I think I can just find a semi-close one in the VCTK dataset (although a 10Gb download will take me a few days sadly).

ghost · 2020-07-30T05:39:43Z

@Ori-Pixel If you have a GPU you can quickly run a few experiments to see how far you can trim the dataset before the audio quality breaks down. Simply delete lines from train.txt and they won't be used.

One of my experiments involved re-recording some of the VCTK p240 utterances with a different voice. 5 minutes of mediocre data (80 utterances) still resulted in a half-decent model. If the labeling is extremely tedious you can try training a model on part of it while continuing to label.

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

ghost · 2020-07-30T05:47:54Z

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

Ori-Pixel · 2020-07-30T06:30:17Z

@blue-fish

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

Is there a list of their speakers somewhere? I only was able to find the 10GB file with not even a magnet link or anything denoting samples or file structure. I mean realistically anything Irish, Scottish, or Gaelic would work. I may also look into downloading it direct to drive (if possible) and even possibly training there (if possible -- as far as I'm aware you can mount the drive and run bash.)

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

Yeah, I just meant using a vctk pretrained that wasn't horribly inconsistent with my single speaker's accent and then fine tuning with my custom labeled lines on top.

I also have a couple idle GPUs in my machine but I always run into venv issues with gpu training so I'll just use colab if I really need a GPU. Too bar downloading from a link to

ghost · 2020-07-30T06:37:55Z

Is there a list of their speakers somewhere?

The zip file I uploaded includes speaker-log.txt (which is included in the full VCTK dataset) which has a list of speaker metadata such as:

ID  AGE  GENDER  ACCENTS  REGION COMMENTS 
p225  23  F    English    Southern  England
p226  22  M    English    Surrey
p227  38  M    English    Cumbria
p228  22  F    English    Southern  England

Ori-Pixel · 2020-07-30T06:41:04Z

Ah I see. I'll give it a look tomorrow along with the results and let you know then, thanks again for being so active!

Ori-Pixel · 2020-07-30T16:46:05Z

@blue-fish p261 is relatively close. if I could get that slice, that would be very helpful (my internet at my current house is sadly 1MB/s)

I trained as per instructions above, sadly I didnt get to see the console output as my power went out after about an hour or so, but I did get this in the training logs, so I think this is as far as it trained.

[2020-07-30 01:36:31.676] Step 278202 [28.894 sec/step, loss=0.64379, avg_loss=0.64339]

Also, just to make sure I did the test, this is the cmd I used:

python demo_toolbox.py -d H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240

Where random seed = 1, enhanced vocoder output is checked, embedding was from p240_1.flac.

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

ghost · 2020-07-30T17:58:20Z

@Ori-Pixel

Here is the dataset in the same format as p240 (embeds overwritten with the one corresponding to p261_001.flac): https://www.dropbox.com/s/o6fz2r6w56djwkf/dataset_p261.zip?dl=0

The source p261 dataset so you can listen to the audios: https://www.dropbox.com/s/ynf823o5619j2q5/p261.zip?dl=0
The processed audio for vocoder training (put this in SV2TTS/synthesizer/audio): https://www.dropbox.com/s/q3bihpem7os54yi/p261_audio.zip?dl=0
The original embeds for the full set (you should not have any use for these except to perform synthesizer training experiments): https://www.dropbox.com/s/y012fvf0zyk50xg/p261_embeds.zip?dl=0

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

Your results sound American to me. Check that you are using the new synthesizer model, then try this text: Take a look at these pages for crooked creek drive.
And compare to my results for 200 steps: #437 (comment)

Ori-Pixel · 2020-07-30T18:24:20Z

Check that you are using the new synthesizer model

Ah, I didn't have that drop down selected. My results are then this, with the same settings:

https://raw.githubusercontent.com/Ori-Pixel/files/master/take%20a%20look%20at%20these%20pages%20for%20crooked%20creek%20drive%20fine%20tuned.flac

I'm also taking your comment above and trying to train my own dataset, but at first I got a dataset roots folder doesnt exist error, so I made the folder and added my files, but when I go to train, I get:

Arguments:
datasets_root:   datasets_root
out_dir:         datasets_root\SV2TTS\synthesizer
n_processes:     None
skip_existing:   False
hparams:
no_alignments:   False
datasets_name:   LibriTTS
subfolders:      train-clean-100
Using data from:
datasets_root\LibriTTS\train-clean-100
LibriTTS:   0%|                                                                            | 0/1 [00:00<?, ?speakers/s]2

gpu warnings here

LibriTTS: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/speakers]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "H:\ttss\Real-Time-Voice-Cloning-master\synthesizer\preprocess.py", line 49, in preprocess_dataset
    print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

Utterances have just the text that was spoken in them, so utterance-000.txt contains Let's have some fun, shall we...

edit: I assume I will need to go through the training docs and start by training the encoder?

ghost · 2020-07-30T19:44:05Z

@Ori-Pixel You also need to add the --no_alignments option to use a non-LibriSpeech dataset that doesn't have an alignments file. I've also fixed the command in the instructions above. Sorry for leaving that out earlier.

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

Edit: If preprocessing completes without finding a wav file, we should remind the user to pass the --no_alignments flag. Or possibly default it to True if the datasets_name is not LibriSpeech.

Ori-Pixel · 2020-07-30T23:46:49Z

@blue-fish Okay, so I got it to train, and I can also train my own dataset for the synthesizer. Really thankful for the help. Here's a result from 200 steps of training if you're interested:

https://raw.githubusercontent.com/Ori-Pixel/files/master/crooked_creek_dw.flac

https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac

ghost · 2020-07-31T00:13:11Z

@Ori-Pixel Nice! It's remarkable how much that voice comes through after 200 steps of finetuning. In my own experiments going up to 400 steps yields a noticeable improvement in the voice quality. More than 400 doesn't seem to help, though it doesn't hurt either.

Edit: You trained on CPU right? How long did it take?

Ori-Pixel · 2020-07-31T01:02:19Z

@blue-fish I did train on CPU(autocorrect!!) (I always have issues with gpu setup. Luckily im building a new pc when the 30xx cards drop with the new zen2 amd cpus). After trying to train from 200-400 it would seem that it takes ~25s per step after 20 steps, so around 2 hours for 200 steps on i5 4690k.

The next steps for me would be encoder/vocoder training but I don't want to invest the compute power since Im working on another NLP problem for my actual research (sentiment analysis) I'll let it run overnight again and this time see how far it gets :)

edit: as @blue-fish said, it seems training it to 400 steps made a large difference. Here's an example of the same voice as above, but with 400 steps of training the p261 set on my own collected voice samples:

original voice: https://raw.githubusercontent.com/Ori-Pixel/files/master/Vo_dark_willow_sylph_attack_14.mp3
200 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac
400 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/dark%20willow%20400.flac

adfost · 2020-08-03T16:21:00Z

@blue-fish I did exactly what you said, after over 10000 steps with the synthesizer, I try to open the toolbox. I type the text to convert into the box, and I get some unrelated text in an almost incomprehensible ramble.

ghost · 2020-08-03T18:11:16Z

@adfost Which set of instructions are you following? LibriSpeech (#437 (comment)) or VCTKp240 (#437 (comment))?

Most likely, when you run synthesizer_train.py it cannot find the pretrained model so it starts training a new synthesizer model from scratch. Please make sure you copied the entire contents of synthesizer/saved_models/logs-pretrained to another "logs-XXXX" folder in the same location, and specify the name (XXXX) to synthesizer_train.py as the first argument.

tiomaldy · 2021-06-25T04:53:49Z

For another language in single speaker ?

ghost · 2021-10-07T20:36:59Z

I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

For those recording their own utterances, this is a useful tool: https://github.com/MycroftAI/mimic-recording-studio

Another dataset recording tool: https://github.com/babua/TTSDatasetRecorder

prince6635 · 2022-01-05T08:13:17Z

Does anyone have the dropbox links? they're invalid right now.

maophp · 2022-07-02T04:40:13Z

the "https ://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=" lost, please fix it tks guys.

samoliverschumacher · 2023-11-03T06:21:51Z

I've made public a repo with a workflow for creating a dataset to perform synthesizer fine tuning.

Not sure if this is the best place to let people know, but hopefully it helps someone.

ghost mentioned this issue Jul 23, 2020

How i can train my audio files .to use indian assent . #429

Closed

ghost mentioned this issue Jul 27, 2020

Poor quality results - are there other ways? #453

Closed

ghost mentioned this issue Jul 28, 2020

Training a new model based on LibriTTS #449

Closed

mbdash mentioned this issue Aug 3, 2020

How can I train this on my voice? #466

Closed

StElysse mentioned this issue Mar 23, 2021

Transitioning to the PyTorch version with Tensorflow-trained models #711

Closed

This was referenced May 30, 2021

Question: Do you offer a trained model for Mexican Spanish (male/female) #764

Closed

How can I train a model for the Turkish language with the Mozilla Common Voice Dataset? #761

Closed

ishaghodgaonkar mentioned this issue Aug 20, 2021

How to improve voice cloning? #824

Closed

ghost mentioned this issue Aug 25, 2021

Need help to clone specific voices #805

Closed

This was referenced Aug 26, 2021

Single Voice Synthesizer Training - Error #830

Closed

Report on Single Voice Training Results #832

Closed

Vocoder Preprocessing Failure #833

Closed

Bebaam mentioned this issue Sep 6, 2021

General question about embedding size #840

Closed

ghost mentioned this issue Oct 29, 2021

How to train models using Multilingual LibriSpeech #879

Closed

Ca-ressemble-a-du-fake mentioned this issue Oct 31, 2021

TTS outputing different words than the ones typed in #883

Closed

This was referenced Nov 2, 2021

Followed your webpage instructions and receive ERROR after ERROR trying to install - final is 'sounddevice' #882

Closed

The voice is way too different can anyone explain how to improve? #888

Closed

This was referenced Nov 10, 2021

Works in Spanish? #789

Closed

Use demo_cli with multiple input mp3 files. #892

Closed

This was referenced Nov 28, 2021

Training on RTX 3090. Batch Sizes and other parameters? #914

Closed

File structure for training (encoder, synthesizer (vocoder)) #934

Open

Bebaam mentioned this issue Dec 8, 2021

Train Synthetizer in Spanish #941

Closed

This was referenced Jan 19, 2022

How to train from scratch #983

Open

Poor vocoder outcome #981

Closed

This was referenced Jan 28, 2022

How to use single speaker trained model for voice cloning? #996

Open

Synthesizer model re-training for single speaker #1000

Open

baljeetrathi mentioned this issue Mar 18, 2022

Training for a Single Voice after the Update #1041

Open

prakharpbuf mentioned this issue Jan 18, 2023

Training 2-3 models, suggestions? #1157

Open

sanal-176 mentioned this issue Mar 30, 2023

Synthesizer training speed is not varying with batch size. #1181

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single speaker fine-tuning process and results #437

Single speaker fine-tuning process and results #437

ghost commented Jul 22, 2020

ghost commented Jul 23, 2020

ghost commented Jul 24, 2020 •

edited by ghost

Loading

mbdash commented Jul 24, 2020

ghost commented Jul 24, 2020

mbdash commented Jul 24, 2020 •

edited

Loading

ghost commented Jul 26, 2020 •

edited by ghost

Loading

ghost commented Jul 28, 2020

Adam-Mortimer commented Jul 29, 2020 •

edited

Loading

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 30, 2020 •

edited by ghost

Loading

ghost commented Jul 30, 2020 •

edited by ghost

Loading

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

ghost commented Jul 30, 2020

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

ghost commented Jul 30, 2020 •

edited by ghost

Loading

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 31, 2020 •

edited by ghost

Loading

Ori-Pixel commented Jul 31, 2020 •

edited

Loading

adfost commented Aug 3, 2020

ghost commented Aug 3, 2020

tiomaldy commented Jun 25, 2021

ghost commented Oct 7, 2021

prince6635 commented Jan 5, 2022

maophp commented Jul 2, 2022

samoliverschumacher commented Nov 3, 2023

Single speaker fine-tuning process and results #437

Single speaker fine-tuning process and results #437

Comments

ghost commented Jul 22, 2020

Summary

Procedure

Results

Discussion

Acknowledgements

ghost commented Jul 23, 2020

ghost commented Jul 24, 2020 • edited by ghost Loading

Single-speaker finetuning using VCTK dataset: samples_vctkp240.zip

Replicating this experiment

mbdash commented Jul 24, 2020

ghost commented Jul 24, 2020

mbdash commented Jul 24, 2020 • edited Loading

ghost commented Jul 26, 2020 • edited by ghost Loading

ghost commented Jul 28, 2020

Adam-Mortimer commented Jul 29, 2020 • edited Loading

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 30, 2020 • edited by ghost Loading

ghost commented Jul 30, 2020 • edited by ghost Loading

Ori-Pixel commented Jul 30, 2020 • edited Loading

ghost commented Jul 30, 2020

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020 • edited Loading

ghost commented Jul 30, 2020

Ori-Pixel commented Jul 30, 2020 • edited Loading

ghost commented Jul 30, 2020 • edited by ghost Loading

Ori-Pixel commented Jul 30, 2020

ghost commented Jul 31, 2020 • edited by ghost Loading

Ori-Pixel commented Jul 31, 2020 • edited Loading

adfost commented Aug 3, 2020

ghost commented Aug 3, 2020

tiomaldy commented Jun 25, 2021

ghost commented Oct 7, 2021

prince6635 commented Jan 5, 2022

maophp commented Jul 2, 2022

samoliverschumacher commented Nov 3, 2023

ghost commented Jul 24, 2020 •

edited by ghost

Loading

mbdash commented Jul 24, 2020 •

edited

Loading

ghost commented Jul 26, 2020 •

edited by ghost

Loading

Adam-Mortimer commented Jul 29, 2020 •

edited

Loading

ghost commented Jul 30, 2020 •

edited by ghost

Loading

ghost commented Jul 30, 2020 •

edited by ghost

Loading

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

Ori-Pixel commented Jul 30, 2020 •

edited

Loading

ghost commented Jul 30, 2020 •

edited by ghost

Loading

ghost commented Jul 31, 2020 •

edited by ghost

Loading

Ori-Pixel commented Jul 31, 2020 •

edited

Loading