Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training from scratch #126

Closed
sberryman opened this issue Sep 8, 2019 · 105 comments
Closed

Training from scratch #126

sberryman opened this issue Sep 8, 2019 · 105 comments

Comments

@sberryman
Copy link

sberryman commented Sep 8, 2019

Thanks for publishing the code and basic training instructions!

Environment

Datasets: (9,063 speakers)

  • LibriTTS (train-other-500)
  • VoxCeleb1
  • VoxCeleb2
  • OpenSLR (42-44, 61-66, 69-80)
  • VCTK

I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.

Hyper Parameters:
Left all parameters untouched.

Encoder training:

39,300 steps:
image

115,900 steps: (almost exactly 24 hours of training)
image

Typical step

Step 115950   Loss: 0.9941   EER: 0.0717   Step time:  mean:   889ms  std:  1320ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  449ms   std: 1317ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:    8ms   std:    2ms
  Loss (10/10):                                    mean:   67ms   std:    7ms
  Backward pass (10/10):                           mean:  237ms   std:   26ms
  Parameter update (10/10):                        mean:  118ms   std:    3ms
  Extras (visualizations, saving) (10/10):         mean:    6ms   std:   18ms

Questions

  1. Will adding an additional ~2,900 speakers make much of a difference for the encoder?
    1. Will adding the remaining LibriTTS datasets (train-clean-100, train-clean-360, dev-clean, dev-other) with 1,221 speakers have any adverse effects training the synthesizer and vocoder?
  2. Does using different languages in the encoder help or hurt?
  3. Does my encoder training thus far look okay? It appears it will take me roughly 7 days to train the encoder up to 846,000 steps.
  4. Can I train the encoder using 16,000Hz while training the synthesizer and vocoder using 24,000Hz? Or do I need to restart and train the encoder on 24,000Hz mel spectrograms?
  5. I've downloaded the source videos for TEDLIUM-3 so I can extract audio at up to 44,100Hz allowing me to expand the synthesizer and vocoder training dataset to TEDLIUM + LibriTTS at 24,000Hz.
  6. Based on other issues I've read it appears you would like to use factchord taco1 implementation. Would you advice I go that route vs nvidia's taco2 pytorch implementation?
@CorentinJ
Copy link
Owner

Great work and great questions! I'll pin this issue for others in need of help.

Firstly, one thing I notice from your profiler output is that you would benefit from a 2x speedup by putting your data on a faster disk (or maybe increasing the number of threads in the DataLoader if you set them too low)

  1. Yes, adding more speakers is always good. Not including the entire LibriSpeech dataset was, I believe, a deliberate choice of the SV2TTS authors to highlight the transfer learning aspect of their framework i.e. that the speaker encoder trained on some data will perform well on entirely new data (and a different purpose too).
  2. That's a difficult question. Ideally you would have English-only speakers with a wide range of accents. I can't say that I have a definitive answer, however if you were to include a wide variety of languages I would recommend moving the speaker embedding size from 256 to 768 (as is done in SV2TTS). You could also do that for English-only speakers, simply I have found 256 to work well so far. A formal evaluation would require to compute the EER, and that is still a grey area for me (see the end of section 3.3.3 of my thesis)
  3. Yes, your training looks like mine. You will see the clusters get tighter over time and the loss will continue decreasing steadily. If you have time, you can train for longer than I did (as I did not converge to 100%)
  4. You're technically perfectly fine with different sample rates. Simply, any 24kHz audio you load/generate can be resampled (using librosa's resample function) to 16kHz for the encoder. I haven't tested the repo with different sampling rates, but I think I tried to make it possible to have different ones. I know there's an issue with that in the toolbox (https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/toolbox/__init__.py#L215), we can try to fix it when you need it.
  5. Ok, I didn't know about that dataset but it seems promising.
  6. I would greatly appreciate if someone were to replace entirely the synthesizer with a pytorch one. Both fatchord's and nvidia's would be fine.

@CorentinJ CorentinJ pinned this issue Sep 8, 2019
@sberryman
Copy link
Author

sberryman commented Sep 8, 2019

Thanks for the quick reply!

I also noticed the blocking operation taking a long time, found it very strange as the mel spectrograms are stored on a Samsung 960 EVO 1TB NVMe drive and SpeakerVerificationDataLoader has num_workers=16 CPU bounces around from about 50-80% utilization and disk is showing 4-18% busy. nvidia-smi is showing low utilization. Maybe I completely glossed over the code where you are reading from the wav audio files during training? That would explain it as wav's are sitting on a slow spinning disk.

  1. Thanks, I'll work on adding in TEDLIUM-3 into the encoder training set.
  2. I'll restart training with an embedding size of 768 by adjusting model_embedding_size = 768 in https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/params_model.py#L4. Would you adjust the model_hidden_size or any other parameters?
  3. I have noticed them continue to tighten up even with multiple (very diverse) languages and an embedding size of 256.
  4. Good to know on different sampling rates. Do you think I would be better off up-sampling the 16kHz to 24kHz for the embedding and down-sampling the remaining to 24kHz? VoxCeleb(1/2) and VCTK are in 16 kHz while the remaining speakers are in 24 kHz or ~44kHz.
  5. It is a great dataset with a wide range of accents, they only provide the data in 16kHz but it is easy to find the source videos and extract 44kHz audio that aligns perfectly.
  6. Once I get to synthesizer training I'll replace your code with fatchord's or nvidia's.

Edit:
The other thing I thought about for speeding up IO would be stacking the numpy files for each speaker into a single file as sequential reading is much faster. I would only have to open 10 files per step vs 100. I have plenty of memory in my computer I'm using for training so maybe that wont be an optimization many others could benefit from?

Edit 2:
I've gone through all the numpy files for each speaker and saved them into a combined file using np.savez and adjusted the code in encoder/data_objects/speaker.py and encoder/data_objects/utterance.py I'm now getting a much more consistent and lower load time for the data. Obviously increasing the embedding size from 256 to 768 has almost tripled the backward pass duration. Funny enough my overall step time has remained about the same but the embedding size tripled. So I consider that a win!

Step   1030   Loss: 3.2002   EER: 0.2662   Step time:  mean:   871ms  std:    58ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  103ms   std:   26ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:    7ms   std:    1ms
  Loss (10/10):                                    mean:   73ms   std:    3ms
  Backward pass (10/10):                           mean:  569ms   std:   67ms
  Parameter update (10/10):                        mean:  116ms   std:    3ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    4ms

Edit 3:
I wasn't happy with the backward pass duration so I made the backwards pass run on the GPU. This is what I'm looking at now...

Step    310   Loss: 3.6576   EER: 0.3275   Step time:  mean:   425ms  std:   233ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  104ms   std:  122ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    1ms
  Loss (10/10):                                    mean:   23ms   std:    1ms
  Backward pass (10/10):                           mean:   80ms   std:    5ms
  Parameter update (10/10):                        mean:  121ms   std:    2ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    3ms

..........
Step    320   Loss: 3.6723   EER: 0.3339   Step time:  mean:   322ms  std:    98ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:   60ms   std:   97ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    0ms
  Loss (10/10):                                    mean:   22ms   std:    1ms
  Backward pass (10/10):                           mean:   77ms   std:    4ms
  Parameter update (10/10):                        mean:  121ms   std:    2ms
  Extras (visualizations, saving) (10/10):         mean:    2ms   std:    4ms

..........
Step    330   Loss: 3.6419   EER: 0.3309   Step time:  mean:   362ms  std:   140ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:   97ms   std:  139ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    1ms
  Loss (10/10):                                    mean:   24ms   std:    3ms
  Backward pass (10/10):                           mean:   78ms   std:    4ms
  Parameter update (10/10):                        mean:  121ms   std:    1ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    3ms

@csu-xiao-an
Copy link

thank you

@CorentinJ
Copy link
Owner

CorentinJ commented Sep 11, 2019

  1. Yes, sorry, you should adjust the hidden layer size as well. The way it is done in the GE2E paper is that all recurrent layers have an output of 768, but are projected down to 256 dimensions before being fed to the next. If you want to implement that you'll have to change the network architecture; but if it trains fast enough with 768 as hidden size, then you're fine.
  2. Oh it's definitely going to work fine on different languages. The question is whether you'll manage to achieve an EER as low as on a single language dataset, and by extension a voice transfer that is just as good.
  3. Hmm, you can give that a shot. You should listen to the quality of downsampled/upsampled audios to see what gives (you can do that in a REPL prompt with sounddevice)
  4. I disagree. A whole lot of the source videos were removed from youtube. I know because I tried to guess the source language from the source videos.
  5. Great. I personally recommend fatchord's (I have played around and analyzed both repos already). If you feel like Tacotron 1 might be a downgrade from Tacotron 2, know that it isn't - Tacotron 1 is still used more often than Tacotron 2 in the litterature. Fatchord's samples are also great. Know that if you reimplement the synthesizer, you will probably have to change some things so that the data format on the vocoder side is good. We can talk about that again then.

There are quite a few ways to gain disk reading speedups for the encoder, but don't forget that you still need variety in the samples/batches. Another bottleneck is the GPU VRAM not being entirely used. Since the complexity of the forward/backward pass is cubic w.r.t the batch size, you would need to put multiple batches in parallel on the same GPU rather than putting a larger batch size. It's something worth looking into.

I had no idea you could specify to run the backward pass on the gpu, how did you do that?

@sberryman
Copy link
Author

Thanks for the continuous feedback.

  1. Unfortunately I didn't have the patience to wait for your response so it has been training with the model and data parameters shown below.
## Model parameters:
learning_rate_init: 0.0001
model_embedding_size: 768
model_hidden_size: 256
model_num_layers: 3
speakers_per_batch: 64
utterances_per_speaker: 10

## Data parameters:
audio_norm_target_dBFS: -30
inference_n_frames: 80
mel_n_channels: 40
mel_window_length: 25
mel_window_step: 10
partials_n_frames: 160
sampling_rate: 16000
vad_max_silence_length: 6
vad_moving_average_width: 8
vad_window_length: 30
  1. I trained with ~9,000 speakers (mixed languages but mostly English) through step 352,600 and included the UMAP projections for that below. I then remembered the Common Voice project from Mozilla and downloaded the entire thing. Then I placed all the individual speakers into unique folders and pruned all the speakers that didn't have 10 or more utterances. I then resumed training with the combined datasets bringing the total speakers to 25,668.
    stack_run_umap_358500
    stack_run_umap_358600

  2. Thanks but I'll hold off on changing sample rate for now, already adjusting a lot.

  3. I didn't download them from YouTube, they are available for download from TED.com at https://www.ted.com/talks/quick-list?page=1 and the alignments match TEDLIUM-3. The transcripts available from TED are of higher quality than the ones in TEDLIUM-3 dataset but alignments don't match due to the TED splash screen/banner that plays in the beginning.

  4. Sounds good, Fatchord's version it is! Perfect timing as another person using this repository (@TheButlah) has just made a lot of improvements and included multi-gpu training.

The combined npz files have been working great for me, it will load all the utterances per speaker and still uses your same sampling code to grab a random sample per speaker. The only thing I removed is loading from individual npy files.

I assume I changed the backwards pass to GPU, either way the GPU utilization is much higher and the profiler is showing significantly lower mean duration's for "Backward pass". I changed the loss_device to run on the GPU.

Then on https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/model.py#L27-L28

self.similarity_weight = nn.Parameter(torch.tensor([10.]).to(loss_device))
self.similarity_bias = nn.Parameter(torch.tensor([-5.]).to(loss_device))

Simply moved the tensor not the parameter to the GPU and changed the GPU sync in train.py to:

def sync(device: torch.device):
    # FIXME
    # return
    # For correct profiling (cuda operations are async)
    if device.type == "cuda":
        # torch.cuda.synchronize(device)
        torch.cuda.synchronize()

I'm now up to step 447,200 and included the loss and UMAP to show progress. I also changed the UMAP visualization to show 30 speakers by adding more colors to the color map.

37b6c18ff7d7d4
37b6c19208f994

cv_run_umap_447200
cv_run_umap_447300

New color map

colormap = np.array([
    [32, 25, 35],
    [255, 255, 255],
    [252, 255, 93],
    [125, 252, 0],
    [14, 196, 52],
    [34, 140, 104],
    [138, 216, 232],
    [35, 91, 84],
    [41, 189, 171],
    [57, 152, 245],
    [55, 41, 79],
    [39, 125, 167],
    [55, 80, 219],
    [242, 32, 32],
    [153, 25, 25],
    [255, 203, 165],
    [230, 143, 102],
    [197, 97, 51],
    [150, 52, 28],
    [99, 40, 25],
    [255, 196, 19],
    [244, 122, 34],
    [47, 42, 160],
    [183, 50, 204],
    [119, 43, 157],
    [240, 124, 171],
    [211, 11, 148],
    [237, 239, 243],
    [195, 165, 180],
    [148, 106, 162],
    [93, 76, 134],
    [0, 0, 0],
    [183, 183, 183],
], dtype=np.float) / 255

@CorentinJ
Copy link
Owner

CorentinJ commented Sep 11, 2019

Ah, I had put a warning not to compute the loss on GPU because for some reason it wasn't working (either it was some intricacies with torch or I forgot to enable grad on some tensor) and would return None. If that works, then I should update the repo to make it the default and have only 1 device for the encoder.

@sberryman
Copy link
Author

You are correct, it was not working until I changed the two lines to move the tensor to the GPU not the parameter. That was all I had to change (I believe, if not I can dig through all my changes and help you isolate that fix.) Technically I changed loss_device to loss_device = device just so I didn't miss anything in train.py. Either way, only one GPU is exposed to my docker container used for training.

Also in the sync function, I had to remove the device parameter and simply use torch.cuda.synchronize()

Clusters are getting tighter but I plan on training until at least 700-900k steps. I'm also tempted to train an English only model to compare.

@TheButlah
Copy link

TheButlah commented Sep 11, 2019

@sberryman will you be submitting a pull request? Id be very interested to see the results using more data for the speaker encoder - the GE2E paper demonstrated that having more data for the encoder is critical to getting the similarity of the cloned speaker close to the original.

Also in my own experience, the compatibility of Fatchords Taco1 with WaveRNN makes it a great candidate, and the codebase is easy to work with. I still believe that Taco2 would be an upgrade in terms of quality of the inflection of the speaker, but that the out of the box compatibility of Fatchords synthesizer with the vocoder makes it a natural choice.

Do note that Fatchords synthesizer does not support multiple speakers, so you would need to add that capability yourself (and a PR on Fatchords repo would be especially appreciated for adding that capability :) )

@ViktorAlm
Copy link

ViktorAlm commented Sep 12, 2019

I'm also very interested in the results. I'm currently training the encoder on about 2k speakers in Swedish and about 4k mixed mainly English. I would really like to see examples from your encoder model on multiple languages to see if its worth crawling radio and tv shows with resemblyzers diarization to create a a fully Swedish dataset or if 6k with 1/3 being Swedish can compare to 25k mixed mainly english for Swedish voice cloning. My hunch is m0ar data

@sberryman
Copy link
Author

Current:

I'm at ~700k steps and still quite a few tight clusters, not sure if this is due to the fact that I trained for 350k steps on 9,000 speakers prior to adding 16,668 more speakers (which also introduced quite a few more languages) I'm going to continue training for another 200k steps which will be done this time tomorrow morning.

image
image
image
image

To-Do:

  1. I'm going to start a new training for English only (there are a few non-english speakers) with ~17,680 speakers using 768/768 (hidden/embedding) size.
  2. Once the mixed set reaches ~900k steps I will stop it and start it over from scratch with 768/768 as it is currently training on 256/768 (256 hidden and 768 embedding size) as I wasn't aware I had to bump both to 768.

Comments

@TheButlah

First, thanks for the massive PR that landed on Fatchords WaveRNN 4 days ago, really excited you added mutli-gpu training and mel's in numpy format! To your question on a PR, I can certainly submit PRs to this repo and WaveRNN. The code to utilize most of the datasets from OpenSLR and Common Voice are bit of a hack but if people want them I'm open to working on a PR for that as well.

Thanks for the feedback on Taco1 and WaveRNN from Fatchords repo, that will be the route I will go. I will most likely run into issues adding multi-speaker but I will start an issue in that repo when I get there.

@ViktorAlm

Great to hear about someone else testing multiple languages! Have you changed any of the data or model parameters? Funny you mentioned using Resemble's diarization as I've had a tab open to that code for a few days and planned on using it against 7,000 hours of local (English) news video I have. That is once I finished training a new model.

As far as sharing the models I'm training, I'm open to it. Here is the model trained to 697,500 steps (768 model embedding size and 256 hidden layers.)
https://www.dropbox.com/s/2b5g2rt4vypx9qq/cv_run_bak_697500.pt?dl=0

Would be interested to know how it performs against your Swedish data @ViktorAlm.

@ViktorAlm
Copy link

ViktorAlm commented Sep 12, 2019

Thanks! I have not changed any params. I was on step 150k with my data to try and do a real run with all the models. I did one where I only did 100k steps on each model with about 900 swedish speakers with about 90gb data in total. It did not clone the voice but produced a good audio quality and atleast a male voice came out when I ran my own voice. I paused it and did a quick test with yours and the encoding result is way better than the small testrun I did.

Swedish and Norwegian are pretty similar. I didnt see any specific Swedish/Norwegian cluster gathering but I only did two tests and umap might remove any visible difference I guess.

Heres a converter if you wish to add norwegian, danish and swedish data to your mix:
https://github.com/ViktorAlm/Nasjonalbank-converter

I also added some results from your encoder in /Results.

When i've played around a bit more i might make a script that evaluates different languages better.

@sberryman
Copy link
Author

@ViktorAlm Thanks for sharing!

Is your Swedish and Norwegian dataset private? I'm up for including those speakers in the next training run where I use 768 for hidden/embedding size if you can share. There are only 20 Swedish voices in the 25,668 speakers I am training on and zero Norwegian. Common voice had 44 speakers for Swedish but I filtered those down to 20 as I had a floor of 12 unique utterances per speaker.

Other updates

  1. I started the English only 768/768 training which takes significantly longer per step (about 4x) so don't expect those results for a while. Progress looks good so far though and it is only on step 8,100!
    image
    image
    image
    image
  2. I've reduced the learning rate from 1e-4 to 1e-5 on the mixed dataset which seems to help. I'll probably drop it down to 1e-6 around step 800-850k.

If anyone else is aware of other datasets I can include please let me know!

@ViktorAlm
Copy link

ViktorAlm commented Sep 12, 2019

Nice!

I edited my old comment because i did not want to clutter your thread with my bad screenshots. I added my converter with links to the datasets. Its very hacky and if you want to add them i really should clean up the code some. I think a simple merge of the folders and then looping through to get the spls(files with info on location etc) and loading the files would be the best way instead of my weird way of scanning the folders. I was testing on just one of the extracted folders and the speech folders did not contain the wavs which was specified in the spl file. Then everything went weird from there.

https://github.com/ViktorAlm/Nasjonalbank-converter

@CorentinJ
Copy link
Owner

Just in case this wasn't clear, Resemblyzer is also my project and is merely an interface to the speaker encoder of this repo. You can replace the pretrained model in the package and put yours instead. I could also distribute models that you provide me for other languages.

@CorentinJ
Copy link
Owner

I also would like to leave my script for evaluating the EER over the test set. It's not clean and I'm not sure if it's correct either (given that you won't find anywhere the right procedure to evaluate the EER over a dataset). You should use this if you want to formally evaluate the performance of the speaker encoder.

If someone manages to make it better then I would gladly include it in the repo

from encoder.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
from encoder.model import SpeakerEncoder
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import torch


# This is my script for computing the test EER.
dataset_root = r"E:\Datasets\SV2TTS\encoder_test"

if __name__ == '__main__':
    speakers_per_batch = 32
    steps = 100
    
    dataset = SpeakerVerificationDataset(Path(dataset_root))
    
    model = SpeakerEncoder(torch.device("cuda"), torch.device("cpu"))
    checkpoint = torch.load("saved_models/pretrained.pt")
    model.load_state_dict(checkpoint["model_state"])
    model.eval()
    
    results = []
    for utterances_per_speaker in range(6, 8):
        loader = SpeakerVerificationDataLoader(
            dataset,
            speakers_per_batch=speakers_per_batch,
            utterances_per_speaker=utterances_per_speaker,
            num_workers=8,
        )
        with torch.no_grad():
            eers = []
            for step, speaker_batch in zip(range(1, steps + 1), loader):
                inputs = torch.from_numpy(speaker_batch.data).cuda()
                embeds = model(inputs)
                embeds_loss = embeds.view((speakers_per_batch, utterances_per_speaker, -1)).cpu()
                _, eer = model.loss(embeds_loss)
                
                eers.append(eer)
                print("Step %d    EER: %.3f" % (step, np.mean(eers)))
        results.append(np.mean(eers))
        
    plt.plot(range(2, 11), results)
    plt.xlabel("Enrollment utterances")
    plt.ylabel("Equal Error Rate")
    plt.show()

@CorentinJ
Copy link
Owner

Also I don't know about that:

I've reduced the learning rate from 1e-4 to 1e-5 on the mixed dataset which seems to help. I'll probably drop it down to 1e-6 around step 800-850k.

  1. I've left my lr to 1e-4 all along, I think you should be fine with that same value as well

  2. Don't forget that I never managed to fully train my speaker encoder. I trained it for 1M steps but the authors of sv2tts trained it for 50M steps. You should aim for more if you can.

@sberryman
Copy link
Author

Thanks @CorentinJ

Well aware Resemblyzer is your project, that is how I ended up finding it. Thanks for open sourcing that project as well. Looking forward to seeing what your next project is!

Thanks for the test script, I was thinking about how I was going to evaluate the models I'm training and would be great to compare these to your public model. Originally I was just going to plot a random 5-10 utterances for every single speaker to get an idea of the overall distribution.

Interesting on not adjusting the learning rate; I'm more accustomed to training image classification models where reducing/decaying the learning rate is almost a requirement. I will not adjust the learning rate any further then.

I was not aware the SV2TTS authors trained for 50M steps, obviously it is time for me to read their paper.

Also, this is turning into more of a discussion than an "issue". I'm happy to move it to another location or can continue using GitHub issues; completely up to you.

Thanks again!

@CorentinJ
Copy link
Owner

Nah it's common for issues to serve a broader purpose than just solving bugs. I don't decay the learning rate simply because it's not a necessity with Adam. The original authors did not use Adam and they did decay the learning rate by the way. Also, you will have to read GE2E to know more about the speaker encoder, because there isn't much info in SV2TTS about how they train or evaluate it.

@slavaGanzin
Copy link

slavaGanzin commented Sep 13, 2019

@sberryman Shaun, would be awesome if you'll create PR. If you don't feel it's polished enough, just mark it WIP. So it wouldn't be merged, but will be just an inspiration for others :)

@sberryman
Copy link
Author

sberryman commented Sep 13, 2019

@slavaGanzin I have pushed my work in progress to my own fork. There are hard coded paths and changes related to grouping all the .npy files into a single .npz for each speaker. I also use docker and volume mappings so I left the basic Dockerfile in there. I don't plan on ever submitting a PR for that branch as I'm still experimenting quite heavily. Basically, feel free to use any of the scripts as a starting point but don't count on them working out of the box.

https://github.com/sberryman/Real-Time-Voice-Cloning/tree/wip

Other updates

  1. Mixed model with 256 hidden and 768 embedding size has finally hit 1,000,000 steps. Based on feedback from @CorentinJ I'm going to let that continue training for a while longer.
    image

Model trained to 1,005,000 is available on my dropbox account now. https://www.dropbox.com/s/69wv21ajt6l2pag/cv_run_bak_1005000.pt?dl=0

  1. English only model is progressing VERY slowly!
    image

@Jessicamat777
Copy link

Hi sberryman, can I know which language your trained model in dropbox.com supports on?

@Jessicamat777
Copy link

I need Chinese pretrained models for project in grad school. Can you guide me on that ?

@sberryman
Copy link
Author

sberryman commented Sep 16, 2019

@Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be of little value unless you wanted to use them with CorentinJ's Resemblyzer.

That being said, the models on dropbox are from the following datasets.

  1. LibriTTS (train-other-500)
  2. VoxCeleb1
  3. VoxCeleb2
  4. OpenSLR (42-44, 61-66, 69-80)
  5. VCTK
  6. Common Voice

A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set and extremely small number of speakers per additional language. I just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers.

Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian.
resemble-ai/Resemblyzer#9

I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch.

@Jessicamat777
Copy link

Jessicamat777 commented Sep 16, 2019 via email

@sberryman
Copy link
Author

@Jessicamat777 multi-gpu training is NOT implemented. If you do implement it, can you please submit a pull-request to this repository so others can benefit?

@sberryman
Copy link
Author

Training is still progressing on the mixed and english models. This is just to update anyone if they are following this issue.

Mixed

image

English

image

@shawwn
Copy link

shawwn commented Sep 19, 2019

Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")

Are we certain that for every possible human voice, there exists an embedding which allows tacotron2 to produce spectrograms indistinguishable from that voice?

If not, then it seems beneficial if tacotron2 were trained on the new diverse speech dataset in addition to the encoder.

For example, in my experiments it has seemed impossible to generate spectrograms with cartoon-style inflections: lots of expressive vocalizations, rapid pitch changes, and so on.

If that's how a speaker sounds normally, then it seems like it's impossible for the encoder to generate any latent vector that would cause tacotron2 to produce spectrograms that sound anything like the speaker.

Perhaps I am confused, but just to confirm: there are three separate things that need to be trained, right? The encoder, the synthesizer (text to spectrogram), and the vocoder (spectrogram to wav). This training process is focusing entirely on the encoder. How is the loss being calculated? If the loss is calculated in terms of "tacotron2 is able to generate spectrograms that sound more like this speaker," then the training here will not have a huge impact on overall quality or diversity. The training would need to be done on the synth, then the encoder.

Do I have this backwards? Is it true that the encoder's final quality is bounded by the expressiveness of the synth? If that's correct, then the synth is what would benefit from the larger dataset.

@CorentinJ
Copy link
Owner

Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")

It's not intuitive, I agree. However, this is clearly the conclusion the authors of the sv2tts paper reached. They argue that most of the ability to clone voices lies in the training of the encoder. They also clearly show that the framework has limitations (which we observe in this repo as well):

An additional limitation lies in the model’s inability to transfer accents. Given sufficient training data, this could be addressed by conditioning the synthesizer on independent speaker and accent embeddings. Finally, we note that the model is also not able to completely isolate the speaker voice from the prosody of the reference audio, ...

If you give a listen to their librispeech samples, you will notice that as well.

@sberryman
Copy link
Author

Training updates

Encoder

I've stopped training both the mixed and English encoders, the mixed encoder reached just over 2.1 million steps with 27,432 speakers.

Synthesizer

Since I'm using LibriTTS I had to make some changes to the code base. First I used Montreal forced aligner to come up with the alignments. Then I realized google already normalized the audio and removed the leading and trailing silence. So at this point I just skipped the alignment portion of preprocessing and use the original transcript (as opposed to the normalized which is also provided) with all punctuation and capitalization left in place. I know the English cleaner converts everything to lowercase though.

I started training last night across two GTX 1080 Ti's and GPU utilization bounces between 20% and 93%.

Overridden hparams:

  • tacotron_num_gpus=2
  • tacotron_batch_size=64
  • sample_rate=24000
  • win_size=1200
  • hop_size=300
  • n_fft=2048
  • speaker_embedding_size=768
  • rescale=False

Training progress

TensorBoard

192 168 7 171_6006_
image
image

Stdout

Step   27753 [1.664 sec/step, loss=0.68117, avg_loss=0.67622]
Step   27754 [1.690 sec/step, loss=0.64809, avg_loss=0.67585]
Step   27755 [1.687 sec/step, loss=0.68754, avg_loss=0.67603]
Step   27756 [1.686 sec/step, loss=0.67575, avg_loss=0.67593]
Step   27757 [1.675 sec/step, loss=0.65758, avg_loss=0.67573]
Step   27758 [1.684 sec/step, loss=0.66391, avg_loss=0.67550]
Step   27759 [1.687 sec/step, loss=0.66689, avg_loss=0.67528]
Step   27760 [1.710 sec/step, loss=0.66279, avg_loss=0.67525]
Step   27761 [1.681 sec/step, loss=0.69119, avg_loss=0.67565]
Step   27762 [1.679 sec/step, loss=0.67129, avg_loss=0.67552]
Step   27763 [1.677 sec/step, loss=0.69174, avg_loss=0.67563]
Step   27764 [1.693 sec/step, loss=0.65657, avg_loss=0.67544]
Step   27765 [1.692 sec/step, loss=0.66381, avg_loss=0.67518]
Step   27766 [1.672 sec/step, loss=0.70290, avg_loss=0.67546]

Plots

step-22000-align
step-22000-mel-spectrogram
step-24000-align
step-24000-mel-spectrogram
step-26000-align
step-26000-mel-spectrogram

WAVs

wavs.zip

Questions:

  1. Is it normal for the max_gradient_norm, stop_token_loss and regularization_loss to be increasing? Basically, do the tensorboard plots look okay?
  2. How many steps did you train the synthesizer?
  3. How many steps did you train the vocoder?

@ghost
Copy link

ghost commented Jun 27, 2020

Thanks for the feedback @LordBaaa . I generated that sample five times on the 428k model trying to get that pop to go away, before I became convinced that it was a feature of the model.

@Oktai15
Copy link

Oktai15 commented Jun 29, 2020

Hello @sberryman! Could you provide pretrained weights from #126 (comment) for Mixed version?

@Liujingxiu23
Copy link

@blue-fish The wavs that you shared sounds good! Are the wavs just the result of vocoder, or an end2end results which using encoder to predict the embedding then using tacotron and vocoder model to synthesize?

@ghost
Copy link

ghost commented Jul 6, 2020

@Liujingxiu23 They are end-to-end results where I replicate the audio samples of the SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/

I use the reference audio from VCTK p240 and p260 to create the embedding and generate synthesized samples #0 and #1 using tacotron and the vocoder model.

@sberryman
Copy link
Author

@Oktai15 I thought I posted the links to the encoder for the mixed version. The tacotron and vocoder weights are useless that I trained. However the encoder is quite good.
https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0

@ghost
Copy link

ghost commented Jul 7, 2020

@Oktai15 I think these are the settings you need to use @sberryman 's mixed encoder: #126 (comment)

I have not tried it though. Please let us know if it works for you.

@Liujingxiu23
Copy link

Liujingxiu23 commented Jul 7, 2020

@blue-fish
Thank you for you reply!

You train encoder,synthsizer as well as the vocoder by yourself as follows?
Encoder: trained 1.56M steps (20 days with a single GPU) with a batch size of 64
Synthesizer: trained 256k steps (1 week with 4 GPUs) with a batch size of 144
Vocoder: trained 428k steps (4 days with a single GPU) with a batch size of 100

I trained the encoder and synthsizer using chinese corpus, but the result is not as good as yours.

For the encoder, have you remove the relu Activation Function in the last linear layer?
For the synthesizer, you use the same data (VCTK+LibriSpeech)as the paper?

@ghost
Copy link

ghost commented Jul 7, 2020

@Liujingxiu23 The info about the model training comes from this page: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Pretrained-models

The encoder and synthesizer are the original models by @CorentinJ . All I did was take his original vocoder model and continued the training to see what would result. I didn't even change any parameters except to cut the batch size in half (100 to 50) so it would fit in my GPU's limited memory.

Edit: In case it is not clear, I used the training code in the repo without modification. I also used the same datasets (LibriSpeech train-clean-100 and -360) and processed them following these instructions: https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training

Also, since Chinese is your target language, you should see @kuangdd 's work here: #30 (comment) if you haven't already.

@Liujingxiu23
Copy link

@blue-fish I see, Thank you very much

@mennatallah644
Copy link

@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!

Encoder

https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0

Synthesizer (Tacotron)

https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0

Vocoder

https://www.dropbox.com/s/bgzeaid0nuh7val/vocoder.zip?dl=0

Dear All,
i've downloaded the models from @sberryman and adapted the hyper parameters accordingly.
I created a few examples with them. I observe the following:

1. the sound quality is pretty good (clearly understandable, no bleeps or blops etc.)

2. the voice does not resemble the reference embedding. it's like a 'generic' voice.

I wonder why that is. Did anybody else experience this?
Thanks!

i also experience that
did you solve this issue?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests