Pretrained fastspeech2 libritts model for testing? #325

ronggong · 2020-10-23T21:33:27Z

Hi,

Thanks for the nice work. Is there a pretrained fastspeech2 libritts model for testing? Like the one trained with ljspeech data?https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

dathudeptrai · 2020-10-24T02:51:28Z

@ronggong I don't have enough resource to training libriTTS, i hope someone can try and make a pull request :D

ronggong · 2020-10-24T10:32:44Z

@dathudeptrai Hi, do you have an estimation about how much resource is needed in terms of GPUs and duration?

dathudeptrai · 2020-10-25T03:44:33Z

@dathudeptrai Hi, do you have an estimation about how much resource is needed in terms of GPUs and duration?

with 2080TI, you can finish training fastspeech2 around 8 hours :D

ronggong · 2020-10-25T23:12:58Z

@dathudeptrai I tried with 2 gpus training with batch_size 16, which is slower than using 1 gpu with the same batch_size. Both gpus are consumed with the same amount of memory. Does the batch_size in the yaml means those for each gpu, so the actual batch_size is 32?

dathudeptrai · 2020-10-26T04:07:05Z

@dathudeptrai I tried with 2 gpus training with batch_size 16, which is slower than using 1 gpu with the same batch_size. Both gpus are consumed with the same amount of memory. Does the batch_size in the yaml means those for each gpu, so the actual batch_size is 32?

yes :)). see here (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2#step-2-training-from-scratch).

IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

ronggong · 2020-10-26T11:55:45Z

@dathudeptrai thanks! Here is an error in training.

  File "/raid/venvs/tf-23/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 164, in _check_eval_interval
    self._eval_epoch()
  File "/raid/venvs/tf-23/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 765, in _eval_epoch
    self.generate_and_save_intermediate_result(batch)
  File "examples/fastspeech2_libritts/train_fastspeech2.py", line 176, in generate_and_save_intermediate_result
    utt_id = utt_ids[idx]
IndexError: index 16 is out of bounds for axis 0 with size 16

The error is in around here

TensorFlowTTS/examples/fastspeech2_libritts/train_fastspeech2.py

Line 170 in cfcc57a

for idx, (mel_gt, mel_before, mel_after) in enumerate(

for idx, (mel_gt, mel_before, mel_after) in enumerate(
            zip(mel_gts, mels_before, mels_after), 1
        ):
            
            
            if self.use_griffin:
                utt_id = utt_ids[idx]

Why do we shift the enumerate counter to start from 1?

ronggong · 2020-10-28T20:06:54Z

@dathudeptrai @ZDisket I finished the libritts model training, however, I am still using the LJspeech multiband melgan vocoder. Also tried the vocoder of TuckerCarlson in TensorVox demo. None of them sounds good to me. Do you have a good vocoder to use? I found the multiband_melgan.v1_24k is somehow interesting, but it's trained with 2048 FFT and 300 hopsize, although the sampling rate is 24k, which corresponds well to the libritts training data.

ZDisket · 2020-10-28T20:17:43Z

@ronggong The vocoder of Tucker Carlson is the multiband_melgan.v1_24k, it's the LibriTTS mb-melgan. Although last time I heard the eval samples on its own dataset didn't sound very good, it still worked fine as a universal vocoder.

ronggong · 2020-10-28T20:35:52Z

@ZDisket Your Tucker Carlson model sounds much better than my libritts model on the same vocoder? Is your Tucker fastspeech2 model trained with data of 2048 FFT and 300 hopsize? My assumption the mismatch of the FFT and hopsize is the issue.

ZDisket · 2020-10-28T20:38:57Z

@ronggong

Is your Tucker fastspeech2 model trained with data of 2048 FFT and 300 hopsize?

Yes. First I upsampled LJSpeech to 24KHz and trained on the same config then took my 24KHz Tucker Carlson data and finetuned the 24-LJSpeech model.

ronggong · 2020-10-28T20:42:03Z

@ZDisket ok, I will retrain my libritts on this config.

ronggong · 2020-10-28T21:48:07Z

trim_silence: false # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048 # Frame size in trimming.
trim_hop_size: 512 # Hop size in trimming.
format: "npy" # Feature file format. Only "npy" is supported.
trim_mfa: false

@ZDisket I am curious that you set trim silence and trim mfa both to false. Does this work better than setting to true. Btw, what does it mean trim_mfa, trimming the alignment results?

ZDisket · 2020-10-28T21:57:54Z

@ronggong MFA trimming trims based on forced alignment results, and I turned both trims off because I found that MFA-aligned FS2 performs better without trimming (as the audio would end too abruptly), and due to the nature of my data collection scripts, most of my datasets already come pre-trimmed.

ronggong · 2020-10-31T22:59:24Z

@ZDisket The Libritts voice sounds a lot better now with FFT 2048 in preprocess, which correspond to the vocoder FFT size. However, the phrase ends very abruptly as you mentioned. I already disabled trim_silence and trim_mfa in preprocess. Do you have an idea why?
If I set both trim_silence and trim_mfa to true, does this has the same effect to trim the start and end silence according to the MFA alignment, as you did in this script ? https://github.com/ZDisket/TensorflowTTS/blob/master/examples/fastspeech2/mfa/postmfa.py

This part is confusing to me https://github.com/ZDisket/TensorflowTTS/blob/e2660d50dd6baf43e83e0534e2595b95cf0e507d/examples/fastspeech2/mfa/postmfa.py#L135
May I understand why do you skip the start silence? We are suppose to trim the start silence, no?

ZDisket · 2020-11-01T05:04:53Z

@ronggong My script, which is entirely separate from the current MFA stuff, loads the sound file and exports trimmed WAVs, although I think there's a flaw with it.

However, the phrase ends very abruptly as you mentioned. I already disabled trim_silence and trim_mfa in preprocess. Do you have an idea why?

Add SIL to the end of your input. I think the current inference only adds END to the end of the input, I do SIL END so it doesn't end abruptly.

ronggong · 2020-11-01T21:58:47Z

@ZDisket I added SIL and END, but it still ends abruptly. I probably will check the MFA alignment to see if the last phoneme is usually well aligned. Or another guess is when generating the data, maybe we didn't put the END at the end of a sentence as a label to differentiate the phoneme at the end from those in the middle of a sentence.
I also found another problem, the synthesis of the first sentence after loading the model is always slow. Is this because of the model warmup?

dathudeptrai · 2020-11-02T01:26:02Z

I also found another problem, the synthesis of the first sentence after loading the model is always slow. Is this because of the model warmup?

yes :D

ronggong · 2020-11-02T21:13:16Z

Is there a way to workaround the warmup? If not, the way is to preload the model and do an dummy inference.

ronggong · 2020-11-02T23:48:24Z

@ZDisket It seems that keep punctuations and adding silence after that can mitigate the abrupt ending problem. I am not sure if the punctuation is used as phonemes in training, will check.
Update: the MFA alignment doesn't have punctuations, so the training data doesn't have punctuations either. I think to solve the problem, we could extend the right-side boundary of the last phoneme before the "END" in a training utterance.
@dathudeptrai I think the libritts model is done, do you want me to create a colab demo. I think it will interesting for those who wants to test multi speakers model.

dathudeptrai · 2020-11-04T13:26:16Z

@ronggong yeah :D i'm interested with ur libriTTS demo :D.

machineko · 2020-11-04T13:30:05Z

@ronggong If u used preprocessing based on multispeaker there is a part of trimming END and SIL from the end based on MFA alignment and in my case models was working a lot better with trimmed based on both END and SIL token in the end of the sentence. Also it wasnt trimmed on librosa but only based on MFA phonmes :)

If u use this trimming in inference time you should also remove SIL and END from the end of the sentence and the problem with abruptly ending sentence wasn't case in this configuration on Libri + my dataset.

prepro

ronggong · 2020-11-04T19:48:46Z

@machineko ok, thanks for the tips, let me try another trimmed version.

machineko · 2020-11-04T20:40:39Z

@ronggong Also this => #296 (comment)

OscarVanL · 2020-11-04T20:47:43Z

I made the changes mentioned earlier in this thread and my model performs way better, and the end of the speech is not cut off after adding SIL and END to the end of my input ids :)

OscarVanL · 2020-11-06T15:47:48Z

For those training FS2 on LibriTTS, I tried switching my LibriTTS subset from train-clean-100 to train-clean-360 and it made a big improvement with the same number of speakers.

My speaker selection involved picking the 100 speakers with the most speech from the subset. In train-clean-100 they have an average of 17.5 mins (some speakers with as little as 12 minutes). In train-clean-360 they have an average of 20 minutes.

I think perhaps adding speakers with small amounts of data (<20 mins) does not help the model.

ronggong · 2020-11-06T16:30:50Z

@dathudeptrai Here is the colab notebook with pretrained Libritts train-clean-100 data Fastspeech2 model. Any suggestions? https://colab.research.google.com/drive/1K-4KxwUGaElMxcLzbFXX1oGAaAJXROpf?usp=sharing

ronggong · 2020-11-06T17:26:59Z

@machineko @OscarVanL
Here is an experiment. model trained with trimmed libritts 100 data. When inference, from top to bottom:
(1) no SIL, no END at the end
(2) only SIL
(3) both SIL and END
(4) only END

Where there is no SIL and END (1), the phrase ends abruptly. In case (2), the last vowel prolonged. In case (3), a vowel-like sound is added at the end. In case (4), a short pause is added to the end. This is only one example, can't generalize to other examples.

Then, model trained with non-trimmed data,
(1) no SIL no END, the audio get trimmed at the end
(2) only SIL, last vowel prolongs
(3) both SIL and END, a little pause at the end.

So for me, with either trimmed or non trimmed data, add SIL at the end is helpful.

OscarVanL · 2020-11-06T17:34:22Z

Yes, this is what I have been doing, I add SIL and END to the end of every speech :) Thank you for making a comparison so I know this is the right approach.

ronggong · 2020-11-06T17:47:58Z

@OscarVanL I am not sure if adding END make sense. If we use trim MFA, the text id of both SIL and END are also trimmed in the training text I suppose

TensorFlowTTS/tensorflow_tts/bin/preprocess.py

Line 158 in 43aa56e

text_ids = text_ids[idx_start:idx_end]

Which might means in training, the model probably never seen END symbol.

OscarVanL · 2020-11-06T17:51:35Z

Interesting, maybe the END is unnecessary then, but there is clearly a difference in your charts 2 and 3, so END must be doing something?

ronggong · 2020-11-06T22:30:45Z

@OscarVanL In the example, it adds some vowel-like noise.

aragorntheking · 2020-11-16T08:52:46Z

@ronggong MFA replacing punctuations before training has side effects. Like question marks don't make the sentence sound like a question anymore. Would somehow putting the punctuations back post MFA alignment make sense ?
Is there a better way to fix this ?

ronggong · 2020-11-23T12:16:05Z

@aragorntheking I think you are right, the punctuations have been removed from the MFA alignment and as well from the training labels, as the labels are generated from the MFA alignment. I think it's worth to investigate the effect of putting the punctuations back.

ronggong · 2020-12-11T14:49:09Z

@aragorntheking I am going to have to time to investigate the punctuation of the MFA alignment. Do you want to collaborate on this?

ronggong · 2020-12-15T21:23:41Z

@OscarVanL @aragorntheking Adding back the punctuations to the transcriptions to train has an effect, although minor. We can hear the pitch change:

Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania?
Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania!
Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania.

Below are the spectrogram of three sentences, pay attention to the ending pitch.

Audio examples

test.zip

ronggong · 2020-12-31T19:45:11Z

@dathudeptrai Just reminder, the pretrained multispeaker libritts fastspeech2 colab is here https://colab.research.google.com/drive/1K-4KxwUGaElMxcLzbFXX1oGAaAJXROpf?usp=sharing if we want to share with others.

stale · 2021-03-01T19:55:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

MostafaAlaviyan · 2022-03-13T09:48:58Z

@OscarVanL @aragorntheking Adding back the punctuations to the transcriptions to train has an effect, although minor. We can hear the pitch change:

Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania?

Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania!

Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania.

Below are the spectrogram of three sentences, pay attention to the ending pitch.

Audio examples

test.zip

@ronggong where do you add the punctuations? do you change the MFA alignments or change the MFA to remain the punctuations? can you explain clearly what you do to solve this problem?

dathudeptrai self-assigned this Oct 24, 2020

dathudeptrai added the question ❓ Further information is requested label Oct 24, 2020

dathudeptrai mentioned this issue Nov 4, 2020

Fine-Tuning with a small dataset #296

Closed

stale bot added the wontfix label Mar 1, 2021

stale bot closed this as completed Mar 8, 2021

Pretrained fastspeech2 libritts model for testing? #325

Pretrained fastspeech2 libritts model for testing? #325

Comments

ronggong commented Oct 23, 2020

dathudeptrai commented Oct 24, 2020

ronggong commented Oct 24, 2020

dathudeptrai commented Oct 25, 2020

ronggong commented Oct 25, 2020 • edited

dathudeptrai commented Oct 26, 2020

ronggong commented Oct 26, 2020

ronggong commented Oct 28, 2020

ZDisket commented Oct 28, 2020 • edited

ronggong commented Oct 28, 2020

ZDisket commented Oct 28, 2020 • edited

ronggong commented Oct 28, 2020

ronggong commented Oct 28, 2020 • edited

ZDisket commented Oct 28, 2020

ronggong commented Oct 31, 2020 • edited

ZDisket commented Nov 1, 2020

ronggong commented Nov 1, 2020

dathudeptrai commented Nov 2, 2020

ronggong commented Nov 2, 2020

ronggong commented Nov 2, 2020 • edited

dathudeptrai commented Nov 4, 2020

machineko commented Nov 4, 2020 • edited

ronggong commented Nov 4, 2020

machineko commented Nov 4, 2020

OscarVanL commented Nov 4, 2020

OscarVanL commented Nov 6, 2020

ronggong commented Nov 6, 2020 • edited

ronggong commented Nov 6, 2020

OscarVanL commented Nov 6, 2020 • edited

ronggong commented Nov 6, 2020

OscarVanL commented Nov 6, 2020 • edited

ronggong commented Nov 6, 2020

aragorntheking commented Nov 16, 2020 • edited

ronggong commented Nov 23, 2020

ronggong commented Dec 11, 2020

ronggong commented Dec 15, 2020

ronggong commented Dec 31, 2020

stale bot commented Mar 1, 2021

MostafaAlaviyan commented Mar 13, 2022 • edited

ronggong commented Oct 25, 2020 •

edited

ZDisket commented Oct 28, 2020 •

edited

ZDisket commented Oct 28, 2020 •

edited

ronggong commented Oct 28, 2020 •

edited

ronggong commented Oct 31, 2020 •

edited

ronggong commented Nov 2, 2020 •

edited

machineko commented Nov 4, 2020 •

edited

ronggong commented Nov 6, 2020 •

edited

OscarVanL commented Nov 6, 2020 •

edited

OscarVanL commented Nov 6, 2020 •

edited

aragorntheking commented Nov 16, 2020 •

edited

MostafaAlaviyan commented Mar 13, 2022 •

edited