Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained fastspeech2 libritts model for testing? #325

Closed
ronggong opened this issue Oct 23, 2020 · 38 comments
Closed

Pretrained fastspeech2 libritts model for testing? #325

ronggong opened this issue Oct 23, 2020 · 38 comments
Assignees
Labels
question ❓ Further information is requested wontfix

Comments

@ronggong
Copy link

Hi,

Thanks for the nice work. Is there a pretrained fastspeech2 libritts model for testing? Like the one trained with ljspeech data?https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

@dathudeptrai dathudeptrai self-assigned this Oct 24, 2020
@dathudeptrai dathudeptrai added the question ❓ Further information is requested label Oct 24, 2020
@dathudeptrai
Copy link
Collaborator

@ronggong I don't have enough resource to training libriTTS, i hope someone can try and make a pull request :D

@ronggong
Copy link
Author

@dathudeptrai Hi, do you have an estimation about how much resource is needed in terms of GPUs and duration?

@dathudeptrai
Copy link
Collaborator

@dathudeptrai Hi, do you have an estimation about how much resource is needed in terms of GPUs and duration?

with 2080TI, you can finish training fastspeech2 around 8 hours :D

@ronggong
Copy link
Author

ronggong commented Oct 25, 2020

@dathudeptrai I tried with 2 gpus training with batch_size 16, which is slower than using 1 gpu with the same batch_size. Both gpus are consumed with the same amount of memory. Does the batch_size in the yaml means those for each gpu, so the actual batch_size is 32?

@dathudeptrai
Copy link
Collaborator

@dathudeptrai I tried with 2 gpus training with batch_size 16, which is slower than using 1 gpu with the same batch_size. Both gpus are consumed with the same amount of memory. Does the batch_size in the yaml means those for each gpu, so the actual batch_size is 32?

yes :)). see here (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/fastspeech2#step-2-training-from-scratch).

IF you want to use MultiGPU to training you can replace CUDA_VISIBLE_DEVICES=0 by CUDA_VISIBLE_DEVICES=0,1,2,3 for example. You also need to tune the batch_size for each GPU (in config file) by yourself to maximize the performance. Note that MultiGPU now support for Training but not yet support for Decode.

@ronggong
Copy link
Author

@dathudeptrai thanks! Here is an error in training.

  File "/raid/venvs/tf-23/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 164, in _check_eval_interval
    self._eval_epoch()
  File "/raid/venvs/tf-23/lib/python3.6/site-packages/tensorflow_tts/trainers/base_trainer.py", line 765, in _eval_epoch
    self.generate_and_save_intermediate_result(batch)
  File "examples/fastspeech2_libritts/train_fastspeech2.py", line 176, in generate_and_save_intermediate_result
    utt_id = utt_ids[idx]
IndexError: index 16 is out of bounds for axis 0 with size 16

The error is in around here

for idx, (mel_gt, mel_before, mel_after) in enumerate(

for idx, (mel_gt, mel_before, mel_after) in enumerate(
            zip(mel_gts, mels_before, mels_after), 1
        ):
            
            
            if self.use_griffin:
                utt_id = utt_ids[idx]

Why do we shift the enumerate counter to start from 1?

@ronggong
Copy link
Author

@dathudeptrai @ZDisket I finished the libritts model training, however, I am still using the LJspeech multiband melgan vocoder. Also tried the vocoder of TuckerCarlson in TensorVox demo. None of them sounds good to me. Do you have a good vocoder to use? I found the multiband_melgan.v1_24k is somehow interesting, but it's trained with 2048 FFT and 300 hopsize, although the sampling rate is 24k, which corresponds well to the libritts training data.

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 28, 2020

@ronggong The vocoder of Tucker Carlson is the multiband_melgan.v1_24k, it's the LibriTTS mb-melgan. Although last time I heard the eval samples on its own dataset didn't sound very good, it still worked fine as a universal vocoder.

@ronggong
Copy link
Author

@ZDisket Your Tucker Carlson model sounds much better than my libritts model on the same vocoder? Is your Tucker fastspeech2 model trained with data of 2048 FFT and 300 hopsize? My assumption the mismatch of the FFT and hopsize is the issue.

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 28, 2020

@ronggong

Is your Tucker fastspeech2 model trained with data of 2048 FFT and 300 hopsize?

Yes. First I upsampled LJSpeech to 24KHz and trained on the same config then took my 24KHz Tucker Carlson data and finetuned the 24-LJSpeech model.

@ronggong
Copy link
Author

@ZDisket ok, I will retrain my libritts on this config.

@ronggong
Copy link
Author

ronggong commented Oct 28, 2020

trim_silence: false # Whether to trim the start and end of silence.
trim_threshold_in_db: 60 # Need to tune carefully if the recording is not good.
trim_frame_size: 2048 # Frame size in trimming.
trim_hop_size: 512 # Hop size in trimming.
format: "npy" # Feature file format. Only "npy" is supported.
trim_mfa: false

@ZDisket I am curious that you set trim silence and trim mfa both to false. Does this work better than setting to true. Btw, what does it mean trim_mfa, trimming the alignment results?

@ZDisket
Copy link
Collaborator

ZDisket commented Oct 28, 2020

@ronggong MFA trimming trims based on forced alignment results, and I turned both trims off because I found that MFA-aligned FS2 performs better without trimming (as the audio would end too abruptly), and due to the nature of my data collection scripts, most of my datasets already come pre-trimmed.

@ronggong
Copy link
Author

ronggong commented Oct 31, 2020

@ZDisket The Libritts voice sounds a lot better now with FFT 2048 in preprocess, which correspond to the vocoder FFT size. However, the phrase ends very abruptly as you mentioned. I already disabled trim_silence and trim_mfa in preprocess. Do you have an idea why?
If I set both trim_silence and trim_mfa to true, does this has the same effect to trim the start and end silence according to the MFA alignment, as you did in this script ? https://github.com/ZDisket/TensorflowTTS/blob/master/examples/fastspeech2/mfa/postmfa.py

This part is confusing to me https://github.com/ZDisket/TensorflowTTS/blob/e2660d50dd6baf43e83e0534e2595b95cf0e507d/examples/fastspeech2/mfa/postmfa.py#L135
May I understand why do you skip the start silence? We are suppose to trim the start silence, no?

@ZDisket
Copy link
Collaborator

ZDisket commented Nov 1, 2020

@ronggong My script, which is entirely separate from the current MFA stuff, loads the sound file and exports trimmed WAVs, although I think there's a flaw with it.

However, the phrase ends very abruptly as you mentioned. I already disabled trim_silence and trim_mfa in preprocess. Do you have an idea why?

Add SIL to the end of your input. I think the current inference only adds END to the end of the input, I do SIL END so it doesn't end abruptly.

@ronggong
Copy link
Author

ronggong commented Nov 1, 2020

@ZDisket I added SIL and END, but it still ends abruptly. I probably will check the MFA alignment to see if the last phoneme is usually well aligned. Or another guess is when generating the data, maybe we didn't put the END at the end of a sentence as a label to differentiate the phoneme at the end from those in the middle of a sentence.
I also found another problem, the synthesis of the first sentence after loading the model is always slow. Is this because of the model warmup?

@dathudeptrai
Copy link
Collaborator

I also found another problem, the synthesis of the first sentence after loading the model is always slow. Is this because of the model warmup?

yes :D

@ronggong
Copy link
Author

ronggong commented Nov 2, 2020

Is there a way to workaround the warmup? If not, the way is to preload the model and do an dummy inference.

@ronggong
Copy link
Author

ronggong commented Nov 2, 2020

@ZDisket It seems that keep punctuations and adding silence after that can mitigate the abrupt ending problem. I am not sure if the punctuation is used as phonemes in training, will check.
Update: the MFA alignment doesn't have punctuations, so the training data doesn't have punctuations either. I think to solve the problem, we could extend the right-side boundary of the last phoneme before the "END" in a training utterance.
@dathudeptrai I think the libritts model is done, do you want me to create a colab demo. I think it will interesting for those who wants to test multi speakers model.

@dathudeptrai
Copy link
Collaborator

@ronggong yeah :D i'm interested with ur libriTTS demo :D.

@machineko
Copy link
Contributor

machineko commented Nov 4, 2020

@ronggong If u used preprocessing based on multispeaker there is a part of trimming END and SIL from the end based on MFA alignment and in my case models was working a lot better with trimmed based on both END and SIL token in the end of the sentence. Also it wasnt trimmed on librosa but only based on MFA phonmes :)

If u use this trimming in inference time you should also remove SIL and END from the end of the sentence and the problem with abruptly ending sentence wasn't case in this configuration on Libri + my dataset.

prepro

@ronggong
Copy link
Author

ronggong commented Nov 4, 2020

@machineko ok, thanks for the tips, let me try another trimmed version.

@machineko
Copy link
Contributor

@ronggong Also this => #296 (comment)

@OscarVanL
Copy link
Contributor

I made the changes mentioned earlier in this thread and my model performs way better, and the end of the speech is not cut off after adding SIL and END to the end of my input ids :)

@OscarVanL
Copy link
Contributor

For those training FS2 on LibriTTS, I tried switching my LibriTTS subset from train-clean-100 to train-clean-360 and it made a big improvement with the same number of speakers.

My speaker selection involved picking the 100 speakers with the most speech from the subset. In train-clean-100 they have an average of 17.5 mins (some speakers with as little as 12 minutes). In train-clean-360 they have an average of 20 minutes.

I think perhaps adding speakers with small amounts of data (<20 mins) does not help the model.

@ronggong
Copy link
Author

ronggong commented Nov 6, 2020

@dathudeptrai Here is the colab notebook with pretrained Libritts train-clean-100 data Fastspeech2 model. Any suggestions? https://colab.research.google.com/drive/1K-4KxwUGaElMxcLzbFXX1oGAaAJXROpf?usp=sharing

@ronggong
Copy link
Author

ronggong commented Nov 6, 2020

@machineko @OscarVanL
Here is an experiment. model trained with trimmed libritts 100 data. When inference, from top to bottom:
(1) no SIL, no END at the end
(2) only SIL
(3) both SIL and END
(4) only END

Where there is no SIL and END (1), the phrase ends abruptly. In case (2), the last vowel prolonged. In case (3), a vowel-like sound is added at the end. In case (4), a short pause is added to the end. This is only one example, can't generalize to other examples.

image

Then, model trained with non-trimmed data,
(1) no SIL no END, the audio get trimmed at the end
(2) only SIL, last vowel prolongs
(3) both SIL and END, a little pause at the end.

image

So for me, with either trimmed or non trimmed data, add SIL at the end is helpful.

@OscarVanL
Copy link
Contributor

OscarVanL commented Nov 6, 2020

Yes, this is what I have been doing, I add SIL and END to the end of every speech :) Thank you for making a comparison so I know this is the right approach.

@ronggong
Copy link
Author

ronggong commented Nov 6, 2020

@OscarVanL I am not sure if adding END make sense. If we use trim MFA, the text id of both SIL and END are also trimmed in the training text I suppose

text_ids = text_ids[idx_start:idx_end]

Which might means in training, the model probably never seen END symbol.

@OscarVanL
Copy link
Contributor

OscarVanL commented Nov 6, 2020

Interesting, maybe the END is unnecessary then, but there is clearly a difference in your charts 2 and 3, so END must be doing something?

@ronggong
Copy link
Author

ronggong commented Nov 6, 2020

@OscarVanL In the example, it adds some vowel-like noise.

@aragorntheking
Copy link

aragorntheking commented Nov 16, 2020

@ronggong MFA replacing punctuations before training has side effects. Like question marks don't make the sentence sound like a question anymore. Would somehow putting the punctuations back post MFA alignment make sense ?
Is there a better way to fix this ?

@ronggong
Copy link
Author

@aragorntheking I think you are right, the punctuations have been removed from the MFA alignment and as well from the training labels, as the labels are generated from the MFA alignment. I think it's worth to investigate the effect of putting the punctuations back.

@ronggong
Copy link
Author

@aragorntheking I am going to have to time to investigate the punctuation of the MFA alignment. Do you want to collaborate on this?

@ronggong
Copy link
Author

@OscarVanL @aragorntheking Adding back the punctuations to the transcriptions to train has an effect, although minor. We can hear the pitch change:

  1. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania?
  2. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania!
  3. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania.

Below are the spectrogram of three sentences, pay attention to the ending pitch.

image

Audio examples

test.zip

@ronggong
Copy link
Author

@dathudeptrai Just reminder, the pretrained multispeaker libritts fastspeech2 colab is here https://colab.research.google.com/drive/1K-4KxwUGaElMxcLzbFXX1oGAaAJXROpf?usp=sharing if we want to share with others.

@stale
Copy link

stale bot commented Mar 1, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the wontfix label Mar 1, 2021
@stale stale bot closed this as completed Mar 8, 2021
@MostafaAlaviyan
Copy link

MostafaAlaviyan commented Mar 13, 2022

@OscarVanL @aragorntheking Adding back the punctuations to the transcriptions to train has an effect, although minor. We can hear the pitch change:

  1. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania?
  2. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania!
  3. Georgia recount confirmed as Biden builds lead over Trump in Pennsylvania.

Below are the spectrogram of three sentences, pay attention to the ending pitch.

image

Audio examples

test.zip

@ronggong where do you add the punctuations? do you change the MFA alignments or change the MFA to remain the punctuations? can you explain clearly what you do to solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question ❓ Further information is requested wontfix
Projects
None yet
Development

No branches or pull requests

7 participants