Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scaling Mel Spectrogram output for Wavenet Vocoder #24

Closed
G-Wang opened this issue May 21, 2018 · 20 comments
Closed

scaling Mel Spectrogram output for Wavenet Vocoder #24

G-Wang opened this issue May 21, 2018 · 20 comments

Comments

@G-Wang
Copy link

G-Wang commented May 21, 2018

Hello,

First of all thanks for the nice Tacotron 2 implementation.

I'm trying to use the trained Tacotron 2 outputs as inputs to r9r9's Wavenet vocoder. However his pre-trained wavenet works on scaled Mel Spectrogram between [0, 1].

What is the range for this tacotron 2 implementation, I'm having a hard time finding this out to use it for scaling.

For reference, this is r9r9's normalization function that he applies to the Mel Spectrogram before using it for training, which scales it between 0 and 1:

def _normalize(S): return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 21, 2018

Our [dynamic range compression] (https://github.com/NVIDIA/tacotron2/blob/master/audio_processing.py#L78) just applies log on clamped values. We also provide a dynamic range decompression function here

I think the code below is what you're looking for.

mel = torch.load(conditional_path)
mel = dynamic_range_decompression(mel)
mel = mel.cpu().numpy()
mel = mel.transpose()

mel = audio._amp_to_db(mel) - hparams.ref_level_db
if not hparams.allow_clipping_in_normalization:
    assert mel.max() <= 0 and mel.min() - hparams.min_level_db >= 0
mel = audio._normalize(mel)

@rafaelvalle
Copy link
Contributor

Curious to hear your samples and to know if you trained with the most recent code that updates the input of the attention and decoder.

@G-Wang
Copy link
Author

G-Wang commented May 22, 2018

Thanks, that worked, (my model still needs some more training), I'm starting another run with the updated attention repository, will post back here with the entire pipeline (+ wavenet vocoder) once that's done.

@G-Wang G-Wang closed this as completed May 22, 2018
@yliess86
Copy link

Hi @G-Wang ! Have you gone further with your training ? I am currently doing the same thing as you and the voice I get from Wavenet sounds like if it has the flu. Did you managed to have good results ?

@rafaelvalle
Copy link
Contributor

@yliess86 can you share the audio and the mel-spectrogram that sounds like it has the flu here?

@G-Wang
Copy link
Author

G-Wang commented Jun 12, 2018

Hello @yliess86, due to my limited compute, I had to set the batch size pretty low to fit everything into the GPU (batch_size of 18), the network's loss was stuck around 0.63 for quite a while, so I didn't continue. however that was older version of the code. I've started running a new training with the latest code since yesterday.

But note that the solution @rafaelvalle provided works no problem once your network has been trained, as I took the ground truth mel spectrogram data that Tacatron-2 trains on and it has very good quality on R9R9's wavenet vocoder (using the above code)

@yliess86
Copy link

@G-Wang Ok thank you.
@rafaelvalle Sure! I will share audio and mel-spec today.

@yliess86
Copy link

yliess86 commented Jun 13, 2018

Here are the text, the corresponding audio and the corresponding mel-spectrogram:

  • 'This is an example of text to speech synthesis after 9 days training. This may sound awful, but it is a start.'
  • Audio
  • Mel-spectrogram (sorry forgot to inverse top direction):
    Mel-Spec

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Jun 14, 2018

@yliess86 Can you share the mel-spectrogram file?
Using the WaveNet decoder is essential for good audio quality!

@yliess86
Copy link

yliess86 commented Jun 14, 2018

@rafaelvalle I plugged the output of the tacotron2 to the convertion pipline you described above and finally plugged it into the r9y9 wavenet. The image you can see is the plt plot I did just before the convertion. Do you want me to give you the output (mel-spec) into a .npy file or other ?

# Tacotron2
mel = taco(sentence)[0]

# Convertion
mel = dynamic_range_decompression(mel_input)
mel = mel.data.cpu().numpy()
mel = mel.transpose()
mel = audio._amp_to_db(mel) - hparams.ref_level_db
if not hparams.allow_clipping_in_normalization:
    assert mel.max() <= 0 and mel.min() - hparams.min_level_db >= 0
mel = audio._normalize(mel)

# Wavenet Vocoder
if mel.shape[1] != hparams.num_mels:
    np.swapaxes(mel, 0, 1)
waveform = wavegen(self.model, c=mel, fast=True, tqdm=tqdm)

@rafaelvalle
Copy link
Contributor

Yes, please do share the mel-spec into a torch or npy file.

@yliess86
Copy link

yliess86 commented Jun 14, 2018

Here is the (.npy) file: Mel-Spec

@rafaelvalle
Copy link
Contributor

@yliess86 The model that produced this mel-spectrogram was not trained on LJ Speech dataset, right?

@yliess86
Copy link

It was. I trained the model with the LJ Speech dataset. This mel Spec is the result of 70000 iterations on this dataset.

@rafaelvalle
Copy link
Contributor

That's unexpected. Did you train using the default params?

@yliess86
Copy link

yliess86 commented Jun 14, 2018

Yes, I just changed the batch size to 24. I'll try to download it again and retrain the model. It was the LJ Speech dataset from the link given on the Rayhane repository so maybe this is not exactly the same.

@rafaelvalle
Copy link
Contributor

What were the training and validation loss of the model used to produce that mel spectrogram?

@yliess86
Copy link

The training loss was between 0.3 and 0.5 and the validation one was 0.46

@rafaelvalle
Copy link
Contributor

rafaelvalle commented Jun 14, 2018

Don't retrain the model, the problem is with the mel-spectrogram representation.
Can you please submit a new issue ? We'll post a solution there.

@mrgloom
Copy link

mrgloom commented Mar 24, 2019

Is it possible to use pretrained wavenet models from https://github.com/r9y9/wavenet_vocoder with https://github.com/NVIDIA/tacotron2 ? or it should be retrained?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants