-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scaling Mel Spectrogram output for Wavenet Vocoder #24
Comments
Our [dynamic range compression] (https://github.com/NVIDIA/tacotron2/blob/master/audio_processing.py#L78) just applies log on clamped values. We also provide a dynamic range decompression function here I think the code below is what you're looking for.
|
Curious to hear your samples and to know if you trained with the most recent code that updates the input of the attention and decoder. |
Thanks, that worked, (my model still needs some more training), I'm starting another run with the updated attention repository, will post back here with the entire pipeline (+ wavenet vocoder) once that's done. |
Hi @G-Wang ! Have you gone further with your training ? I am currently doing the same thing as you and the voice I get from Wavenet sounds like if it has the flu. Did you managed to have good results ? |
@yliess86 can you share the audio and the mel-spectrogram that sounds like it has the flu here? |
Hello @yliess86, due to my limited compute, I had to set the batch size pretty low to fit everything into the GPU (batch_size of 18), the network's loss was stuck around 0.63 for quite a while, so I didn't continue. however that was older version of the code. I've started running a new training with the latest code since yesterday. But note that the solution @rafaelvalle provided works no problem once your network has been trained, as I took the ground truth mel spectrogram data that Tacatron-2 trains on and it has very good quality on R9R9's wavenet vocoder (using the above code) |
@G-Wang Ok thank you. |
Here are the text, the corresponding audio and the corresponding mel-spectrogram:
|
@yliess86 Can you share the mel-spectrogram file? |
@rafaelvalle I plugged the output of the tacotron2 to the convertion pipline you described above and finally plugged it into the r9y9 wavenet. The image you can see is the plt plot I did just before the convertion. Do you want me to give you the output (mel-spec) into a .npy file or other ? # Tacotron2
mel = taco(sentence)[0]
# Convertion
mel = dynamic_range_decompression(mel_input)
mel = mel.data.cpu().numpy()
mel = mel.transpose()
mel = audio._amp_to_db(mel) - hparams.ref_level_db
if not hparams.allow_clipping_in_normalization:
assert mel.max() <= 0 and mel.min() - hparams.min_level_db >= 0
mel = audio._normalize(mel)
# Wavenet Vocoder
if mel.shape[1] != hparams.num_mels:
np.swapaxes(mel, 0, 1)
waveform = wavegen(self.model, c=mel, fast=True, tqdm=tqdm) |
Yes, please do share the mel-spec into a torch or npy file. |
Here is the (.npy) file: Mel-Spec |
@yliess86 The model that produced this mel-spectrogram was not trained on LJ Speech dataset, right? |
It was. I trained the model with the LJ Speech dataset. This mel Spec is the result of 70000 iterations on this dataset. |
That's unexpected. Did you train using the default params? |
Yes, I just changed the batch size to 24. I'll try to download it again and retrain the model. It was the LJ Speech dataset from the link given on the Rayhane repository so maybe this is not exactly the same. |
What were the training and validation loss of the model used to produce that mel spectrogram? |
The training loss was between 0.3 and 0.5 and the validation one was 0.46 |
Don't retrain the model, the problem is with the mel-spectrogram representation. |
Is it possible to use pretrained wavenet models from https://github.com/r9y9/wavenet_vocoder with https://github.com/NVIDIA/tacotron2 ? or it should be retrained? |
Hello,
First of all thanks for the nice Tacotron 2 implementation.
I'm trying to use the trained Tacotron 2 outputs as inputs to r9r9's Wavenet vocoder. However his pre-trained wavenet works on scaled Mel Spectrogram between [0, 1].
What is the range for this tacotron 2 implementation, I'm having a hard time finding this out to use it for scaling.
For reference, this is r9r9's normalization function that he applies to the Mel Spectrogram before using it for training, which scales it between 0 and 1:
def _normalize(S): return np.clip((S - hparams.min_level_db) / -hparams.min_level_db, 0, 1)
The text was updated successfully, but these errors were encountered: