Hoarseness in synthesised voice #297

sharathadavanne · 2020-01-21T09:08:35Z

Hi, so we have been training both tacotron2 and waveglow models on clean male speech (Hindi language) of about 10+ hours @ 16kHz sampling rate using phonemic text. We keep the window parameters, and mel_band parameters unchanged from the 22.5kHz setup in the original repo. Both the models were trained from scratch in a distributed manner using 8-V100 for over 2 days. The resulting voice synthesis produces a hoarse voice. We tried synthesizing using models from different iterations. The hoarseness remains the same irrespective. From visualizing the spectrogram of the original (left-side of the image) and synthesized audio (right-side of the image) of the same text we observe that the spectrogram of synthesized audio is smudged in the middle frequencies in comparison to crisp harmonics of the original audio.

Any suggestions on how to train better?

Left: Original (Audio) , Right: Synthesized (Audio)

cjerry1243 · 2020-02-27T05:08:25Z

@sharathadavanne

did you resample your data to 22.05kHz? The pretrained weights are at 22.05kHz
it's necessary to trim silence of an utterance in the head and tail since you dont want tacotron to learn the alignment of silence. Neither should waveglow learn the distribution of silence.
audio normalization may help if the sound level of your dataset varies largely across different clips.
do you have good phonetic/character embeddings for hindi language?

These are my findings. You can try them.

ksaidin · 2020-04-26T20:18:35Z

@sharathadavanne, For me, it sounds as if it is under trained.

@cjerry1243 , @Yeongtae on the other hand, suggests to add some silence in the end.
#269 (comment)

He is deliberately adding silence in his fork:
https://github.com/Yeongtae/tacotron2/blob/master/preprocess_audio.py

@rafaelvalle , can you please comment on this one?

sharathadavanne · 2020-04-27T09:46:08Z

Hi @ksaidin and @cjerry1243 Sorry for the late reply. I found out that the problem was with the waveglow model, and had nothing to do with tacotron. The waveglow model trained with zero-mean unit-variance normalized mel features was resulting in a noisy synthesis, while training with raw un-normalized mel features was resulting in a clean synthesis. Obviously, in this above comparison, theTacotron models were trained with the corresponding normalization. Did you face this kind of issue with Waveglow @rafaelvalle ?

Yeongtae · 2020-04-27T11:27:09Z

@ksaidin, @sharathadavanne If you train wavglow with silence padding, it makes loss explosion.

sharathadavanne · 2020-04-27T11:42:34Z

You are right @Yeongtae , but in my scenario, the hoarseness was not a result of silence in training data. It was mainly a result of normalizing the mel features to be in zero-mean unit-varance. When trained with these normalized features the model went on to converge neatly during training. However during inference, even with training data, it resulted in hoarse output similar to the recordings attached in the issue above. However, if I trained the waveglow with the same features as above that are not normalized, it results in clean synthesis.

ksaidin · 2020-04-27T20:19:56Z

@sharathadavanne , glad to hear that, please share the final audio of this text for the comparison.

sharathadavanne · 2020-04-28T12:53:42Z

@ksaidin I dont have the exact same recordings as above, but I have audio examples for other text, which you can hear on soundclound. You see that with unnormalized features waveglow output is identical to original, whereas, with normalized features, the waveglow output is noisy. Both models were trained for about 500k iterations.

AnkurDebnath35 · 2020-04-29T15:20:30Z

@sharathadavanne @ksaidin Hey, even I have been trying out with Hindi dataset. But what I'm confused with is to use pretrained waveglow model or train the waveglow afresh.
And If needed, can you guide me through it, about how you trained the waveglow and all.

P.S.: I have around 23hrs+ data.

Also, please tell me the other required changes to make the thing work, like in symbols.py , and cleaners etc. And how did you handle english words during inference. Please note that the speaker in my data uses english words quite often which I have transliterated to Devanagari script for training purpose.

sharathadavanne · 2020-04-30T04:53:47Z

@AnkurDebnath35 you definitely need to train a waveglow model with your data. Since you have 23+ hours, you can train it from scratch also, however it might take longer. In my experience, it is much faster to train starting from the pretrained waveglow model (of any language/gender). And you dont really need to worry about your transcript for training waveglow, it only needs your audio and not your transcript.

AnkurDebnath35 · 2020-04-30T07:36:42Z

@sharathadavanne Thanks. So suppose there is this published waveglow pretrained model. So are you suggesting that I train waveglow from that pretrained model and then train taco?

You also said something about feature normalisation. Can you explain it? Details like where to change, what to change in the code of waveglow/taco.
Also suggest me required changes in taco codes. Thank you again for helping me out here.

sharathadavanne · 2020-04-30T07:46:58Z

Hi @AnkurDebnath35 before all the improvements, please train the base frameworks- tacotron and waveglow - from scratch with your data. Once you get your acceptable results you can think of any improvements. For now, you can follow these steps.
If you face any specific problem from the above experiment, you can look up the already exhaustive issues of this repo, and the waveglow repo. If you dont find your problem, raise an issue. Good luck!

rafaelvalle · 2020-05-05T20:22:36Z

@AnkurDebnath35 it is not always necessary to train a Tacotron or a waveglow from scratch, specially if your voice is female, because fine-tuning the existing model can give you good results.

AnkurDebnath35 · 2020-05-05T22:21:03Z

@AnkurDebnath35 it is not always necessary to train a Tacotron or a waveglow from scratch, specially if your voice is female, because fine-tuning the existing model can give you good results.

Yes! But my dataset is a male one. Any suggestions?

rafaelvalle · 2020-05-05T22:46:14Z

You can get better results by training from scratch while we prepare a male speaker baseline model.

AnkurDebnath35 · 2020-05-05T22:48:12Z

I read an issue here talking about training with 4-5hrs data. I have around 4.5 hr data and want to train it over the published model.

srijan14 · 2020-06-13T11:58:21Z

@sharathadavanne Can you share the vocab that you used for Hindi. I am not sure if just adding the hindi vocab characters in symbols.py is enough because in devnagri some characters take 8bytes (one with the sign) and in general what i saw in python get's splitted in two characters. So does it make any issue while training?

sharathadavanne · 2020-06-15T07:25:32Z

Hi, sorry I won't be able to share the code. But you can use any of the publicly available Hindi words to phoneme converter. For instance, you can use the IIT Madras group's code available here - https://www.iitm.ac.in/donlab/tts/unified.php

srijan14 · 2020-06-20T19:06:28Z

@sharathadavanne Thanks for pointing to the resource. So just correct me if understanding is wrong, so for Hindi, we can keep the vocab in the symbols.py(all english characters present in default code) and i guess as a first experiment, most other parameters also to be as is , and you have changed your hindi input text into somewhat hinglish form,say for example one particular input case can be "aapka naam kya hai" and something like this?

sharathadavanne · 2020-06-21T03:44:41Z

You could do as you suggested. Or a better way is to use the phoneme sequence you get out of the word2phoneme model directly with your tacotron, i.e., skipping the text preprocessing part.

Obviously, since tacotron deals with numbers and not phones directly, you will still have to convert the phoneme sequence to number/symbol sequence as done here for regular text.

AnkurDebnath35 · 2020-06-29T00:11:00Z

Can someone help me cite this repository in a paper? I dont know whom to cite it agains. Like, who is the creator of this repository?

ksaidin · 2020-07-01T02:39:26Z

Ryan Prenger, Rafael Valle, Bryan Catanzaro

Varshul · 2020-09-02T22:49:42Z

After trying out multiple experiments along annealing, fine-tuning on english, fine-tuning on english+hindi we are still getting a reverb affect/hoarseness in the synthesized samples for male speakers.
You can check the alignments here along with the sample - https://trello.com/b/l5YRrfg1/tacotron2-detailing

However it just works soo fine for female speakers it just doesn't cut it for male speakers hindi male speaker and tamil male speaker.

@rafaelvalle would be great if we can give a baseline for male speaker as you've mentioned

AnkurDebnath35 · 2020-09-03T05:13:45Z

@Varshul

I have experimented with fine tuning Hindi on English, that too a male Hindi speaker. The results were just fine. How many hours of data did you have?

raikarsagar · 2020-09-14T07:35:16Z

@AnkurDebnath35 @sharathadavanne Hi, I am retraining the waveglow model with male speaker data with warm_start. What is the minimum amount of audio you would recommend for training? and is there any constraint on use of multiple male speaker data?

MuruganR96 · 2020-11-02T03:12:22Z

Hi @sharathadavanne sir, I am using unified-parser, as you suggested. but how i will convert number to hindienglish based text sir?

@sharathadavanne @srijan14 sir, Did you faced this issue, how to solve this issue?

@rafaelvalle sir, Can i finetune my hindi female voice (~25 hours) with LJSpeech female pretrained model? or i will do training from scratch? and which approach i can get good results?

Thanks
Murugan Rajenthiran

sharathadavanne · 2020-11-02T04:02:03Z

Hi @MuruganR96 you might need to write your own text normalization script for numbers/dates etc.. AFAIK there is no out of the box code available publicly.

Regarding finetuning, it will only help if both your training and finetuning data have a common phone set. However, if they have different phone set, you will have to train from scratch. Your 25 hours should be more than enough to train a model from scratch.

DanRuta · 2021-01-29T15:05:39Z

You can get better results by training from scratch while we prepare a male speaker baseline model.

Has this been made public? @rafaelvalle

Aasthaengg · 2021-09-11T09:05:39Z

Hi, so we have been training both tacotron2 and waveglow models on clean male speech (Hindi language) of about 10+ hours @ 16kHz sampling rate using phonemic text. We keep the window parameters, and mel_band parameters unchanged from the 22.5kHz setup in the original repo. Both the models were trained from scratch in a distributed manner using 8-V100 for over 2 days. The resulting voice synthesis produces a hoarse voice. We tried synthesizing using models from different iterations. The hoarseness remains the same irrespective. From visualizing the spectrogram of the original (left-side of the image) and synthesized audio (right-side of the image) of the same text we observe that the spectrogram of synthesized audio is smudged in the middle frequencies in comparison to crisp harmonics of the original audio.

Any suggestions on how to train better?

Left: Original (Audio) , Right: Synthesized (Audio)

Hi, did you change the cleaners for it as well or just symbols.py?

sharathadavanne closed this as completed Apr 30, 2020

ksaidin mentioned this issue May 2, 2020

how to improve the quality of male's voice? #234

Closed

berger-ulak mentioned this issue May 26, 2020

Training with raw un-normalized mel features was resulting in a clean synthesis NVIDIA/waveglow#197

Closed

ksaidin mentioned this issue Jun 2, 2020

Warm start from published model NVIDIA/waveglow#200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoarseness in synthesised voice #297

Hoarseness in synthesised voice #297

sharathadavanne commented Jan 21, 2020 •

edited

cjerry1243 commented Feb 27, 2020

ksaidin commented Apr 26, 2020

sharathadavanne commented Apr 27, 2020

Yeongtae commented Apr 27, 2020

sharathadavanne commented Apr 27, 2020

ksaidin commented Apr 27, 2020

sharathadavanne commented Apr 28, 2020

AnkurDebnath35 commented Apr 29, 2020

sharathadavanne commented Apr 30, 2020

AnkurDebnath35 commented Apr 30, 2020

sharathadavanne commented Apr 30, 2020

rafaelvalle commented May 5, 2020 •

edited

AnkurDebnath35 commented May 5, 2020

rafaelvalle commented May 5, 2020

AnkurDebnath35 commented May 5, 2020

srijan14 commented Jun 13, 2020

sharathadavanne commented Jun 15, 2020

srijan14 commented Jun 20, 2020

sharathadavanne commented Jun 21, 2020

AnkurDebnath35 commented Jun 29, 2020

ksaidin commented Jul 1, 2020

Varshul commented Sep 2, 2020 •

edited

AnkurDebnath35 commented Sep 3, 2020

raikarsagar commented Sep 14, 2020

MuruganR96 commented Nov 2, 2020 •

edited

sharathadavanne commented Nov 2, 2020

DanRuta commented Jan 29, 2021

Aasthaengg commented Sep 11, 2021

Hoarseness in synthesised voice #297

Hoarseness in synthesised voice #297

Comments

sharathadavanne commented Jan 21, 2020 • edited

cjerry1243 commented Feb 27, 2020

ksaidin commented Apr 26, 2020

sharathadavanne commented Apr 27, 2020

Yeongtae commented Apr 27, 2020

sharathadavanne commented Apr 27, 2020

ksaidin commented Apr 27, 2020

sharathadavanne commented Apr 28, 2020

AnkurDebnath35 commented Apr 29, 2020

sharathadavanne commented Apr 30, 2020

AnkurDebnath35 commented Apr 30, 2020

sharathadavanne commented Apr 30, 2020

rafaelvalle commented May 5, 2020 • edited

AnkurDebnath35 commented May 5, 2020

rafaelvalle commented May 5, 2020

AnkurDebnath35 commented May 5, 2020

srijan14 commented Jun 13, 2020

sharathadavanne commented Jun 15, 2020

srijan14 commented Jun 20, 2020

sharathadavanne commented Jun 21, 2020

AnkurDebnath35 commented Jun 29, 2020

ksaidin commented Jul 1, 2020

Varshul commented Sep 2, 2020 • edited

AnkurDebnath35 commented Sep 3, 2020

raikarsagar commented Sep 14, 2020

MuruganR96 commented Nov 2, 2020 • edited

sharathadavanne commented Nov 2, 2020

DanRuta commented Jan 29, 2021

Aasthaengg commented Sep 11, 2021

sharathadavanne commented Jan 21, 2020 •

edited

rafaelvalle commented May 5, 2020 •

edited

Varshul commented Sep 2, 2020 •

edited

MuruganR96 commented Nov 2, 2020 •

edited