Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoarseness in synthesised voice #297

Closed
sharathadavanne opened this issue Jan 21, 2020 · 28 comments
Closed

Hoarseness in synthesised voice #297

sharathadavanne opened this issue Jan 21, 2020 · 28 comments

Comments

@sharathadavanne
Copy link

sharathadavanne commented Jan 21, 2020

Hi, so we have been training both tacotron2 and waveglow models on clean male speech (Hindi language) of about 10+ hours @ 16kHz sampling rate using phonemic text. We keep the window parameters, and mel_band parameters unchanged from the 22.5kHz setup in the original repo. Both the models were trained from scratch in a distributed manner using 8-V100 for over 2 days. The resulting voice synthesis produces a hoarse voice. We tried synthesizing using models from different iterations. The hoarseness remains the same irrespective. From visualizing the spectrogram of the original (left-side of the image) and synthesized audio (right-side of the image) of the same text we observe that the spectrogram of synthesized audio is smudged in the middle frequencies in comparison to crisp harmonics of the original audio.

Any suggestions on how to train better?

Left: Original (Audio) , Right: Synthesized (Audio)

Spectrogram - left: Original, right: Synthesized

@cjerry1243
Copy link

@sharathadavanne

  1. did you resample your data to 22.05kHz? The pretrained weights are at 22.05kHz
  2. it's necessary to trim silence of an utterance in the head and tail since you dont want tacotron to learn the alignment of silence. Neither should waveglow learn the distribution of silence.
  3. audio normalization may help if the sound level of your dataset varies largely across different clips.
  4. do you have good phonetic/character embeddings for hindi language?

These are my findings. You can try them.

@ksaidin
Copy link

ksaidin commented Apr 26, 2020

@sharathadavanne, For me, it sounds as if it is under trained.

@cjerry1243 , @Yeongtae on the other hand, suggests to add some silence in the end.
#269 (comment)

He is deliberately adding silence in his fork:
https://github.com/Yeongtae/tacotron2/blob/master/preprocess_audio.py

@rafaelvalle , can you please comment on this one?

@sharathadavanne
Copy link
Author

Hi @ksaidin and @cjerry1243 Sorry for the late reply. I found out that the problem was with the waveglow model, and had nothing to do with tacotron. The waveglow model trained with zero-mean unit-variance normalized mel features was resulting in a noisy synthesis, while training with raw un-normalized mel features was resulting in a clean synthesis. Obviously, in this above comparison, theTacotron models were trained with the corresponding normalization. Did you face this kind of issue with Waveglow @rafaelvalle ?

@Yeongtae
Copy link

@ksaidin, @sharathadavanne If you train wavglow with silence padding, it makes loss explosion.

@sharathadavanne
Copy link
Author

You are right @Yeongtae , but in my scenario, the hoarseness was not a result of silence in training data. It was mainly a result of normalizing the mel features to be in zero-mean unit-varance. When trained with these normalized features the model went on to converge neatly during training. However during inference, even with training data, it resulted in hoarse output similar to the recordings attached in the issue above. However, if I trained the waveglow with the same features as above that are not normalized, it results in clean synthesis.

@ksaidin
Copy link

ksaidin commented Apr 27, 2020

@sharathadavanne , glad to hear that, please share the final audio of this text for the comparison.

@sharathadavanne
Copy link
Author

@ksaidin I dont have the exact same recordings as above, but I have audio examples for other text, which you can hear on soundclound. You see that with unnormalized features waveglow output is identical to original, whereas, with normalized features, the waveglow output is noisy. Both models were trained for about 500k iterations.

@AnkurDebnath35
Copy link

@sharathadavanne @ksaidin Hey, even I have been trying out with Hindi dataset. But what I'm confused with is to use pretrained waveglow model or train the waveglow afresh.
And If needed, can you guide me through it, about how you trained the waveglow and all.

P.S.: I have around 23hrs+ data.

Also, please tell me the other required changes to make the thing work, like in symbols.py , and cleaners etc. And how did you handle english words during inference. Please note that the speaker in my data uses english words quite often which I have transliterated to Devanagari script for training purpose.

@sharathadavanne
Copy link
Author

@AnkurDebnath35 you definitely need to train a waveglow model with your data. Since you have 23+ hours, you can train it from scratch also, however it might take longer. In my experience, it is much faster to train starting from the pretrained waveglow model (of any language/gender). And you dont really need to worry about your transcript for training waveglow, it only needs your audio and not your transcript.

@AnkurDebnath35
Copy link

@sharathadavanne Thanks. So suppose there is this published waveglow pretrained model. So are you suggesting that I train waveglow from that pretrained model and then train taco?

You also said something about feature normalisation. Can you explain it? Details like where to change, what to change in the code of waveglow/taco.
Also suggest me required changes in taco codes. Thank you again for helping me out here.

@sharathadavanne
Copy link
Author

Hi @AnkurDebnath35 before all the improvements, please train the base frameworks- tacotron and waveglow - from scratch with your data. Once you get your acceptable results you can think of any improvements. For now, you can follow these steps.
If you face any specific problem from the above experiment, you can look up the already exhaustive issues of this repo, and the waveglow repo. If you dont find your problem, raise an issue. Good luck!

@rafaelvalle
Copy link
Contributor

rafaelvalle commented May 5, 2020

@AnkurDebnath35 it is not always necessary to train a Tacotron or a waveglow from scratch, specially if your voice is female, because fine-tuning the existing model can give you good results.

@AnkurDebnath35
Copy link

@AnkurDebnath35 it is not always necessary to train a Tacotron or a waveglow from scratch, specially if your voice is female, because fine-tuning the existing model can give you good results.

Yes! But my dataset is a male one. Any suggestions?

@rafaelvalle
Copy link
Contributor

You can get better results by training from scratch while we prepare a male speaker baseline model.

@AnkurDebnath35
Copy link

I read an issue here talking about training with 4-5hrs data. I have around 4.5 hr data and want to train it over the published model.

@srijan14
Copy link

@sharathadavanne Can you share the vocab that you used for Hindi. I am not sure if just adding the hindi vocab characters in symbols.py is enough because in devnagri some characters take 8bytes (one with the sign) and in general what i saw in python get's splitted in two characters. So does it make any issue while training?

@sharathadavanne
Copy link
Author

Hi, sorry I won't be able to share the code. But you can use any of the publicly available Hindi words to phoneme converter. For instance, you can use the IIT Madras group's code available here - https://www.iitm.ac.in/donlab/tts/unified.php

@srijan14
Copy link

@sharathadavanne Thanks for pointing to the resource. So just correct me if understanding is wrong, so for Hindi, we can keep the vocab in the symbols.py(all english characters present in default code) and i guess as a first experiment, most other parameters also to be as is , and you have changed your hindi input text into somewhat hinglish form,say for example one particular input case can be "aapka naam kya hai" and something like this?

@sharathadavanne
Copy link
Author

You could do as you suggested. Or a better way is to use the phoneme sequence you get out of the word2phoneme model directly with your tacotron, i.e., skipping the text preprocessing part.

Obviously, since tacotron deals with numbers and not phones directly, you will still have to convert the phoneme sequence to number/symbol sequence as done here for regular text.

@AnkurDebnath35
Copy link

Can someone help me cite this repository in a paper? I dont know whom to cite it agains. Like, who is the creator of this repository?

@ksaidin
Copy link

ksaidin commented Jul 1, 2020

@Varshul
Copy link

Varshul commented Sep 2, 2020

After trying out multiple experiments along annealing, fine-tuning on english, fine-tuning on english+hindi we are still getting a reverb affect/hoarseness in the synthesized samples for male speakers.
You can check the alignments here along with the sample - https://trello.com/b/l5YRrfg1/tacotron2-detailing

However it just works soo fine for female speakers it just doesn't cut it for male speakers hindi male speaker and tamil male speaker.

@rafaelvalle would be great if we can give a baseline for male speaker as you've mentioned

@AnkurDebnath35
Copy link

@Varshul

I have experimented with fine tuning Hindi on English, that too a male Hindi speaker. The results were just fine. How many hours of data did you have?

@raikarsagar
Copy link

@AnkurDebnath35 @sharathadavanne Hi, I am retraining the waveglow model with male speaker data with warm_start. What is the minimum amount of audio you would recommend for training? and is there any constraint on use of multiple male speaker data?

@MuruganR96
Copy link

MuruganR96 commented Nov 2, 2020

Hi @sharathadavanne sir, I am using unified-parser, as you suggested. but how i will convert number to hindienglish based text sir?

@sharathadavanne @srijan14 sir, Did you faced this issue, how to solve this issue?

@rafaelvalle sir, Can i finetune my hindi female voice (~25 hours) with LJSpeech female pretrained model? or i will do training from scratch? and which approach i can get good results?

Thanks
Murugan Rajenthiran

@sharathadavanne
Copy link
Author

Hi @MuruganR96 you might need to write your own text normalization script for numbers/dates etc.. AFAIK there is no out of the box code available publicly.

Regarding finetuning, it will only help if both your training and finetuning data have a common phone set. However, if they have different phone set, you will have to train from scratch. Your 25 hours should be more than enough to train a model from scratch.

@DanRuta
Copy link

DanRuta commented Jan 29, 2021

You can get better results by training from scratch while we prepare a male speaker baseline model.

Has this been made public? @rafaelvalle

@Aasthaengg
Copy link

Hi, so we have been training both tacotron2 and waveglow models on clean male speech (Hindi language) of about 10+ hours @ 16kHz sampling rate using phonemic text. We keep the window parameters, and mel_band parameters unchanged from the 22.5kHz setup in the original repo. Both the models were trained from scratch in a distributed manner using 8-V100 for over 2 days. The resulting voice synthesis produces a hoarse voice. We tried synthesizing using models from different iterations. The hoarseness remains the same irrespective. From visualizing the spectrogram of the original (left-side of the image) and synthesized audio (right-side of the image) of the same text we observe that the spectrogram of synthesized audio is smudged in the middle frequencies in comparison to crisp harmonics of the original audio.

Any suggestions on how to train better?

Left: Original (Audio) , Right: Synthesized (Audio)

Spectrogram - left: Original, right: Synthesized

Hi, did you change the cleaners for it as well or just symbols.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests