New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hoarseness in synthesised voice #297
Comments
These are my findings. You can try them. |
@sharathadavanne, For me, it sounds as if it is under trained. @cjerry1243 , @Yeongtae on the other hand, suggests to add some silence in the end. He is deliberately adding silence in his fork: @rafaelvalle , can you please comment on this one? |
Hi @ksaidin and @cjerry1243 Sorry for the late reply. I found out that the problem was with the waveglow model, and had nothing to do with tacotron. The waveglow model trained with zero-mean unit-variance normalized mel features was resulting in a noisy synthesis, while training with raw un-normalized mel features was resulting in a clean synthesis. Obviously, in this above comparison, theTacotron models were trained with the corresponding normalization. Did you face this kind of issue with Waveglow @rafaelvalle ? |
@ksaidin, @sharathadavanne If you train wavglow with silence padding, it makes loss explosion. |
You are right @Yeongtae , but in my scenario, the hoarseness was not a result of silence in training data. It was mainly a result of normalizing the mel features to be in zero-mean unit-varance. When trained with these normalized features the model went on to converge neatly during training. However during inference, even with training data, it resulted in hoarse output similar to the recordings attached in the issue above. However, if I trained the waveglow with the same features as above that are not normalized, it results in clean synthesis. |
@sharathadavanne , glad to hear that, please share the final audio of this text for the comparison. |
@ksaidin I dont have the exact same recordings as above, but I have audio examples for other text, which you can hear on soundclound. You see that with unnormalized features waveglow output is identical to original, whereas, with normalized features, the waveglow output is noisy. Both models were trained for about 500k iterations. |
@sharathadavanne @ksaidin Hey, even I have been trying out with Hindi dataset. But what I'm confused with is to use pretrained waveglow model or train the waveglow afresh. P.S.: I have around 23hrs+ data. Also, please tell me the other required changes to make the thing work, like in symbols.py , and cleaners etc. And how did you handle english words during inference. Please note that the speaker in my data uses english words quite often which I have transliterated to Devanagari script for training purpose. |
@AnkurDebnath35 you definitely need to train a waveglow model with your data. Since you have 23+ hours, you can train it from scratch also, however it might take longer. In my experience, it is much faster to train starting from the pretrained waveglow model (of any language/gender). And you dont really need to worry about your transcript for training waveglow, it only needs your audio and not your transcript. |
@sharathadavanne Thanks. So suppose there is this published waveglow pretrained model. So are you suggesting that I train waveglow from that pretrained model and then train taco? You also said something about feature normalisation. Can you explain it? Details like where to change, what to change in the code of waveglow/taco. |
Hi @AnkurDebnath35 before all the improvements, please train the base frameworks- tacotron and waveglow - from scratch with your data. Once you get your acceptable results you can think of any improvements. For now, you can follow these steps. |
@AnkurDebnath35 it is not always necessary to train a Tacotron or a waveglow from scratch, specially if your voice is female, because fine-tuning the existing model can give you good results. |
Yes! But my dataset is a male one. Any suggestions? |
You can get better results by training from scratch while we prepare a male speaker baseline model. |
I read an issue here talking about training with 4-5hrs data. I have around 4.5 hr data and want to train it over the published model. |
@sharathadavanne Can you share the vocab that you used for Hindi. I am not sure if just adding the hindi vocab characters in symbols.py is enough because in devnagri some characters take 8bytes (one with the sign) and in general what i saw in python get's splitted in two characters. So does it make any issue while training? |
Hi, sorry I won't be able to share the code. But you can use any of the publicly available Hindi words to phoneme converter. For instance, you can use the IIT Madras group's code available here - https://www.iitm.ac.in/donlab/tts/unified.php |
@sharathadavanne Thanks for pointing to the resource. So just correct me if understanding is wrong, so for Hindi, we can keep the vocab in the symbols.py(all english characters present in default code) and i guess as a first experiment, most other parameters also to be as is , and you have changed your hindi input text into somewhat hinglish form,say for example one particular input case can be "aapka naam kya hai" and something like this? |
You could do as you suggested. Or a better way is to use the phoneme sequence you get out of the word2phoneme model directly with your tacotron, i.e., skipping the text preprocessing part. Obviously, since tacotron deals with numbers and not phones directly, you will still have to convert the phoneme sequence to number/symbol sequence as done here for regular text. |
Can someone help me cite this repository in a paper? I dont know whom to cite it agains. Like, who is the creator of this repository? |
After trying out multiple experiments along annealing, fine-tuning on english, fine-tuning on english+hindi we are still getting a reverb affect/hoarseness in the synthesized samples for male speakers. However it just works soo fine for female speakers it just doesn't cut it for male speakers hindi male speaker and tamil male speaker. @rafaelvalle would be great if we can give a baseline for male speaker as you've mentioned |
I have experimented with fine tuning Hindi on English, that too a male Hindi speaker. The results were just fine. How many hours of data did you have? |
@AnkurDebnath35 @sharathadavanne Hi, I am retraining the waveglow model with male speaker data with warm_start. What is the minimum amount of audio you would recommend for training? and is there any constraint on use of multiple male speaker data? |
Hi @sharathadavanne sir, I am using unified-parser, as you suggested. but how i will convert number to hindienglish based text sir? @sharathadavanne @srijan14 sir, Did you faced this issue, how to solve this issue? @rafaelvalle sir, Can i finetune my hindi female voice (~25 hours) with LJSpeech female pretrained model? or i will do training from scratch? and which approach i can get good results? Thanks |
Hi @MuruganR96 you might need to write your own text normalization script for numbers/dates etc.. AFAIK there is no out of the box code available publicly. Regarding finetuning, it will only help if both your training and finetuning data have a common phone set. However, if they have different phone set, you will have to train from scratch. Your 25 hours should be more than enough to train a model from scratch. |
Has this been made public? @rafaelvalle |
Hi, did you change the cleaners for it as well or just symbols.py? |
Hi, so we have been training both tacotron2 and waveglow models on clean male speech (Hindi language) of about 10+ hours @ 16kHz sampling rate using phonemic text. We keep the window parameters, and mel_band parameters unchanged from the 22.5kHz setup in the original repo. Both the models were trained from scratch in a distributed manner using 8-V100 for over 2 days. The resulting voice synthesis produces a hoarse voice. We tried synthesizing using models from different iterations. The hoarseness remains the same irrespective. From visualizing the spectrogram of the original (left-side of the image) and synthesized audio (right-side of the image) of the same text we observe that the spectrogram of synthesized audio is smudged in the middle frequencies in comparison to crisp harmonics of the original audio.
Any suggestions on how to train better?
Left: Original (Audio) , Right: Synthesized (Audio)
The text was updated successfully, but these errors were encountered: