-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
I am trying to understand if CosyVoice2.0 can generate speech from the following:
Speech style and characteristics taken from the prompt speech in zero-shot inference, which is very accurate.
And an instruct text for further speaker variability.
The current issue I'm facing with the instruct2 call is that it doesn't really capture the characteristics of the prompt speech, but manages to synthesize the emotion very well. The synthesized speech is wildly off the prompt speech characteristics. For example:
prompt_speech_16k = 'some_audio_file_with_a_certain_speech_style.wav' for i, j in enumerate(cosyvoice.inference_instruct2('This is a synthesized speech in a certain speech style', 'Speak in a happy tone', prompt_speech_16k, stream=False)): torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
The certain speech style isn't captured, but the emotion (happy tone) is captured with a completely different speech style (and not similar to the prompt speech).
Am I missing something here?