Is it possible to use instruction call with the zero-shot inference?

I am trying to understand if CosyVoice2.0 can generate speech from the following:

Speech style and characteristics taken from the prompt speech in zero-shot inference, which is very accurate.
And an instruct text for further speaker variability.

The current issue I'm facing with the instruct2 call is that it doesn't really capture the characteristics of the prompt speech, but manages to synthesize the emotion very well. The synthesized speech is wildly off the prompt speech characteristics. For example:

`prompt_speech_16k = 'some_audio_file_with_a_certain_speech_style.wav'
for i, j in enumerate(cosyvoice.inference_instruct2('This is a synthesized speech in a certain speech style', 'Speak in a happy tone', prompt_speech_16k, stream=False)):
   torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)`

The certain speech style isn't captured, but the emotion (happy tone) is captured with a completely different speech style (and not similar to the prompt speech).

Am I missing something here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is it possible to use instruction call with the zero-shot inference? #1314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Is it possible to use instruction call with the zero-shot inference? #1314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions