Skip to content

Is it possible to use instruction call with the zero-shot inference? #1314

@ashaydave

Description

@ashaydave

I am trying to understand if CosyVoice2.0 can generate speech from the following:

Speech style and characteristics taken from the prompt speech in zero-shot inference, which is very accurate.
And an instruct text for further speaker variability.

The current issue I'm facing with the instruct2 call is that it doesn't really capture the characteristics of the prompt speech, but manages to synthesize the emotion very well. The synthesized speech is wildly off the prompt speech characteristics. For example:

prompt_speech_16k = 'some_audio_file_with_a_certain_speech_style.wav' for i, j in enumerate(cosyvoice.inference_instruct2('This is a synthesized speech in a certain speech style', 'Speak in a happy tone', prompt_speech_16k, stream=False)): torchaudio.save('instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

The certain speech style isn't captured, but the emotion (happy tone) is captured with a completely different speech style (and not similar to the prompt speech).

Am I missing something here?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions