Hi, thanks for open-sourcing nice work.
In cosyvoice decoder, speaker embedding is used, while there're many works(voicebox, soundstorm, e2-tts, f5-tts, etc) that does not use speaker embedding on decoder side.
In cosyvoice2, speech tokenizer's ability has been improved quite a lot, If speech token has really small speaker informations relying only on prefix prompt would work well on zero-shot cloning task. I think your team already did some experiments about dropping speaker embedding. Is there any good reason to use speaker embedding in flow matching decoder? I hope such results be in cosyvoice2 tech report, or next version of cosyvoice model's tech report.
Hi, thanks for open-sourcing nice work.
In cosyvoice decoder, speaker embedding is used, while there're many works(voicebox, soundstorm, e2-tts, f5-tts, etc) that does not use speaker embedding on decoder side.
In cosyvoice2, speech tokenizer's ability has been improved quite a lot, If speech token has really small speaker informations relying only on prefix prompt would work well on zero-shot cloning task. I think your team already did some experiments about dropping speaker embedding. Is there any good reason to use speaker embedding in flow matching decoder? I hope such results be in cosyvoice2 tech report, or next version of cosyvoice model's tech report.