Questions about flow matching decoders

Hi, thanks for open-sourcing nice work. 

In cosyvoice decoder, speaker embedding is used, while there're many works(voicebox, soundstorm, e2-tts, f5-tts, etc) that does not use speaker embedding on decoder side.

In cosyvoice2, speech tokenizer's ability has been improved quite a lot, If speech token has really small speaker informations relying only on prefix prompt would work well on zero-shot cloning task. I think your team already did some experiments about dropping speaker embedding. Is there any good reason to use speaker embedding in flow matching decoder? I hope such results be in cosyvoice2 tech report, or next version of cosyvoice model's tech report.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about flow matching decoders #905

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Questions about flow matching decoders #905

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions