Skip to content

Questions about flow matching decoders #905

@sphmel

Description

@sphmel

Hi, thanks for open-sourcing nice work.

In cosyvoice decoder, speaker embedding is used, while there're many works(voicebox, soundstorm, e2-tts, f5-tts, etc) that does not use speaker embedding on decoder side.

In cosyvoice2, speech tokenizer's ability has been improved quite a lot, If speech token has really small speaker informations relying only on prefix prompt would work well on zero-shot cloning task. I think your team already did some experiments about dropping speaker embedding. Is there any good reason to use speaker embedding in flow matching decoder? I hope such results be in cosyvoice2 tech report, or next version of cosyvoice model's tech report.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions