About the Top-K mechanism

Thank you for the excellent work! I have a few questions regarding the top-k KV cache selection mechanism, and I would greatly appreciate your clarification.
 
1. Is the top-k mechanism applied during training as well, or is it only used during inference?

2. As I understand it, the top-k mechanism is only used to select which tokens participate in the actual attention computation, while the historical KV cache itself still remains in memory without eviction. If so, the memory usage would still grow linearly with the generated sequence length. Is this understanding accurate?


Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the Top-K mechanism #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

About the Top-K mechanism #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions