Keep kv cache as list of tensors maybe better than one tensor #1562

lingzhi98 · 2024-04-08T02:40:33Z

Describe the bug
If we keep kv cache as list of tensors, there has no need to concatenate kv caches of each decoder blocks (https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gemma/gemma_causal_lm.py#L225). It is helpful for model performance.

Expected behavior
Remove useless concatenation to improve performance.

lingzhi98 · 2024-04-08T03:29:30Z

Spliting kv cache into key cache and value cache is also important (https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/gemma/gemma_attention.py#L166).

mattdangerw · 2024-04-08T19:55:14Z

@lingzhi98 thanks! We are planning some generation improvements so will definitely check this out. Agreed we can let performance be our guide. Probably particularly jax compiled performance.

Were you thinking of a specific backend/compiled with XLA/not compiled? What's motivating the suggestion?

lingzhi98 · 2024-04-09T02:21:33Z

I use jax as keras backend. I have seen the concatenation become the main overhead if increasing batch size. Due to keep kv caches as one tensor, we need slice the kv cache to get corresponding key/value cache to compute attention output and then update cache. Dynamic update slice fusion will blocked by this slice op (https://github.com/openxla/xla/blob/main/xla/service/gpu/ir_emission_utils.cc#L472) and hurts performance again.

github-actions bot assigned SuryanarayanaY Apr 8, 2024

github-actions bot added the Gemma Gemma model specific issues label Apr 8, 2024

lingzhi98 changed the title ~~Keep kv cache as list of tensors maybe better than a one tensor~~ Keep kv cache as list of tensors maybe better than one tensor Apr 8, 2024

SuryanarayanaY added the type:feature New feature or request label Apr 8, 2024

mattdangerw added the stat:awaiting response from contributor label Apr 8, 2024

mattdangerw removed the stat:awaiting response from contributor label Apr 9, 2024

SuryanarayanaY added the stat:awaiting keras-eng label Apr 17, 2024

mattdangerw mentioned this issue Apr 30, 2024

[DO NOT MERGE] Experimental implementation of CausalLM with a Keras Functional backbone_with_cache #1598

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep kv cache as list of tensors maybe better than one tensor #1562

Keep kv cache as list of tensors maybe better than one tensor #1562

lingzhi98 commented Apr 8, 2024 •

edited

lingzhi98 commented Apr 8, 2024

mattdangerw commented Apr 8, 2024 •

edited

lingzhi98 commented Apr 9, 2024 •

edited

Keep kv cache as list of tensors maybe better than one tensor #1562

Keep kv cache as list of tensors maybe better than one tensor #1562

Comments

lingzhi98 commented Apr 8, 2024 • edited

lingzhi98 commented Apr 8, 2024

mattdangerw commented Apr 8, 2024 • edited

lingzhi98 commented Apr 9, 2024 • edited

lingzhi98 commented Apr 8, 2024 •

edited

mattdangerw commented Apr 8, 2024 •

edited

lingzhi98 commented Apr 9, 2024 •

edited