Use KV cache till input seq len for prefill phase #154

puneeshkhanna · 2024-04-10T08:24:36Z

Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable.
All the changes are protected by bucket internal flag.

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

puneeshkhanna · 2024-04-10T08:57:38Z

Updated command (remove --reuse_cache , setting PT_HPUGRAPH_DISABLE_TENSOR_CACHE=1 automatically taken care)

python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf/ --use_hpu_graphs --use_kv_cache --max_input_tokens 2048 --max_new_tokens 2048 --batch_size 200 --attn_softmax_bf16 --trim_logits --bf16 --warmup 2 --n_iterations 2 --limit_hpu_graphs --bucket_internal --bucket_size 128

Also requires pytorch-integration patch - https://gerrit.habana-labs.com/#/c/408363/

optimum/habana/transformers/models/llama/modeling_llama.py

* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

astachowiczhabana · 2024-06-11T07:09:42Z

huggingface#1028

puneeshkhanna requested review from ssarkar2, bhargaveede, vivekgoe, mandy-li, libinta and dvarshney-habana as code owners April 10, 2024 08:24

Revert initialization of KV cache

057a60d

bgoldberg-habana reviewed Apr 10, 2024

View reviewed changes

optimum/habana/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

Puneesh Khanna added 3 commits April 11, 2024 09:37

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag

de71a47

remove os import

534e16a

remove commented print

ee5fe3a

dvarshney-habana approved these changes Apr 11, 2024

View reviewed changes

dvarshney-habana merged commit 60b5d9b into HabanaAI:habana-main Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use KV cache till input seq len for prefill phase #154

Use KV cache till input seq len for prefill phase #154

puneeshkhanna commented Apr 10, 2024

puneeshkhanna commented Apr 10, 2024 •

edited

Loading

astachowiczhabana commented Jun 11, 2024

Use KV cache till input seq len for prefill phase #154

Use KV cache till input seq len for prefill phase #154

Conversation

puneeshkhanna commented Apr 10, 2024

What does this PR do?

Before submitting

puneeshkhanna commented Apr 10, 2024 • edited Loading

astachowiczhabana commented Jun 11, 2024

puneeshkhanna commented Apr 10, 2024 •

edited

Loading