update llama flash attention #646

yaoguany · 2023-09-18T11:29:08Z

1.update llama flash attention code(past_key_value, attention_mask)
2.support flash attention + kv_cache now.

yaoguany · 2023-09-19T13:14:08Z

a40 train

dataset:alpaca

llama2-7b, 2048 blocksize, 10 batchsize, gradient checkpoint
with flash attention: memory:19672mb
without flash attention: memory:42700mb

a40 eval

dataset:MedQA-USMLE validation, block_size:4096, batch size:10

acc

same acc in first 200 samples:0.41

memory:

with flash attention:39752mb
without flash attention: 34876mb

yaoguany · 2023-09-19T13:22:53Z

bugs need to be fixed in the future

eval with flash attention still occupies much memory(the eval result is same)
use_cache should be set to True when use flash attention, otherwise the eval process may get stuck.

research4pan

LGTM, update llama flash attention implementation to make it compatible with latest version of transformers and flash_attention. Major changes are

Support past_key_value cache optimization of reusing key value tensor caches during token-by-token decoding
Flash attention is not enabled during token-by-token decoding or incremental state decoding

yaoguany added 4 commits September 18, 2023 16:37

update llama flash attention

ccc2d33

update decoder attention mask code

bb9b1c0

update codes

0a63964

update code to make minimum change

bf87e6d

research4pan approved these changes Sep 19, 2023

View reviewed changes

research4pan merged commit c44175b into OptimalScale:main Sep 19, 2023
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update llama flash attention #646

update llama flash attention #646

yaoguany commented Sep 18, 2023

yaoguany commented Sep 19, 2023 •

edited by research4pan

Loading

yaoguany commented Sep 19, 2023 •

edited by research4pan

Loading

research4pan left a comment

update llama flash attention #646

update llama flash attention #646

Conversation

yaoguany commented Sep 18, 2023

yaoguany commented Sep 19, 2023 • edited by research4pan Loading

a40 train

dataset:alpaca

a40 eval

dataset:MedQA-USMLE validation, block_size:4096, batch size:10

acc

memory:

yaoguany commented Sep 19, 2023 • edited by research4pan Loading

bugs need to be fixed in the future

research4pan left a comment

Choose a reason for hiding this comment

yaoguany commented Sep 19, 2023 •

edited by research4pan

Loading

yaoguany commented Sep 19, 2023 •

edited by research4pan

Loading