Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update llama flash attention #646

Merged
merged 4 commits into from
Sep 19, 2023
Merged

update llama flash attention #646

merged 4 commits into from
Sep 19, 2023

Conversation

yaoguany
Copy link
Collaborator

1.update llama flash attention code(past_key_value, attention_mask)
2.support flash attention + kv_cache now.

@yaoguany
Copy link
Collaborator Author

yaoguany commented Sep 19, 2023

a40 train

dataset:alpaca

llama2-7b, 2048 blocksize, 10 batchsize, gradient checkpoint
with flash attention: memory:19672mb
without flash attention: memory:42700mb

a40 eval

dataset:MedQA-USMLE validation, block_size:4096, batch size:10

acc

same acc in first 200 samples:0.41

memory:

with flash attention:39752mb
without flash attention: 34876mb

@yaoguany
Copy link
Collaborator Author

yaoguany commented Sep 19, 2023

bugs need to be fixed in the future

  • eval with flash attention still occupies much memory(the eval result is same)
  • use_cache should be set to True when use flash attention, otherwise the eval process may get stuck.

Copy link
Contributor

@research4pan research4pan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, update llama flash attention implementation to make it compatible with latest version of transformers and flash_attention. Major changes are

  • Support past_key_value cache optimization of reusing key value tensor caches during token-by-token decoding
  • Flash attention is not enabled during token-by-token decoding or incremental state decoding

@research4pan research4pan merged commit c44175b into OptimalScale:main Sep 19, 2023
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants