Add support for Flash attention #725

VHellendoorn · 2022-11-30T17:58:33Z

This PR adds Tri Dao's Flash Attention as an optional backend for the global attention operation, enabled by setting the attention_config to [[["flash"], ...]. I've tested the changes in my own environment and consistently see a 2x boost for 4K sequence lengths in models ranging from 100M - 3B parameters.

Maybe relevant: @tridao @lucidrains

Quentin-Anthony · 2022-11-30T18:14:44Z

@VHellendoorn -- Thanks so much for adding this support! I'll run some tests myself sometime this week. I'll report back here whether I can reproduce your speedup.

megatron/model/transformer.py

dashstander · 2022-12-07T19:23:27Z

Here is a side by side comparison of a 1.3B model on an 80GB A100 with and without Flash attention. So far seems pretty stable at 180 vs 130 TFLOPS, respectively.

tridao · 2022-12-07T19:31:13Z

This is awesome, thanks @VHellendoorn for the integration and for @dashstander for the comparison runs!

Quentin-Anthony · 2022-12-07T19:31:54Z

Thanks for testing this out @dashstander!

@VHellendoorn -- I'll review and approve this once you have run our pre-commit checks. Can you run them and commit the formatting updates with:

pip install pre-commit
cd /path/to/gpt-neox
pre-commit install
pre-commit run --all-files # You will probably have to run twice so that formatting changes can be automatically applied. Make sure all checks pass.
<git commit and push>

StellaAthena · 2022-12-07T20:23:23Z

That is quite impressive! Let’s make sure to leave the runs going and do some downstream evals once they’ve finished training @dashstander, but this looks compelling enough to me to merge.

VHellendoorn · 2022-12-07T20:24:05Z

Agree, thanks @dashstander! Out of curiosity, what sequence length did you run this experiment with?

@Quentin-Anthony: I ran the precommit script and pushed the corresponding fixes. It made changes to quite a few files outside of the ones I changed; hope you don't mind that I left those out.

StellaAthena · 2022-12-07T20:27:45Z

@VHellendoorn you can find the configs for the run here. The sequence length is 2048, which is EleutherAI’s default value.

VHellendoorn · 2022-12-07T20:51:12Z

Great, thanks! FWIW, I've been training with both 4K and 8K sequence lengths and it's really very fast there too -- the 8K model is barely slower than the 4K one for a 2.7B model. No formal benchmarks to share, but just noting that for others looking for info. All props to @tridao for making this possible, of course!

tridao · 2022-12-07T20:54:15Z

Great, thanks! FWIW, I've been training with both 4K and 8K sequence lengths and it's really very fast there too -- the 8K model is barely slower than the 4K one for a 2.7B model. No formal benchmarks to share, but just noting that for others looking for info. All props to @tridao for making this possible, of course!

Ha I've been training 8k models as well (GPT 1.3B and 2.7B on the Pile)! Will put out some speed numbers this week in a blogpost.

StellaAthena · 2022-12-08T00:31:59Z

@VHellendoorn @tridao this is very interesting info! Do you have the experience that longer sequence lengths increases performance in areas you care about noticeably? My recollection (though I would have to go hunt down papers) is that many people have found that 2048 tokens is sufficient for most NLP tasks.

VHellendoorn · 2022-12-08T02:40:16Z

I can't speak for regular NLP tasks, but in my current experiments on code, the difference between 2K and 4K (and above) is surprisingly large, at least in terms of loss. In several of my runs, the larger context window reduces loss by 10-20%, especially past 100B tokens or so, at least for smaller models. Bigger model sweeps are still going, but as a first datapoint, the 8K model I mention above has just passed the loss of a concurrently running 2K context-window equivalent in less than half the steps (~23B and ~52B tokens in, resp.). Seems on track to converge quite a bit lower still.

This might be a code-specific phenomenon (though @tridao can hopefully tell us otherwise!). Code files are typically quite large -- 4K tokens is not a bad guess for the median in many languages. I'll also add the caveat that I'm still working on this sweep, but the results have been pretty consistent so far. Definitely makes me see why Codex was trained with 4K tokens.

tridao · 2022-12-08T06:43:04Z

I think most current NLP benchmark tasks don't have long sequences.
Right now I'm interested in how to equip models with memory (e.g., ChatGPT claims to "remember what user said earlier in the conversation" and is rumored to have context length 8k).
I've also been working with some collaborators on multi-turn user interaction and they said they need/want 8k or even 16k context length.

StellaAthena · 2022-12-10T13:44:28Z

@Quentin-Anthony are there any concerns about this PR? It looks good to merge to me.

chuanli11 · 2023-01-03T22:28:03Z

Thanks to adding this support. Just tried it out and in my case (a 8xA100 40GB server) i was able to get 15 - 50% improvement on TFLOPs with flash attention, depending on the model sizes.

Also in my tests the memory usage seems to be roughly the same with/without flash attention. This study seems to suggest there is about 15% memory reduction for some of the models.

Wonder if anyone can shed light on how much improvement to expect by adding flash attention to GPT-neox models.

dashstander · 2023-01-06T00:26:56Z

@chuanli11 what model sizes did you use for your tests? I ran some tests at 2.7B and 6.7B parameters. For 2.7B there was a ~20% improvement in peak memory usage, but for 6.7B there wasn't. I'm not sure why that might be though.

Also pinging @Quentin-Anthony for visibility.

chuanli11 · 2023-01-06T06:16:27Z

Hey, @dashstander these are some of the tests I ran (all with 8xA100 40GB)

Model	non-flash (activation checkpoint = false)	non-flash (activation checkpoint = true)	flash	pipeline parallel	model parallel	micro_bs
19M_pythia	OOM	80TFLOPs/35GB	90TFLOPs/30GB	1	1	32
2.7B	OOM	96.2TFLOPS/31GB	146.2TFLOPS/31GB	1	1	4
6.7B	OOM	67.4TFLOPS/40GB	84TFLOPS/40GB	2	1	4

activation checkpoint has huge impact on memory usage, see comment here.

Further data points for non-flash + activation checkpoint = false with different micro_bs_per_gpu:

Model	micro_bs = 1	micro_bs = 2	micro_bs = 4	micro_bs = 8	micro_bs = 16	micro_bs = 32	pipeline parallel	model parallel
19M_pythia	48TFLOS/5GB	74TFLOPS/7GB	102FLOPS/9.5GB	121.5TFLOPS/15GB	64.6TFLOPS / 27GB	OOM	1	1
2.7B	90TFLOS/40GB	OOM	OOM	OOM	OOM	OOM	1	1
6.7B	OOM	OOM	OOM	OOM	OOM	OOM	2	1

tridao · 2023-01-06T06:40:13Z

Do you have activation checkpointing on? In that case, for the non-flash runs, even though the attention matrix is materialized, it's not saved for the backward (but instead recomputed). So you're trading off compute to reduce memory.
With FlashAttention we found we didn't need to do activation checkpointing (e.g. for 2.7B or 6.7B models on 80GB cards).

chuanli11 · 2023-01-06T16:52:51Z

Do you have activation checkpointing on? In that case, for the non-flash runs, even though the attention matrix is materialized, it's not saved for the backward (but instead recomputed). So you're trading off compute to reduce memory. With FlashAttention we found we didn't need to do activation checkpointing (e.g. for 2.7B or 6.7B models on 80GB cards).

Thanks for the tip! Indeed activation checkpointing was on and that had huge impact. Update my original post.

dashstander · 2023-01-06T18:46:09Z

@tridao and this is why gradient checkpointing is obviated by FlashAttention?

tridao · 2023-01-06T21:09:14Z

I'd say FlashAttention reduces the need for gradient checkpointing, since attention now doesn't take up quadratic memory. Ofc there are cases (GPU with small memory, large models) where one would still need gradient checkpointing regardless.

greg1232 · 2023-03-23T15:34:07Z

Now that this code is available.

Are there any plans to train pythia/etc with longer sequence lengths, e.g. 32k (using sparse flash attention)?

Quentin-Anthony · 2023-03-23T16:04:58Z

Now that this code is available.

Are there any plans to train pythia/etc with longer sequence lengths, e.g. 32k (using sparse flash attention)?

We intend to finetune some of the Pythia models on longer context lengths as part of the INCITE grant: https://twitter.com/BlancheMinerva/status/1593725723352539136?s=20&t=E-NvaQqiMS7IgmN3uS-zpg

If you're interested in contributing, please join the discord at https://discord.gg/hrcJTaSDeC

guozhiyao · 2023-04-17T01:50:44Z

Hey, @dashstander these are some of the tests I ran (all with 8xA100 40GB)

Model non-flash (activation checkpoint = false) non-flash (activation checkpoint = true) flash pipeline parallel model parallel micro_bs
19M_pythia OOM 80TFLOPs/35GB 90TFLOPs/30GB 1 1 32
2.7B OOM 96.2TFLOPS/31GB 146.2TFLOPS/31GB 1 1 4
6.7B OOM 67.4TFLOPS/40GB 84TFLOPS/40GB 2 1 4
activation checkpoint has huge impact on memory usage, see comment here.

Further data points for non-flash + activation checkpoint = false with different micro_bs_per_gpu:

Model micro_bs = 1 micro_bs = 2 micro_bs = 4 micro_bs = 8 micro_bs = 16 micro_bs = 32 pipeline parallel model parallel
19M_pythia 48TFLOS/5GB 74TFLOPS/7GB 102FLOPS/9.5GB 121.5TFLOPS/15GB 64.6TFLOPS / 27GB OOM 1 1
2.7B 90TFLOS/40GB OOM OOM OOM OOM OOM 1 1
6.7B OOM OOM OOM OOM OOM OOM 2 1

@chuanli11 Hi. Does this test enable fp16 or bf16? Is zero used?

Add support for Flash attention

6a072ba

VHellendoorn requested a review from a team as a code owner November 30, 2022 17:58

VHellendoorn requested review from Quentin-Anthony and StellaAthena November 30, 2022 17:58

dashstander reviewed Dec 6, 2022

View reviewed changes

megatron/model/transformer.py Show resolved Hide resolved

Fix attention type can be both sparse and flash

320f5ac

Updates from running pre-commit on modified files

edd1c47

Merge branch 'main' into main

f37be02

Quentin-Anthony mentioned this pull request Dec 8, 2022

Flash Attention EleutherAI/NeMo#1

Closed

Update README.md

244334b

Quentin-Anthony approved these changes Dec 10, 2022

View reviewed changes

Quentin-Anthony merged commit efd5911 into EleutherAI:main Dec 10, 2022

StellaAthena added this to the Release V2 milestone Dec 20, 2022

0x000011b mentioned this pull request Feb 19, 2023

Investigate and implement Flash Attention PygmalionAI/training-code#5

Closed

pseudotensor mentioned this pull request Apr 11, 2023

Add option to replace attention with flash attention h2oai/h2ogpt#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Flash attention #725

Add support for Flash attention #725

VHellendoorn commented Nov 30, 2022

Quentin-Anthony commented Nov 30, 2022 •

edited

dashstander commented Dec 7, 2022

tridao commented Dec 7, 2022

Quentin-Anthony commented Dec 7, 2022

StellaAthena commented Dec 7, 2022

VHellendoorn commented Dec 7, 2022

StellaAthena commented Dec 7, 2022

VHellendoorn commented Dec 7, 2022

tridao commented Dec 7, 2022

StellaAthena commented Dec 8, 2022

VHellendoorn commented Dec 8, 2022

tridao commented Dec 8, 2022

StellaAthena commented Dec 10, 2022

chuanli11 commented Jan 3, 2023 •

edited

dashstander commented Jan 6, 2023

chuanli11 commented Jan 6, 2023 •

edited

tridao commented Jan 6, 2023

chuanli11 commented Jan 6, 2023

dashstander commented Jan 6, 2023

tridao commented Jan 6, 2023

greg1232 commented Mar 23, 2023

Quentin-Anthony commented Mar 23, 2023

guozhiyao commented Apr 17, 2023

Add support for Flash attention #725

Add support for Flash attention #725

Conversation

VHellendoorn commented Nov 30, 2022

Quentin-Anthony commented Nov 30, 2022 • edited

dashstander commented Dec 7, 2022

tridao commented Dec 7, 2022

Quentin-Anthony commented Dec 7, 2022

StellaAthena commented Dec 7, 2022

VHellendoorn commented Dec 7, 2022

StellaAthena commented Dec 7, 2022

VHellendoorn commented Dec 7, 2022

tridao commented Dec 7, 2022

StellaAthena commented Dec 8, 2022

VHellendoorn commented Dec 8, 2022

tridao commented Dec 8, 2022

StellaAthena commented Dec 10, 2022

chuanli11 commented Jan 3, 2023 • edited

dashstander commented Jan 6, 2023

chuanli11 commented Jan 6, 2023 • edited

tridao commented Jan 6, 2023

chuanli11 commented Jan 6, 2023

dashstander commented Jan 6, 2023

tridao commented Jan 6, 2023

greg1232 commented Mar 23, 2023

Quentin-Anthony commented Mar 23, 2023

guozhiyao commented Apr 17, 2023

Quentin-Anthony commented Nov 30, 2022 •

edited

chuanli11 commented Jan 3, 2023 •

edited

chuanli11 commented Jan 6, 2023 •

edited