Add flash attention to Transformers #342

Warvito · 2023-03-23T22:42:10Z

Implements #339

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

marksgraham

Hi Walter,

I've noticed some discrepancies between the two implementations. It would also be good to add tests for use_flash_attention, and maybe even a test that compares the output for the calculations with and without flash attention.

marksgraham · 2023-03-24T17:02:24Z

generative/networks/blocks/selfattention.py

 import torch.nn as nn
 from torch.nn import functional as F

+if importlib.util.find_spec("xformers") is not None:


Currently if the code does not find xformers butuse_flash_attention is set True the code errors out. I think we need to self use_flash_attention=False in the init if has_xformers=False, and ideally raise a warning, too.

Added error message in case user want to use flash attention but xformers is not installed

marksgraham · 2023-03-24T17:26:13Z

generative/networks/blocks/selfattention.py

+        if self.use_flash_attention:
+            query = query.contiguous()
+            key = key.contiguous()
+            value = value.contiguous()
+            y = xops.memory_efficient_attention(
+                query, key, value, attn_bias=xops.LowerTriangularMask() if self.causal else None
+            )
+
+        else:
+            # manual implementation of attention
+            attention_scores = (query @ key.transpose(-2, -1)) * self.scale
+
+            if self.causal:
+                attention_scores = attention_scores.masked_fill(self.causal_mask[:, :, :t, :kv_t] == 0, float("-inf"))


These two methods give different values. scale isn't currently passed to memory_efficient_attention, nor the dropout probability, but even account for that it looks like that isn't the root of the difference. The python implementation of memory_efficient_attnetion looks a little different to the manual implementation here:

https://github.com/facebookresearch/xformers/blob/658ebab39545f180a6075385b3897921623d6c3b/xformers/ops/fmha/__init__.py#L142-L149

thanks for pointing this, I fixed the problem

import torch from generative.networks.blocks import SABlock device = torch.device("cuda") sab = SABlock(hidden_size=4, num_heads=2, dropout_rate=0.0, use_flash_attention=False) sab = sab.to(device) sab.eval() with torch.no_grad(): x = torch.randn(1, 3, 4).to(device) result = sab(x) sab.use_flash_attention = True result_flash = sab(x) torch.isclose(result, result_flash)

returning

tensor([[[True, True, True, True], [True, True, True, True], [True, True, True, True]]], device='cuda:0')

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

Warvito · 2023-03-25T12:23:42Z

@marksgraham I was thinking, since now Pytorch 2.0 has native flash attention, do you think it is worth to abandon the xformer dependency and adopt the native option instead? Since Pytorch 2.0 has compatibility, it would not be an issue to update to it in place of installing xformers (which requires Pytorch > 1.13)

marksgraham · 2023-03-27T15:00:34Z

Hi @Warvito

It would be great to use the native flash attention, but should we keep in the current xformers implementation for users that are on pytorch<2.0?

Warvito · 2023-03-27T15:18:14Z

Yes, I guess it would be okay

Add flash attention

f7528b9

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

Warvito linked an issue Mar 23, 2023 that may be closed by this pull request

Add Flash attention to transformers #339

Closed

Remove x-transformer from requirements-dev.txt

32fe706

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

Warvito marked this pull request as ready for review March 24, 2023 07:27

marksgraham self-requested a review March 24, 2023 14:54

marksgraham suggested changes Mar 24, 2023

View reviewed changes

Warvito added 2 commits March 25, 2023 12:15

Fix difference in values

66d67e6

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

Add error

8abd047

Signed-off-by: Walter Hugo Lopez Pinaya <ianonimato@hotmail.com>

Warvito requested a review from marksgraham March 25, 2023 12:20

marksgraham approved these changes Mar 27, 2023

View reviewed changes

Warvito merged commit b5ef5ff into main Mar 27, 2023

Warvito deleted the 339-add-flash-attention-to-transformers branch March 31, 2023 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add flash attention to Transformers #342

Add flash attention to Transformers #342

Uh oh!

Warvito commented Mar 23, 2023

Uh oh!

marksgraham left a comment

Uh oh!

marksgraham Mar 24, 2023

Uh oh!

Warvito Mar 25, 2023

Uh oh!

marksgraham Mar 24, 2023

Uh oh!

Warvito Mar 25, 2023

Uh oh!

Warvito commented Mar 25, 2023

Uh oh!

marksgraham commented Mar 27, 2023

Uh oh!

Warvito commented Mar 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add flash attention to Transformers #342

Add flash attention to Transformers #342

Uh oh!

Conversation

Warvito commented Mar 23, 2023

Uh oh!

marksgraham left a comment

Choose a reason for hiding this comment

Uh oh!

marksgraham Mar 24, 2023

Choose a reason for hiding this comment

Uh oh!

Warvito Mar 25, 2023

Choose a reason for hiding this comment

Uh oh!

marksgraham Mar 24, 2023

Choose a reason for hiding this comment

Uh oh!

Warvito Mar 25, 2023

Choose a reason for hiding this comment

Uh oh!

Warvito commented Mar 25, 2023

Uh oh!

marksgraham commented Mar 27, 2023

Uh oh!

Warvito commented Mar 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants