Support returning attention weights in naive attention modules #589

kklemon · 2023-10-04T10:09:55Z

Adds a return_attn_weights option to the forward method of the SelfAttention and CrossAttention modules, as well as to MHA and indirectly to Block.

Motivation

Since FlashAttention does not explicitly store a full attention matrix, it does not allow to access or extract attention weights. However, this can be useful or even required for many applications. As a workaround, the use of FA can be disabled on a possibly already pretrained model and then the proposed option used to access attention weights.

Discussion

Since the attention modules are usually deeply nested within Transformer-like architectures, the return_attn_weights argument and return values have to be propagated through the whole layer chain. This leads to a strong entanglement between the modules, as seen in the changes.

An alternative could be to implement the option for the low-level attention modules, i.e. Cross- and SelfAttention, similar as done in PyTorch for their MHA implementation, but not expose this in any upper layer. The attention weights would then need to be extracted via a forward hook. While this would complicate the process of extracting attention maps, it would reduce inter-layer dependencies and thus improve maintainability.

Since extracting attention maps does not appear to be a highly requested feature, this could be seen as the preferred solution.

Todos

Add tests

kklemon added 2 commits October 4, 2023 10:02

Support returning attn weights in naive attention modules

47c7731

Support returning attn weights in MHA and Block classes

ae8f7e4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support returning attention weights in naive attention modules #589

Support returning attention weights in naive attention modules #589

kklemon commented Oct 4, 2023 •

edited

Loading

Support returning attention weights in naive attention modules #589

Are you sure you want to change the base?

Support returning attention weights in naive attention modules #589

Conversation

kklemon commented Oct 4, 2023 • edited Loading

Motivation

Discussion

Todos

kklemon commented Oct 4, 2023 •

edited

Loading