Skip to content
This repository was archived by the owner on Jul 18, 2024. It is now read-only.

Conversation

@Drajan
Copy link

@Drajan Drajan commented May 3, 2019

Multi-head Attention weights are computed using two FF layers with ReLU activation and softmax for probabilities ==> softmax(FF(ReLU(FF(input))))
input = [batch_size, num_words, embed_size]
attention = [batch_size, num_words, num_attention_heads]
input_attention_weighted = [batch_size, num_heads, embed_size]
output = [batch_size, num_heads*embed_size] ==> concatenating multi-head representation

@tkornuta-ibm tkornuta-ibm merged commit f6c6d1e into IBM:develop May 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants