## Encoder and Decoder
![](https://www.tensorflow.org/images/tutorials/transformer/transformer.png)

The transformer model follows the same general pattern as a standard sequence to sequence with attention model.  
- The input sentence is passed through N encoder layers that generates an output for each word/token in the sequence.  
- The decoder attends on the encoder's output and its own input (self-attention) to predict the next word.

仔细看这个图，
- 在 encoder 模块，multi-head attention 中的输入 q, k, v 是一致的(self-attention).
- 在 deocder 模块，multi-head attention 中的输入 q 来自 target sentence， k，v 来自 source sentence. 所以最后的输出 `output.shape == (batch, q_len, d_model)`


### Encoder layer

Each encoder layer consists of sublayers:

- Multi-head attention (with padding mask)  
- Point wise feed forward networks.

Each of these sublayers has a residual connection around it followed by a layer normalization. Residual connections help in avoiding the vanishing gradient problem in deep networks.

The output of each sublayer is LayerNorm(x + Sublayer(x)). The normalization is done on the d_model (last) axis. There are N encoder layers in the transformer.

In [2]:
import tensorflow as tf
from multi_head_attention import MultiHeadAttention, point_wise_feed_forward_network


class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # [batch, seq_len, d_model]

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

if __name__ == "__main__":
    sample_encoder_layer = EncoderLayer(512, 8, 2048)

    sample_encoder_layer_output = sample_encoder_layer(
        tf.random.uniform((64, 43, 512)), False, None)

    print(sample_encoder_layer_output.shape)  # (batch_size, input_seq_len, d_model)

(64, 43, 512)


### Decoder layer

Each decoder layer consists of sublayers:

- Masked multi-head attention (with look ahead mask and padding mask)  
- Multi-head attention (with padding mask). V (value) and K (key) receive the encoder output as inputs. Q (query) receives the output from the masked multi-head attention sublayer.  
- Point wise feed forward networks

decoder 和 encoder 的区别是 decoder 有两个 attenion 模块，分别是：
- masked multi-head attention(这部分 q,k,v 是一样的) 不仅包括 padding mask，还包括 look ahead mask. 
- multi-head attention 这部分 query 是来自上一层的 masked multi-head attention sublayer 的输出， key, value 是来自 encoder 的输出，所以这里只包括 padding mask.