# Implement a transformer layer with a single head of attention in TensorFlow

The formula is:
attention(Q,K,V)=softmax(QK.T/sqrt(d_k))V where
- Q, K, V are the query matrix, key matrix and value matrix
- d_k is the query and key matrix dimension
- d_v is the value matrix dimension

The **query** is the representation for the word we want to calculate self-attention for. So since we want to get the self-attention for “fluffy”, we only consider its query, not the one of “pancakes”. As soon as we are finished calculating the self-attention for “fluffy”, we can also discard its query vector.

The **key** is a representation of each word in the sequence and is used to match against the query of the word for which we currently want to calculate self-attention.

The **value** is the actual representation of each word in a sequence, the representation we really care about. Multiplying the query and key gives us a score that tells us how much weight each value (and thus, its corresponding word) obtains in the self-attention vector. Note that the the value is not directly multiplied with the score, but first the scores are divided by the square root of the dk, the dimension of the key vector, and softmax is applied.

How does it works:
- calculate scalar product between query matrix and key matrix transposed
- the result is then divided for the radical square of key and query matrix dimension (sqrt(d_k)). This is needed because the scalar product may get too big and make the learning unstable
- the division result is then passed in a softmax to obtain attention weights, used to weight value matrix and obtain output attention

source(s): https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960

https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/


## Solution

In [23]:
import os
import numpy as np
import tensorflow as tf

In [16]:
import tensorflow as tf
import numpy as np

class SingleHeadTransformer(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        self.dot = tf.keras.layers.Dot()
        self.softmax = tf.keras.activations.softmax()

    def call(self, Q, K, V):
        enum = self.dot([Q, K])
        den = np.sqrt(K.shape[-1])
        output = self.softmax(enum / den)
        output *= V
        return output

class ResidualA(tf.keras.layers.Layer):
    def __init__(self, h_size ,use_head = False) -> None:
      super().__init__()
      self.query = tf.keras.layers.Dense(h_size)
      self.key = tf.keras.layers.Dense(h_size)
      self.value = tf.keras.layers.Dense(h_size)
      self.pos_enc = PositionalEncoding(h_size)  # cambia la dimensione dell'encoding
      self.t_head = None
      self.norm = tf.keras.layers.LayerNormalization()  # usa LayerNormalization invece di Normalization
      if use_head:
        self.t_head = SingleHeadTransformer()

    def call(self, inputs):
      inputs = self.pos_enc(inputs)
      if self.t_head is not None:
        Q = self.query(inputs)
        K = self.key(inputs)
        V = self.value(inputs)
        inputs += self.t_head(Q,K,V)
      inputs = self.norm(inputs)
      return inputs

class ResidualB(tf.keras.layers.Layer):
    def __init__(self, h_size, use_res=False):
        super().__init__()
        self.use_res = use_res
        self.fc1 = tf.keras.layers.Dense(4 * h_size, activation=tf.keras.activations.relu)
        self.fc2 = tf.keras.layers.Dense(h_size)
        self.norm = tf.keras.layers.LayerNormalization()

    def call(self, inputs):
        x = inputs
        if self.use_res:
            x = self.fc1(x)
            x = self.fc2(x)
            x += inputs
        x = self.norm(x)
        return x

class PositionalEncoding(tf.keras.layers.Layer):
    def __init__(self, d, n=10000):
        super().__init__()
        self.n = n
        self.d = d

    def call(self, inputs):
        seq_len = inputs.shape[2]
        P = np.zeros((seq_len, self.d))
        for k in range(seq_len):
            for i in np.arange(int(self.d / 2)):
                denominator = np.power(self.n, 2 * i / self.d)
                P[k, 2 * i] = np.sin(k / denominator)
                P[k, 2 * i + 1] = np.cos(k / denominator)
        return inputs + P

class Transformer(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        self.block_a = ResidualA(128, True)
        self.block_b = ResidualB(128, True)

    def call(self, inputs):
        x = self.block_a(inputs)
        x = self.block_b(x)
        return x


- `SingleHeadTransformer`: defines a single head of attention in the Transformer, where query, key and value goes through a scalar dot product attention and softmax activation. This layer is used in the `ResidualA` to transform a residual of the network.
- `ResidualA`: implements the first block type of the Transformer. It is a self-attention process followed by a FFNN. The input goes through a codification, then a self-attention is applied using the `SingleHeadTransformer` 
- `ResidualB`: implements the second block type of the Transformer. It is a FFNN with a residual. 
- `PositionalEncoding`: implement the codification of input position in the Transformer. Calculates the codification position using the formula. 
- `Transformer`: defines the entire architecture, combining the two block types. `block_a` implements the self attention; while `block_b` implements the FFNN. 