## Section 05: Positional Encoding Layer

In this section, we will make use of the positional encoding matrix to build a positional encoding layer in TensorFlow. The positional encoding layer is at the entry point of a transformer model. However, the Keras library does not provide us one. We thus create a custom layer to implement the positional encoding.

This custom Keras layer takes three parameters:
- *sequence_length*: the maximum length of the input sequence.
- *vocab_size*: the size of the vocabulary used to generate the token embeddings.
- *embed_dim*: the dimension of the embedding vector.

The layer has two sub-layers:
- *token_embeddings*: this is an Embedding layer from Keras, it converts the input integer tokens to D-dimensional float vectors (where D is equal to embed_dim).
- *position_embeddings*: this is a matrix of hard-coded sine values that is used to add positional information to the token embeddings.

When the layer is called on an input, it first generates the token embeddings using the token_embeddings layer, then it adds the position embeddings to the token embeddings and returns the result.

In [7]:
import tensorflow as tf

class PositionalEmbedding(tf.keras.layers.Layer):
    """Positional embedding layer. Assume tokenized input, transform into
    embedding and returns positional-encoded output."""
    def __init__(self, sequence_length, vocab_size, embed_dim, **kwargs):
        """
        Args:
            sequence_length: Input sequence length
            vocab_size: Input vocab size, for setting up embedding matrix
            embed_dim: Embedding vector size, for setting up embedding matrix
        """
        super().__init__(**kwargs)
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim     # d_model in paper
        # token embedding layer: Convert integer token to D-dim float vector
        self.token_embeddings = tf.keras.layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim, mask_zero=True
        )
        # positional embedding layer: a matrix of hard-coded sine values
        matrix = pos_enc_matrix(sequence_length, embed_dim, n=10000)
        self.position_embeddings = tf.constant(matrix, dtype="float32")
 
    def call(self, inputs):
        """Input tokens convert into embedding vectors then superimposed
        with position vectors"""
        embedded_tokens = self.token_embeddings(inputs)
        return embedded_tokens + self.position_embeddings
 
    # this layer is using an Embedding layer, which can take a mask
    # see https://www.tensorflow.org/guide/keras/masking_and_padding#passing_mask_tensors_directly_to_layers
    def compute_mask(self, *args, **kwargs):
        return self.token_embeddings.compute_mask(*args, **kwargs)
 
    def get_config(self):
        # to make save and load a model using custom layer possible
        config = super().get_config()
        config.update({
            "sequence_length": self.sequence_length,
            "vocab_size": self.vocab_size,
            "embed_dim": self.embed_dim,
        })
        return config

This layer combines an embedding layer with position encoding. The embedding layer creates word embeddings, namely, converting an integer token label from the vectorized sentence into a vector that can carry the meaning of the word. With the embedding, you can tell how close in meaning the two different words are (see Section 04).

The embedding output depends on the tokenized input sentence. But the positional encoding is a constant matrix as it depends only on the position. Hence you create a constant tensor for that at the time you created this layer. TensorFlow is smart enough to match the dimensions when you add the embedding output to the positional encoding matrix, in the call() function.

Two additional functions are defined in the layer above. The compute_mask() function is passed on to the embedding layer. This is needed to tell which positions of the output are padded. This will be used internally by Keras. The get_config() function is defined to remember all the config parameters of this layer. This is a standard practice in Keras so that you remember all the parameters you passed on to the constructor and return them in get_config(), so the model can be saved and loaded.

We will now pass the training dataset to the PositionalEmbedding layer and print one output.
First, however, we need to re-load functions from Sections 3 and 4 and to re-create the training dataset (see Section 3 for details). 

In [11]:
def format_dataset(eng, ita):
    """Take an English and a Italian sentence pair, convert into input and target.
    The input is a dict with keys `encoder_inputs` and `decoder_inputs`, each
    is a vector, corresponding to English and Italian sentences respectively.
    The target is also vector of the Italian sentence, advanced by 1 token. All
    vector are in the same length.
 
    The output will be used for training the transformer model. In the model we
    will create, the input tensors are named `encoder_inputs` and `decoder_inputs`
    which should be matched to the keys in the dictionary for the source part
    """
    eng = eng_vectorizer(eng)
    ita = ita_vectorizer(ita)
    source = {"encoder_inputs": eng,
              "decoder_inputs": ita[:, :-1]} # between the [start] and [end] signals
    target = ita[:, 1:] # between the [start] and [end] signals
    return (source, target)
  

def make_dataset(pairs, batch_size=64):
    """Create TensorFlow Dataset for the sentence pairs"""
    # aggregate sentences using zip(*pairs)
    eng_texts, ita_texts = zip(*pairs)
    # convert them into list, and then create tensors
    dataset = tf.data.Dataset.from_tensor_slices((list(eng_texts), list(ita_texts)))
    return dataset.shuffle(2048) \
                  .batch(batch_size).map(format_dataset) \
                  .prefetch(16).cache()


def pos_enc_matrix(L, d, n=10000):
    assert d % 2 == 0, "Output dimension needs to be an even integer"
    d2 = d//2
    P = np.zeros((L, d))
    k = np.arange(L).reshape(-1, 1)     # L-column vector
    i = np.arange(d2).reshape(1, -1)    # d-row vector
    denom = np.power(n, -i/d2)          # n**(-2*i/d)
    args = k * denom                    # (L,d) matrix
    P[:, ::2] = np.sin(args)
    P[:, 1::2] = np.cos(args)
    return P
  
  
  
import pickle 
from tensorflow.keras.layers import TextVectorization

with open("key_vals.pickle", "rb") as fp:
    key_vals = pickle.load(fp)

with open(f"vectorized_ENGvoc_{key_vals['vocab_size_eng']}_ITAvoc_{key_vals['vocab_size_ita']}_seqLen_{key_vals['seq_length']}.pickle", "rb") as fp:
    data = pickle.load(fp)

# create new instances of the English and Italian vectorizers using the configurations that were saved previously.
# The from_config() method allows for recreating the same TextVectorization layer from a previously saved configuration.
eng_vectorizer = TextVectorization.from_config(data["engvec_config"])
eng_vectorizer.set_weights(data["engvec_weights"])
eng_vectorizer.set_vocabulary(data["engvec_vocabulary"])
ita_vectorizer = TextVectorization.from_config(data["itavec_config"])
ita_vectorizer.set_weights(data["itavec_weights"])
ita_vectorizer.set_vocabulary(data["itavec_vocabulary"])
  
train_pairs = data["train"]

train_ds = make_dataset(train_pairs)

# test the dataset
for inputs, targets in train_ds.take(1):
    print(inputs["encoder_inputs"])
    embed_en = PositionalEmbedding(seq_length, key_vals["vocab_size_eng"], embed_dim=512)
    en_emb = embed_en(inputs["encoder_inputs"])
    print(en_emb.shape)
    print(en_emb._keras_mask)

tf.Tensor(
[[ 36  13   4 ...   0   0   0]
 [ 42 518  34 ...   0   0   0]
 [  3 158  12 ...   0   0   0]
 ...
 [  3  24   8 ...   0   0   0]
 [298  32   4 ...   0   0   0]
 [  5 458 279 ...   0   0   0]], shape=(64, 20), dtype=int64)
(64, 20, 512)
tf.Tensor(
[[ True  True  True ... False False False]
 [ True  True  True ... False False False]
 [ True  True  True ... False False False]
 ...
 [ True  True  True ... False False False]
 [ True  True  True ... False False False]
 [ True  True  True ... False False False]], shape=(64, 20), dtype=bool)


The first tensor printed above is one batch (64 samples) of the vectorized input sentences, padded with zero to length seq_len (=20). Each token is an integer but will be converted into an embedding of dimension d (=512). Hence the shape of en_emb above is (batch size * seq_len * d) = (64, 20, 512).

The last tensor printed above is the mask used (i.e., matches the input where the position is not zero). When we compute the accuracy, we have to remember the padded locations should not be counted.

Finally, *pos_enc_matrix* and *PositionalEmbedding* were saved in the *positional_encoding* file.