# Table of Contents

- [1. A quick overviw of transformer](#1-A-quick-overview-of-transformer)
- [2. Implementation of a Small Transformer in TensorFlow](#2-implementation-of-a-small-transformer-in-tensorflow)
  - [2.1 Pre-processing of the raw data](#21-Pre--processing-of-the-raw-data)
  - [2.2 Attention and Multi-Head Attention](#22-attention-and-multi-head-attention)
  - [2.3 Encoder and Decoder](#23-encoder-and-decoder)

---






# 1. A birdseye of Transformer

The goal of this blog post is to provide the introduction of the core ideas behind transformer without getting drown into technical details. I want to provide a skeleton implementation of transformer so the reader can see the big picture of what it does before reading the original paper "attention is all you need". The code is in tensorflow but it is also easy to find pytorch code if you prefer that.(At the moment of the doing this exercise, I was more familar with tensorflow). Before we dive into MHA(Multi-head-attention), we need to transform text into list of numbers that can be easily understood by neural network. These include tokenization, embedding, positional encoding. I will explain what these does but these are less important details for understanding attention machnism. Modern language model has evolved well beyound in MHA and from what I have learned, there is no strict rule that says you have to use certain architectures. This kind of flexibility was a culture shock to me as a physicist. I know there are better blog post out there but this is my first post and I need to write a shit first before I can write better stuff. 

If we think transformer as a blackbox that perform certain task, you don't really want to process raw data. Tokenization, embedding, positional encoding are just the pre-processing of the data so the actual blackbox can have better performance. The actual component that does the job is the encoder and decoder. Encoder is the piece that learns pattern from the data and decoder is the part that generates output. A encoder model only has the encoder, and it is suitable for task such as classification which is all about understanding the data pattern. On the other hand, a decoder model only has the decoder, and it's great for next token generation task. Sequence-to-sequence model is a combination of both of them. There are subtle difference between the decoder in decoder-model and sequence-to-sequence model, which will be breifly mentioned at the end of the post and will be addressed in details in the next post.



# 2. Implementation of a Small Transformer in TensorFlow

We first import necessary packages. To pre-process the data, I use BERT, which is a pretrained model.

In [None]:
import tensorflow as tf
from transformers import BertTokenizer,TFBertModel


## 2.1 Pre-processing of the raw data


We first define the tokenizer, which turns your input strings into tensorflow tensors. All tokenization does is to map a input word/symbol to a number. Some tokenizer map words to number(one word -> one token), while some do sub-word tokenization, for example BERT. Again, how you do it is flexible as long as it works and performs well for you model.


The maximal length I defined here is 10, and set "return_attention_mask=True", which means that if the input string has less token than 10, it will fill the rest spots with zeros, while if it's greater than 10, the input will be cut off at 10. This process is called padding. Padding mask is a tensor that remembers whether certain location has meaningful information or simply padded zeros. 

In [7]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenization_test(input_test):
    inputs = tokenizer(input_test, 
                       return_tensors='tf',
                       max_length=10,
                       return_attention_mask=True,
                       padding='max_length',
                       truncation=True)
    ID_test=inputs['input_ids']
    mask=inputs["attention_mask"]
    print(ID_test)
    print(mask)

if __name__ == '__main__':
    tokenization_test("My dog is cute")

tf.Tensor([[  101  2026  3899  2003 10140   102     0     0     0     0]], shape=(1, 10), dtype=int32)
tf.Tensor([[1 1 1 1 1 1 0 0 0 0]], shape=(1, 10), dtype=int32)


Next step is embedding, which maps numbers (token_IDs) into vectors. Alternatively, you can think of token IDs are one dimensional projections of some higher dimensional vector which encodes richer information. It's just a fixed map so we call pre-trained model as well.

In [8]:
# The embedding is the process where you map the token ID at each position to a vector of size D
# Model output is a tensor of size 1 X N_id X D

Embedding_model=TFBertModel.from_pretrained('bert-base-uncased')

def embedding_test(input_test):
    inputs = tokenizer(input_test, 
                       return_tensors='tf',
                       max_length=10,
                       return_attention_mask=True,
                       padding='max_length',
                       truncation=True)
    Embedding_tmp=Embedding_model(inputs)
    output_embedding=Embedding_tmp.last_hidden_state

    print(f"output_shape={output_embedding.shape}")

if __name__ == '__main__':
    embedding_test("My dog is cute") 

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

output_shape=(1, 10, 768)


We can see from the out put that the model takes scalar into a 768 (the embedding dimension,commonly denoted as $d_{model}$ in the literature)dimensional vector. In tensorflow, there is also pre-defined embedding layer, where you can define custom embedding dimension.

In [9]:
# TensorFlow offers embedding layer for custom embedding
# embedding = tf.keras.layers.Embedding(input_dim=N_id, output_dim=embedding_dim)

As for positional encoding, we simply add information about the location of certain token into the same vector space. The same word in different location now can be distinguished in a sentence. In the original paper "attention is all you need", the author used a fixed map. Let us first denote the embedding is the map,
\begin{align}
 N_\alpha \rightarrow A^\mu_\alpha \notag
\end{align}
where $N_{\alpha}$ is the token ID at position $\alpha$, $A^\mu_\alpha$ is the corresponding embeded vector with $\mu=0,1,.....,d_{model}-1$. It is common to choose embedding_dim to be an even number. For token at position "$\alpha$", we have the positional encoding as the map $\alpha \rightarrow P^\mu_\alpha$
\begin{equation}
\begin{split}\notag
P_\alpha^{2i} &= \sin\Big(\frac{\alpha}{10^{4 \times \frac{2i}{d_{model}}}} \Big) \\
P_\alpha^{2i+1}&= \cos\Big(\frac{\alpha}{10^{4 \times \frac{2i}{d_{model}}}} \Big)
\end{split}
\end{equation}
with $i=0,1,2,...,\frac{d_model}{2}-1$ and $P_\alpha^\mu$ is another vector living in the same space, only encoding the information about the position $\alpha$. What we feed into the transformer is $P_\alpha^\mu+A_\alpha^\mu$.
Again, one can use fixed map or defined a position encoding layer with learnable parameters.

In [None]:
# Fixed positional encoding is included for demonstration purpose
# In practice, we can use the positional encoding layer that can be optimized during training
# Here we assumed that N_id is N_id_max
# Tokenization will called the number N_id for padding
# Later we will add attention mask to deal with this during calculation of attention scores
class fixed_pos_encod_layer(tf.keras.layers.Layer):
      def __init__(self, N_id_max, embedding_dim):
            assert embedding_dim % 2 == 0, "embedding_dim must be even"
            super(fixed_pos_encod_layer, self).__init__()

            self.N_id_max=N_id_max
            self.embedding_dim=embedding_dim
    
      def _pos_encod(self):
            # position range from 0 to N_id-1
            # i range from 0 to embedding_dim-1
            # Due to the alternating pattern of sin and cos, by convention, embedding_dim is even
            # Position indices
            pos_index = tf.range(self.N_id_max, dtype=tf.float32)[:, tf.newaxis]
            # frequency factors, note that it is more efficient to use tf.exp then tf.pow
            omega = tf.exp(-2*tf.range(0,self.embedding_dim,2,dtype=tf.float32)/self.embedding_dim *tf.math.log(10000.0))
            
            angles=pos_index*omega 

            pos_encoding=tf.concat([tf.sin(angles),tf.cos(angles)],axis=-1)
            
            # add batch dimension
            # prepare for broadcasting
            return tf.expand_dims(pos_encoding,0)
      
      def call(self, inputs):
            # inputs is a tensor of size B X N_id X D
            N_id=tf.shape(inputs)[1]
            return inputs + self._pos_encod()[:,:N_id,:]
      
class pos_encod_layer(tf.keras.layers.Layer):
      def __init__(self, N_id_max, embedding_dim):
            assert embedding_dim % 2 == 0, "embedding_dim must be even"
            super(pos_encod_layer, self).__init__()

            self.N_id_max=N_id_max
            self.embedding_dim=embedding_dim
            
            # for trainable positional encoding, we want to make sure weights can be called 
            self.pos_encoding=self.add_weight(name="pos_encoding",
                                              shape=(1,N_id_max,embedding_dim),
                                              initializer='random_normal',
                                              trainable=True)
            
      def call(self,inputs):
                  N_id=tf.shape(inputs)[1]
                  return inputs + tf.expand_dims(self.pos_encoding[:N_id,:],0) 
                  # or self.pos_encoding[tf.newaxis,:N_id,:]
      


## 2.2 Attention and Multi-Head Attention

Now it comes to the main topic, the Multi-Head Attention layer. To understand what's going on, it's better to start with single-head attention. Here the math was done for encoder model. I will talk about the difference with other models later. From our pre-processing steps, we have the input as $B \times N_{id} \times d_{model}$, where $B$ is the batch size, $N_{id}$ is the input sequence length, and $d_{model}$ is the embedding dimension. The MHA process can be broadcast in the Batch size dimention. Let's set batch size to 1. Let's denote the data as the tensor $M_{ij}$, where $i=0,1,2,3,...,N_{id}-1$, and $j=0,1,2,3,...,d_{model}$. MHA will create three projection tensor, of the size $d_{model} \times d_{model}$ ,$W^q,W^k,W^v$ (I put the lable in the superscript because I want to consistently put indices as subscript). Q stands for Query, K stands for Key, and V stands for value. We act these projection on the original input $M$ to obtain three tensor of size $N_{id} \times d_{model}$,
\begin{equation}
\begin{split}\notag
Q_{ij}&=M_{il} W^q_{lj}\\
K_{ij}&=M_{il} W^k_{lj} \\
V_{ij}&=M_{il} W^v_{lj}
\end{split}
\end{equation}
where repeated indices are assumed to be summed. $\vec{A} \cdot \vec{B}\equiv A^iB^i$. We then procede to calculate the attention score, the overlap between the Query and the key. I will try my best to give intuiation about this later but let me lay out exactly the mathematical operation that happens in this step. The attention score is denoted as S, of size $N_{id} \times N_{id}$
\begin{equation}
S_{ij}=Q \cdot K^T/\sqrt{d_{model}}=Q_{il} (K^T)_{lj}/\sqrt{d_{model}}=Q_{il}K_{jl}/\sqrt{d_{model}} \notag
\end{equation}
perhaps a better way to see what this is to write Q, K, V as a vector of vectors of size $d_{model}$. For example, $Q=[\vec q_0,\vec q_1,....,\vec q_{N_{id}-1}]$. In this way, we can clearly see that,
\begin{align} \notag
S_{ij}=\vec q_i \cdot \vec k_j^T /\sqrt{d_{model}}
\end{align}
now the overlap can not be directly used, so we need some kind of probability and the goal is to write something like,
\begin{align}
\notag
\vec o_i=P_{ij} \vec v_j
\end{align}
The reason I didn't apply the softmax right away is that I want to demonstrate it clearly along which dimension you apply the softmax. To produce output at position i, you need to apply the softmax along the last index. For each overlap, one generate a new value vector with the weight $P_{ij}$, where,
\begin{equation}
P_{ij}=\frac{e^{S_{ij}}}{\sum_j e^{S_{ij}}} \notag
\end{equation}
This whole operation will be trivial if all tensor are fixed. Namely, only the optimal state has meaning. After training, and assuming perfect training, what would happen is that $Q$ is the optimal representation of the data, $K$ is the optimal organization of the data and $V$ is the efficient representation of the output data. To be more specific, positional encoding plus embedding might not be enough, espcially if you are using the fixed map as shown in our example, to represent the data in an efficient way. While $K$ has to learn to extract relevent feature of the input data. Splitting the above procedure into multiple heads is just to do this calculation independent in subspace, then put the predicted result back together. As a convention, one put final projection $W^o$ on top of concatenated result, which is to increase the expressiveness of the model. 

In [None]:
# Multi-head attention layer
# input is tensor B X N_id X d_{model}
# attention, vector of D dimension
# W_q W_k W_v => Q, K, V, eg: Q= inputs \cdot W_q
# attention score Q \dot K^T   
# Q is the information that is in your input
# K is the information that you are comparing to
# D= d_head * Nof_heads
# 1000= 10 * 100
class MHA(tf.keras.layers.Layer):
        def __init__(self, embedding_dim, Nof_heads):
            super(MHA,self).__init__()
            assert embedding_dim % Nof_heads == 0, "embedding_dim must be divisible by num_heads"

            self.embedding_dim=embedding_dim
            self.Nof_heads=Nof_heads  
            self.key_dim= embedding_dim // Nof_heads
 
            # weigth matrices for Q,K,V
            self.W_q=self.add_weight(name="W_q",
                                     shape=(embedding_dim, embedding_dim),
                                     initializer="random_normal",
                                     trainable=True)
            
            self.W_k=self.add_weight(name="W_k",
                                     shape=(embedding_dim, embedding_dim),
                                     initializer="random_normal",
                                     trainable=True)
            
            self.W_v=self.add_weight(name="W_v",
                                     shape=(embedding_dim, embedding_dim),
                                     initializer="random_normal",
                                     trainable=True)
            
            self.W_out=self.add_weight(name="W_out",
                                     shape=(embedding_dim, embedding_dim),
                                     initializer="random_normal",
                                     trainable=True)
            
        def split_heads(self,Vector):
            
            # input vector is a tensor of size B X N_id X D
            # name vector emphasizes that it is a vector in the embedding space that we are splitting
            # split the last dimension into (Nof_heads, key_dim)
            # Output should be (B, Nof_heads, N_id, key_dim)


            # Split the last dimension into (Nof_heads, key_dim)
            input_reshaped=tf.reshape(Vector,(tf.shape(Vector)[0],tf.shape(Vector)[1],self.Nof_heads,self.key_dim))
            # Transpose to get the standard format 
            
            return tf.transpose(input_reshaped,perm=[0,2,1,3]) # B X Nof_heads X N_id  X key_dim
        
        def call(self,Q,K,V,mask=None):
                
                # Project Q,K,V
                Q=tf.matmul(Q,self.W_q) # B X N_id X D
                K=tf.matmul(K,self.W_k) # B X N_id X D
                V=tf.matmul(V,self.W_v) # B X N_id X D
                
                # split the vector into different heads B X Nof_heads X N_id  X key_dim
                Q=self.split_heads(Q)
                K=self.split_heads(K)
                V=self.split_heads(V)

                # calculate the attention scores for each heads
                # keep in mind that we have to use a tf tensor for the denominator 
                Scores= tf.matmul(Q,K,transpose_b=True) /tf.math.sqrt(tf.cast(self.key_dim,tf.float32))
                # at each batch, head, Q=size of N_id X key_dim, K^T=size of key_dim X N_id 
                # Scores is a tensor of size B X Nof_heads X N_id X N_id
                if mask is not None:
                      
                      mask=tf.where(mask == 0, -1e9, 0.0)
                      Scores= Scores + mask

                Soft_max=tf.nn.softmax(Scores,axis=-1)
                 
                # output per head is B X Nof_heads X N_id X key_dim
                Output= tf.matmul(Soft_max,V) 
                # transpose back to the original format
                Output=tf.transpose(Output,perm=[0,2,1,3])
                # concatenate the heads
                Output=tf.reshape(Output,(tf.shape(Q)[0], tf.shape(Q)[2], self.embedding_dim))
                # alternatively but less explicitly one can do tf.reshape(Output,tf.shape(inputs))
                # final projection
                O=tf.matmul(Output, self.W_out)      
               
                return O 
        


You might have noticed that we have added mask to the attention scores. What this does is to remove the contribution of certain positions. The way we remove them is to add a huge negative number $-10^{9}$ to the attention score because $e^{-10^9} \sim 0$. This will be sufficient to suppress the corresponding contributions. One type of mask is the padding mask, which record whether real information or padded zeros are at certain position. Another mask is the casual mask, which remove the effects of later token on the previous token. We will talk about them further after we have the code for encoder layer and decoder layer.

To increase the expressiveness of the model, we add the feed forward layer which has two dense layers where you first take the the output to higher dimension and reduce it back. Usually one take the higher dimension to be some integer times the embedding dimension.

In [10]:
# Feed forward network layer
class Feed_forward_network(tf.keras.layers.Layer):
        def __init__(self,embedding_dim, expanding_dim):
            super(Feed_forward_network,self).__init__()
            self.embedding_dim=embedding_dim
            self.expanding_dim=expanding_dim
            self.ffn=tf.keras.Sequential([
                tf.keras.layers.Dense(expanding_dim,activation='relu'),
                tf.keras.layers.Dense(embedding_dim)])
        
        def call(self,input):
              return self.ffn(input)


## 2.3 Encoder and Decoder

The transformer consists of an encoder and decoder which are simply stacking multiple encoder layers and decoder layers together. The encoder layer is: MHA layer-> layer normalization-> feed forward layer -> layer normalization. We only need to use the padding mask or source mask. Query, Key, and Value are all calculated from the input data.

In [None]:
class encoder_layer(tf.keras.layers.Layer):
        # the input is result of positional encoding namely B X N_id X D
        # in parameter needs to specify MHA and FFN
        def __init__(self,embedding_dim,Nof_heads,expanding_dim):
            super(encoder_layer,self).__init__()
            #initialize the paramters
            self.embedding_dim=embedding_dim
            self.Nof_heads=Nof_heads
            self.expanding_dim=expanding_dim
            # create the layers
            self.MHA=MHA(embedding_dim,Nof_heads,mask=None)
            self.FFN=Feed_forward_network(embedding_dim,expanding_dim)
            self.LN1=tf.keras.layers.LayerNormalization(epsilon=1e-6)
            self.LN2=tf.keras.layers.LayerNormalization(epsilon=1e-6)

        def call(self,inputs,source_mask=None):
              # MHA
              Out_MHA=self.MHA(inputs,inputs,inputs,mask=source_mask)
              # residual connection+ layer normalization
              Out1=self.LN1(inputs+Out_MHA)       
              # Feed forward network
              Out_ffn=self.FFN(Out1)
              # residual connection+ layer normalization
              return self.LN2(Out1+Out_ffn)     


On the other hand, the decoder layer is slightly more flexible. It has one self-attention layer, one cross attention layer. For the self-attention layer, we always apply the causal mask (target mask)on the target data. While for the cross attention layer, you will compute Q from result of the first self-attention layer which acted on the target sequence. As for K and V, you will use the output from the encoder if it's a sequence to sequence model. In the case of decoder only model, there is no need for cross attention and one use target data for K and V as well.

In [None]:

class decoder_layer(tf.keras.layers.Layer):
      def __init__(self, embedding_dim, Nof_heads,expanding_dim):
            super(decoder_layer,self).__init__()
            # initialize the paramters 
            self.embedding_dim=embedding_dim
            self.Nof_heads=Nof_heads
            self.expanding_dim=expanding_dim

            # create the layers
            self.MHA_masked=MHA(embedding_dim,Nof_heads, mask=None)   
            self.MHA_cross= MHA(embedding_dim,Nof_heads, mask=None)   
            self.FFN=Feed_forward_network(embedding_dim,expanding_dim)
            self.LN1=tf.keras.layers.LayerNormalization(epsilon=1e-6)
            self.LN2=tf.keras.layers.LayerNormalization(epsilon=1e-6)
            self.LN3=tf.keras.layers.LayerNormalization(epsilon=1e-6)

      def call(self,target,encoder_output,source_mask=None,target_mask=None):
            #inputs could be encoder output or directly from the positional encoding
            # MHA masked
            Out_MHA_masked=self.MHA_masked(target,target,target,mask=target_mask)
            # residual connection+ layer normalization
            Out1=self.LN1(target+Out_MHA_masked)
            # MHA cross
            Out_MHA_cross=self.MHA_cross(Out1,encoder_output,encoder_output,mask=source_mask)
            # residual connection+ layer normalization
            Out2=self.LN2(Out1+Out_MHA_cross)
            # Feed forward network
            Out_ffn=self.FFN(Out2)
            # residual connection+ layer normalization
            return self.LN3(Out2+Out_ffn)

In [None]:
class Encoder(tf.keras.layers.Layer):

      def __init__(self, embedding_dim, Nof_heads,expanding_dim, N_layers):
            super(Encoder,self).__init__()
            #initialize the parameters
            self.embedding_dim=embedding_dim
            self.Nof_heads=Nof_heads
            self.expanding_dim=expanding_dim
            self.N_layers=N_layers
            # create the layers
            self.encoder_layers=[encoder_layer(embedding_dim,Nof_heads,expanding_dim) for _ in range(Nof_layers)]
            
      def call(self,inputs,source_mask=None):
            #inputs is the output of the positional encoding
            for i in range(self.N_layers):
                  inputs=self.encoder_layers[i](inputs,source_mask)
            return inputs
      
class Decoder(tf.keras.layers.Layer):
      def __init__(self, embedding_dim, Nof_heads, expanding_dim, N_layers):
            super(Decoder,self).__init__()
            #initialize the parameters
            self.embedding_dim=embedding_dim
            self.Nof_heads=Nof_heads
            self.expanding_dim=expanding_dim
            self.N_layers=N_layers
            # create the layers
            self.decoder_layers=[decoder_layer(embedding_dim,Nof_heads,expanding_dim) for _ in range(N_layers)]
            
      def call(self,target,encoder_output,source_mask=None,target_mask=None):
            #inputs is the output of the positional encoding
            for i in range(self.N_layers):
                  target=self.decoder_layers[i](target,encoder_output,source_mask,target_mask)
            return target


Here we have included the most general architecture, without details like dropout layer. In the next post, we will write code for the decoder, encoder and sequence-to-sequence and demonstrate how inference works.