1  Generating (Shakespearean) Text with a GPT-like Transformer

In this exercise we are going to build a GPT-like transformer. Such a transformeris a decoder-only transformer and hence the doesn‚Äôt include a cross attentionlayer.

As a practical application of the transformer, we will train it to generate Shake-spearean text.

**Note**: the following video1from Andrej Karpathy explains in detail how to builda GPT-like decoder. In the video the Pytorch framework is used, but the con-cepts are identical. The video also follows the ‚ÄúAttention is all you need‚Äù paper,apart from the placement of the *LayerNormalization* layers.

The overall architecture of a **decoder-only transformer** is shown in Figure 1

1.1  Implement theFeedForwardLayer

We start by implementing theFeedForwardlayer. According to equation (2) ofthe ‚ÄúAttention is all you Need‚Äù paper, this layer performs the following calculation:
$$
\text{FFN}(ùë•) = \text{max}(0, xW1+ b1)W_2+ b_2
$$
which is applied to each position (i.e. each time step) $ùë•$ independently. The dimensions of the input and output are identical, but the layer in between is 4 times wider.

For this exercise, we are going to implement this layer as a subclass ofkeras.layers.Layerbut we will **not** use any other layers, instead you should [https://www.youtube.com/watch?v=kCc8FmEb1nY2](https://www.youtube.com/watch?v=kCc8FmEb1nY2)


In [1]:
%pip install --upgrade pip --quiet
%pip install keras --quiet
%pip install tensorflow-metal --quiet
%pip install tensorflow-macos --quiet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
#SimpleFeedForwardlayer
import keras

@keras.saving.register_keras_serializable()

class FeedForward(keras.layers.Layer):

    def __init__(self, factor=4, **kwargs):
        super().__init__(**kwargs)
        self.factor = factor

    def build(self, batch_input_shape):
        time_steps, embed_size = batch_input_shape[1:]
        #! YOUR CODE HERE:
        self.w1 = self.add_weight(shape=(embed_size,self.factor*embed_size))
        self.w2 = self.add_weight(shape=(self.factor*embed_size,embed_size))
        self.b1 = self.add_weight(shape=(self.factor*embed_size))
        self.b2 = self.add_weight(shape=(embed_size,))

    #? Call kun je oproepen met `FeedForward()()`
    def __call__(self, inputs):
        #! YOUR CODE HERE:
        #! Perform calculation on inputs and return result
        inputs = keras.ops.matmul(inputs,self.w1)
        inputs = keras.layers.Add()([inputs,self.b1])
        inputs = keras.layers.Activation("relu")(inputs)

        inputs = keras.ops.matmul(inputs,self.w2)
        inputs = inputs + self.b2
        return inputs

    def get_config(self):
        base_config = super().get_config()
        return{**base_config,"factor": self.factor,}

In [None]:
#SimpletestcodeTEST_SHAPE = (2, 10, 32)#Batchsize2,10timesteps,embeddingdimension32X = keras.random.normal(shape=TEST_SHAPE)ff = FeedForward()print(f"Shapeofoutput{ff(X).shape}")#Shouldprint(2,10,32)forwinff.get_weights():#Checkthattheshapesarewhatyouexpectprint(w.shape

: 

1.2  Implement a GPT Decoder Block

Next, implement a *GPTDecoderBlock* class as a subclass of *keras.layers.Layer*. This layer represents one decoder block. It consists of
- Masked (or causal) multi-head attention.
- Layer normalization (and a skip connection)
- A feed forward layer (which was implemented in the previous step)
- A second layer normalization step (and a skip connection)

You can see the starter code for *GPTDecoderBlock* class in Figure 3.

Note the following:
1. Keras provides a *MultiHeadAttention* attention class that you can use.In the ‚ÄúAttention is all you Need‚Äù paper it is mentioned at the end of sec-tion 3.2.2 that the dimensions for the keys and values are
   $$
   d_k = d_v = d_{model}/h,
   $$
   where $h$ denotes the number of heads and $ùëë_{model}$ is the dimension of the embeddings.
2. In thecallmethod you need to make sure to apply the causal masking.
3. In section 3.1 of the ‚ÄúAttention is all you Need‚Äù paper you can see thatthe skip connection and the layer normalisation are implemented as fol-lows:LayerNorm(ùë• +Sublayer(ùë•)),2If you don‚Äôt, the model will seem to learn very quickly but at test time it will not doanything useful.

In [None]:
# A GPT decoderblock (with out crossattention)
@keras.saving.register_keras_serializable()
class GPTDecoderBlock(keras.layers.Layer):
    def __init__(self, num_heads, embed_size, **kwargs):
        super().__init__(**kwargs)
        self.num_heads = num_heads
        self.embed_size = embed_size
        #! YOUR CODE HERE
        #! Add needed layers (either from Keras or your own custom layer)
        self.attention = keras.layers.MultiHeadAttention(num_heads=num_heads,
                                                         key_dim=(embed_size//num_heads))
        self.normalization_1 = keras.layers.LayerNormalization()
        self.normalization_2 = keras.layers.LayerNormalization()
        self.feed_forward_network = FeedForward()

    def __call__(self, inputs):
        #! YOUR CODE HERE
        #! Perform the computation on inputs and return result
        skip_connection_1 = inputs # (seq_length, embed_size)

        #* Output van attention
        x = self.attention(inputs,inputs,use_causal_mask=True)
        #* Output van attention en normalisatie
        x = self.normalization_1(x + skip_connection_1)

        skip_connection_2 = x
        #* Output van feedforward netwerk
        x = self.feed_forward_network(x)
        #* Output van feedforward en normalisatie
        x = self.normalization_2(x + skip_connection_2)
        return x # (seq_length, embed_size)

        pass

    def get_config(self):
        base_config = super().get_config()
        return {**base_config, "num_heads": self.num_heads, "embed_size": self.embed_size}

: 

where Sublayer is either the multi-head attention layer or the feed forwardlayer.

Run the following code to perform a simple test of your class:

In [None]:
# Simple test code
TEST_SHAPE = (2, 10, 32) # Batchsize 2,10 time steps,embedding dimension 32

#? Wat is mijn context window (aka length)? 10
X = keras.random.normal(shape=TEST_SHAPE)
gpt_block = GPTDecoderBlock(num_heads=4, embed_size=32)
print(f"Shape of output {gpt_block(X).shape}") # Should print (2,10,32)

: 

1.3  Implement a EmbeddingWithPositionLayer

The decoder starts with an embedding layer which embeds the (integer) tokensinto a vector space. Next, a positional embedding is added to the token em-beddings. In the ‚ÄúAttention is all you Need‚Äù paper it is mentioned that theseembeddings can either be ‚Äúfixed‚Äù, or they can be learned.We will implement a classEmbeddingWithPositionwhich combines the tokenembedding with a learnable positional embedding. We willnotmake use of theEmbeddinglayer to implement the class. Implementing this ‚Äúby hand‚Äù will helpyou gain a better understanding of what embeddings are actually doing.Start from the code in Figure4to implement this class.Hints:1.In thebuildmethod you should add two learnable weights matrices. Thedimensions of these matrices depend on the values of the arguments thatwere passed to the constructor:‚Ä¢num_tokens: the number of tokens in the vocabulary
max_seq_length: the maximum length of any sequence. The modelwill not work if sequences with a length longer than this maximumlength are used.‚Ä¢embed_size: the dimension of the embeddings.2.In thecallmethod, you can usekeras.ops.taketo select rows fromthe embedding matrix.3.The positional embeddings are (by definition) the firstlengthrows fromthe positional embedding matrix.4.Rely on the+operator to perform the broadcasting between the tokenembeddings and the positional embeddings

In [None]:
from keras.layers import EmbeddingWithPosition

tokens = keras.ops.convert_to_tensor([[1,3,5],[0,2,4]])
embed_layer = EmbeddingWithPosition(num_tokens=10, max_seq_length=5, embed_size=32)
print(embed_layer(tokens).shape) # Should print (2,3,32)
for w in embed_layer.get_weights(): # Check that this is what you would expect
    print(w.shape)

: 

1.4  Build the Complete Model

Write a methodget_modelthat returns a complete GPT-like decoder. Since wehave all the necessary layers, this is now a simple sequential model. Completethe code in Figure5.As you can see in Figure1, there is a linear layer after the last decoder block.This linear layer works independently for each token. For this exercise we willoutput the logits for the tokens instead of the token probabilities. Stated other-wise, the last layer in our model does not include an activation function

In [None]:
@keras.saving.register_keras_serializable()
class EmbeddingWithPosition(keras.layers.Layer):
    def __init__(self, num_tokens, max_seq_length, embed_size, **kwargs):
        super().__init__(**kwargs)
        #! YOUR CODE HERE
        #! Save constructor arguments
        self.num_tokens = num_tokens
        self.max_seq_length = max_seq_length
        self.embed_size = embed_size
    def build(self, batch_input_shape):
        #! Shape not actually needed!!
        #! YOUR CODE HERE
        #! Add the weights for the two embeddings
        #? Token kunnen omzetten naar een embedding?
        #? token 0 (the)
        #? --> embedding [30,45,29,..., 223,45] # 512
        #? token 2 (or)
        #? --> embedding [12,34,56,...,78] # 512
        #? [
        #? (0): [30,45,29,...,223,45],
        #? ...
        #? (2): [12,34,56,...,78]
        #? ]
        self.embedding_loop_table = self.add_weight(shape=(self.num_tokens,self.embed_size))
        self.position_lookup_table = self.add_weight(shape=(self.max_seq_length,self.embed_size))
    def __call__(self, inputs):
        _, length = keras.ops.shape(inputs)
        # YOUR CODE HERE
        # Get both embeddings and add them.
        token_embeddings = keras.ops.take(self.embedding_loop_table,inputs,axis=0)
        position_embeddings = self.position_lookup_table[:length]
        return token_embeddings + position_embeddings

    def get_config(self):
        base_config = super().get_config()
        return{**base_config,"num_tokens": self.num_tokens,
                "max_seq_length": self.max_seq_length,
                "embed_size": self.embed_size}

: 