<a href="https://colab.research.google.com/github/AdamClarkStandke/GenerativeDeepLearning/blob/main/GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative Pre-trained Transformer (i.e., GPT)

---

This is a basic generative pre-trained transformer (i.e. GPT) to generate text. As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


> In this chapter, we are going to delve into how modern text generation models make use of the *Transformer* architecture to reach state-of-the-art performance on text generation challenges. In particular, we will explore a type of autoregressive model known as the generative pre-trained transformer (GPT), *which powers OpenAI’s GPT-4 model*, widely considered to be the current state of the art for text generation.

Furthermore, as the authors of [Improving Language Understanding
by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) state:

> Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

Accordingly, *this notebook will focus on the first stage* of the OpenAI paper in which unsupervised learning is used to pre-train a general *decoder transformer.* This will include implementing the main key elements/components that are found in (most) state of the art generative pre-trained transformers. These include the following:

*   Multi-Head Attention
*   Causal Masking
*   Transformer Block
*   Positional Encoding

I will be using the [reddit dataset](https://www.tensorflow.org/datasets/overview) to train the language model.






In [1]:
import numpy as np
import json
import re
import string
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses, preprocessing
import tensorflow_datasets as tfds
from tensorflow.keras.layers import TextVectorization

In [24]:
VOCAB_SIZE = 15000
MAX_LEN = 80
EMBEDDING_DIM = 512
KEY_DIM = 64
N_HEADS = 8
FEED_FORWARD_DIM =  2048
VALIDATION_SPLIT = 0.2
SEED = 42
BATCH_SIZE = 64
EPOCHS = 80

In [None]:
ds = tfds.load('reddit')

Downloading and preparing dataset 2.93 GiB (download: 2.93 GiB, generated: 18.09 GiB, total: 21.01 GiB) to /root/tensorflow_datasets/reddit/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/3848330 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/reddit/1.0.0.incompleteF9X2U4/reddit-train.tfrecord*...:   0%|          | …

Dataset reddit downloaded and prepared to /root/tensorflow_datasets/reddit/1.0.0. Subsequent calls will reuse this data.


In [None]:
x=ds["train"]

In [None]:
train_text = x.map(lambda text: text['normalizedBody'])

In [None]:
train_text_2 = train_text.take(1000000) # take 1 millions samples from reddit dataset to train on

In [None]:
vectorize_layer = TextVectorization(max_tokens=VOCAB_SIZE, output_mode='int',output_sequence_length=MAX_LEN+ 1)

In [None]:
vectorize_layer.adapt(train_text_2)

In [None]:
vocab = vectorize_layer.get_vocabulary()

In [None]:
# Display some token:word mappings
for i, word in enumerate(vocab[:20]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: the
3: i
4: to
5: and
6: a
7: of
8: that
9: in
10: it
11: my
12: is
13: for
14: was
15: you
16: with
17: but
18: me
19: this


In [None]:
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = train_text_2.map(prepare_inputs)
train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)

In [None]:
example_input_output = train_ds.take(1).get_single_element()

In [None]:
# Example Input
example_input_output[0][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([   3,   14,  266,    4,   19,  206,   96,  365,  393,    5,   40,
         25,   24,  169,   44,  304,   10,   14,  367,  456,   29,   83,
        277,  198,  307,  163,  266,   16,   29,    5,   28,  309,   38,
          6,  543,  213,    4,  712,   36,   16,    3,  638,  244,    8,
          3,   77,    4,  187,    4,   22,  102,  121,   16,   29,   27,
        845,  129,    5,  502,    6,  117,    5,  118,    3,  355,   44,
        304,    3,  258,    5,   28,  437,   18,   29,    1,   27, 2016,
          4,  465,  188])>

In [None]:
# Example Output (shifted by one token)
example_input_output[1][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([  14,  266,    4,   19,  206,   96,  365,  393,    5,   40,   25,
         24,  169,   44,  304,   10,   14,  367,  456,   29,   83,  277,
        198,  307,  163,  266,   16,   29,    5,   28,  309,   38,    6,
        543,  213,    4,  712,   36,   16,    3,  638,  244,    8,    3,
         77,    4,  187,    4,   22,  102,  121,   16,   29,   27,  845,
        129,    5,  502,    6,  117,    5,  118,    3,  355,   44,  304,
          3,  258,    5,   28,  437,   18,   29,    1,   27, 2016,    4,
        465,  188,   76])>

# Creating Casual Mask

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


> [W]e want our GPT model to be able to handle a group of query vectors in parallel (i.e., a matrix)...we need to apply a mask to the query/key dot product, to avoid information from future words leaking through. This is known as *causal masking.*

The casual_attention_mask function implements *causal masking*. The function takes in the batch_size, the source and destination sequence length, and data type. In step 1 the destination sequence length shape is changed from (seq_length,) to (seq_length, 1) while in step 2 the source sequence lengh shape is not changed (i.e. stays the same). In step 3 a matrix of booleans of shape (seq_length, seq_length) is created and in step 4 the matrix of booleans is cast into binary values.In step 5 the mask is reshaped to have a 3D shape of (Batch_size,seq_length seq_length). In step 6 a concatenation is done to create a 1D array is created with the values of [batch_size, 1, 1] and then tiled in step 7 to produce the final 3D casual mask of shape (batch_size, seq_length, seq_length).



In [21]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    i = tf.range(n_dest)[:, None] # 1
    j = tf.range(n_src)    # 2
    m = i >= j - n_src + n_dest # 3
    mask = tf.cast(m, dtype) # 4
    mask = tf.reshape(mask, [1, n_dest, n_src]) # 5
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    ) # 6
    return tf.tile(mask, mult) # 7

# Creating Transformer Block

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


>  A Transformer block is a single component within a Transformer that applies some *skip connections*, *feed-forward (dense) layers*, and *layer normalization* around the *multihead attention layer*...[where] [i]n Keras, we can build a *MultiHeadAttention layer that concatenates the output from multiple attention heads*, allowing each to learn a distinct attention mechanism so that the layer as a whole can learn more complex relationships.

The Transformer block is one of the most important aspects of language modeling. Before Transformer Blocks were introduced the previous AutoRegressive method of LSTMs was the state of the art in text generation. It was not until the Transformer block that allowed for deep learning language modeling to take place. In the paper [Improving Language Understanding
by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) the authors used 12 of these transformers for each layer (i.e. a 12-layer decoder-only transformer). To speed up training I will be using the base model as detailed by [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) of 6 transformer blocks.  


In [22]:
class TransformerBlock(layers.Layer):
    def __init__(self, num_heads, key_dim, embed_dim, ff_dim, dropout_rate=0.1):# defining sublayers of the TransformerBlock
        super(TransformerBlock, self).__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        self.embed_dim = embed_dim
        self.ff_dim = ff_dim
        self.dropout_rate = dropout_rate
        self.attn = layers.MultiHeadAttention(
            num_heads, key_dim, output_shape=embed_dim
        ) # multi-head Attention layer
        self.dropout_1 = layers.Dropout(self.dropout_rate)
        self.ln_1 = layers.LayerNormalization(epsilon=1e-6)
        self.ffn_1 = layers.Dense(self.ff_dim, activation="relu")
        self.ffn_2 = layers.Dense(self.embed_dim)
        self.dropout_2 = layers.Dropout(self.dropout_rate)
        self.ln_2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(
            batch_size, seq_len, seq_len, tf.bool
        ) # casual attention mask
        attention_output, attention_scores = self.attn(
            inputs,
            inputs,
            attention_mask=causal_mask,
            return_attention_scores=True,
        ) # multi-head Attention
        attention_output = self.dropout_1(attention_output)
        out1 = self.ln_1(inputs + attention_output) # 1st skip connection and layer normilizations
        ffn_1 = self.ffn_1(out1) # feed-forward network
        ffn_2 = self.ffn_2(ffn_1)
        ffn_output = self.dropout_2(ffn_2)
        return self.ln_2(out1 + ffn_output)# 2nd skip connection and layer normilizations

# Creating Token and Positional Embedding

Because of the orderining of sentences is important in regards to choosing/predicting the next word in the sequence a token and positional embedding is created.

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):

> [I]n the multihead attention layer, there is nothing that cares about the ordering of the keys. The dot product between each key and the query is calculated in parallel, not sequentially...We use a technique called positional encoding when creating the inputs to the initial Transformer block. Instead of only encoding each token using a token embedding, we also encode the position of the token, using a position embedding.

In [23]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, max_len, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.max_len = max_len
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.token_emb = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

# Constructing Model and Training

In [25]:
inputs = layers.Input(shape=(None,), dtype=tf.int32)
# Token and POsitional Embedding
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, EMBEDDING_DIM)(inputs)
# 6-layer Transformer decoder
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
# output
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
gpt = models.Model(inputs=inputs, outputs=outputs)
gpt.compile("adam", loss=[losses.SparseCategoricalCrossentropy(), None])
gpt.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 512)         7720960   
 ng (TokenAndPositionEmbedd                                      
 ing)                                                            
                                                                 
 transformer_block (Transfo  (None, None, 512)         3152384   
 rmerBlock)                                                      
                                                                 
 transformer_block_1 (Trans  (None, None, 512)         3152384   
 formerBlock)                                                    
                                                                 
 transformer_block_2 (Trans  (None, None, 512)         315238

In [None]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }

    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:
            x = np.array([start_tokens])
            y, att = self.model.predict(x, verbose=0)
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            info.append(
                {
                    "prompt": start_prompt,
                    "word_probs": probs,
                    "atts": att[0, :, -1, :],
                }
            )
            start_tokens.append(sample_token)
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("did you hear", max_tokens=80, temperature=1.0)

# Tokenize starting prompt
text_generator = TextGenerator(vocab)
gpt.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)
# Save the final model
gpt.save("/content/gpt")