<a href="https://colab.research.google.com/github/AdamClarkStandke/GenerativeDeepLearning/blob/main/GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative Pre-trained Transformer (GPT)

---

This is a basic generative pre-trained transformer (GPT) used to generate text. As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


> In this chapter, we are going to delve into how modern text generation models make use of the *Transformer* architecture to reach state-of-the-art performance on text generation challenges. In particular, we will explore a type of autoregressive model known as the generative pre-trained transformer (GPT), *which powers OpenAI’s GPT-4 model*, widely considered to be the current state of the art for text generation.

Furthermore, as the authors of [Improving Language Understanding
by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) state:

> Our training procedure consists of two stages. The first stage is learning a high-capacity language model on a large corpus of text. This is followed by a fine-tuning stage, where we adapt the model to a discriminative task with labeled data.

Accordingly, *this notebook will focus on the first stage* of the OpenAI paper in which unsupervised learning was used to pre-train a general-purpose *decoder transformer.* This includes implementing the following components found in (most) state of the art GPTs:

*   Multi-Head Attention
*   Causal Masking
*   Transformer Block
*   Positional Encoding

I will be using the [reddit dataset](https://www.tensorflow.org/datasets/overview) to train the language model. This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.






In [1]:
import numpy as np
import json
import re
import string
import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses, preprocessing
import tensorflow_datasets as tfds
from tensorflow.keras.layers import TextVectorization

In [90]:
'''
These are the default hyperparameters (that can be changed) used to train
the language model where:

VOCAB_SIZE: is the number of words that make up the dictionary
MAX_LEN: size of the max sentence length that can be embedded
KEY_DIM: embedding size for the query and key tensors found in the attention head
N_HEADS: number of attention heads
N: number of transformer blocks to use
FEED_FORWARD_DIM: number of hidden units in the feed-forward layers of the transformer block
EPOCHS: the total number of times to cycle through the entire training set
'''

VOCAB_SIZE = 30000
MAX_LEN = 80
EMBEDDING_DIM = 1024
KEY_DIM = 1024
N_HEADS = 12
N = 1
FEED_FORWARD_DIM =  2048
SEED = 42
BATCH_SIZE = 64
EPOCHS = 5

# Loading, Pre and Post-processing Reddit Dataset

Basically I followed the great tutorial [Load Text](https://www.tensorflow.org/tutorials/load_data/text#download_more_datasets_using_tensorflow_datasets_tfds) to load the reddit dataset. As detailed below after the dataset was downloaded and extracted it was saved in the directory /root/tensorflow_datasets/reddit/1.0.0. If you are using Colab to download this dataset you will need to have at least 51 GB of System RAM otherwise you will not be able to download the dataset with the tfds.load function.

I decided to sample 1 million posts from the normalized body of the Reddit dataset and then batch the sample with the BATCH_SIZE hyperparamter and randomly shuffle the elements of the dataset with a buffer size of 1000. The raw text was converted into tokens using  [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization). After doing so the vocab which maps tokens to words was generated and the training set was created by shifting the Y dataset by one token.

In [3]:
ds = tfds.load('reddit')

Downloading and preparing dataset 2.93 GiB (download: 2.93 GiB, generated: 18.09 GiB, total: 21.01 GiB) to /root/tensorflow_datasets/reddit/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/3848330 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/reddit/1.0.0.incompleteXKQW6K/reddit-train.tfrecord*...:   0%|          | …

Dataset reddit downloaded and prepared to /root/tensorflow_datasets/reddit/1.0.0. Subsequent calls will reuse this data.


In [4]:
x=ds["train"]

In [5]:
train_text = x.map(lambda text: text['normalizedBody'])

In [57]:
train_text_2 = train_text.take(1000000).batch(BATCH_SIZE).shuffle(1000) # take 1mil samples from approx. 4 million Reddit comments for training

In [72]:
vectorize_layer = TextVectorization(max_tokens=VOCAB_SIZE, output_mode='int',output_sequence_length=MAX_LEN+ 1)

In [73]:
vectorize_layer.adapt(train_text_2)

In [74]:
vocab = vectorize_layer.get_vocabulary()

In [75]:
# Display some token:word mappings
for i, word in enumerate(vocab[:20]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: the
3: i
4: to
5: and
6: a
7: of
8: that
9: in
10: it
11: my
12: is
13: for
14: was
15: you
16: with
17: but
18: me
19: this


In [76]:
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = train_text_2.map(prepare_inputs)
train_ds = train_ds.cache().prefetch(tf.data.AUTOTUNE)

In [77]:
example_input_output = train_ds.take(1).get_single_element()

In [78]:
# Example Input
example_input_output[0][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([    3,    21,    66, 13429,   497,     2,   357,   129,     5,
        1583,  2587,    73,  4713,     9,    11,  2627,     4,    47,
          11,   302,   346,   248,  3745,  9232,    37,    21,   519,
        1209,     2,   461,     7,    57,    37,   114,    10,    48,
         138,     5,     3,    21,    66,  5916,     5,   599,   110,
          73,    85, 13429,   416,  1828,    50,    66,  1468,   634,
         246,   135,    11,  3745,  2180,    30,     2,  7351,    13,
        6032,     3,    21,   274,  1214,     5,   958,    17,     3,
         113,    47,    91,   616,     8,   534,     2,   201])>

In [79]:
# Example Output (shifted by one token)
example_input_output[1][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([   21,    66, 13429,   497,     2,   357,   129,     5,  1583,
        2587,    73,  4713,     9,    11,  2627,     4,    47,    11,
         302,   346,   248,  3745,  9232,    37,    21,   519,  1209,
           2,   461,     7,    57,    37,   114,    10,    48,   138,
           5,     3,    21,    66,  5916,     5,   599,   110,    73,
          85, 13429,   416,  1828,    50,    66,  1468,   634,   246,
         135,    11,  3745,  2180,    30,     2,  7351,    13,  6032,
           3,    21,   274,  1214,     5,   958,    17,     3,   113,
          47,    91,   616,     8,   534,     2,   201,    57])>

# Creating Casual Mask

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


> [W]e want our GPT model to be able to handle a group of query vectors in parallel (i.e., a matrix)...we need to apply a mask to the query/key dot product, to avoid information from future words leaking through. This is known as *causal masking.*

The casual_attention_mask function implements *causal masking*. The function takes in the batch_size, the source and destination sequence lengths, and the data type. In step 1 the destination sequence length's shape is changed from (seq_length,) to (seq_length, 1) while in step 2 the source sequence lengh's shape is not changed. In step 3 and 4 a matrix of booleans of shape (seq_length, seq_length) is created.In step 5 the mask is reshaped to have a 3D shape of (Batch_size,seq_length seq_length). In steps 6 and 7 a concatenation is done to create a 1D array with the values of [batch_size, 1, 1] and tiled to produce the final 3D casual mask of shape (batch_size, seq_length, seq_length).



In [85]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    i = tf.range(n_dest)[:, None] # 1
    j = tf.range(n_src)    # 2
    m = i >= j - n_src + n_dest # 3
    mask = tf.cast(m, dtype) # 4
    mask = tf.reshape(mask, [1, n_dest, n_src]) # 5
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    ) # 6
    return tf.tile(mask, mult) # 7

# Creating Transformer Block

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):


>  A Transformer block is a single component within a Transformer that applies some *skip connections*, *feed-forward (dense) layers*, and *layer normalization* around the *multihead attention layer*...[where] [i]n Keras, we can build a *MultiHeadAttention layer that concatenates the output from multiple attention heads*, allowing each to learn a distinct attention mechanism so that the layer as a whole can learn more complex relationships.

The Transformer block is one of the most important aspects of today's language models. Before Transformer Blocks were introduced the previous method of using bidirectional LSTMs was the state of the art for text generation. It was not til the Transformer block was introduced that allowed for large language models to take happen. In the paper [Improving Language Understanding
by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) the authors used 12 of these transformers for each layer (i.e. a 12-layer decoder-only transformer). To speed up training I used 1 Transformer and most of the base model parameters as detailed by [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) for the other hyperparamters.  


In [86]:
class TransformerBlock(layers.Layer):
    def __init__(self, num_heads, key_dim, embed_dim, ff_dim, dropout_rate=0.1):# defining sublayers of the TransformerBlock
        super(TransformerBlock, self).__init__()
        self.num_heads = num_heads
        self.key_dim = key_dim
        self.embed_dim = embed_dim
        self.ff_dim = ff_dim
        self.dropout_rate = dropout_rate
        self.attn = layers.MultiHeadAttention(
            num_heads, key_dim, output_shape=embed_dim
        ) # multi-head Attention layer
        self.dropout_1 = layers.Dropout(self.dropout_rate)
        self.ln_1 = layers.LayerNormalization(epsilon=1e-6)
        self.ffn_1 = layers.Dense(self.ff_dim, activation="relu")
        self.ffn_2 = layers.Dense(self.embed_dim)
        self.dropout_2 = layers.Dropout(self.dropout_rate)
        self.ln_2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(
            batch_size, seq_len, seq_len, tf.bool
        ) # casual attention mask
        attention_output, attention_scores = self.attn(
            inputs,
            inputs,
            attention_mask=causal_mask,
            return_attention_scores=True,
        ) # multi-head Attention
        attention_output = self.dropout_1(attention_output)
        out1 = self.ln_1(inputs + attention_output) # 1st skip connection and layer normilizations
        ffn_1 = self.ffn_1(out1) # feed-forward network
        ffn_2 = self.ffn_2(ffn_1)
        ffn_output = self.dropout_2(ffn_2)
        return self.ln_2(out1 + ffn_output)# 2nd skip connection and layer normilizations

# Creating Token and Positional Embedding

Because the orderining of words in a sentence is important for semantic meaning and predicting the next word in a sequence, a token and position embedding was used.

As David Foster states in his book [Generative Deep Learning - 2nd Edition](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=asc_df_1098134184/?tag=hyprod-20&linkCode=df0&hvadid=632163212339&hvpos=&hvnetw=g&hvrand=18153244410548753671&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9006620&hvtargid=pla-1852750701094&psc=1&mcid=e1431dcee2ae37808a9d62e277627154&gclid=CjwKCAiAp5qsBhAPEiwAP0qeJhrNONFHcyhwaOsNRPnS2wgtXzPPZGM9Wm7zaXDn87j1IGA0UT5sfRoCz7cQAvD_BwE):

> [I]n the multihead attention layer, there is nothing that cares about the ordering of the keys. The dot product between each key and the query is calculated in parallel, not sequentially...We use a technique called positional encoding when creating the inputs to the initial Transformer block. Instead of only encoding each token using a token embedding, we also encode the position of the token, using a position embedding.

In [87]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, max_len, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.max_len = max_len
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.token_emb = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        )
        self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

# Constructing the Model

In [92]:
inputs = layers.Input(shape=(None,), dtype=tf.int32)
# Token and POsitional Embedding
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, EMBEDDING_DIM)(inputs)
# 6-layer Transformer decoder
for _ in range(N):
  x = TransformerBlock(N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM)(x)
# output
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
gpt = models.Model(inputs=inputs, outputs=outputs)
gpt.compile("adam", loss=[losses.SparseCategoricalCrossentropy(), None])
gpt.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_7 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddi  (None, None, 1024)        30801920  
 ng_6 (TokenAndPositionEmbe                                      
 dding)                                                          
                                                                 
 transformer_block_24 (Tran  (None, None, 1024)        54571008  
 sformerBlock)                                                   
                                                                 
 dense_56 (Dense)            (None, None, 30000)       30750000  
                                                                 
Total params: 116122928 (442.97 MB)
Trainable params: 116122928 (442.97 MB)
Non-trainable params: 0 (0.00 Byte)
_____________

# Training GPT

I trained GPT using one V100 GPU for five epochs. It took approximately 8 hours to train the model during which the intial Sparse Categorical Cross-Entropy loss of 7 was able to be minimized to 4. Obviously this end result is far from being globally optimal and reaching a global minimum of 0. Because of this fact the text generated by the model (as seen in the next section) is not always semantically correct.

In order to easily improve the text generation process some of the following changes should be tried:

1.   use the full Reddit dataset instead of a sub-sample as I did
2.   increase the VOCAB_SIZE
3.   *train for at least 100 epochs*
4.   increase the number of Transformer blocks and the number of attention heads
5.   choose a different learning rate for the Adam algorithm


In [None]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }

    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            info.append(
                {
                    "prompt": start_prompt,
                    "word_probs": probs,
                }
            )
            start_tokens.append(sample_token)
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("did you", max_tokens=20, temperature=1.0)

# Tokenize starting prompt
text_generator = TextGenerator(vocab)
gpt.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[text_generator],
)
# Save the final model
gpt.save("/content/gpt")

# Generating Text



In [96]:
# generating text using the prompt: should I go
text_generator.generate("should I go", max_tokens=20, temperature=1.0)


generated text:
should I go down let away with mild munchies and hello im a real lightweight and looking for a lightweight



[{'prompt': 'should I go',
  'word_probs': array([3.9252677e-04, 2.4097566e-02, 7.1504386e-03, ..., 1.2247952e-07,
         2.1028454e-07, 1.1844580e-07], dtype=float32)},
 {'prompt': 'should I go down',
  'word_probs': array([1.5663232e-03, 1.8423527e-02, 7.0624501e-02, ..., 4.6719705e-07,
         2.5653665e-07, 1.5967171e-07], dtype=float32)},
 {'prompt': 'should I go down let',
  'word_probs': array([2.1322555e-06, 9.7094597e-03, 3.7058368e-02, ..., 1.5525318e-08,
         3.6275036e-08, 7.6346582e-09], dtype=float32)},
 {'prompt': 'should I go down let away',
  'word_probs': array([3.8629423e-05, 3.1404667e-02, 6.4428546e-02, ..., 2.7698659e-08,
         5.3843504e-08, 3.9898399e-08], dtype=float32)},
 {'prompt': 'should I go down let away with',
  'word_probs': array([3.1660475e-05, 2.9596783e-02, 2.1106137e-01, ..., 2.2095466e-07,
         1.5075646e-07, 6.4391621e-08], dtype=float32)},
 {'prompt': 'should I go down let away with mild',
  'word_probs': array([1.3377670e-05, 1.34

In [98]:
# generating text using the prompt: this guy
text_generator.generate("this guy", max_tokens=20, temperature=1)


generated text:
this guy has never experienced a really big breakup for the first week of university he came out of nowhere



[{'prompt': 'this guy',
  'word_probs': array([4.17970004e-05, 1.12000285e-02, 1.00543676e-03, ...,
         1.57740523e-08, 2.12044256e-07, 1.86644940e-08], dtype=float32)},
 {'prompt': 'this guy has',
  'word_probs': array([1.8237852e-06, 4.9863472e-03, 1.2692408e-02, ..., 5.0823417e-09,
         5.2507780e-09, 1.6172116e-08], dtype=float32)},
 {'prompt': 'this guy has never',
  'word_probs': array([3.7056052e-06, 1.7005111e-03, 1.9142291e-04, ..., 2.8955099e-10,
         3.3867982e-09, 2.5159232e-09], dtype=float32)},
 {'prompt': 'this guy has never experienced',
  'word_probs': array([2.0916112e-05, 9.3607400e-03, 3.1279765e-02, ..., 4.1899160e-09,
         4.2624958e-08, 1.8873907e-07], dtype=float32)},
 {'prompt': 'this guy has never experienced a',
  'word_probs': array([1.8404297e-06, 2.8257452e-02, 3.7658425e-05, ..., 3.5909318e-09,
         5.4782652e-07, 2.5327545e-06], dtype=float32)},
 {'prompt': 'this guy has never experienced a really',
  'word_probs': array([5.8553519e-