# 🚀 GPT

In this notebook, we'll walk through the steps required to train your own GPT model on the wine review dataset

The code is adapted from the excellent [GPT tutorial](https://keras.io/examples/generative/text_generation_with_miniature_gpt/) created by Apoorv Nandan available on the Keras website.

In [34]:
%load_ext autoreload
%autoreload 2
import numpy as np
import json
import re
import string
from IPython.display import display, HTML

%cd /home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/notebooks

import tensorflow as tf
from tensorflow.keras import layers, models, losses, callbacks

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/notebooks


## 0. Parameters <a name="parameters"></a>

In [35]:
VOCAB_SIZE = 10000
MAX_LEN = 80
EMBEDDING_DIM = 256
KEY_DIM = 256
N_HEADS = 2
FEED_FORWARD_DIM = 256
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 5

## 1. Load the data <a name="load"></a>

In [36]:
# Load the full dataset
with open("/home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/data/wine-reviews/winemag-data-130k-v2.json") as json_data:
    wine_data = json.load(json_data)

In [37]:
wine_data[10]

{'points': '87',
 'title': 'Kirkland Signature 2011 Mountain Cuvée Cabernet Sauvignon (Napa Valley)',
 'description': 'Soft, supple plum envelopes an oaky structure in this Cabernet, supported by 15% Merlot. Coffee and chocolate complete the picture, finishing strong at the end, resulting in a value-priced wine of attractive flavor and immediate accessibility.',
 'taster_name': 'Virginie Boone',
 'taster_twitter_handle': '@vboone',
 'price': 19,
 'designation': 'Mountain Cuvée',
 'variety': 'Cabernet Sauvignon',
 'region_1': 'Napa Valley',
 'region_2': 'Napa',
 'province': 'California',
 'country': 'US',
 'winery': 'Kirkland Signature'}

In [38]:
# Filter the dataset
filtered_data = [
    "wine review : "
    + x["country"]
    + " : "
    + x["province"]
    + " : "
    + x["variety"]
    + " : "
    + x["description"]
    for x in wine_data
    if x["country"] is not None
    and x["province"] is not None
    and x["variety"] is not None
    and x["description"] is not None
]

In [39]:
# Count the recipes
n_wines = len(filtered_data)
print(f"{n_wines} recipes loaded")

129907 recipes loaded


In [40]:
example = filtered_data[25]
print(example)

wine review : US : California : Pinot Noir : Oak and earth intermingle around robust aromas of wet forest floor in this vineyard-designated Pinot that hails from a high-elevation site. Small in production, it offers intense, full-bodied raspberry and blackberry steeped in smoky spice and smooth texture.


## 2. Tokenize the data <a name="tokenize"></a>

In [41]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}, '\n'])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s

text_data = [pad_punctuation(x) for x in filtered_data]

In [42]:
# Display an example of a recipe
example_data = text_data[25]
example_data

'wine review : US : California : Pinot Noir : Oak and earth intermingle around robust aromas of wet forest floor in this vineyard - designated Pinot that hails from a high - elevation site . Small in production , it offers intense , full - bodied raspberry and blackberry steeped in smoky spice and smooth texture . '

In [43]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [44]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [45]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

2023-08-14 20:48:57.168197: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [129907]
	 [[{{node Placeholder/_0}}]]
2023-08-14 20:48:57.168386: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [129907]
	 [[{{node Placeholder/_0}}]]


In [46]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: :
3: ,
4: .
5: and
6: the
7: wine
8: a
9: of


In [47]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[   7   10    2   20    2   29    2   43   62    2   55    5  243 4145
  453  634   26    9  497  499  667   17   12  142   14 2214   43   25
 2484   32    8  223   14 2213  948    4  594   17  987    3   15   75
  237    3   64   14   82   97    5   74 2633   17  198   49    5  125
   77    4    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0]


## 3. Create the Training Set <a name="create"></a>

In [48]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y

train_ds = text_ds.map(prepare_inputs)

In [49]:
example_input_output = train_ds.take(1).get_single_element()

In [50]:
# Example Input
example_input_output[0][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([   7,   10,    2,   20,    2,   29,    2,   61,    2,  541,   17,
        347,   55,    5,  878,  796,    3,   12, 5033,   53,   13,  663,
          5,  125,    3,   11,  266,    9, 3354,   69,   23,  827,    4,
        396,   30,  755,   15,   73,    5,  247,    3,  452,   79,   18,
         21,    8,  108,    9,  734,   55,    4,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])>

In [51]:
# Example Output (shifted by one token)
example_input_output[1][0]

<tf.Tensor: shape=(80,), dtype=int64, numpy=
array([  10,    2,   20,    2,   29,    2,   61,    2,  541,   17,  347,
         55,    5,  878,  796,    3,   12, 5033,   53,   13,  663,    5,
        125,    3,   11,  266,    9, 3354,   69,   23,  827,    4,  396,
         30,  755,   15,   73,    5,  247,    3,  452,   79,   18,   21,
          8,  108,    9,  734,   55,    4,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])>

## 5. Create the causal attention mask function <a name="causal"></a>

## Mathematical Overview

### Attention

One of the main advantages of a *Transformer* model is that it is easily to parallelize, unlike, say, RNNs. We feed the entire string in at once. However, consider the following sentence: "The dog likes to play ....". The next obvious word should be "fetch", but why? Well, presumably because of the words "dog", which, if it were, say, "cat", would lead to a different game, and the word "likes", which, if it were "hates", would also change the word that follows. So, we need to figure where the model should focus its *Attention*.

To do so, we will follow our standard text preprocessing seen above, tokenizing, projecting into an embedding space, and so on. Now, we will talk about a single *Self-Attention Head*, but then will continue about how to stack them. Each head has three matrices that are updated during training. They are the *Query*, *Key*, and *Value* matrices, or $W_q, W_k, W_v$, respectively.

Then, each token vector $\mathbf{x}^{(t)}_i \in X, t \in {1, ..., T}$ can be mutlipled by each weight matrix to get:

$$ \mathbf{q}^{(t)} = W_q \mathbf{x}^{(t)} $$
$$ \mathbf{k}^{(t)} = W_k \mathbf{x}^{(t)} $$
$$ \mathbf{v}^{(t)} = W_v \mathbf{x}^{(t)} $$

We are now looking at $\mathbf{q}^{(t)}$ as the query associated with $\mathbf{x}^{(t)}_i$. We are ultimately going to put it in context with the other tokens in the sentence (telling us how much attention to pay to that $\mathbf{x}^{(t)}_i$) Now, $\mathbf{q}^{(t)}$ and $\mathbf{k}^{(t)}$ have the same length, call it $d_k$, while $\mathbf{v}^{(t)}$ has length $d_v$. Note that these could be the same, and often are. Also, recall that each token is a vector (after vectorization) of length $d$. So, $W_Q$ and $W_k$ are size $d_k \times d$, and $W_v$ is size $d_v \times d$. We now calculate the normalized weights for each possible pair of $\mathbf{q}^{{(t)}^T} \mathbf{k}^{(j)}$ as $j$ ranges over $m$ keys:

$$ \omega_{t,j} = \frac{\mathbf{q}^{{(t)}^T} \mathbf{k}^{(j)}}{\sqrt{d_k}} $$

We then apply the softmax function to get:

$$ \alpha_{t,i} = \alpha(\mathbf{q}^{(t)}, \mathbf{k}^{(1:m)}) = softmax_i ([\omega_{t,1}, ..., \omega_{t,m}]) = \frac{\exp(\omega_{t,i})}{\sum_{j=1}^m \exp(\omega_{t},j)} \in \mathbb{R}$$

Finally, we arrive at the context vector for token $t$, called $\mathbf{z}_t$:

$$ \mathbf{z}_t = Attention(\mathbf{q}^{(t)}, (\mathbf{k}_1, \mathbf{v}_1), ... , (\mathbf{k}_m, \mathbf{v}_m)) = \sum_{j=1}^m \alpha_{t,j}\mathbf{v}^{(j)} \in \mathbb{R}^v$$

As in, we have included all the other keys and values associated with this query to build our context vector.

Another way it might be shown is to consider a batch of size $n$ of query vectors out of $m$ (or $T$) tokens. In this case, we get:

$$ Q \in \mathbb{R}^{n \times d} $$
$$ K \in \mathbb{R}^{m \times d} $$ 
$$ V \in \mathbb{R}^{m \times v} $$

This gives us:

$$ Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d_k}} \right)V \in \mathbb{R}^{n \times v}$$

Now, we have $n$ tokens' context, each of length $v$ and we have successfully parallelized the process. If $n=m=T$, then we simply have the context for all the tokens.

We can also do this with multiple heads. We simply concatenate each output and then run the result through another matrix $W_o$ to get the desired output shape.

$$ c_t = W_o \cdot concat([c^1_t, c^2_t, ..., c^n_t]) $$

However, we note that if we are doing all several queries at once, we need to apply a mask, as we don't want future words to leak into the information at present. This essentially looks like a matrix with zeros below the diagonal. That way when we apply $Attention(Q,K,V)$ for all the queries, each will be properly masked to only have the words leading up the the query, as in, the keys from future words won't be applied. For example, if we have "My dog like to run.", the mask would mask out to get queries "My", "My dog", "My dog likes", "My dog like to", etc. 

Graphically, this looks like:

<div style='text-align: center;'>
    <img src='/home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/notebooks/Graphics/MultiHeaded_Attention.png' alt='MultiHeaded_Attention' width='500'>
</div>

In [52]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)

np.transpose(causal_attention_mask(1, 10, 10, dtype=tf.int32)[0]) # 10 is the sequence lengths here, as when you do QK^T, you get a d-by-d, 
# (seq_len-by-seq_len) matrix, so we want our mask to do that

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int32)

## 6. Create a Transformer Block layer <a name="transformer"></a>

### Mathematical Overview of the Transformer

The transformer has the following structure:

<div style='text-align: center;'>
    <img src='/home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/notebooks/Graphics/Transformer_Block.png' alt='Transformer_Block' width='300'>
</div>

Essentially, each of $Q,K,V$ are fed into the multi-headed attention, the $Q$ information also passes around the attention layer and is added to the output (skip connection, like in Residual Blocks), then *Layer Normalized*, before being passed to the feedforward network and so on.

We note that batch normalization normalizes each feature independently across the mini-batch. Layer normalization normalizes each of the inputs in the batch independently across all features.

<div style='text-align: center;'>
    <img src='/home/clachris/Documents/projects/Generative_Deep_Learning_2nd_Edition/notebooks/Graphics/Normalization_Types.png' alt='Normalization_Types' width='750'>
</div>

In [53]:
class TransformerBlock(layers.Layer):
    def __init__(self, num_heads, key_dim, embed_dim, ff_dim, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.num_heads = num_heads # Number of attention heads
        self.key_dim = key_dim # Key dimension
        self.embed_dim = embed_dim # Word embedding dimension
        self.ff_dim = ff_dim # Fully connected ()feed forward) layer's dimension
        self.dropout_rate = dropout_rate
        self.attn = layers.MultiHeadAttention(
            num_heads, key_dim, output_shape=embed_dim
        ) # Creating the attention head
        self.dropout_1 = layers.Dropout(self.dropout_rate)
        self.ln_1 = layers.LayerNormalization(epsilon=1e-6) # The layer normalization
        self.ffn_1 = layers.Dense(self.ff_dim, activation="relu") # Dense layer
        self.ffn_2 = layers.Dense(self.embed_dim) # Another dense layer
        self.dropout_2 = layers.Dropout(self.dropout_rate)
        self.ln_2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(
            batch_size, seq_len, seq_len, tf.bool
        ) # Creating the causal mask
        attention_output, attention_scores = self.attn(
            inputs,
            inputs,
            attention_mask=causal_mask,
            return_attention_scores=True,
        ) # Getting the attention scores
        attention_output = self.dropout_1(attention_output)
        out1 = self.ln_1(inputs + attention_output) # Skip connection adds the query and attention then applies layer normalization
        ffn_1 = self.ffn_1(out1) # Fully connected
        ffn_2 = self.ffn_2(ffn_1) # Second fully connected
        ffn_output = self.dropout_2(ffn_2)
        return (self.ln_2(out1 + ffn_output), attention_scores) # This is doing the skip connection with the first layer normalization's output 
        #and the result of that output being passed throughh two feed-forward layersand then applying layer normalization to that output

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "key_dim": self.key_dim,
                "embed_dim": self.embed_dim,
                "num_heads": self.num_heads,
                "ff_dim": self.ff_dim,
                "dropout_rate": self.dropout_rate,
            }
        )
        return config

## 7. Create the Token and Position Embedding <a name="embedder"></a>

### Positional Embedding

However, before we can do all this, we have to keep in mind that we are passing all the queries and keys in together, in parallel. This makes it faster to train, but we need a way to encode order onto the inputs, as sentences change significantly depending on words order. Consider "The dog looked at the boy and ..." versus "The boy looked at the dog and ..." They have the same words, but in different order, and so we want the model to be able to distinguish between them. This is as simple as creating a second embedding layer with the input_dim as the mex lengths of the sentences (as opposed to the vocab size like we normally would) and adding the two.

In [54]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, max_len, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.max_len = max_len
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.token_emb = layers.Embedding(
            input_dim=vocab_size, output_dim=embed_dim
        ) # Token embedding
        self.pos_emb = layers.Embedding(input_dim=max_len, output_dim=embed_dim) # Position embedding

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1) # Getting a list of integer positions
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions # Just adding the two

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "max_len": self.max_len,
                "vocab_size": self.vocab_size,
                "embed_dim": self.embed_dim,
            }
        )
        return config

## 8. Build the Transformer model <a name="transformer_decoder"></a>

In [55]:
inputs = layers.Input(shape=(None,), dtype=tf.int32)
x = TokenAndPositionEmbedding(MAX_LEN, VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x, attention_scores = TransformerBlock(
    N_HEADS, KEY_DIM, EMBEDDING_DIM, FEED_FORWARD_DIM
)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x) # Final layer for outputting probabilities over all the vocabulary words
# At this point, we have (batch_size, sequence_length, 10000)
gpt = models.Model(inputs=inputs, outputs=[outputs, attention_scores])
gpt.compile("adam", loss=[losses.SparseCategoricalCrossentropy(), None])

In [56]:
gpt.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 256)        2580480   
 g_5 (TokenAndPositionEmbedd                                     
 ing)                                                            
                                                                 
 transformer_block_5 (Transf  ((None, None, 256),      658688    
 ormerBlock)                  (None, 2, None, None))             
                                                                 
 dense_17 (Dense)            (None, None, 10000)       2570000   
                                                                 
Total params: 5,809,168
Trainable params: 5,809,168
Non-trainable params: 0
_________________________________________________

In [57]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    gpt = models.load_model("./models/gpt", compile=True)

## 9. Train the Transformer <a name="train"></a>

In [58]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word 
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        } # Reverses the vocab we created earlier to go from word to index

    def sample_from(self, probs, temperature):  # Once we have our probability vectors, this will sample from them
        probs = probs ** (1 / temperature) # This changes the probabilities by a temperature, closer to 0 is more deterministic and 1 is more random
        probs = probs / np.sum(probs) # The probabilities no longer sum to one, so we normalize them
        return np.random.choice(len(probs), p=probs), probs # Sampling a single value from the probabilities

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:
            x = np.array([start_tokens])
            y, att = self.model.predict(x, verbose=0)
            sample_token, probs = self.sample_from(y[0][-1], temperature)
            info.append(
                {
                    "prompt": start_prompt,
                    "word_probs": probs,
                    "atts": att[0, :, -1, :],
                }
            )
            start_tokens.append(sample_token)
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("wine review", max_tokens=80, temperature=1.0)

In [59]:
# Create a model save checkpoint
model_checkpoint_callback = callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.ckpt",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [60]:
gpt.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[model_checkpoint_callback, tensorboard_callback, text_generator],
)

Epoch 1/5


2023-08-14 20:49:00.566445: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype string and shape [129907]
	 [[{{node Placeholder/_0}}]]
2023-08-14 20:49:00.566694: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_12' with dtype string
	 [[{{node Placeholder/_12}}]]


generated text:
wine review : us : california : merlot : this is a dense , rich , floral syrah , blended with 25 % syrah and 46 % petite sirah . many big and broad purple leather combine on the palate , with black cherry and baking spices , which match stick and cola flavors . 

Epoch 2/5
generated text:
wine review : morocco : zenata : viognier : there is a very nice burst of pineapple and freshly mowed lawn in this orange zest flavor and refreshing in acidity with floral , semisweet flavors of tart green apples . 

Epoch 3/5
generated text:
wine review : us : oregon : pinot noir : this pretty pale red , rich , berry fruit with accents of cherry and forest floor . the flavors are rounded , appealing , spicy , accented with a hint of pencil lead and baking spices . the subtle , euro style convinces , though not for you want in a wine that will probably be best enjoyed in balance . 

Epoch 4/5
generated text:
wine review : chile : apalta : carmenère : tarry , but it relies on alert cola

<keras.callbacks.History at 0x7f10a73a2920>

In [61]:
# Save the final model
gpt.save("./models/gpt")

2023-08-14 21:42:15.885608: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'input_6' with dtype int32 and shape [?,?]
	 [[{{node input_6}}]]
2023-08-14 21:42:15.956783: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'input_6' with dtype int32 and shape [?,?]
	 [[{{node input_6}}]]
2023-08-14 21:42:16.016754: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'inputs' with dtype int32 and shape [?,?]
	 [[{{node inputs}}]]
2023-08-14 21:42:1

INFO:tensorflow:Assets written to: ./models/gpt/assets


INFO:tensorflow:Assets written to: ./models/gpt/assets


# 3. Generate text using the Transformer

In [62]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        highlighted_text = []
        for word, att_score in zip(
            i["prompt"].split(), np.mean(i["atts"], axis=0)
        ):
            highlighted_text.append(
                '<span style="background-color:rgba(135,206,250,'
                + str(att_score / max(np.mean(i["atts"], axis=0)))
                + ');">'
                + word
                + "</span>"
            )
        highlighted_text = " ".join(highlighted_text)
        display(HTML(highlighted_text))

        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [63]:
info = text_generator.generate(
    "wine review : us", max_tokens=80, temperature=1.0
)


generated text:
wine review : us : new york : gewürztraminer : while voluptuous aromas of new french oak spice and ripe copper penny add rich notes of allspice , dried banana and honeycomb on this spicy expression . on the palate , it ' s soft and supple in mouthfeel , it ' s a rich but lengthy , spice flavors acidity and a vibrant spiciness that lingers long on the finish . 



In [64]:
info = text_generator.generate(
    "wine review : italy", max_tokens=80, temperature=0.5
)


generated text:
wine review : italy : northeastern italy : chardonnay : this is a beautiful wine with intense aromas of toasted oak , butterscotch , vanilla and butterscotch . the palate is rich and creamy , with a shot of acidity . 



In [65]:
info = text_generator.generate(
    "wine review : germany", max_tokens=80, temperature=0.5
)
print_probs(info, vocab)


generated text:
wine review : germany : mosel : riesling : this is a very minerally , refreshing riesling from start to finish in this dry , richly textured spätlese . it ' s a luscious , luscious peach and apricot flavors penetrate deeply on the palate , yet somehow manages to dazzle the palate . drink now through 2025 . 



::   	100.0%
-:   	0.0%
zealand:   	0.0%
grosso:   	0.0%
africa:   	0.0%
--------



mosel:   	99.37%
pfalz:   	0.36%
rheingau:   	0.19%
nahe:   	0.03%
rheinhessen:   	0.03%
--------



::   	99.99%
-:   	0.01%
noir:   	0.0%
blanc:   	0.0%
blend:   	0.0%
--------



riesling:   	100.0%
pinot:   	0.0%
chardonnay:   	0.0%
weissburgunder:   	0.0%
white:   	0.0%
--------



::   	100.0%
-:   	0.0%
blanc:   	0.0%
grosso:   	0.0%
blend:   	0.0%
--------



a:   	22.36%
while:   	19.02%
this:   	13.88%
the:   	7.98%
hints:   	4.46%
--------



is:   	36.63%
wine:   	18.94%
intensely:   	8.88%
riesling:   	6.73%
off:   	5.72%
--------



a:   	94.26%
an:   	5.01%
intensely:   	0.22%
off:   	0.09%
remarkably:   	0.06%
--------



bit:   	36.01%
straightforward:   	6.11%
delicate:   	4.33%
delightfully:   	4.15%
rich:   	3.1%
--------



dry:   	24.24%
fresh:   	10.89%
perfumed:   	10.35%
floral:   	6.81%
rich:   	5.1%
--------



,:   	92.58%
and:   	6.45%
yet:   	0.27%
wine:   	0.25%
riesling:   	0.17%
--------



dry:   	20.44%
deeply:   	15.16%
intensely:   	6.74%
mineral:   	6.16%
yet:   	5.6%
--------



wine:   	44.96%
riesling:   	44.51%
,:   	3.76%
and:   	2.91%
white:   	1.36%
--------



that:   	32.12%
.:   	31.94%
,:   	15.41%
with:   	14.12%
from:   	5.23%
--------



start:   	49.39%
the:   	23.2%
a:   	14.1%
nose:   	6.36%
mosel:   	3.89%
--------



to:   	100.0%
,:   	0.0%
.:   	0.0%
-:   	0.0%
it:   	0.0%
--------



finish:   	99.95%
the:   	0.04%
a:   	0.0%
nose:   	0.0%
this:   	0.0%
--------



.:   	61.49%
,:   	34.59%
with:   	1.63%
that:   	0.73%
in:   	0.72%
--------



this:   	62.83%
the:   	22.57%
a:   	11.26%
its:   	2.9%
an:   	0.1%
--------



dry:   	50.54%
feather:   	14.29%
off:   	13.85%
intensely:   	3.27%
lip:   	2.67%
--------



,:   	79.75%
riesling:   	12.15%
-:   	6.46%
and:   	0.47%
yet:   	0.29%
--------



intensely:   	22.85%
full:   	9.16%
medium:   	7.27%
yet:   	6.53%
crisp:   	4.26%
--------



textured:   	97.68%
fruity:   	1.36%
concentrated:   	0.52%
layered:   	0.18%
perfumed:   	0.07%
--------



riesling:   	48.48%
auslese:   	22.3%
spätlese:   	16.68%
wine:   	10.19%
kabinett:   	2.17%
--------



.:   	99.92%
,:   	0.03%
that:   	0.02%
is:   	0.01%
from:   	0.01%
--------



it:   	78.93%
the:   	4.53%
its:   	3.8%
hints:   	2.2%
flavors:   	1.86%
--------



':   	99.8%
has:   	0.09%
is:   	0.04%
offers:   	0.02%
shows:   	0.01%
--------



s:   	100.0%
ll:   	0.0%
[UNK]:   	0.0%
11:   	0.0%
hints:   	0.0%
--------



a:   	31.39%
intensely:   	25.69%
lusciously:   	9.84%
rich:   	3.59%
delicate:   	2.54%
--------



bit:   	42.21%
thrilling:   	10.74%
lush:   	5.19%
rich:   	4.05%
luscious:   	3.87%
--------



,:   	90.06%
mouthful:   	1.68%
and:   	1.47%
wine:   	1.28%
yet:   	0.99%
--------



luscious:   	16.12%
lush:   	12.95%
concentrated:   	9.63%
juicy:   	9.59%
lusciously:   	8.37%
--------



spätlese:   	34.75%
wine:   	10.49%
and:   	9.5%
,:   	9.01%
riesling:   	7.79%
--------



and:   	44.0%
-:   	41.44%
flavor:   	11.77%
,:   	1.87%
fruit:   	0.39%
--------



apricot:   	75.28%
honey:   	12.29%
tangerine:   	2.01%
melon:   	1.93%
pear:   	1.82%
--------



flavors:   	91.98%
flavor:   	3.19%
fruit:   	2.65%
nectar:   	1.3%
notes:   	0.58%
--------



penetrate:   	48.4%
are:   	45.7%
flood:   	2.34%
,:   	1.76%
[UNK]:   	0.38%
--------



deeply:   	86.44%
through:   	11.78%
the:   	1.24%
on:   	0.11%
throughout:   	0.1%
--------



,:   	69.28%
on:   	10.9%
.:   	9.01%
through:   	7.33%
into:   	1.49%
--------



the:   	99.86%
its:   	0.13%
a:   	0.01%
,:   	0.0%
every:   	0.0%
--------



palate:   	98.5%
midpalate:   	0.92%
finish:   	0.41%
long:   	0.09%
nose:   	0.05%
--------



,:   	80.79%
.:   	18.28%
and:   	0.34%
but:   	0.19%
yet:   	0.15%
--------



but:   	49.33%
finishing:   	18.56%
yet:   	9.71%
with:   	7.77%
accented:   	4.41%
--------



it:   	18.77%
vibrantly:   	12.44%
finishes:   	12.42%
the:   	8.53%
deeply:   	3.96%
--------



manages:   	82.86%
it:   	5.08%
weightless:   	2.67%
balanced:   	1.21%
reined:   	0.74%
--------



to:   	99.94%
mosel:   	0.04%
elegance:   	0.01%
that:   	0.0%
refreshment:   	0.0%
--------



be:   	39.83%
capture:   	13.84%
strike:   	8.65%
balance:   	7.91%
keep:   	5.58%
--------



a:   	31.31%
the:   	19.97%
and:   	11.91%
in:   	7.76%
with:   	7.74%
--------



palate:   	79.66%
mosel:   	13.94%
long:   	1.34%
midpalate:   	1.05%
unctuousness:   	0.94%
--------



.:   	69.72%
,:   	16.85%
and:   	6.87%
with:   	4.43%
that:   	0.65%
--------



it:   	63.6%
:   	20.9%
drink:   	7.09%
a:   	3.07%
the:   	1.8%
--------



now:   	78.28%
now–2016:   	10.35%
now–2025:   	3.92%
now–2030:   	2.39%
after:   	0.84%
--------



through:   	98.91%
or:   	0.65%
.:   	0.19%
and:   	0.14%
for:   	0.04%
--------



2020:   	48.0%
2025:   	42.97%
2018:   	4.02%
2021:   	1.28%
2019:   	1.05%
--------



.:   	97.76%
or:   	2.09%
,:   	0.09%
and:   	0.06%
to:   	0.0%
--------



:   	100.0%
imported:   	0.0%
.:   	0.0%
through:   	0.0%
[UNK]:   	0.0%
--------

