In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import TextVectorization
import numpy as np
import os
import re
import string
import random


2023-04-19 11:31:32.900850: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


# WhackGPT

We can make a transformer based model to generate chatGPT-ish text responses. Ours will be far more stupid, but hey, it's a taking computer. 

## Transformer Architecture

What the heck is a transformer, what does it do, and why is it so cool? A transformer model is a type of neural network that was creating in 2017 at Google. The core idea behind transformers is the idea of attention, which is deailed a little bit below. The diagramed strucutre of a transformer model can be a little intimidating, but we can make sense of the critical parts without too much issue. 

![Transformer](images/transformer.png "Transformer")

A transformer model contains a few key parts, each of with is dealt with in more detail below.
<ul>
<li> Embedding - the embedding layer generates embeddings (vector representations) for each token. The embeddings are created for both the token itself and its position in the sequence. </li>
<li> Attention layers - the attention layers are the core of the transformer model. They are responsible for creating a representation of the input sequence that is used to generate the output sequence. </li>
<li> Encoder - the encoder is a stack of attention layers that are used to create a representation of the input sequence. </li>
<li> Decoder - the decoder is a stack of attention layers that are used to create a representation of the output sequence. </li>
</ul>

The attention part is the star of the show, it is a method to be able to focus the attention of the model on the critical portions of the input sequence and generate contextually informed predictions for the output. As well, transformers do all of this in a way that is more parallelizable than LSTM based models that were the state of the art before transformers, only a few years ago. 

### Embedding

The embedding here has two parts:
<ul>
<li> Token embedding: This maps each token to a vector representation in N-dimensional space. This is what we are used to for embedding. The original transformer paper used a 512-dimensional embedding, so each token was represented by a vector of 512 values that position it on a 512D grid. 
<li> Positional embedding: This maps each token's position in the <i>sequence</i>. The position embedding can be thought of as an extension of the concept of just tracking which word of a sentence each token is, 1,2,3...
</ul>

#### Token Embedding

Token embedding is something that we are used to from when we used word2vec to generate embeddings for classification models. We are tranlating each token into an N-dimensional representation in space. The big difference here is that our embedding space is being learned by the model during training, so we should expect that the model will be shifting each token around in space as it learns more about what that word means, or more accurately, how it is used in our training data. 

![Embedding](images/embedding.png "Embedding")

#### Positional Embedding

The positional embedding is needed and most clearly seen if we compare this to an LSTM. In an LSTM, the position of a token is always known as we process the data sequentially. In the transformer model, the data is taken in parallel, so we don't have the sequence data built in. This has the benefit of allowing the model to process more of its work in parallel than an LSTM, but it also means that the model needs to be told where each token is in the sequence. What is the positional embedding? It follows the same concept as the token embedding, we are representing something with a vector of values. In the positional embedding, the math is a little involved, but it uses sine and cosine functions to represent the position of a token. 

![Positional Embedding](images/positional_emb.png "Positional Embedding")

Where:
<ul>
<li> <b>k:</b> position of the token. 
<li> <b>d:</b> dimension of the embedding.
<li> <b>i:</b> used for mapping to both sine and cosine functions.
</ul>

This positional embedding uses the trig functions to introduce some additional capability to our embedding values. First, this helps if we encounter longer sentences later on - if we embedded the position with a simple word count number, that would be an issue for us. Second, the trig functions allow us to embed the position in a way that is not deterministic. This means that the model can learn where tokens occur in relation to each other without being told explicitly. This is useful if you think of sentences such as:
<ul>
<li> I do not like the story of the movie, but I do like the cast.
<li> I do like the story of the movie, but I do not like the cast.
</ul>

These two sentences use the same words, but the meaning is opposite. The positional embedding helps capture the relationship between the words based on where the occur, and connect words that occur in certain "areas" to those in other "areas" of a sentence. This is really useful if you think of something like an adjective, that adjective modifies some noun, and understanding English requires that we are able to identify which noun it belongs to. Positional embedding with sine/cosine help with that, the position is recorded not only in a way that tells us where a word sits in an absolute sense, but it tells us where that word sits relative to the other words it is with. This is one reason transformers are so useful for tasks like language, their ability to contextualize the relationships in parts of text surpasses that of other models that we have today; when generating text, this gives us the most natural sounding text, as the "next word" prediction is based on a more comprehensive understanding of the sentence. 

Notably, the positional embedding uses the word embedding dimension, d, as the dimension of the positional embedding. This is because the positional embedding is added to the token embedding, so the two need to be the same dimension. This means that the embedding matrix generated can be quite large for each token. This also means that the input to any future modelling is going to contain those two vectors, likely represented in a high dimension - what is the token, and where is it in the sequence.

In [3]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x + positions

## Transformer Construction

We can now create a function to construct the core piece of our model, the transformer. The transformer layer has a few parts, the critical one being the attention layer. 

<b>Note:</b> the declarations of the layers are slightly different in a functional model. Each layer is a function that takes an input tensor and returns an output tensor. The layers are then called in the call method of the model.

### Pay (Multi-Head) Attention

The core piece of the transformer architecture is the attention mechanism. Attention serves as a way to focus, or pay attention, to certain parts of the input.

The attention mechanism contains three key matrices that we'll ultimately use to calculate things:
<ul>
<li> Query
<li> Key
<li> Value
</ul>

The query, key, values are commonly described as analagous to doing a Google search. For example, when you search for videos on Youtube, the search engine will map your <b>query</b> (text in the search bar) against a set of <b>keys</b> (video title, description, etc.) associated with candidate videos in their database, then present you the best matched <b>values</b> (videos).  

Using the query, key, and value objects involves a multistep process. 
<ul>
<li> First, the query, key, and value all get a copy of the embedding (position and token) matrix fed in, which is then multipled by a set of weights that belong to a linear layer (no activation) for that Q/K/V input. 
<li> The value matrix is set aside for the moment. 
<li> The results of the query and key matricies are then mutipled by each other, which generates attention scores. 
<li> The result is then passed through a softmax function to normalize the weights and generates the actual attention mask. 
<li> The normalized weights are then multiplied by the value matrix, which gives us the final output.
</ul>

To ultimately create the layer, we have several of these heads, similar to filters in a CNN. 

![Multi-Head Attention](images/multi_head_att.png "Multi-Head Attention")

Take and example of a sentence being, "“Anthony Hopkins admired Michael Bay as a great director", the product of the query and key matricies would look something like this:

![Attention Mechanism](images/key_value.png "Attention Mechanism")

These attention scores are measures of how important each word in the input sequence is to each other word. We normally see each word being really important to itself, then as the similarity decreases, the importance decreases. In this example, "Hopkins" and "Anthony" have a high score of attention with respect to each other, which makes sense! We would likely want to produce those two words in sequence. Given large amounts of data, the model can become very good at identifying what is important and what is not, and in particular, understanding context. Because the attention is based on the positon and token embeddings, and we have multiple heads (see below) each honing in on some other aspect of the text, the model can learn relationships between parts of speech that are challenging for other types of models, such as a sentence that has a lage independent clause in the middle of it or figures of speach that have little impact on the meaning of a sentence. Importantly, each token in a sentence is taken as the input, so we generate such a matrix for each "query" token.

![Attention Sequences](images/attention_seq.png "Attention Sequences")

#### Attention Masking

Once we get the attention mask, we combine it with the value matrix to get the final output from our attention layer. The easiest way to think of applying an attention mask is with an example from computer vision. The "thing" that we are trying to do with computer vision, say image recognition, is to capture information from the "important part" of the image. We don't want to focus, normally, on background stuff. The attention mask serves to act basically as a filter, that blocks out the less important and lets through the more important. So we can think of the end result as the input + mask = useful output. This image is a little blurry, but it shows the idea. If we have a model being trained to identify objects, we might end up with a mask that looks like this. Note the final result and the original (which has had the color space changed). The desired result is the bottom left, where the objects we want to identify are the focus. Applying the mask to the original serves to do that - remove the less important stuff, emphasize the more important stuff. With language, we get the same thing. We want to focus on the important parts of a sentence and ignore the less important parts - that measure of importance is what we are learning during training. 

![Attention Mask](images/attention_mask.png "Attention Mask")

<b>Note:</b> we also have a causal mask, which is used to prevent the model from "cheating" by looking ahead in the input sequence. This effectively stops the model from just looking up the answer, which would let it sidestep learning. 

#### Multi-Head Attention

The layer that we are adding is called a multi-head attention layer, implying that we have multiple attention filters at once. This part works similarly to how the convolutional filters work in a CNN. Each filter in a CNN learns to identify some useful feature in that context - edges, colors, etc... Here, each attention head learns to focus on a different aspect of the input, language in our case. As our model is trained, each attention head will learn to focus on different aspects of the input. Recall that the weights for the filter are normally random initially, so the training process will cause each one to find its own thing to focus on as we shrink the loss. 

### Attention Magic

This is a very brief and high level overview of attention and its application to our neural networks. There is a lot more to it, it is a very interesting topic, and based on what we know now (2023), transformer based models will likely be exceedingly common over the near future. If you want to learn more, I recommend the following resources:
<ul>
<li> https://data-science-blog.com/blog/2021/04/07/multi-head-attention-mechanism/
<li> https://www.youtube.com/watch?v=6D4EWKJgNn0
<li> https://data-science-blog.com/blog/2021/04/22/positional-encoding-residual-connections-padding-masks-all-the-details-of-transformer-model/
</ul>

The ability of the transformer models to, without external direction, learn what is important and what is not is what makes them both so powerful and so flexible. The examples of the GPT models accurately performing tasks that it wasn't trained on are good examples of this flexibility. If we have training data to supply the transformer model, it can very accurately learn to extract what matters from what doesn't, irrespective of the specific task that it is working on, which makes learning that task much easier.

In [4]:
def causal_attention_mask(batch_size, n_dest, n_src, dtype):
    """
    Mask the upper half of the dot product matrix in self attention.
    This prevents flow of information from future tokens to current token.
    1's in the lower triangle, counting from the lower right corner.
    """
    i = tf.range(n_dest)[:, None]
    j = tf.range(n_src)
    m = i >= j - n_src + n_dest
    mask = tf.cast(m, dtype)
    mask = tf.reshape(mask, [1, n_dest, n_src])
    mult = tf.concat(
        [tf.expand_dims(batch_size, -1), tf.constant([1, 1], dtype=tf.int32)], 0
    )
    return tf.tile(mask, mult)


class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads, embed_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size = input_shape[0]
        seq_len = input_shape[1]
        causal_mask = causal_attention_mask(batch_size, seq_len, seq_len, tf.bool)
        attention_output = self.att(inputs, inputs, attention_mask=causal_mask)
        attention_output = self.dropout1(attention_output)
        out1 = self.layernorm1(inputs + attention_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output)
        return self.layernorm2(out1 + ffn_output)

### Create Model

We can now create the model, and we will use the portions that we constructed above. The basic parts are:
<ul>
<li> Token and positional embedding - create representations of each sequence. 
<li> Transformer layers - the core of the model.
<li> Output layer - dense layers to convert the output of the transformer layers to the output of the model.
</ul>

The basic structure of different varieties of neural networks is also seen here, we again have a dense neural network to generate predictions from inputs, and that network can be fed by either:
<ul>
<li> Our actual data, for normal regression or classification.
<li> The output of convolutional layers, for image processing. 
<li> The output of recurrent layers, for sequential data.
<li> The output of transformer layers, for quickly expanding types of tasks. 
</ul>

No matter the specific implementation, the basic structure, and ability to learn, is the same in all neural networks. The ability to learn relationships that are complex, obscure, and impossible for a human to describe makes neural networks extremely powerful. If we can generate some architecture that is good at extracting features from some specific type of data, we can combine that with a regular neural network to make all kinds of predictions or generate new data. Our "predictor" dense model, and the "extractor" early layers can then both learn epoch by epoch, together, to be as accurate as possible. As the capacity of processors increases and the experience of researchers grows, we can expect to see more and more expansion in what neural networks can do. In particular, the increased ability to parallelize the processing of sequential data with the transformer architecture is massively helpful - we saw in the LSTM models the depth of the sequences of calculations meant that growing models to be very powerful requires lots of processing, in a way that is extremely hard to parallelize, limiting the growth. Transformers can do more in parallel, and it is much easier to add another processor than it is to develop a processor that is twice as fast; these models will likely grow to more efficiently process data accross large networks of worker machines, generating larger and more powerful models.

In [5]:
vocab_size = 20000  # Only consider the top 20k words
maxlen = 80  # Max sequence size
embed_dim = 256  # Embedding size for each token
num_heads = 2  # Number of attention heads
feed_forward_dim = 256  # Hidden layer size in feed forward network inside transformer


def create_model():
    inputs = layers.Input(shape=(maxlen,), dtype=tf.int32)
    embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
    x = embedding_layer(inputs)
    transformer_block = TransformerBlock(embed_dim, num_heads, feed_forward_dim)
    x = transformer_block(x)
    outputs = layers.Dense(vocab_size)(x)
    model = keras.Model(inputs=inputs, outputs=[outputs, x])
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(
        "adam", loss=[loss_fn, None],
    )  # No loss and optimization based on word embeddings from transformer block
    return model


### Get Data and Prepare for Training

This example uses some movie reviews for source data. The dataset comes already split into positive and negative labels, for classification, and into training and testing sets. We don't need any of these divisions, we just need all the text for training, so the data preparation steps here are:
<ul>
<li> Download the data.
<li> Loop through all the files and generate a list of all the file names. 
<li> Crate a dataset from all the files. 
<li> Clean the data by removing the html tags and punctuation.
<li> Tokenize the data by splitting the text into words and creating a vocabulary.
<li> Create training ready data by creating sequences of X = "up to the current word" and Y = "the next word".
<li> Set the dataset to be shuffled, batched, and prefetched.
</ul>

<b>Note:</b> there are a few odd [UNK] tokens, this is a placeholder for words that are not in the vocabulary. Were this a production model, we'd want to come up with some more sophisticated way of handling this, but for this example, we'll just leave it as is. When dealing with natural text, it is common to have things like this for unknown data, or other special tokens for the beginning or end of a sentence (e.g. [BOS] or [EOS]).  

In [6]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  3682k      0  0:00:22  0:00:22 --:--:-- 4804k0:34  0:00:10  0:00:24 3140k


In [7]:

batch_size = 256
EPOCHS = 20

# The dataset contains each review in a separate text file
# The text files are present in four different folders
# Create a list all files
filenames = []
directories = [
    "aclImdb/train/pos",
    "aclImdb/train/neg",
    "aclImdb/test/pos",
    "aclImdb/test/neg",
]
for dir in directories:
    for f in os.listdir(dir):
        filenames.append(os.path.join(dir, f))

print(f"{len(filenames)} files")

# Create a dataset from text files
random.shuffle(filenames)
text_ds = tf.data.TextLineDataset(filenames)
text_ds = text_ds.shuffle(buffer_size=256)
text_ds = text_ds.batch(batch_size)


def custom_standardization(input_string):
    """ Remove html line-break tags and handle punctuation """
    lowercased = tf.strings.lower(input_string)
    stripped_html = tf.strings.regex_replace(lowercased, "<br />", " ")
    return tf.strings.regex_replace(stripped_html, f"([{string.punctuation}])", r" \1")


# Create a vectorization layer and adapt it to the text
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size - 1,
    output_mode="int",
    output_sequence_length=maxlen + 1,
)
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()  # To get words back from token indices


def prepare_lm_inputs_labels(text):
    """
    Shift word sequences by 1 position so that the target for position (i) is
    word at position (i+1). The model will use all words up till position (i)
    to predict the next word.
    """
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


text_ds = text_ds.map(prepare_lm_inputs_labels)
text_ds = text_ds.prefetch(tf.data.AUTOTUNE)

50000 files


2023-04-19 11:40:18.439331: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


We can look at one example of the data below. 

In [8]:
tmp = text_ds.as_numpy_iterator()
x_tmp, y_tmp = next(tmp)
print(x_tmp.shape, y_tmp.shape)
samp_x = x_tmp[0]
samp_y = y_tmp[0]
print("Tokens:", samp_x, "\n\n", samp_y)
word = ""
for x_ in samp_x:
    word += vocab[x_] + " "
print("Sentence:", word, "\n\nNext word:", vocab[samp_y[-1]])

(256, 80) (256, 80)
Tokens: [   12   218    13    22    37     2  3545 13500  2055    22  1384     3
    13    22    74    84  1121    11     2     1  2028     8 16538    45
    75    12   439    45    43    12    16   192   289   994     8    13
    18     3    12    93   439  4251  1060     4 11159  7792   534     3
     3     3    43    12    32    74    57  1163  2449    19  1084   956
   329 13057   115     3   492     4    12   262    14 15583  3038   134
    28  1643     1    19   109     7   145   108] 

 [  218    13    22    37     2  3545 13500  2055    22  1384     3    13
    22    74    84  1121    11     2     1  2028     8 16538    45    75
    12   439    45    43    12    16   192   289   994     8    13    18
     3    12    93   439  4251  1060     4 11159  7792   534     3     3
     3    43    12    32    74    57  1163  2449    19  1084   956   329
 13057   115     3   492     4    12   262    14 15583  3038   134    28
  1643     1    19   109     7   145   108 

### Callback

To get our text out, we can use a callback that will be called at the end of each epoch. We can still get things from "predict" after the fact, but this will give us some step by step evidence of our program's smarts. We will make two instances of this callback, each with different seeds. 

In [9]:
class TextGenerator(keras.callbacks.Callback):
    """A callback to generate text from a trained model.
    1. Feed some starting prompt to the model
    2. Predict probabilities for the next token
    3. Sample the next token and add it to the next input

    Arguments:
        max_tokens: Integer, the number of tokens to be generated after prompt.
        start_tokens: List of integers, the token indices for the starting prompt.
        index_to_word: List of strings, obtained from the TextVectorization layer.
        top_k: Integer, sample from the `top_k` token predictions.
        print_every: Integer, print after this many epochs.
    """

    def __init__(self, max_tokens, start_tokens, index_to_word, top_k=10, print_every=1, log_dir="logs"):
        self.max_tokens = max_tokens
        self.start_tokens = start_tokens
        self.index_to_word = index_to_word
        self.print_every = print_every
        self.k = top_k
        self.log_dir = log_dir

    def sample_from(self, logits):
        logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
        indices = np.asarray(indices).astype("int32")
        preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
        preds = np.asarray(preds).astype("float32")
        return np.random.choice(indices, p=preds)

    def detokenize(self, number):
        return self.index_to_word[number]

    def on_epoch_end(self, epoch, logs=None):
        start_tokens = [_ for _ in self.start_tokens]
        #if (epoch + 1) % self.print_every != 0:
        #    return
        num_tokens_generated = 0
        tokens_generated = []
        while num_tokens_generated <= self.max_tokens:
            pad_len = maxlen - len(start_tokens)
            sample_index = len(start_tokens) - 1
            if pad_len < 0:
                x = start_tokens[:maxlen]
                sample_index = maxlen - 1
            elif pad_len > 0:
                x = start_tokens + [0] * pad_len
            else:
                x = start_tokens
            x = np.array([x])
            y, _ = self.model.predict(x)
            sample_token = self.sample_from(y[0][sample_index])
            tokens_generated.append(sample_token)
            start_tokens.append(sample_token)
            num_tokens_generated = len(tokens_generated)
        txt = " ".join(
            [self.detokenize(_) for _ in self.start_tokens + tokens_generated]
        )
        print(f"generated text:\n{txt}\n")
        file_writer = tf.summary.create_file_writer(self.log_dir)
        with file_writer.as_default():
            tf.summary.text("Text Data", txt, step=epoch)
        file_writer.flush()

### Train, Run, Predict

Now that the model is created, we can fit it to the training data then test out the abilities. Our prediction is an incremental process, we start with a seed, then we predict the next word, then we add that word to the seed, and predict the next word, and so on. At each step, the model looks at the input to this point, calculates the attention, finds the most suitable (highest score) word from the vocabulary, generates it, and moves one more step forward. 

<b>Note:</b> Trying to train this on my laptop on CPU took forever, I didn't get to the point where the first epoch gave me a time estimate. On GPU it is much, much faster.

In [10]:
# Tokenize starting prompt
word_to_index = {}
for index, word in enumerate(vocab):
    word_to_index[word] = index

start_prompt1 = "this movie is"
start_tokens1 = [word_to_index.get(_, 1) for _ in start_prompt1.split()]
start_prompt2 = "Skiing fast makes me"
start_tokens2 = [word_to_index.get(_, 1) for _ in start_prompt2.split()]
num_tokens_generated = 40

log_dir = "logs/text"
text_gen_callback1 = TextGenerator(num_tokens_generated, start_tokens1, vocab, log_dir=log_dir)
text_gen_callback2 = TextGenerator(num_tokens_generated, start_tokens2, vocab, log_dir=log_dir)

In [None]:
## Launch TensorBoard
%load_ext tensorboard
%tensorboard --logdir log_dir


In [11]:
model = create_model()
model.fit(text_ds, verbose=0, epochs=EPOCHS, callbacks=[text_gen_callback1, text_gen_callback2])

KeyboardInterrupt: 

In [None]:
def indToSentence(ind, dict, length=40):
    word = ""
    for n_ in ind:
        word += dict[n_] + " "
    return word

# def sentenceToInd(sentence, dict, length=40):
#     indicies = []
#     words = sentence.split()
#     for word in words:
#         indicies.append(dict[word])
#     return indicies

def sentenceToInd(sentence, dict, length=40):
    indicies = []
    words = sentence.split()
    for word in words:
        if isinstance(word, str):
            index = dict.get(word)
            if index is not None:
                indicies.append(index)
    return indicies

def sample_from(self, logits):
    logits, indices = tf.math.top_k(logits, k=self.k, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

def generateText(model, index_to_word, word_to_index, startPrompt, length=40):
    start_tokens = sentenceToInd(startPrompt, word_to_index)
    num_tokens_generated = 0
    tokens_generated = []
    while num_tokens_generated <= length:
        pad_len = maxlen - len(start_tokens)
        sample_index = len(start_tokens) - 1
        if pad_len < 0:
            x = start_tokens[:maxlen]
            sample_index = maxlen - 1
        elif pad_len > 0:
            x = start_tokens + [0] * pad_len
        else:
            x = start_tokens
        x = np.array([x])
        y, _ = model.predict(x)
        #sample_token = np.argmax(y[0][sample_index])
        logits = y[0][sample_index]
        sample_token = sample_from(logits)
        tokens_generated.append(sample_token)
        start_tokens.append(sample_token)
        num_tokens_generated = len(tokens_generated)
    txt = indToSentence(start_tokens + tokens_generated, index_to_word)
    return txt

In [None]:
t1 = generateText(model, vocab, word_to_index, "this movie is")
t2 = generateText(model, vocab, word_to_index, "Skiing fast makes me")
t3 = generateText(model, vocab, word_to_index, "We are going to make this country great")
t4 = generateText(model, vocab, word_to_index, "Where my dogs at")

In [None]:
print(t1)
print(t2)
print(t3)
print(t4)