### Generative deep learning
#### Text generation
In this section, we’ll explore how recurrent neural networks can be used to generate sequence data. We’ll use text generation as an example, but the exact same techniques can be generalized to any kind of sequence data: you could apply it to sequences of musical notes in order to generate new music, to timeseries of brush stroke data (perhaps recorded while an artist paints on an iPad) to generate paintings stroke by stroke, and so on. <br>
Sequence data generation is in no way limited to artistic content generation. It has been successfully applied to speech synthesis and to dialogue generation for chatbots. The Smart Reply feature that Google released in 2016, capable of automatically generating a selection of quick replies to emails or text messages, is powered by similar techniques.

##### How do you generate sequence data?
The universal way to generate sequence data in deep learning is to train a model (usually a Transformer or an RNN) to predict the next token or next few tokens in a sequence, using the previous tokens as input. For instance, given the input “the cat is on the,” the model is trained to predict the target “mat,” the next word. As usual when working with text data, tokens are typically words or characters, and any network that can model the probability of the next token given the previous ones is called a language model. A language model captures the latent space of language: its statistical structure. <br>
Once you have such a trained language model, you can sample from it (generate new sequences): you feed it an initial string of text (called conditioning data), ask it to generate the next character or the next word (you can even generate several tokens at once), add the generated output back to the input data, and repeat the process many times (see figure 12.1). This loop allows you to generate sequences of arbitrary length that reflect the structure of the data on which the model was trained: sequences that look almost like human-written sentences.

![](./images/12.1.png)

##### The importance of the sampling strategy
When generating text, the way you choose the next token is crucially important. A naive approach is **greedy sampling**, consisting of always choosing the most likely next character. But such an approach results in repetitive, predictable strings that don’t look like coherent language. A more interesting approach makes slightly more surprising choices: it introduces randomness in the sampling process by sampling from the probability distribution for the next character. This is called **stochastic sampling** (recall that **stochasticity** is what we call **randomness** in this field). In such a setup, if a word has probability 0.3 of being next in the sentence according to the model, you’ll choose it 30% of the time. Note that greedy sampling can also be cast as sampling from a probability distribution: one where a certain word has probability 1 and all others have probability 0. <br>
Sampling probabilistically from the **softmax** output of the model is neat: it allows even unlikely words to be sampled some of the time, generating more interesting looking sentences and sometimes showing creativity by coming up with new, realistic sounding sentences that didn’t occur in the training data. But there’s one issue with this strategy: it doesn’t offer a way to control the amount of randomness in the sampling process. <br>

Why would you want more or less randomness? Consider an extreme case: pure random sampling, where you draw the next word from a uniform probability distribution, and every word is equally likely. This scheme has maximum randomness; in other words, this probability distribution has maximum entropy. Naturally, it won’t produce anything interesting. At the other extreme, greedy sampling doesn’t produce anything interesting, either, and has no randomness: the corresponding probability distribution has minimum entropy. Sampling from the “real” probability distribution— the distribution that is output by the model’s softmax function—constitutes an intermediate point between these two extremes. But there are many other intermediate points of higher or lower entropy that you may want to explore. Less entropy will give the generated sequences a more predictable structure (and thus they will potentially be more realistic looking), whereas more entropy will result in more surprising and creative sequences. When sampling from generative models, it’s always good to explore different amounts of randomness in the generation process. Because we— humans—are the ultimate judges of how interesting the generated data is, interestingness is highly subjective, and there’s no telling in advance where the point of optimal entropy lies. <br>

In order to control the amount of stochasticity in the sampling process, we’ll introduce a parameter called the **softmax temperature**, which characterizes the entropy of the probability distribution used for sampling: it characterizes how surprising or predictable the choice of the next word will be. Given a **temperature** value, a new probability distribution is computed from the original one (the softmax output of the model) by reweighting it in the following way.

##### Reweighting a probability distribution to a different temperature

In [2]:
import numpy as np
# original_distribution is a 1D NumPy array of probability values that must sum to 1. 
# temperature is a factor quantifying the entropy of the output distribution.
def reweight_distribution(original_distribution, temperature=0.5):
    distribution = np.log(original_distribution) / temperature
    distribution = np.exp(distribution)
    # Returns a reweighted version of the original distribution. 
    # The sum of the distribution may no longer be 1, so you divide it by its sum to obtain the new distribution.
    return distribution / np.sum(distribution)

Higher temperatures result in sampling distributions of higher entropy that will generate more surprising and unstructured generated data, whereas a lower temperature will result in less randomness and much more predictable generated data (see figure 12.2).

![](./images/12.2.png)

##### Implementing text generation with Keras

Let’s put these ideas into practice in a Keras implementation. The first thing you need is a lot of text data that you can use to learn a language model. You can use any sufficiently large text file or set of text files—Wikipedia, The Lord of the Rings, and so on. <br>
In this example, we’ll keep working with the IMDB movie review dataset from the last chapter, and we’ll learn to generate never-read-before movie reviews. As such, our language model will be a model of the style and topics of these movie reviews specifically, rather than a general model of the English language.

##### PREPARING THE DATA
Just like in the previous chapter, let’s download and uncompress the IMDB movie
reviews dataset.

##### Downloading and uncompressing the IMDB movie reviews dataset

In [3]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

You’re already familiar with the structure of the data: we get a folder named aclImdb containing two subfolders, one for negative-sentiment movie reviews, and one for positive-sentiment reviews. There’s one text file per review. We’ll call **text_dataset_from_directory** with **label_mode=None** to create a dataset that reads from these files and yields the text content of each file.

##### Creating a dataset from text files (one file = one sample)

In [4]:
import tensorflow as tf
from tensorflow import keras

dataset = keras.utils.text_dataset_from_directory(
    directory="aclImdb", label_mode=None, batch_size=256)
# Strip the <br /> HTML tag that occurs in many of the reviews. 
# This did not matter much for text classification, but we wouldn’t want to generate <br /> tags in this example!
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))

Now let’s use a **TextVectorization** layer to compute the vocabulary we’ll be working with. We’ll only use the first **sequence_length** words of each review: our **TextVectorization** layer will cut off anything beyond that when vectorizing a text.

##### Preparing a TextVectorization layer

In [5]:
from tensorflow.keras.layers import TextVectorization

sequence_length = 100
# We’ll only consider the top 15,000 most common words—anything else will be treated as the out-of-vocabulary token, "[UNK]".
vocab_size = 15000
text_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int", # We want to return integer word index sequences.
    output_sequence_length=sequence_length, # We’ll work with inputs and targets of length 100 (but since we’ll offset the targets by 1, the model will actually see sequences of length 99).
)
text_vectorization.adapt(dataset)

Let’s use the layer to create a language modeling dataset where input samples are vectorized texts, and corresponding targets are the same texts offset by one word.

##### Setting up a language modeling dataset

In [6]:
def prepare_lm_dataset(text_batch):
    vectorized_sequences = text_vectorization(text_batch) # Convert a batch of texts (strings) to a batch of integer sequences.
    x = vectorized_sequences[:, :-1] # Create inputs by cutting off the last word of the sequences.
    y = vectorized_sequences[:, 1:] # Create targets by offsetting the sequences by 1.
    return x, y

lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)

##### A TRANSFORMER-BASED SEQUENCE-TO-SEQUENCE MODEL

We’ll train a model to predict a probability distribution over the next word in a sentence, given a number of initial words. When the model is trained, we’ll feed it with a prompt, sample the next word, add that word back to the prompt, and repeat, until we’ve generated a short paragraph. <br>
Like we did for temperature forecasting in chapter 10, we could train a model that takes as input a sequence of N words and simply predicts word N+1. However, there are several issues with this setup in the context of sequence generation. <br>
First, the model would only learn to produce predictions when N words were available, but it would be useful to be able to start predicting with fewer than N words. Otherwise we’d be constrained to only use relatively long prompts (in our implementation, N=100 words). We didn’t have this need in chapter 10. <br>
Second, many of our training sequences will be mostly overlapping. Consider N = 4. The text “A complete sentence must have, at minimum, three things: a subject, verb, and an object” would be used to generate the following training sequences:
- “A complete sentence must”
- “complete sentence must have”
- “sentence must have at”
- and so on, until “verb and an object”

A model that treats each such sequence as an independent sample would have to do a lot of redundant work, re-encoding multiple times subsequences that it has largely seen before. In chapter 10, this wasn’t much of a problem, because we didn’t have that many training samples in the first place, and we needed to benchmark dense and convolutional models, for which redoing the work every time is the only option. We could try to alleviate this redundancy problem by using strides to sample our sequences— skipping a few words between two consecutive samples. But that would reduce our number of training samples while only providing a partial solution. <br>
To address these two issues, we’ll use a **sequence-to-sequence model**: we’ll feed sequences of N words (indexed from 0 to N) into our model, and we’ll predict the sequence offset by one (from 1 to N+1). We’ll use **causal masking** to make sure that, for any i, the model will only be using words from 0 to i in order to predict the word i + 1. This means that we’re simultaneously training the model to solve N mostly overlapping but different problems: predicting the next words given a sequence of 1 <= i <= N prior words (see figure 12.3). At generation time, even if you only prompt the model with a single word, it will be able to give you a probability distribution for the next possible words.

![](./images/12.3.png)

Note that we could have used a similar sequence-to-sequence setup on our temperature forecasting problem in chapter 10: given a sequence of 120 hourly data points, learn to generate a sequence of 120 temperatures offset by 24 hours in the future. You’d be not only solving the initial problem, but also solving the 119 related problems of forecasting temperature in 24 hours, given 1 <= i < 120 prior hourly data points. If you try to retrain the RNNs from chapter 10 in a sequence-to-sequence setup, you’ll find that you get similar but incrementally worse results, because the constraint of solving these additional 119 related problems with the same model interferes slightly with the task we actually do care about. <br>
In the previous chapter, you learned about the setup you can use for sequence-tosequence learning in the general case: feed the source sequence into an encoder, and then feed both the encoded sequence and the target sequence into a decoder that tries to predict the same target sequence offset by one step. When you’re doing text generation, there is no source sequence: you’re just trying to predict the next tokens in the target sequence given past tokens, which we can do using only the decoder. And thanks to causal padding, the decoder will only look at words 0…N to predict the word N+1.

Let’s implement our model—we’re going to reuse the building blocks we created in chapter 11: **PositionalEmbedding** and **TransformerDecoder**.

In [7]:
import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super(TransformerDecoder, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

##### A simple Transformer-based language model

In [8]:
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
outputs = layers.Dense(vocab_size, activation="softmax")(x) # Softmax over possible vocabulary words, computed for each output sequence timestep.
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")

##### text-generation callback with variable-temperature sampling
We’ll use a callback to generate text using a range of different temperatures after every epoch. This allows you to see how the generated text evolves as the model begins to converge, as well as the impact of temperature in the sampling strategy. To seed text generation, we’ll use the prompt “this movie”: all of our generated texts will start with this.

##### The text-generation callback

In [9]:
import numpy as np

tokens_index = dict(enumerate(text_vectorization.get_vocabulary())) # Dict that maps word indices back to strings, to be used for text decoding

def sample_next(predictions, temperature=1.0): # Implements variable temperature sampling from a probability distribution
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

class TextGenerator(keras.callbacks.Callback):
    def __init__(self,
                 prompt, # Prompt that we use to seed text generation
                 generate_length, # Number of words to generate
                 model_input_length,
                 temperatures=(1.,), # Range of temperatures to use for sampling
                 print_freq=1):
        self.prompt = prompt
        self.generate_length = generate_length
        self.model_input_length = model_input_length
        self.temperatures = temperatures
        self.print_freq = print_freq

    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.print_freq != 0:
            return
        for temperature in self.temperatures:
            print("== Generating with temperature", temperature)
            sentence = self.prompt # When generating text, we start from our prompt
            for i in range(self.generate_length):
                # Feed the current sequence into our model.
                tokenized_sentence = text_vectorization([sentence])
                predictions = self.model(tokenized_sentence)
                # Retrieve the predictions for the last timestep, and use them to sample a new word.
                next_token = sample_next(predictions[0, i, :])
                sampled_token = tokens_index[next_token]
                # Append the new word to the current sequence and repeat.
                sentence += " " + sampled_token
            print(sentence)

prompt = "This movie"
text_gen_callback = TextGenerator(
    prompt,
    generate_length=50,
    model_input_length=sequence_length,
    temperatures=(0.2, 0.5, 0.7, 1., 1.5)) # We’ll use a diverse range of temperatures to sample text, to demonstrate the effect of temperature on text generation.

Let’s fit() this thing.

##### Fitting the language model

In [11]:
model.fit(lm_dataset, epochs=20, callbacks=[text_gen_callback])

#### Wrapping up
- You can generate discrete sequence data by training a model to predict the next token(s), given previous tokens.
- In the case of text, such a model is called a **language model**. It can be based on either words or characters.
- Sampling the next token requires a balance between adhering to what the model judges likely, and introducing randomness.
- One way to handle this is the notion of **softmax temperature**. Always experiment with different temperatures to find the right one.