# Generative networks

Recurrent Neural Networks (RNNs) and their gated cell variants, such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs), introduced a way to model language. In other words, they can learn the order of words and predict the next word in a sequence. This capability allows us to use RNNs for **generative tasks**, such as regular text generation, machine translation, and even image captioning.

In the RNN architecture we discussed in the previous unit, each RNN unit produced the next hidden state as its output. However, we can also add another output to each recurrent unit, enabling us to generate a **sequence** (of the same length as the original sequence). Additionally, we can use RNN units that do not take an input at every step but instead start with an initial state vector and then generate a sequence of outputs.

In this notebook, we will focus on simple generative models that help us create text. To keep things straightforward, we will build a **character-level network**, which generates text one letter at a time. During training, we need to take a text corpus and split it into sequences of letters.


In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

## Building character vocabulary

To create a character-level generative network, we need to split the text into individual characters instead of words. The `TextVectorization` layer we used earlier cannot handle this, so we have two options:

* Manually load the text and perform tokenization "by hand," as shown in [this official Keras example](https://keras.io/examples/generative/lstm_character_level_text_generation/)
* Use the `Tokenizer` class for character-level tokenization.

We will choose the second option. The `Tokenizer` class can also be used for word-level tokenization, making it relatively easy to switch between character-level and word-level tokenization.

To perform character-level tokenization, we need to pass the parameter `char_level=True`:


In [2]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True,lower=False)
tokenizer.fit_on_texts([x['title'].numpy().decode('utf-8') for x in ds_train])

We also want to use one special token to denote **end of sequence**, which we will call `<eos>`. Let's add it manually to the vocabulary:


In [3]:
eos_token = len(tokenizer.word_index)+1
tokenizer.word_index['<eos>'] = eos_token

vocab_size = eos_token + 1

Now, to encode text into sequences of numbers, we can use:


In [4]:
tokenizer.texts_to_sequences(['Hello, world!'])

[[48, 2, 10, 10, 5, 44, 1, 25, 5, 8, 10, 13, 78]]

## Training a generative RNN to generate titles

The method we will use to train an RNN to generate news titles is as follows. At each step, we will take one title, feed it into the RNN, and for each input character, we will ask the network to generate the next output character:

![Image showing an example RNN generation of the word 'HELLO'.](../../../../../translated_images/rnn-generate.56c54afb52f9781d63a7c16ea9c1b86cb70e6e1eae6a742b56b7b37468576b17.en.png)

For the last character in our sequence, we will ask the network to generate the `<eos>` token.

The key difference in the generative RNN we are using here is that we will take the output from each step of the RNN, not just from the final cell. This can be achieved by setting the `return_sequences` parameter in the RNN cell.

Therefore, during training, the input to the network will be a sequence of encoded characters of a certain length, and the output will be a sequence of the same length, but shifted by one element and ending with `<eos>`. A minibatch will consist of several such sequences, and we will need to use **padding** to align all sequences.

Let's create functions to transform the dataset for us. Since we want to pad sequences at the minibatch level, we will first batch the dataset by calling `.batch()`, and then use `map` to apply the transformation. This means the transformation function will take an entire minibatch as its parameter:


In [5]:
def title_batch(x):
    x = [t.numpy().decode('utf-8') for t in x]
    z = tokenizer.texts_to_sequences(x)
    z = tf.keras.preprocessing.sequence.pad_sequences(z)
    return tf.one_hot(z,vocab_size), tf.one_hot(tf.concat([z[:,1:],tf.constant(eos_token,shape=(len(z),1))],axis=1),vocab_size)

A few important things that we do here:
* We first extract the actual text from the string tensor
* `text_to_sequences` converts the list of strings into a list of integer tensors
* `pad_sequences` pads those tensors to their maximum length
* We finally one-hot encode all the characters, and also do the shifting and `<eos>` appending. We will soon see why we need one-hot-encoded characters

However, this function is **Pythonic**, meaning it cannot be automatically translated into Tensorflow's computational graph. If we try to use this function directly in the `Dataset.map` function, we will encounter errors. To resolve this, we need to wrap this Pythonic call using the `py_function` wrapper:


In [6]:
def title_batch_fn(x):
    x = x['title']
    a,b = tf.py_function(title_batch,inp=[x],Tout=(tf.float32,tf.float32))
    return a,b

> **Note**: Differentiating between Pythonic and TensorFlow transformation functions might seem overly complicated, and you may wonder why we don't transform the dataset using standard Python functions before passing it to `fit`. While this is certainly possible, using `Dataset.map` offers a significant advantage: the data transformation pipeline is executed within TensorFlow's computational graph, leveraging GPU computations and reducing the need to transfer data between the CPU and GPU.

Now we can construct our generator network and begin training. It can be based on any recurrent cell we discussed in the previous unit (simple, LSTM, or GRU). In this example, we will use LSTM.

Since the network takes characters as input and the vocabulary size is relatively small, we don't need an embedding layer—one-hot-encoded input can be fed directly into the LSTM cell. The output layer will be a `Dense` classifier that converts the LSTM output into one-hot-encoded token indices.

Additionally, because we are working with variable-length sequences, we can use a `Masking` layer to create a mask that ignores the padded portion of the string. This isn't strictly necessary, as we aren't particularly concerned with anything beyond the `<eos>` token, but we'll use it to gain some experience with this type of layer. The `input_shape` will be `(None, vocab_size)`, where `None` represents sequences of variable length, and the output shape will also be `(None, vocab_size)`, as shown in the `summary`:


In [7]:
model = keras.models.Sequential([
    keras.layers.Masking(input_shape=(None,vocab_size)),
    keras.layers.LSTM(128,return_sequences=True),
    keras.layers.Dense(vocab_size,activation='softmax')
])

model.summary()
model.compile(loss='categorical_crossentropy')

model.fit(ds_train.batch(8).map(title_batch_fn))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None, 84)          0         
_________________________________________________________________
lstm (LSTM)                  (None, None, 128)         109056    
_________________________________________________________________
dense (Dense)                (None, None, 84)          10836     
Total params: 119,892
Trainable params: 119,892
Non-trainable params: 0
_________________________________________________________________


<tensorflow.python.keras.callbacks.History at 0x7fa40c1245e0>

## Generating output

Now that we have trained the model, we want to use it to generate some output. First, we need a way to decode text represented by a sequence of token numbers. To achieve this, we could use the `tokenizer.sequences_to_texts` function; however, it does not perform well with character-level tokenization. Therefore, we will take the dictionary of tokens from the tokenizer (called `word_index`), create a reverse map, and write our own decoding function:


In [10]:
reverse_map = {val:key for key, val in tokenizer.word_index.items()}

def decode(x):
    return ''.join([reverse_map[t] for t in x])

Now, let's begin the generation process. We start with a string `start`, encode it into a sequence `inp`, and then at each step, we use our network to predict the next character.

The network's output `out` is a vector with `vocab_size` elements, each representing the probability of a specific token. Using `argmax`, we can determine the most likely token number. This character is then added to the list of generated tokens, and the generation process continues. This character-by-character generation is repeated `size` times to produce the desired number of characters, but the process stops early if the `eos_token` is encountered.


In [12]:
def generate(model,size=100,start='Today '):
        inp = tokenizer.texts_to_sequences([start])[0]
        chars = inp
        for i in range(size):
            out = model(tf.expand_dims(tf.one_hot(inp,vocab_size),0))[0][-1]
            nc = tf.argmax(out)
            if nc==eos_token:
                break
            chars.append(nc.numpy())
            inp = inp+[nc]
        return decode(chars)
    
generate(model)

'Today #39;s lead to strike for the strike for the strike for the strike (AFP)'

## Sampling output during training

Since we don't have any useful metrics like *accuracy*, the only way to observe whether our model is improving is by **sampling** the generated text during training. To achieve this, we will use **callbacks**, which are functions that can be passed to the `fit` function and are called periodically during training.


In [13]:
sampling_callback = keras.callbacks.LambdaCallback(
  on_epoch_end = lambda batch, logs: print(generate(model))
)

model.fit(ds_train.batch(8).map(title_batch_fn),callbacks=[sampling_callback],epochs=3)

Epoch 1/3
Today #39;s a lead in the company for the strike
Epoch 2/3
Today #39;s the Market Service on Security Start (AP)
Epoch 3/3
Today #39;s a line on the strike to start for the start


<tensorflow.python.keras.callbacks.History at 0x7fa40c74e3d0>

This example already generates some pretty good text, but it can be further improved in several ways:
* **More text**. We have only used titles for our task, but you may want to experiment with full text. Remember that RNNs are not very good at handling long sequences, so it makes sense either to split them into shorter sentences or to always train on a fixed sequence length of some predefined value `num_chars` (for example, 256). You could try modifying the example above into such an architecture, using [official Keras tutorial](https://keras.io/examples/generative/lstm_character_level_text_generation/) as inspiration.
* **Multilayer LSTM**. It might be worth trying 2 or 3 layers of LSTM cells. As mentioned in the previous unit, each layer of LSTM extracts certain patterns from text, and in the case of a character-level generator, we can expect the lower LSTM level to focus on extracting syllables, while higher levels handle words and word combinations. This can be easily implemented by passing a number-of-layers parameter to the LSTM constructor.
* You may also want to experiment with **GRU units** to see which ones perform better, as well as with **different hidden layer sizes**. A hidden layer that is too large may lead to overfitting (e.g., the network will memorize the exact text), while a smaller size might not produce good results.


## Soft text generation and temperature

In the previous definition of `generate`, we always selected the character with the highest probability as the next character in the generated text. This often caused the text to "loop" through the same character sequences repeatedly, as shown in this example:
```
today of the second the company and a second the company ...
```

However, if we examine the probability distribution for the next character, we might find that the difference between the top probabilities is not significant. For instance, one character might have a probability of 0.2, while another has 0.19, and so on. For example, when determining the next character in the sequence '*play*', the next character could just as likely be a space or **e** (as in the word *player*).

This brings us to the conclusion that it is not always "fair" to select the character with the highest probability, as choosing the second-highest might still result in meaningful text. A better approach is to **sample** characters from the probability distribution provided by the network's output.

This sampling can be performed using the `np.multinomial` function, which implements the **multinomial distribution**. Below is a function that demonstrates this **soft** text generation:


In [33]:
def generate_soft(model,size=100,start='Today ',temperature=1.0):
        inp = tokenizer.texts_to_sequences([start])[0]
        chars = inp
        for i in range(size):
            out = model(tf.expand_dims(tf.one_hot(inp,vocab_size),0))[0][-1]
            probs = tf.exp(tf.math.log(out)/temperature).numpy().astype(np.float64)
            probs = probs/np.sum(probs)
            nc = np.argmax(np.random.multinomial(1,probs,1))
            if nc==eos_token:
                break
            chars.append(nc)
            inp = inp+[nc]
        return decode(chars)

words = ['Today ','On Sunday ','Moscow, ','President ','Little red riding hood ']
    
for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"\n--- Temperature = {i}")
    for j in range(5):
        print(generate_soft(model,size=300,start=words[j],temperature=i))


--- Temperature = 0.3
Today #39;s strike #39; to start at the store return
On Sunday PO to Be Data Profit Up (Reuters)
Moscow, SP wins straight to the Microsoft #39;s control of the space start
President olding of the blast start for the strike to pay &lt;b&gt;...&lt;/b&gt;
Little red riding hood ficed to the spam countered in European &lt;b&gt;...&lt;/b&gt;

--- Temperature = 0.8
Today countie strikes ryder missile faces food market blut
On Sunday collores lose-toppy of sale of Bullment in &lt;b&gt;...&lt;/b&gt;
Moscow, IBM Diffeiting in Afghan Software Hotels (Reuters)
President Ol Luster for Profit Peaced Raised (AP)
Little red riding hood dace on depart talks #39; bank up

--- Temperature = 1.0
Today wits House buiting debate fixes #39; supervice stake again
On Sunday arling digital poaching In for level
Moscow, DS Up 7, Top Proble Protest Caprey Mamarian Strike
President teps help of roubler stepted lessabul-Dhalitics (AFP)
Little red riding hood signs on cash in Carter-youb

---

KeyError: 0

We have introduced one more parameter called **temperature**, which is used to indicate how hard we should stick to the highest probability. If temperature is 1.0, we do fair multinomial sampling, and when temperature goes to infinity - all probabilities become equal, and we randomly select next character. In the example below we can observe that the text becomes meaningless when we increase the temperature too much, and it resembles "cycled" hard-generated text when it becomes closer to 0.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
