This is the conspect of Chapter 16 in "Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow

# Generating Shakesperean Text Using a Character RNN

A look into how to build a Char-RNN

### Creating the Training Dataset

In [177]:
import keras

In [178]:
shakespeare_url = "https://homl.info/shakespeare" # shortcut URL, all Shakespeare's works
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

We encode every character as an integer. The tokenizer will find all the characters used in the text and map each of them to a
different character ID, from 1 to the number of distinct characters.

In [179]:
import tensorflow as tf

In [180]:
from tensorflow.keras.preprocessing.text import Tokenizer


In [181]:
# converts the text to lowercase by default 
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

In [182]:
tokenizer.texts_to_sequences(["First"])
[[20, 6, 9, 8, 3]]


[[20, 6, 9, 8, 3]]

In [183]:
tokenizer.sequences_to_texts([[20, 6, 9, 8, 3]])

['f i r s t']

In [184]:
max_id = len(tokenizer.word_index) # number of distinct characters
max_id

39

In [185]:
import numpy as np

In [186]:
# encode the full text so each character is represented by its ID
# subtract 1 to get IDs from 0 to 38, rather than from 1 to 39
encoded = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1


In [187]:
len(encoded[0])

1115394

In [188]:
dataset_size = tokenizer.document_count

### Splitting a Sequential Dataset

N.B.! a good idea is to leave a gap between these sets to avoid the risk of a paragraph overlapping over two sets.

In [189]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

### Chopping the Sequential Dataset into Multiple Windows

We will use the dataset’s window() method to convert this long sequence of characters into many smaller windows of text. Every instance in the dataset will be a fairly short substring of the whole text, and the RNN will be unrolled only over the length of these substrings. This
is called truncated backpropagation through time.

In [190]:

n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True) 
# drop_remainder=True to ensure to ensure that all windows are exactly 101 characters long
# the first window contains characters 0 to 100, the second contains characters 1 to 101, and so on

The window() method creates a dataset that contains windows, each of which is also represented as a dataset. It’s a nested dataset, analogous to a list of lists. But: we cannot use a nested dataset directly for training, as our model will expect tensors as input, not
datasets. So, we must call the flat_map() method: it converts a nested dataset into a flat dataset (one that does not contain datasets). The flat_map() method takes a function as an argument.

For example, if you pass the function lambda ds: ds.batch(2) to flat_map(), then it will transform the nested dataset {{1, 2},
{3, 4, 5, 6}} into the flat dataset {[1, 2], [3, 4], [5, 6]}: it’s a dataset of tensors of size 2. 

In [191]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Since Gradient Descent works best when the instances in the training set are independent and identically distributed, we need to shuffle these windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character):

In [192]:
batch_size = 32

dataset = dataset.shuffle(10000).batch(batch_size) 
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))


Here, we will encode each character using a one-hot vector because there are fairly few distinct characters (only 39):

In [193]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch)
    )

In [194]:
# Add debugging prints in the data pipeline
for X_batch, Y_batch in dataset.take(1):
    print("Shape of X_batch:", X_batch.shape)  # Expected: (batch_size, sequence_length, max_id)
    print("Shape of Y_batch:", Y_batch.shape)  # Expected: (batch_size, sequence_length)

prefetch is a method used to improve the performance of data input pipelines. It allows the pipeline to fetch data in the background while the model is training, thus overlapping the data preprocessing and model execution.

The prefetch method takes an argument that specifies the number of batches to prefetch.

In [195]:
dataset = dataset.prefetch(1) 

In [196]:
dataset

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, None, 1115394, 39), dtype=tf.float32, name=None), TensorSpec(shape=(None, None, 1115394), dtype=tf.int64, name=None))>

### Building and Training the Char-RNN Model 

In [197]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
        dropout=0.2, recurrent_dropout=0.2), # dropout and recurrent_dropout are applied to the inputs and the recurrent state, respectively
    keras.layers.GRU(128, return_sequences=True,
        dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
        activation="softmax")) # max_id was 39
        # we want to output a probability for each possible character (at each time step)
        # The output probabilities should sum up to 1 at each time step, so we apply the softmax
])

model.summary()


In [198]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")


In [199]:
history = model.fit(dataset, epochs=20)


Epoch 1/20


ValueError: Exception encountered when calling Sequential.call().

[1mInvalid input shape for input Tensor("data:0", shape=(None, None, 1115394, 39), dtype=float32). Expected shape (None, None, 39), but input has incompatible shape (None, None, 1115394, 39)[0m

Arguments received by Sequential.call():
  • inputs=tf.Tensor(shape=(None, None, 1115394, 39), dtype=float32)
  • training=True
  • mask=None