# RNN text generation

In this notebook, we'll have a look at simple character-level text generation with recurrent neural networks in Keras.

This is largely a simplified version of the Tensorflow tutorial [Text generation with an RNN](https://www.tensorflow.org/tutorials/text/text_generation), which in turn follows in part Andrej Karpathy's
 [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

## Load Shakespeare dataset

We'll use a dataset of texts from Shakespeare provided by Google. The approach would work with any reasonably-sized plain text file.

In [1]:
# Give -nc (--no-clobber) argument so that the file isn't downloaded multiple times 
!wget -nc https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt

File ‘shakespeare.txt’ already there; not retrieving.



This is a simple plain text file, so we'll just read it in directly.

In [2]:
with open('shakespeare.txt') as f:
    text = f.read()

print(len(text))

1115394


Note that the text includes punctuation and newline characters.

In [3]:
print(repr(text[:200]))
print(text[:200])

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you'
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you


## Vectorize text

First, find the set of unique characters in the text:

In [4]:
characters = sorted(set(text))

# What did we get?
print(len(characters))
print(characters)

65
['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Create a mapping from characters to integers and the inverse mapping from those integers back to the characters.

In [5]:
char_to_int = { c: i for i, c in enumerate(characters) }
int_to_char = { i: c for c, i in char_to_int.items() }

Let's have a look at that mapping.

In [6]:
from pprint import pprint    # pretty-printer


def truncate_dict(d, count=10):
    # Returns at most count items from the given dictionary.  
    return dict(i for i, _ in zip(d.items(), range(count)))


pprint(truncate_dict(char_to_int, 20))

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19}


Map text characters to the (arbitrary) indices defined above:

In [7]:
data = [char_to_int[c] for c in text]

print(data[:100])

[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59]


Next, let's split that data into fixed-length parts for training.

(The +1 is here because we'll be shifting the outputs by one character.)

In [8]:
sequence_length = 100

sequences = []
for i in range(0, len(data), sequence_length+1):
    sequences.append(data[i:i+sequence_length+1])
sequences.pop()    # drop (likely) incomplete last sequence

    
for i in range(5):
    print(repr(''.join(int_to_char[c] for c in sequences[i])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


Split into paired inputs and outputs so that each output matches the text of its input, shifted to the right by one character.

In [9]:
inputs, outputs = [], []
for sequence in sequences:
    inputs.append(sequence[:-1])
    outputs.append(sequence[1:])
    
    
print(repr(''.join(int_to_char[c] for c in inputs[0])))
print(repr(''.join(int_to_char[c] for c in outputs[0])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


## Build model

We build a straightforward RNN model where each RNN state outputs unnormalized predictions ("logits"):

* input: sequence of `sequence_length` integers corresponding to characters
* embedding: randomly initialized mapping to `embedding_dim`-dimensional vectors
* rnn: recurrent neural network with `rnn_units`-dimensional state
* output: `num_classes`-dimensional fully connected layer with softmax activation


In [10]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, GRU, Dense


vocab_size = len(characters)
embedding_dim = 256
rnn_units = 1024


input_ = Input(shape=(sequence_length,))
embedding = Embedding(vocab_size, embedding_dim)(input_)
# Note: return_sequences=True
rnn = GRU(rnn_units, return_sequences=True)(embedding)
# Note: no activation function (e.g. softmax)
output = Dense(vocab_size)(rnn)
model = Model(inputs=[input_], outputs=[output])

print(model.summary())

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 256)          16640     
_________________________________________________________________
gru (GRU)                    (None, 100, 1024)         3935232   
_________________________________________________________________
dense (Dense)                (None, 100, 65)           66625     
Total params: 4,018,497
Trainable params: 4,018,497
Non-trainable params: 0
_________________________________________________________________
None


Define an appropriate loss function and compile the model.

In [11]:
from tensorflow.keras.losses import sparse_categorical_crossentropy


# This is just 'sparse_categorical_crossentropy' where from_logits=True
# specifies that the values are unnormalized (no softmax)
def sparse_categorical_crossentropy_with_logits(labels, logits):
    return sparse_categorical_crossentropy(labels, logits, from_logits=True)


model.compile(
    optimizer='adam',
    loss=sparse_categorical_crossentropy_with_logits,
)

Cast inputs and outputs as numpy arrays

In [12]:
import numpy as np

X, Y = np.array(inputs), np.array(outputs)
print(len(inputs), len(inputs[0]))
print(X.shape)

11043 100
(11043, 100)


Train the model. (This will take a while unless running on a GPU.)

In [None]:
epochs = 1
batch_size = 16


history = model.fit(
    X,
    Y,
    epochs=epochs,
    batch_size=batch_size
)

## Generate text with trained model

Note that as we set a fixed `sequence_length` to keep things simple, we won't actually be using the RNN state here; we're instead simply always giving the RNN the catenation of the initial seed text and previously generated characters  as input.

In [None]:
import tensorflow as tf


def predict_one_character(input_, temperature=1.0):
    input_ = ' ' * (sequence_length-len(input_)) + input_   # pad to sequence_length
    input_ = input_[-sequence_length:]    # truncate if longer
    X = np.array([[char_to_int[c] for c in input_]])    # note batch dim
    model.reset_states()    # forget state
    Y = model(X)
    Y = tf.squeeze(Y, 0)    # drop batch dim
    Y = Y / temperature
    # Sample from categorical distribution
    pred_id = tf.random.categorical(Y, num_samples=1)[-1,0].numpy()
    return int_to_char[pred_id]


seed = 'ROMEO'
string = seed
generated = []
for i in range(500):
    # Lower temperature gives more likely predictions, higher more surprising
    char = predict_one_character(string, temperature=1.0)
    generated.append(char)
    string += char


print(seed + ''.join(generated))