# Shakespeare RNN

In [14]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

Download the shakespeare text dataset and open the text file.

In [2]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Inspect the different elements in the text.

In [3]:
"".join(sorted(set(shakespeare_text.lower())))

"\n !$&',-.3:;?abcdefghijklmnopqrstuvwxyz"

We create a tokenizer at the character level, so each character is associated a number.

In [4]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

An example of a phrase being turned to a vector and vice versa.

In [9]:
tokenizer.texts_to_sequences(["hello world"])
#Note that the 1 in the vector is the space between the words

[[7, 2, 12, 12, 4, 1, 17, 4, 9, 12, 13]]

In [10]:
tokenizer.sequences_to_texts([[7, 2, 12, 12, 4, 1, 17, 4, 9, 12, 13]])

['h e l l o   w o r l d']

In [12]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters
(max_id,dataset_size)

(39, 1115394)

Tokenize text and create a training section of the data

In [31]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

Split the dataset into windows of length 101, where in window of 101 characters, we use the first 100 characters to try and predict the 101st character. Our first window will be the 1st-101st letter. The second window will bbe from the 2nd letter to the 102nd letter etc.

In [32]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.repeat().window(window_length, shift=1, drop_remainder=True)

In [33]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Set random seeds to keep results reproducible

In [34]:
np.random.seed(42)
tf.random.set_seed(42)

Batch the dataset into sets of 32 windows and one-hot encode the different characters in the X_batch

In [35]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

In [36]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

In [37]:
dataset = dataset.prefetch(1)

In [38]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


We use a GRU based model and use adam as our optimizer. This problem also lends itself to the sparse_categorical_cross_entropy loss function.

In [36]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, steps_per_epoch=train_size // batch_size,
                    epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


We define some helper functions to allow us to make predictions more easily. The number of characters to predict is set to 1000 in the complete_text function. The first letter needs to be specified when using the function.

In [41]:
def preprocess(texts):
    X=np.array(tokenizer.texts_to_sequences(texts))-1
    return tf.one_hot(X,max_id)

In [42]:
def next_char(text, temperature=1):
    X_new=preprocess([text])
    y_proba=model.predict(X_new)[0,-1:,:]
    rescaled_logits=tf.math.log(y_proba)/temperature
    char_id=tf.random.categorical(rescaled_logits, num_samples=1)+1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

def complete_text(text,n_chars=1000,temperature=1):
    for _ in range(n_chars):
        text+=next_char(text,temperature)
    return text

Here is an example of generating random text, and randomly generating an initial letter to start:

In [45]:
from random import randint
letters='qwertyuiopasdfghjklzxcvbnm'
index=randint(0,25)

print(complete_text(letters[index], temperature=1))

wrong, the city.

hortensio:
give himself, receive the rightal blow the rest,
they burn users, that must tell this give be honour
and me all particially as for it;
which gives a suitor heard withal instrued
and make god pirped in althe takes oc very deeds.

painio:
alive to common; even are away, i may leave her
balls pass, since i proceed and live it find for
the stolence hath thus flound the state,
but knock you faol, was more a counsel for the good the belly
would be loud and so.
behive you what you will not hake? you danciy perhaps of me;
he'll part the worsporants ere it?

maniantio:
become it that shall socient the good master,
this good fyceinor, well
his faults, i have her colding me not in my tears,
and do i will content to be marcived by a fair in hell:
if i proceed us belly once.

bianca:
who shall be countent to her way to leave not
his own prince well; even be not rates, i did no foll
if you plead boar beneyous the ready us words:
why ched you state, the opentastal pains a