In [15]:
import random
import numpy as np
import tensorflow as tf

In [16]:
filepath = tf.keras.utils.get_file('shakespeare.txt',
        'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
text = open(filepath, 'rb').read().decode(encoding='utf-8')

# Preparing data #
The problem that we have right now with our data is that we are dealing with text. We cannot just train a neural network on letters or sentences. We need to convert all of these values into numerical data. So we have to come up with a system that allows us to convert the text into numbers, then predict specific numbers based on that data and then again convert the resulting numbers back into text.

In [17]:
text = open(filepath, 'rb').read().decode(encoding='utf-8').lower()

For the sake of simplicity I am going to modify the last code line that we wrote. In this case I immediately convert all of the text into lower-case so that we have fewer possible choices. Also I am not going to use the whole text file as training data. If you have the capacities or the time to train your model on the whole data, do it! It will produce much better results. But if your machine is slow or you have limited time, you might consider just using a part of the text.

In [18]:
text = text[300000:800000]

Here we select all the characters from character number 300,000 up until 800,000. So we are processing a total of 500,000 characters, which should be enough for pretty descent results.

In [19]:
characters = sorted(set(text))

char_to_index = dict((c, i) for i, c in enumerate(characters))
index_to_char = dict((i, c) for i, c in enumerate(characters))

Now we create a sorted set of all the unique characters that occur in the text. In a set no value appears more than once, so this is a good way to filter out the characters. After that we define two structures for converting the values. Both are dictionaries that enumerate the characters. In the first one, the characters are the keys and the indices are the values. In the second one it is the other way around. Now we can easily convert a character into a unique numerical representation and vice versa.

In [20]:
SEQ_LENGTH = 40
STEP_SIZE = 3

sentences = []
next_char = []

In this next step, we define how long a sequence shall be and also how many characters we will step further to start the next sentence. What we try to do here is to take sentences and then save the next letter as the training data.

In [21]:
for i in range(0, len(text) - SEQ_LENGTH, STEP_SIZE):
    sentences.append(text[i: i + SEQ_LENGTH])
    next_char.append(text[i + SEQ_LENGTH])

We iterate through the whole text and gather all sentences and their next character. This is the training data for our neural network. Now we just need to convert it into a numerical format.

In [24]:
import numpy as np

x = np.zeros((len(sentences), SEQ_LENGTH, len(characters)), dtype=bool)
y = np.zeros((len(sentences), len(characters)), dtype=bool)

for i, satz in enumerate(sentences):
    for t, char in enumerate(satz):
        x[i, t, char_to_index[char]] = True
    y[i, char_to_index[next_char[i]]] = True


# Building Recurrent Neural Network #

In [25]:
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import Activation, Dense, LSTM

Of course you can also just refer to these things manually if you want to. We will use Sequential for our model, Activation, Dense and LSTM for our layers and RMSprop for optimization during the compilation of our model. LSTM stands for long-short-term memory and is a type of recurrent neural network layer. It might be called the memory of our model. This is crucial, since we are dealing with sequential data.

In [26]:
model = Sequential()
model.add(LSTM(128,
               input_shape=(SEQ_LENGTH,
                            len(characters))))
model.add(Dense(len(characters)))
model.add(Activation('softmax'))

Our structure is simple! The inputs immediately flow into our LSTM layer with 128 neurons. Our input shape is the length of a sentence times the amount of characters. The character which shall follow will be set to True or one. This layer is followed by a Dense hidden layer, which just increases complexity. In the end we use the Softmax activation function in order to make our results add up to one. This gives us the probability for each character.

In [27]:
model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(lr=0.01))

model.fit(x, y, batch_size=256, epochs=4)



Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.src.callbacks.History at 0x1d96904c4c0>

Now we compile the model and train it with our training data that we prepared above. We choose a batch size of 256 (which you can change if you want) and four epochs. This means that our model is going to see the same data four times.

# Helper Function #

Our model is now trained but it only outputs the probabilities for the next character. We therefore need some additional functions to make our script generate some reasonable text.

In [28]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# Generating Text #
Now we can get to the final function of our script. The function that generates the final text.

In [29]:
def generate_text(length, temperature):
    start_index = random.randint(0, len(text) - SEQ_LENGTH - 1)
    generated = ''
    sentence = text[start_index: start_index + SEQ_LENGTH]
    generated += sentence
    for i in range(length):
        x_predictions = np.zeros((1, SEQ_LENGTH, len(characters)))
        for t, char in enumerate(sentence):
            x_predictions[0, t, char_to_index[char]] = 1

        predictions = model.predict(x_predictions, verbose=0)[0]
        next_index = sample(predictions,
                                 temperature)
        next_character = index_to_char[next_index]

        generated += next_character
        sentence = sentence[1:] + next_character
    return generated

Again, it is less complicated than it looks. We basically choose a random starting position within the text because we need some starting text in order to predict the “next” character. So basically the first SEQ_LENGTH amount of characters will be copied from the original text. But we could just cut them off afterwards and we would end up with text that is completely generated by our neural network.

So we choose some random starting text and then we run a for loop in the range of the length that we want. We can generate a text with 100 characters or one with 20,000. We then convert our sentence into the desired input format that we already talked about. The sentence is now an array with ones or Trues, wherever a character occurs. Then we use the predict method of our model, to predict the likelihoods of the next characters. Then we make use of our sample helper function. In this function we also have a temperature parameter, which we can pass to that helper function. Of course the result we get needs to be converted from the numerical format into a readable character. Once this is done, we add the character to our generated text and repeat the process, until we reach the desired length.

# Results #
The results are actually quite good! Let’s take a look at some samples. I played around with the parameters, in order to diversify the results. I am not going to post the full results here but just some interesting snippets.

In [30]:
print(generate_text(300, 0.2))
print(generate_text(300, 0.4))
print(generate_text(300, 0.5))
print(generate_text(300, 0.6))
print(generate_text(300, 0.7))
print(generate_text(300, 0.8))

ine own again, be gone,
that i might stre with the the the sord the garenoun the sore the the be the rous the sarent the mand the the cound with the sear the seand the sore the to the sores corenter seath me the beand the sour not the mare the cours the sing the seand sour be the surest the sore the sore the ment the sore the beast the so
ear her hence.

queen margaret:
so come my sing thy bears woll my seall the merend so for will,
in the buck shat the se preast the with the bewer ar and gead in the of my to butered she the nowes my the fore for the coust the sing the sand the but the sand she gord am the goond the ward the songer.

cours:
and ser:
wher and be wath at she
roat of death.
rescue, fair lord, or else the to mard to the gores:
en the sord mome to loms on the the hare roule dore the beath farle my to which rey and or she she congers.

lick of lile::
ce all thes shell wor mun the my wars deand the hane or thy if of the geallont
you soul the and you to nomen she for sing
cour