In this notebook, we'll train an RNN to predict the next character given a sequence of characters coming from a text. Such models are also frequently referred to as "language models".

We'll use the book "Pride and Prejudice" from Project Gutenberg here to train our model:

In [None]:
!wget https://www.gutenberg.org/files/1342/1342-0.txt -O prideandprejudice.txt

--2020-07-21 13:35:41--  https://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 799738 (781K) [text/plain]
Saving to: ‘prideandprejudice.txt’


2020-07-21 13:35:43 (471 KB/s) - ‘prideandprejudice.txt’ saved [799738/799738]



In [None]:
%tensorflow_version 2.x

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Embedding, LSTM, Attention, Input, Flatten
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.datasets import imdb
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing import sequence
import numpy as np
import random
import re
import io

Next, we read in the file and do some basic processing:

In [None]:
text = io.open('prideandprejudice.txt', encoding='utf-8').read().lower().replace('\n', ' ').replace('\ufeff', '')
text = re.compile(r"\s+").sub(" ", text).strip()

chars = sorted(list(set(text)))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

print('Corpus length   =', len(text))
print('Total chars     =', len(chars))
print('First ten chars =', chars[:10])

Corpus length   = 701860
Total chars     = 64
First ten chars = [' ', '!', '#', '$', '%', "'", '(', ')', '*', ',']


Next, we will cut the text up into semi-redundant sequences of `maxlen` characters:

In [None]:
maxlen = 80
step   = 3

sentences  = []
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])

print('Nr. of sequences =', len(sentences))

Nr. of sequences = 233927


Let's take a look at a couple of examples of train X/y instances we have now:

In [None]:
for i in range(10):
  print(sentences[i], '-->', next_chars[i])

the project gutenberg ebook of pride and prejudice, by jane austen this ebook is -->  
 project gutenberg ebook of pride and prejudice, by jane austen this ebook is fo --> r
oject gutenberg ebook of pride and prejudice, by jane austen this ebook is for t --> h
ct gutenberg ebook of pride and prejudice, by jane austen this ebook is for the  --> u
gutenberg ebook of pride and prejudice, by jane austen this ebook is for the use -->  
enberg ebook of pride and prejudice, by jane austen this ebook is for the use of -->  
erg ebook of pride and prejudice, by jane austen this ebook is for the use of an --> y
 ebook of pride and prejudice, by jane austen this ebook is for the use of anyon --> e
ook of pride and prejudice, by jane austen this ebook is for the use of anyone a --> n
 of pride and prejudice, by jane austen this ebook is for the use of anyone anyw --> h


We still need to convert these valid inputs, so let's do that now. Here, we're using a one-hot encoding mechanism given the low number of characters, but an embedding would work as well, in which case we could convert each sentence to a sequence of integers.

In [None]:
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)),         dtype=np.bool)

for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
      X[i, t, char_indices[char]] = 1
  y[i, char_indices[next_chars[i]]] = 1

We can now define our network:

In [None]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam')

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
Total params: 107,072
Trainable params: 107,072
Non-trainable params: 0
_________________________________________________________________


We're going to do something clever here. We'll use a Keras callback here to see how well our model is doing every so often.

However, just taking the predicted next character from the network would be pretty boring, as it is very likely our network will get stuck in loops. A better technique is hence weighted sampling from the predictions using some "temperature" parameter. The lower the temperature, the more we'll stick to the most-sure prediction:

In [None]:
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds  = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

We then define our callback function:

In [None]:
def on_epoch_end(epoch, logs):
  print()
  if epoch % 5 != 0: return

  print('Generating text after epoch {}'.format(epoch))

  # Start from a random piece of text:
  start_index = random.randint(0, len(text) - maxlen - 1)
  
  for temperature in [0.2, 1.0]:
      sentence = text[start_index: start_index + maxlen]
      
      print('(temperature = {})'.format(temperature), sentence, end='  -->  ')

      # Predict 400 characters
      for i in range(400):

          # Prepare input based on sentence so far
          X_pred = np.zeros((1, maxlen, len(chars)))
          for t, char in enumerate(sentence):
              X_pred[0, t, char_indices[char]] = 1.

          # Predict and sample
          preds      = model.predict(X_pred, verbose=0)[0]
          next_index = sample(preds, temperature)
          next_char  = indices_char[next_index]

          # Change the sentence so far (shifting the window to the right)
          sentence = sentence[1:] + next_char

          print(next_char, end='')

      print()

And let's train: note the difference in behavior given the temperature we specify:

In [None]:
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(X, y, batch_size=256, epochs=60, verbose=0, callbacks=[print_callback])


Generating text after epoch 0
(temperature = 0.2) ied only with a laugh; and as it had been asked without the least suspicion, she  -->   the the the the mared and and he the mist and an the here the the the coulling wat and and and the the the sound was ther and and the the ther the he the ther soung the her the her ind and and and and and the hat and and and and and in the hith and and ind the ther the the the sout and and the the the the lese the her ant and soull the the the the ther and and and and and and and and and and the 
(temperature = 1.0) ied only with a laugh; and as it had been asked without the least suspicion, she  -->   whas te mocitbesther bo estul, bve tichedurse ilsurtti kfithe lis hem, whe, onljiginot orethein , andan ithe, wasy is horr the ah orseveghos ithe nobewysun avet ahe, oblwidest. yom. frrelkavr ur axeg, qu rrenquiny bytit oud cthenl”d. —i vem e sougt, anwashe he cerithonth wh alcenerell tatheed them st culgcltady bed thowik cass benove s, axdef oi worea

<tensorflow.python.keras.callbacks.History at 0x7f80a03bf400>