## Recurrent Neural Network and Jane Austen

Author: Greg Strabel

This notebook trains a simple recurrent neural network (single layer of LSTM units) to read a sequence of unicode characters and then predict the next unicode character in the sequence. The corpus for training the network is comprised of overlapping sequences of text from Jane Austen's Sense and Sensibility. The code in this notebook is based on the example found [here](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py). Given the amount of time required to train even a single epoch of the model on a CPU, for the sake of the reader I have saved a copy of the neural network produced after 5 training epochs.

In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
import os
cwd = os.getcwd()

# Get the text of Jane Austen's Sense and Sensibility from Project Gutenberg
text = nltk.corpus.gutenberg.raw('austen-sense.txt')

print(text[:1000])

[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion and housekeeper in his sister.
But her death, which happened ten years before his own,
produced a great alteration in his home; for to supply
her loss, he invited and received into his house the family
of his nephew Mr. Henry Dashwood, the legal inheritor
of the Norland estate, and the person to whom he intended
to bequeath it.  In the society of his nephew and niece,
and their children, the old Gentleman's days were
comfortably spent.  His attachment to them all increased.
The constant attention 

In [2]:
chars = sorted(list(set(text)))
char_to_ix = {ch:i for i,ch in enumerate(chars)}
ix_to_char = {i:ch for i,ch in enumerate(chars)}

In [4]:
print('Number of distinct characters in training corpus: {}'.format(len(chars)))

Number of distinct characters in training corpus: 78


I construct a simple recurrent neural network comprised of a single layer of 128 LSTM units:

In [5]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
import random
import sys

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 50
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('Number of sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_to_ix[char]] = 1
    y[i, char_to_ix[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Using TensorFlow backend.


Number of sequences: 224324
Vectorization...
Build model...


In [15]:
# Load the saved model
from keras.models import load_model
model = load_model(cwd + '\\JaneAustenKerasModel.h5')

Next, we train the model for two epochs at a time. After each epoch, we provide the RNN a seed of 50 unicode characters and ask it to generate the next 400 unicode characters. Rather than have the RNN do this just once for each epoch, we have the RNN generate unicode sequences for four different levels of sampling diversity. The intuition behind using different levels of sampling diversity is similar to that behind simulated annealing. Given a sequence of unicode characters, the RNN estimates the probability of each unicode character coming next in the sequence. To generate a sequence of unicode characters, we are asking the RNN to sample from this distribution. The RNN is effectively sampling from a search space in the same way that one would in stochastic gradient descent. As in simulated annealing, we may be able to achieve superior results by allowing the RNN to explore the search space more or less freely; intuitively, this is the goal of testing different sampling diversity.

In [9]:
# train the model, output generated text after each iteration
for iteration in range(1, 3):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y,
              batch_size=128,
              epochs=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_to_ix[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = ix_to_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "sewhere, it could not be
in THAT quarter.  'THERE,"
sewhere, it could not be
in THAT quarter.  'THERE, and the possible to be the world to her sister, and the same and some of the world to the same to be the more than that the could every thing to her sister, and the same time to be the same to be the particulare of the particular
of the same to the three of the same and the same to the comfort any one of the particular
of the conscier, and her sister,
and the particular sincered to the more than 

----- diversity: 0.5
----- Generating with seed: "sewhere, it could not be
in THAT quarter.  'THERE,"
sewhere, it could not be
in THAT quarter.  'THERE, and in the increasing to have herself, and the pardon accommonable, and in the every at her sister and declared to the conscience, her sister's disposition, and they were without her own than that I would be united to 



raged her o, she was peruacy in the bealment as my sisted, great continued, and could talked Willoughby was, where must been one early at her wos.
But our miam brother by my aparrponcy."

"I am
she could be so much hinces, to spirits fartable,
with

----- diversity: 1.2
----- Generating with seed: "sewhere, it could not be
in THAT quarter.  'THERE,"
sewhere, it could not be
in THAT quarter.  'THERE,
and was as no many rousingly
answer.
But well-esudeed risking hroming soatiwity
in sister's tably, "I go repity that seemed the glad them to
gisted themsemb reell a cautitomablesf from Ma. Oh, you rep as Miss Thousate, that Sir John."

"It cle;svis Edward
was,
unottered, "you hasmed, each mess of good
of her
dextreable
happinets--shendered it was heard, less to love aeac doaw trad!
For HESiMy I d

--------------------------------------------------
Iteration 2
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "rary accommodation to
yourself--such, in short, as"
rary accommodation t

In [11]:
# Save the model
model.save(cwd + '\\JaneAustenKerasModel.h5')

Should we be impressed by the results we see above? Certainly the generated text itself does not seem to make any sense. But think about what the RNN has managed to learn. The RNN has not been directly provided any notion of English words; we provide it sequences of unicode characters rather than encodings of tokenized words. This means that when the RNN generates English words like 'the', 'should', 'sister', etc., it is not because the RNN was directly provided with information that these sequences of characters are words; the RNN figured this out itself. This recognition of English words occurs after only a few training epochs. The more epochs one uses in training the model, the better the model gets at learning parts of the English language. For instance, after several training epochs, the model appears to have learned that the article 'the' or the possesive pronoun 'her' commonly follow the preposition 'of', but the reverse cases are unlikely. And although the generated sentences don't make any sense, it's not hard to find sentence fragments that are grammatically correct. In conclusion, given the simplicity of the RNN (a single hidden layer of 128 LSTM units) and the small number of training epochs (I'm running this on a 2.7GHz CPU rather than a GPU), I find the RNN's performance rather remarkable.