# **Text generation in the style of Alice's Adventures in Wonderland** #

## **Import necessary dependencies** ##

In [2]:
import sys
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


## **Load and preprocess the data** ##
First, we need to load and preprocess the data. The data is the full text of the book Alice's Adventures in Wonderland by Lewis Carroll. The book is on public domain and can be downloaded for free in multiple formats on Project Gutenberg website: https://www.gutenberg.org/wiki/Main_Page

In [6]:
# load the text file and convert them all into lower case
filename = "data/wonderland.txt"
raw_text = open(filename, encoding='utf-8').read()
raw_text = raw_text.lower()

# print out the first 1000 characters
raw_text[: 1000]

"\ufeffproject gutenberg's alice's adventures in wonderland, by lewis carroll\n\nthis ebook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  you may copy it, give it away or\nre-use it under the terms of the project gutenberg license included\nwith this ebook or online at www.gutenberg.org\n\n\ntitle: alice's adventures in wonderland\n\nauthor: lewis carroll\n\nposting date: june 25, 2008 [ebook #11]\nrelease date: march, 1994\n[last updated: december 20, 2011]\n\nlanguage: english\n\n\n*** start of this project gutenberg ebook alice's adventures in wonderland ***\n\n\n\n\n\n\n\n\n\n\nalice's adventures in wonderland\n\nlewis carroll\n\nthe millennium fulcrum edition 3.0\n\n\n\n\nchapter i. down the rabbit-hole\n\nalice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is

In [7]:
# get the list of all unique characters
chars = sorted(list(set(raw_text)))

# map between characters and their index, in both directions
# so we can encode and decode between them
char2id = dict((c, i) for i, c in enumerate(chars))
id2char = dict((i, c) for i, c in enumerate(chars))

# print out all unique characters
print(chars)

['\n', ' ', '!', '"', '#', '$', '%', "'", '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '\ufeff']


In [8]:
# number of raw characters and
# number of unique characters in the text
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)

Total Characters:  163781
Total Vocab:  59


Next we split the book's text into subsequences of 100 continuous character, each subsequence will overlap with the previous one(and the next subsequence) 99 characters. In other words, we "skip" 1 character to make a new subsequence, so the learning algorithm can "learn" a character based on its 100 previous characters(which is its context).
After that, we turn the characters into integers using the above look-up table.

In [9]:
# the number of characters in each subsequence
seq_length  = 100
# skip 1 character, meaning 2 continous subsequences
# will have the same 99 characters
skip = 1

dataX = []
dataY = []
for i in range(0, n_chars - seq_length, skip):
    # the context, which is the previous 100 characters
    seq_in = raw_text[i:i + seq_length]
    
    # the correct character we want to predict
    seq_out = raw_text[i + seq_length]
    
    # change them into indices
    dataX.append([char2id[char] for char in seq_in])
    dataY.append(char2id[seq_out])
    
# a pattern is a context to predict a character
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)

Total Patterns:  163681


Because we choose our seq_length as 100 characters, the total number of patterns is less than 100 compared to the total number of characters(excluding the first 100 characters).

In [10]:
# reshape x to be [batch_size=n_patterns, timesteps=seq_length, input_dim=1] to 
# feed into LSTM network
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize the data
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

## **Model training** ##

Now we can build our deep learning model. It has a single  LSTM layer consisting of 256 memory units and a dropout fraction of 20% to prevent overfitting. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 59 characters.

The model is compiled using the log loss cross-entropy function as the loss function, and the adam algorithm as the optimizier.



In [11]:
# build the LSTM model
model = Sequential()

# input LSTM layer
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))

# dropout layer
model.add(Dropout(0.2))

# output layer
model.add(Dense(y.shape[1], activation='softmax'))

# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# save the model after each epoch if there is an improvement
# (the loss reduced) for easily retrain and reload
filepath="model_checkpoints/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

It's time to train the model. This step is very expensive computationaly, in my computer with a pretty powerful NVIDIA GPU card, it took at least 8 minutes for each epoch, hence a training session of 10 epochs costed more than 80 minutes!

In [8]:
# fit the model into the data
model.fit(X, y, epochs=10, batch_size=128, callbacks=callbacks_list, verbose=2)

Epoch 1/10
Epoch 00000: loss improved from inf to 2.98379, saving model to model_checkpoints/weights-improvement-00-2.9838.hdf5
513s - loss: 2.9838
Epoch 2/10
Epoch 00001: loss improved from 2.98379 to 2.80069, saving model to model_checkpoints/weights-improvement-01-2.8007.hdf5
474s - loss: 2.8007
Epoch 3/10
Epoch 00002: loss improved from 2.80069 to 2.71612, saving model to model_checkpoints/weights-improvement-02-2.7161.hdf5
458s - loss: 2.7161
Epoch 4/10
Epoch 00003: loss improved from 2.71612 to 2.64447, saving model to model_checkpoints/weights-improvement-03-2.6445.hdf5
447s - loss: 2.6445
Epoch 5/10
Epoch 00004: loss improved from 2.64447 to 2.59306, saving model to model_checkpoints/weights-improvement-04-2.5931.hdf5
452s - loss: 2.5931
Epoch 6/10
Epoch 00005: loss improved from 2.59306 to 2.54385, saving model to model_checkpoints/weights-improvement-05-2.5439.hdf5
431s - loss: 2.5439
Epoch 7/10
Epoch 00006: loss improved from 2.54385 to 2.49083, saving model to model_checkpo

<keras.callbacks.History at 0x9f38a81a90>

We can see that the loss did reduced after each epoch, so the last epoch also provided the best weight values. We will stop here and try to generate some text

## **Using LSTM network to generate text** ##

In [12]:
# load the network weights from the best epoch
filename = "model_checkpoints/weights-improvement-09-2.3639.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

To generate new text, we can randomly select a pattern(which is a collection of 100 characters) in the dataset, use this pattern as the context for the learning algorithm to predict the first next most likely character. Then the next pattern will have the same 99 characters as the previous pattern plus the newly predicted character, from here the algorithm will predict the second character. And so on. We will generate 500 characters and then convert them from integer indices back into human-readable letters.

In [40]:
# pick a random pattern 
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]

# print out the pattern in human-readable form
print ("The context:")
print ("\"", ''.join([id2char[index] for index in pattern]), "\"")
print("-------------------------------------------------------------------")
print("\nGenerated text: \n", "\"")


# AI writer in action!
for i in range(500):
    # reshape x to be [batch_size=1, timesteps=seq_length, input_dim=1] to 
    # feed into LSTM network
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    
    # normalize the data
    x = x / float(n_vocab)
    
    # predict the next character from its context x
    prediction = model.predict(x, verbose=0)
    
    # get the index of the predicted character from sotfmax function
    index = numpy.argmax(prediction)
    
    # change index into letter
    result = id2char[index]
    
    # 
    seq_in = [id2char[index] for index in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print ("\n\"", "\nDone.")

The context:
" do such a
thing before, and he wasn't going to begin at his time of life.

the king's argument was,  "
-------------------------------------------------------------------

Generated text: 
 "
 she soeen sai so the toree the was so tee to the three sare the was so tee to the three hare th the the wooee sas in the care an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade an the cate an the cade a
" 
Done.


## **Conclusion** ##

Well, right now the result looks very bad, they are all nonsense(well, just like the original Alice's Adventures in Wonderland anyway!). This is just a very simple model though, and it can be improved by:
- Building a more complex neural network, for example, add another LSTM layer, or just add more memory units in the layer.
- Training in more epoch(10 is very small but already very costly computational speaking) and larger batch size(require more memory).