# Text generation with (larger) LSTM Recurrent Neural Networks

By Alex Gascón Bononad - alexgascon.93@gmail.com

## 0. Introduction

### 0.1. Introduction to the Notebook
In this notebook we're going to end what we started in the first one of this repository (#1 Text generation with (larger) LSTM Recurrent Neural Networks): we''ll follow the following tutorial: http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ [1] to create a LSTM RNN capable of generating text. 



## 1. Description of the problem

In the previous part we were able to achieve this, but the obtained NN was too small and the generated text wasn't understandable enough. That's why now we're going to take what we learned and tested and expand it in order to get a better functional example. 

Besides, in this case, we're going to change the book we'll use to train our network: instead of Alice in Wonderland, the book to use will be "El ingenioso hidalgo don Quijote de la Mancha", one of the most famous books of Spanish literature. We have also obtained it from [Project Gutenberg](http://www.gutenberg.org/cache/epub/2000/pg2000.txt), but you can find the version without headers and footers in the same folder of this notebook. 

## 2. Develop a LSTM Recurrent Neural Network

Let's get started! 

One of the good things of this project is that most of what we have to do is exactly the same than in the project #1 - Text generation with LSTM Recurrent Neural Networks, that you can also find in this repository, but at a larger scale. Therefore, must of the operations (such as the pre-processing ones) have already been explained. 

In [1]:
import sys
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

# Load the book and covert it to lowercase
filename = "El ingenioso hidalgo don Quijote de la Mancha (learning version).txt"
book = open(filename).read()
book = book.lower()

# Create mapping of unique chars to integers, and its reverse
chars = sorted(list(set(book)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

# Summarizing the loaded data
n_chars = len(book)
n_vocab = len(chars)
print "Total Characters: ", n_chars
print "Total Vocab: ", n_vocab

# Prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

# Iterating over the book
for i in range(0, n_chars - seq_length, 1):
    sequence_in = book[i:i + seq_length]
    sequence_out = book[i + seq_length]
    
    # Converting each char to its corresponding int
    sequence_in_int = [char_to_int[char] for char in sequence_in]
    sequence_out_int = char_to_int[sequence_out]

    # Appending the result to the current data 
    dataX.append(sequence_in_int)
    dataY.append(sequence_out_int)
n_patterns = len(dataX)
print "Total Patterns: ", n_patterns

# Reshaping X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# Normalizing
X = X / float(n_vocab)
# One hot encode the output variable
y = np_utils.to_categorical(dataY)


Using Theano backend.
Using gpu device 0: GeForce GTX 960M (CNMeM is disabled, cuDNN not available)


Total Characters:  169841
Total Vocab:  53
Total Patterns:  169741


Here is where the changes start: as we saw in the previous repository, our RNN didn't achieve optimal results. The output had some grammatical structure and there were some understandable words, but still the result was far from what we'd expect to be a good text. 

In order to improve that, we're going to change the architecture of our model: instead of using a single LSTM layer, we're going to use two identical ones. 

In [2]:
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))


However, this implies quite a big problem: if a Neural Network with a single layer took several hours to train, it won't be very wrong to assume that this one may took a few days (_note from the future: in my GTX960M, each epoch required about 25 minutes to complete_). In order to let us work without any worries, we'll prepare our checkpoint snippet to let us resume the training from wherever we want.

In [3]:
# Starting from a checkpoint (if we set one)
checkpoint = "weights-improvement-04-1.4611.hdf5"
if checkpoint:
    model.load_weights(checkpoint)

# Amount of epochs that we still have to run
epochs_run = 10 + 15 + 8 + 3 + 6 + 5
epochs_left = 50 - epochs_run

# Define the checkpoints structure
filepath="weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]


You may have detected that we'll start from 50 epochs. This is another way of letting our RNN learn more about the training examples we use, and that will for sure let us achieve better results than with only 20 epochs.


We've finally ended up with the things to explain, so let's start with the part that everyone was waiting for: it's time to stop the talking and start the training!

![Let's get ready](https://media0.giphy.com/media/XTggTtrOfx2p2/200.gif#7)

In [4]:
# Compiling the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Fitting the model
model.fit(X, y, nb_epoch=epochs_left, batch_size=64, callbacks=callbacks_list)


Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14

(Output truncated to avoid excessive unnecesary repetition)

## 3. Conclusion

After what probably has been a very, very long training, we finally have some good weights to use! In my case, after 50 epochs I've achieved a loss of 1.4625. Now it's time to use our model and those weigths to predict results. 

The following steps are also the ones we used in the notebook #1, so they don't need a very detailed explanation: we'll start by loading the model, then we'll choose a random sentence from our book that will be used as the seed for our RNN and finally we'll predict the output character.

In [5]:
# Load the network weights
filename = "weights-improvement-03-1.4625.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
seed = ''.join([int_to_char[value] for value in pattern])
print "Seed:"
print "\"", seed, "\""
result_str = ""

# Generate characters
for i in range(500):
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(n_vocab)
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	result_str += result
	pattern.append(index)
	pattern = pattern[1:len(pattern)]
print "\nDone."

Seed:
"  hincó de rodillas ante él, diciéndole:

-no me levantaré jamás de donde estoy, valeroso caball "

Done.


Finally, we'll output the result and see if we get what we expected

In [6]:
print seed + result_str

 hincó de rodillas ante él, diciéndole:

-no me levantaré jamás de donde estoy, valeroso caballero, que estaba diciendo:

-ste que se dice -dijo el cura-, y el cual se dejare a su amo de la mancha, estaba diciendo:

-ste que se dice -dijo el cura-, y el cual se dejare a su amo de la mancha, estaba diciendo:

-ste que se dice -dijo el cura-, y el cual se dejare a su amo de la mancha, estaba diciendo:

-ste que se dice -dijo el cura-, y el cual se dejare a su amo de la mancha, estaba diciendo:

-ste que se dice -dijo el cura-, y el cual se dejare a su amo de la mancha, estaba dici


We can see that the obtained result is way better than the one from the previous notebook: all the words make sense, the punctuation marks are all correctly set, the dashes are correctly opened and closed... It totally looks like a real dialogue!

However, it presents an obvious problem: the same sentence repeats over and over. The reason is still unknown (maybe the neural network has overfit the training example), but definitely we're getting closer to a fully functional text generator!

As this problem may be too complex to be treated in just a few lines, we'll close this notebook for now and try to find the cause and solve it in the future. Hope the tutorial has been useful!