# Text generation with LSTM Recurrent Neural Networks

By Alex Gascón Bononad - alexgascon.93@gmail.com

## 0. Introduction

### 0.1. Introduction to the Notebook
In this notebook we'll follow the following tutorial: http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ [1]

As the title states, we're going to use Python and Keras to create a LSTM Recurrent Neural Network to use for text generation. We'll start by following what the tutorial specifies and then we'll try some variations that let us experimentate and get different results


### 0.2. Introduction to the concepts

There are several concepts that will be useful to know during this tutorial

#### Text generation
Text Generation or Natural language generation (NLG) is the natural language processing task of generating natural language from a machine representation system. It could be said an NLG system is like a translator that converts data into a natural language representation. 

NLG may be viewed as the opposite of natural language understanding: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into words.

_More info:_
https://en.wikipedia.org/wiki/Natural_language_generation [2]

#### LSTM Recurrent Neural Networks
LSTM is the acronym for Long short-term memory, a type of Recurrent Neural Networks capable of learning long-term dependencies. They are explicitly designed to avoid the long-term dependency problem, so they don't struggle at all to remember information for long periods of time.

_More info:_ <br />
http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf [3] <br />
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ [4]



## 1. Description of the problem
As we've previously explained, the purpose of this notebook is to generate text using a Neural Network. 

In order to do this, we need a big amount of information to train the NN, and in this case, we've chosen the book _Alice's Adventures in Wonderland_, by Lewis Carroll. It's no longer protected with copyright, and we can find it on [Project Gutenberg](https://www.gutenberg.org/). Specifically, we're going to use the text version of the book.

First of all, we'll remove the header and footer of the book, as it contains generic info about Project Gutenberg that we don't want to include in the training set. This doesn't mean that it isn't important (in fact, as we're using information they've hosted and distributed, the least we can do for the project is to read these information), but that we don't want to analyse it as it may greatly affect our output. You can find the edited .txt file in the same directory as this notebook, with the name "Alice's adventures in Wonderland.txt"


## 2. Develop a Small LSTM Recurrent Neural Network

Now let's start the coding! The first steps will be to load the libraries that we'll need during this notebook and to read the text file. Besides, we'll convert it to lowercase in order to simplify the analysis.

In [1]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [2]:
# Reading the book and converting to lowercase
filename = "Alice's adventures in Wonderland.txt"
book = open(filename).read()
book = book.lower()

Now, we'll prepare the data to be modelled by the Neural Network. In order to do this we'll map each character to their corresponding ASCII integer.

In [3]:
# Creating a sorted list where every char in the book appears just once
chars = sorted(list(set(book)))
# Getting a dictionary that representa the mapping of every char (key) to their corresponding int (value)
char_to_int = dict((c, i) for i, c in enumerate(chars))

Up to this point, we have the following data:

In [4]:
print("Total characters in the book: ", len(book))
print("Unique characters in the book: ", len(chars))

Total characters in the book:  144586
Unique characters in the book:  46


In this step we'll start to define the training data for our LSTM RNN. In order to follow the original tutorial, what we'll do is to split the whole book into strings of 100 characters, and consider the following one as the expected output. 

We will slide this 100-character window along the whole book one chatacter at a time, so each char will be learned 100 times (with the exception of the first 100 characters). 

As a simpler way of explaining this: if instead of a window of length 100 we'll use one of length 5, the following ones will be some of the training inputs:

CHAPT --> E <br/>
HAPTE --> R <br/>
wonde --> r <br/>
onder --> l <br/>
nderl --> a

In [10]:
# Preparing the dataset of input-output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []

# Iterating over the book
for i in range(0, len(book) - seq_length):
    sequence_in = book[i : i + seq_length]
    sequence_out = book[i + seq_length]
    
    # Converting each char to its corresponding int
    sequence_in_int = [char_to_int[char] for char in sequence_in]
    sequence_out_int = char_to_int[sequence_out]
    
    # Appending the result to the current data 
    dataX.append(sequence_in_int)
    dataY.append(sequence_out_int)
    
print("Total patterns: ", len(dataX))

Total patterns:  144486


As we can see, the total amount of input patterns is exactly the total amount of characters in the text - 100. This, of course, is not a coincidence: we have a pattern to predict each of the characters, except for the first 100 (the value of our seq_length) because they don't have enough preceding characters to be predicted. 

The next step will be to transform the training data to make it suitable for its use in Keras.

We'll do this in three steps: the first one is to transform the list of input sequences into the form _[samples, time steps, features]_ expected by an LSTM RNN. 
- **Samples**: the amount of input elements to use during our training
- **Time steps**: indicates how long will a single input be kept in our network (i.e. how many iterations will occur before it leaves the network)
- **Features**: How many features does the network predict each iteration

In [11]:
# Reshaping X to be [samples, time_steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, len(sequence_out)))

The next step will be to normalize the outputs between 0 and 1, in order to make the patterns easier to learn by the LSTM network, that by default uses the sigmoid activation function.

In [12]:
# Normalizing between 0 and 1
X = X / float(len(chars))

Finally, we need to convert the output patterns into a one hot enconding. This encoding is represented with a group of bits only one single can have a value of 1, with the other ones being 0. 

Let's see some examples of binary to one-hot conversions:
_000 --> 00000001
001 --> 00000010
010 --> 00000100
011 --> 00001000
100 --> 00010000
101 --> 00100000
110 --> 01000000
111 --> 10000000_

In our case, the one-hot encoding will be represented by an array formed by 46 items (the amount of unique characters in our text) in which only one of them will be a 1. For example, the value of "n" (integer value 31) will be the following one:

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0. 0.  0.  0.  0.  0.  0.  0.  0.]

In [13]:
# One hot encode the output variable
y = np_utils.to_categorical(dataY)