<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Text_Generation_With_LSTM_Recurrent_Neural_Networks_in_Python_with_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Generation With LSTM Recurrent Neural Networks in Python with Keras

[tutorial link](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)

Recurrent neural networks can also be used as generative models.

This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.

Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.

In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in Python with Keras.


### Problem Description: Project Gutenberg


We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.

These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.

### Develop a Small LSTM Recurrent Neural Network

In this section we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.



In [0]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Next, we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.



In [0]:
# load ascii text and covert to lowercase

filename = 'alice_data'

raw_text = open(file=filename).read()
raw_text = raw_text.lower()


Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

In [31]:
chars = sorted(list(set(raw_text)))


chars_to_int = dict((c,i) for i,c in enumerate(chars))

print (chars_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, '*': 7, ',': 8, '-': 9, '.': 10, '0': 11, '3': 12, ':': 13, ';': 14, '?': 15, '[': 16, ']': 17, '_': 18, 'a': 19, 'b': 20, 'c': 21, 'd': 22, 'e': 23, 'f': 24, 'g': 25, 'h': 26, 'i': 27, 'j': 28, 'k': 29, 'l': 30, 'm': 31, 'n': 32, 'o': 33, 'p': 34, 'q': 35, 'r': 36, 's': 37, 't': 38, 'u': 39, 'v': 40, 'w': 41, 'x': 42, 'y': 43, 'z': 44}


dataset that will reduce the vocabulary and may improve the modeling process.

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

In [32]:
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)


Total Characters:  144408
Total Vocab:  45


Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).

For example, if the sequence length is 5 (for simplicity) then the first two training patterns would be as follows:

```
CHAPT -> E
HAPTE -> R
```

As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.


Basic idea of the data we want:


In [33]:
seq_length = 10


for i in range(0, 5):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	print (seq_in, seq_out)


alice's ad v
lice's adv e
ice's adve n
ce's adven t
e's advent u


So the `X` data should be like `["alice's ad", "lice's adv", ...]` and `Y` should be like `["v", "e", ...]`. But the chars should not be present, the integere representations of the chars should be present in X and Y


Extracting a small part of the datastet for experiment

In [34]:
raw_text_sub = raw_text[:100]

print(raw_text_sub)

alice's adventures in wonderland

lewis carroll

the millennium fulcrum edition 3.0




chapter i. d


In [35]:
text_to_process = raw_text
seq_length = 100

dataX = []

dataY = []

n_chars = len(text_to_process)

for i in range (0, n_chars-seq_length):
  
  seq_in = text_to_process[i:i+seq_length]
  
  seq_out = text_to_process[i+seq_length]
  
  # print (seq_in, seq_out) 
  
  dataX.append([chars_to_int[char] for char in seq_in])
  
  dataY.append(chars_to_int[seq_out])
  
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)
  

Total Patterns:  144308


In [0]:
# test to see if result matches with tutorial's data 
# print (dataX == dataX_new and dataY == dataY_new)

In [37]:
print (len(dataX), len(dataY))

print (len(dataX[0]))

144308 144308
100


Running the code to this point shows us that when we split up the dataset into training data for the network to learn that we have just under 150,000 training pattens. This makes sense as excluding the first 100 characters, we have one training pattern to predict each of the remaining characters.

Now that we have prepared our training data we need to transform it so that it is suitable for use with Keras.

First we must transform the list of input sequences into the form** [samples, time steps, features]** expected by an LSTM network.

Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros except with a 1 in the column for the letter (integer) that the pattern represents.

For example, when “n” (integer value 31) is one hot encoded it looks as follows:

```
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.]
  ```
  
We can implement these steps as below.

In [42]:
print (len(dataX))

print(len(dataX[0]))

print (n_vocab)

144308
100
45


In [0]:
# [samples, time steps, features]

X = numpy.reshape(dataX, (n_patterns, seq_length, 1))

# normalize

X = X/float(n_vocab)

# one hot encode the output variable

y = np_utils.to_categorical(dataY)

In [50]:
print (X.shape)

print (y.shape)

print (y[0])

print (X[0][:5])

(144308, 100, 1)
(144308, 45)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[0.42222222]
 [0.66666667]
 [0.6       ]
 [0.46666667]
 [0.51111111]]


We can now define our LSTM model. Here we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20%. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.


The problem is really a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here using the ADAM optimization algorithm for speed.




In [51]:
# define the LSTM model

model = Sequential()

model.add(LSTM(units=256, input_shape = (X.shape[1], X.shape[2])))

model.add(Dropout(rate=0.2))

model.add(Dense(y.shape[1], activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead we are interested in a generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train (about 300 seconds per epoch on an Nvidia K520 GPU). Because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch. We will use the best set of weights (lowest loss) to instantiate our generative model in the next section.



In [0]:
# define the checkpoint

filepath = 'weights-improvement-{epoch:02d}-{loss:.4f}.hdf5'

checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')

callbacks_list = [checkpoint]


We can now fit our model to the data. Here we use a modest number of 20 epochs and a large batch size of 128 patterns.



In [54]:
model.fit(x=X, y=y, batch_size=128, epochs=20, callbacks=callbacks_list)

Instructions for updating:
Use tf.cast instead.
Epoch 1/20

Epoch 00001: loss improved from inf to 2.95602, saving model to weights-improvement-01-2.9560.hdf5
Epoch 2/20

Epoch 00002: loss improved from 2.95602 to 2.74594, saving model to weights-improvement-02-2.7459.hdf5
Epoch 3/20

Epoch 00003: loss improved from 2.74594 to 2.64284, saving model to weights-improvement-03-2.6428.hdf5
Epoch 4/20

Epoch 00004: loss improved from 2.64284 to 2.57104, saving model to weights-improvement-04-2.5710.hdf5
Epoch 5/20

Epoch 00005: loss improved from 2.57104 to 2.51109, saving model to weights-improvement-05-2.5111.hdf5
Epoch 6/20

Epoch 00006: loss improved from 2.51109 to 2.45242, saving model to weights-improvement-06-2.4524.hdf5
Epoch 7/20

Epoch 00007: loss improved from 2.45242 to 2.39778, saving model to weights-improvement-07-2.3978.hdf5
Epoch 8/20

Epoch 00008: loss improved from 2.39778 to 2.34780, saving model to weights-improvement-08-2.3478.hdf5
Epoch 9/20

Epoch 00009: loss improv

<keras.callbacks.History at 0x7f312d6dcdd8>

You will see different results because of the stochastic nature of the model, and because it is hard to fix the random seed for LSTM models to get 100% reproducible results. This is not a concern for this generative model.

After running the example, you should have a number of weight checkpoint files in the local directory.

You can delete them all except the one with the smallest loss value. For example, when I ran this example, below was the checkpoint with the smallest loss that I achieved.

`content/weights-improvement-20-1.9205.hdf5`

The network loss decreased almost every epoch and I expect the network could benefit from training for many more epochs.

In the next section we will look at using this model to generate new text sequences.

### Generating Text with an LSTM Network

Generating text using the trained LSTM network is relatively straightforward.

Firstly, we load the data and define the network in exactly the same way, except the network weights are loaded from a checkpoint file and the network does not need to be trained.


In [0]:
# load the nw weights

filename = 'weights-improvement-20-1.9205.hdf5'

model.load_weights(filepath=filename)

model.compile(loss='categorical_crossentropy', optimizer='adam')



Also, when preparing the mapping of unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions.

