<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Text_Generation_With_LSTM_Recurrent_Neural_Networks_in_Python_with_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Generation With LSTM Recurrent Neural Networks in Python with Keras

[tutorial link](https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/)

Recurrent neural networks can also be used as generative models.

This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.

Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.

In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in Python with Keras.


### Problem Description: Project Gutenberg


We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.

These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.

### Develop a Small LSTM Recurrent Neural Network

In this section we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.



In [0]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Next, we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.



In [0]:
# load ascii text and covert to lowercase

filename = 'alice_data'

raw_text = open(file=filename).read()
raw_text = raw_text.lower()


Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

In [31]:
chars = sorted(list(set(raw_text)))


chars_to_int = dict((c,i) for i,c in enumerate(chars))

print (chars_to_int)

{'\n': 0, ' ': 1, '!': 2, '"': 3, "'": 4, '(': 5, ')': 6, '*': 7, ',': 8, '-': 9, '.': 10, '0': 11, '3': 12, ':': 13, ';': 14, '?': 15, '[': 16, ']': 17, '_': 18, 'a': 19, 'b': 20, 'c': 21, 'd': 22, 'e': 23, 'f': 24, 'g': 25, 'h': 26, 'i': 27, 'j': 28, 'k': 29, 'l': 30, 'm': 31, 'n': 32, 'o': 33, 'p': 34, 'q': 35, 'r': 36, 's': 37, 't': 38, 'u': 39, 'v': 40, 'w': 41, 'x': 42, 'y': 43, 'z': 44}


dataset that will reduce the vocabulary and may improve the modeling process.

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

In [32]:
n_chars = len(raw_text)
n_vocab = len(chars)
print ("Total Characters: ", n_chars)
print ("Total Vocab: ", n_vocab)


Total Characters:  144408
Total Vocab:  45


Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).

For example, if the sequence length is 5 (for simplicity) then the first two training patterns would be as follows:

```
CHAPT -> E
HAPTE -> R
```

As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.


Basic idea of the data we want:


In [33]:
seq_length = 10


for i in range(0, 5):
	seq_in = raw_text[i:i + seq_length]
	seq_out = raw_text[i + seq_length]
	print (seq_in, seq_out)


alice's ad v
lice's adv e
ice's adve n
ce's adven t
e's advent u


So the `X` data should be like `["alice's ad", "lice's adv", ...]` and `Y` should be like `["v", "e", ...]`. But the chars should not be present, the integere representations of the chars should be present in X and Y


Extracting a small part of the datastet for experiment

In [34]:
raw_text_sub = raw_text[:100]

print(raw_text_sub)

alice's adventures in wonderland

lewis carroll

the millennium fulcrum edition 3.0




chapter i. d


In [35]:
text_to_process = raw_text
seq_length = 100

dataX = []

dataY = []

n_chars = len(text_to_process)

for i in range (0, n_chars-seq_length):
  
  seq_in = text_to_process[i:i+seq_length]
  
  seq_out = text_to_process[i+seq_length]
  
  # print (seq_in, seq_out) 
  
  dataX.append([chars_to_int[char] for char in seq_in])
  
  dataY.append(chars_to_int[seq_out])
  
n_patterns = len(dataX)
print ("Total Patterns: ", n_patterns)
  

Total Patterns:  144308


In [0]:
# test to see if result matches with tutorial's data 
# print (dataX == dataX_new and dataY == dataY_new)

In [37]:
print (len(dataX), len(dataY))

print (len(dataX[0]))

144308 144308
100
