### Import necessary modules

In [1]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Activation
from keras import utils
import numpy as np

Using TensorFlow backend.


### Download text file and clean up
The text file downloaded is plain, we will convert it to lowercase and omit blank lines.

In [0]:
INPUT_FILE = utils.get_file("wonderland.txt", 
                            origin="http://www.gutenberg.org/files/11/11-0.txt")

In [3]:
print("Extracting text from input...")
fin = open(INPUT_FILE, 'rb')
lines = []
for line in fin:
    line = line.strip().lower()
    line = line.decode("ascii", "ignore")
    if len(line) == 0:  # omit blank lines
        continue
    lines.append(line)
fin.close()
text = " ".join(lines)  # putting all lists together as a long string

Extracting text from input...


### Creating lookup tables
As we will do a character-based text predition, we will do some preprocessing with characters in our datasets. In order to transform characters into numbers (for training purpose), we will create a dictionary in which we can find corresponding index given a character. And vice versa, for prediction purpose, we also need to create a dictionary in which we can find corresponding character based on specific index.

In [0]:
chars = set([c for c in text])  # character vocabulary
nb_chars = len(chars)  # total number of chars

# create a dictionary for finding an index given a charcter
char2index = dict((c, i) for i, c in enumerate(chars))

##-----> To do: create a dictionary for finding a charcter given its index
index2char = 

### Prepare for inputs and outputs
 The raw inputs are fixed length strings, say SEQLEN=10
<br> The raw outputs are the one character which follows along corresponding input string
<br> For example, assuming an input text "The sky was falling", we would get the 
<br> following sequence of 'input_chars' and 'label_chars' (first 2 only)
<br>   'The sky wa' -> 's'
<br>   'he sky was' -> ' '

In [5]:
print("Creating input and label text...")
SEQLEN = 10
STEP = 1
input_chars = []
label_chars = []
for i in range(0, len(text) - SEQLEN, STEP):
    input_chars.append(text[i:i + SEQLEN])
    label_chars.append(text[i + SEQLEN])

Creating input and label text...


### Vectorize inputs and outputs
 Each input string is represented by 'SEQLEN' characters
<br> We vectorize each character as a 1-hot encoding of size len(char)/nb_chars. 
<br> So there are len(input_chars) such inputs  --> shape(X) is (len(input_chars), seqlen, nb_chars).
<br> Similarly, we have the shape of y as (len(input_chars), nb_chars).

In [6]:
print("Vectorizing input and label text...")
X = np.zeros((len(input_chars), SEQLEN, nb_chars), dtype=np.bool)
y = np.zeros((len(input_chars), nb_chars), dtype=np.bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1

Vectorizing input and label text...


### Build one layer LSTM model
Here we're going to build up a one-layer LSTM model for character prediction, in which parameters will be assigned and the number of iternations and predictions will be defined as well.

In [0]:
HIDDEN_SIZE = 128  # the number of neurons in one layer LSTM
BATCH_SIZE = 128  # the number of a batch of inputs
NUM_ITERATIONS = 25  # the number of iterations training whole inputs datasets
NUM_EPOCHS_PER_ITERATION = 1  # the number of epochs for one iteration
NUM_PREDS_PER_EPOCH = 100  # the number of generated characters for one epoch

model = Sequential()  # LSTM is sequential as it is a RNN type of model
model.add(LSTM(HIDDEN_SIZE, input_shape=(SEQLEN, nb_chars)))  # add one layer of LSTM
model.add(Dense(nb_chars))  # change the input into output shape
model.add(Activation("softmax"))  # generate probabilities

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")  # nb_chars categories

### Train and test the LSTM model
In each iteration, we train the model first, then predict certain number of characters.

In [8]:
for iteration in range(NUM_ITERATIONS):
    print("=" * 50)  # separators ======
    print("Iteration #: %d" % (iteration))
    model.fit(X, y, batch_size=BATCH_SIZE, epochs=NUM_EPOCHS_PER_ITERATION)  # start training
    
    # testing model
    # randomly choose a row from input_chars,  
    # then use it to generate text from model for next 100 chars
    test_idx = np.random.randint(len(input_chars))  # randomly choose an index
    test_chars = input_chars[test_idx]  # find the corresponding string
    print("Generating from seed: %s" % (test_chars))  # print that string
    print(test_chars, end="")  # beginning of output string 
    for i in range(NUM_PREDS_PER_EPOCH):  # print 100 characters
        Xtest = np.zeros((1, SEQLEN, nb_chars))  # vectorize test input into one-hot embedding
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]  # predict next character with trained model
        ##----> To do: 1) find the index which has the highest probability (refer to 'np.argmax');
        ##----> To do: 2) and find its corresponding character (refer to 'index2char').
        ypred = 
        print(ypred, end="")
        # move forward with 'test_chars + ypred' for next prediction
        test_chars = test_chars[1:] + ypred  
    print()

Iteration #: 0
Epoch 1/1
Generating from seed: see what w
see what whe she wand the said the said the said the said the said the said the said the said the said the sai
Iteration #: 1
Epoch 1/1
Generating from seed: lectronic 
lectronic to the was the was the was the was the was the was the was the was the was the was the was the was t
Iteration #: 2
Epoch 1/1
Generating from seed: jesty mean
jesty mean the gryphon and the rather the project gutenberg-tm alice of the reat of the ray the rather the pro
Iteration #: 3
Epoch 1/1
Generating from seed:  very like
 very like the gryphon a little said the gryphon a little said the gryphon a little said the gryphon a little 
Iteration #: 4
Epoch 1/1
Generating from seed: ing to dra
ing to drast her all the said the could the said the could the said the could the said the could the said the 
Iteration #: 5
Epoch 1/1
Generating from seed: white rabb
white rabbit the did a little bean the dormouse the door all the did a little bean the dormouse t

### Questions
- 1) Why do we need a clean up for downloaded textual file?
- 2) Which parameters do you think have impact on the performance of the model we built?
- 3) What do you think we can do in order to get better results?

### Additional Exercise
According to your answer of question 3, please try to modify above codes to improve its performance