## Understand how state is maintained in LSTM network

- How to develop a naive LSTm network for a sequence predicition problem
- How to carefully manage state through batches and features with an LSTM network
- How to manually manage state in an LSTM network for stateful predicition


### Problem Description: Learn the Alphabet

In this note we are going to develop an contrast a number of different LSTM recurrent neural network models. 

The context of these comparisons will be a simple sequence prediction problem of learning the alphabet. That is, given a letter of alphabet, predict the next letter of the alphabet. 


In [48]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils

In [49]:
# fix random seed for reproducibility
numpy.random.seed(7)

In [50]:
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [51]:
# note, try to visualize what is enumerate
for x in enumerate("abc"):
    print(x)

(0, 'a')
(1, 'b')
(2, 'c')


Now we need to create our input and output pairs on which to train our neural network. We can do this by defining an input sequence length, then reading sequences from the input alphabet sequence.

For example we use an input length of 1. Starting at the beginning of the raw input data, we can read off the first letter “A” and the next letter as the prediction “B”. We move along one character and repeat until we reach a prediction of “Z”.

In [52]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
	seq_in = alphabet[i:i + seq_length]
	seq_out = alphabet[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
	print(seq_in, '->', seq_out)

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


We need to reshape the NumPy array into a format expected by the LSTM networks, that is [samples, time steps, features].

In [53]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
print(X.shape)

(25, 1, 1)


In [54]:
# normalize
X = X / float(len(alphabet))

Finally, we can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, we can convert the output (y) to a one hot encoding, using the Keras built-in function **to_categorical()**.

In [55]:
# for i in range(5):
#     print(dataY[i])

y = np_utils.to_categorical(dataY)
# for i in range(5):
#     print(y[i])


Naive LSTM for Learning One-Char to One-Char Mapping

Let’s start off by designing a simple LSTM to learn how to predict the next character in the alphabet given the context of just one character.

We will frame the problem as a random collection of one-letter input to one-letter output pairs. As we will see this is a difficult framing of the problem for the LSTM to learn.

Let’s define an LSTM network with 32 units and a single output neuron with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called **“categorical_crossentropy”** in Keras), and optimize the network using the ADAM optimization function.

The model is fit over 500 epochs with a batch size of 1.

In [56]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
0s - loss: 3.2661 - acc: 0.0000e+00
Epoch 2/500
0s - loss: 3.2582 - acc: 0.0000e+00
Epoch 3/500
0s - loss: 3.2551 - acc: 0.0400
Epoch 4/500
0s - loss: 3.2524 - acc: 0.0400
Epoch 5/500
0s - loss: 3.2495 - acc: 0.0400
Epoch 6/500
0s - loss: 3.2471 - acc: 0.0400
Epoch 7/500
0s - loss: 3.2440 - acc: 0.0400
Epoch 8/500
0s - loss: 3.2412 - acc: 0.0400
Epoch 9/500
0s - loss: 3.2378 - acc: 0.0400
Epoch 10/500
0s - loss: 3.2348 - acc: 0.0400
Epoch 11/500
0s - loss: 3.2313 - acc: 0.0400
Epoch 12/500
0s - loss: 3.2277 - acc: 0.0400
Epoch 13/500
0s - loss: 3.2237 - acc: 0.0400
Epoch 14/500
0s - loss: 3.2203 - acc: 0.0400
Epoch 15/500
0s - loss: 3.2160 - acc: 0.0400
Epoch 16/500
0s - loss: 3.2117 - acc: 0.0400
Epoch 17/500
0s - loss: 3.2066 - acc: 0.0400
Epoch 18/500
0s - loss: 3.2017 - acc: 0.0400
Epoch 19/500
0s - loss: 3.1969 - acc: 0.0400
Epoch 20/500
0s - loss: 3.1911 - acc: 0.0400
Epoch 21/500
0s - loss: 3.1854 - acc: 0.0400
Epoch 22/500
0s - loss: 3.1789 - acc: 0.0400
Epoch 23/50

<keras.callbacks.History at 0x7f4e5bc96208>

In [57]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 92.00%


We can then re-run the training data through the network and generate predictions, converting both the input and output pairs back into their original character format to get a visual idea of how well the network learned the problem.

In [58]:
# demonstrate some model predictions
for pattern in dataX:
	x = numpy.reshape(pattern, (1, len(pattern), 1))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	print(seq_in, "->", result)

['A'] -> B
['B'] -> C
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> W
['W'] -> Y
['X'] -> Z
['Y'] -> Z


We can see that this problem is indeed difficult for the network to learn.

**The reason is, the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order and the state of the network is reset after each pattern (each batch where each batch contains one pattern).**

This is abuse of the LSTM network architecture, treating it like a standard multilayer Perceptron.

Next, let’s try a different framing of the problem in order to provide more sequence to the network from which to learn.

### Naive LSTM for a Three-Char Feature Window to One-Char Mapping

A popular approach to adding more context to data for multilayer Perceptrons is to use the window method.

This is where previous steps in the sequence are provided as additional input features to the network. We can try the same trick to provide more context to the LSTM network.

Here, we increase the sequence length from 1 to 3, for example:

prepare the dataset of input to output pairs encoded as integers
    
    seq_length = 3

    ABC -> D
    BCD -> E
    CDE -> F

Each element in the sequence is then provided as a new input feature to the network. This requires a modification of how the input sequences reshaped in the data preparation step:

    # reshape X to be [samples, time steps, features]
    X = numpy.reshape(dataX, (len(dataX), 1, seq_length))
    
It also requires a modification for how the sample patterns are reshaped when demonstrating predictions from the model.

    x = numpy.reshape(pattern, (1, 1, len(pattern))) 
    
(Note, in previous example, the shape is defined as below
    
    X = numpy.reshape(dataX, (1, seq_length, 1))
    
    
The entire code list below 

In [59]:
# Naive LSTM to learn three-char window to one-char mapping
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
# fix random seed for reproducibility
numpy.random.seed(7)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
	seq_in = alphabet[i:i + seq_length]
	seq_out = alphabet[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
	print(seq_in, '->', seq_out)



ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


In [60]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))
print(X.shape)

(23, 1, 3)


In [61]:
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=500, batch_size=1, verbose=2)

Epoch 1/500
0s - loss: 3.2651 - acc: 0.0000e+00
Epoch 2/500
0s - loss: 3.2527 - acc: 0.0435
Epoch 3/500
0s - loss: 3.2462 - acc: 0.0435
Epoch 4/500
0s - loss: 3.2402 - acc: 0.0000e+00
Epoch 5/500
0s - loss: 3.2339 - acc: 0.0435
Epoch 6/500
0s - loss: 3.2274 - acc: 0.0435
Epoch 7/500
0s - loss: 3.2209 - acc: 0.0435
Epoch 8/500
0s - loss: 3.2143 - acc: 0.0000e+00
Epoch 9/500
0s - loss: 3.2068 - acc: 0.0435
Epoch 10/500
0s - loss: 3.1993 - acc: 0.0435
Epoch 11/500
0s - loss: 3.1918 - acc: 0.0435
Epoch 12/500
0s - loss: 3.1839 - acc: 0.0000e+00
Epoch 13/500
0s - loss: 3.1757 - acc: 0.0435
Epoch 14/500
0s - loss: 3.1674 - acc: 0.0435
Epoch 15/500
0s - loss: 3.1587 - acc: 0.0000e+00
Epoch 16/500
0s - loss: 3.1499 - acc: 0.0435
Epoch 17/500
0s - loss: 3.1419 - acc: 0.0000e+00
Epoch 18/500
0s - loss: 3.1341 - acc: 0.0000e+00
Epoch 19/500
0s - loss: 3.1246 - acc: 0.0435
Epoch 20/500
0s - loss: 3.1168 - acc: 0.0435
Epoch 21/500
0s - loss: 3.1097 - acc: 0.0435
Epoch 22/500
0s - loss: 3.1018 - acc

<keras.callbacks.History at 0x7f4e5b3db978>

In [62]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))
# demonstrate some model predictions
for pattern in dataX:
	x = numpy.reshape(pattern, (1, 1, len(pattern)))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	result = int_to_char[index]
	seq_in = [int_to_char[value] for value in pattern]
	print(seq_in, "->", result)

Model Accuracy: 86.96%
['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z


We can see a small lift in performance that may or may not be real. This is a simple problem that we were still not able to learn with LSTMs even with the window method.

**Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. We have given more context to the network, but not more sequence as it expected.**

In the next section, we will give more context to the network in the form of time steps.