# Stateful LSTM for a One-Char to One-Char Mapping

We have seen that we can break-up our raw data into fixed size sequences and that this representation can be learned by the LSTM, but only to learn random mappings of 3 characters to 1 character.

We have also seen that we can pervert batch size to offer more sequence to the network, but only during training.

Ideally, *we want to expose the network to the entire sequence and let it learn the inter-dependencies, rather than us define those dependencies explicitly in the framing of the problem.*

**We can do this in Keras by making the LSTM layers stateful and manually resetting the state of the network at the end of the epoch, which is also the end of the training sequence.**

**This is truly how the LSTM networks are intended to be used. We find that by allowing the network itself to learn the dependencies between the characters, that we need a smaller network (half the number of units) and fewer training epochs (almost half).**

*为什么要**reset**, If the model is stateless, the cell states are reset at each sequence. With the stateful model, **all the states are propagated to the next batch**. It means that the state of the sample located at index i, Xi will be used in the computation of the sample Xi+bs in the next batch, where bs is the batch size (no shuffling).*

We first need to define our LSTM layer as stateful. **In so doing, we must explicitly specify the batch size as a dimension on the input shape. This also means that when we evaluate the network or make predictions, we must also specify and adhere to this same batch size. **This is not a problem now as we are using a batch size of 1. This could introduce difficulties when making predictions when the batch size is not one as predictions will need to be made in batch and in sequence.

    batch_size = 1
    model.add(LSTM(16, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
    
An important difference in training the stateful LSTM is that we train it manually one epoch at a time and reset the state after each epoch. We can do this in a for loop. Again, we do not shuffle the input, preserving the sequence in which the input training data was created.

    for i in range(300):
	model.fit(X, y, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
	model.reset_states()
    
As mentioned, we specify the batch size when evaluating the performance of the network on the entire training dataset.  

    # summarize performance of the model
    scores = model.evaluate(X, y, batch_size=batch_size, verbose=0)
    model.reset_states()
    print("Model Accuracy: %.2f%%" % (scores[1]*100))
    
Finally, we can demonstrate that the network has indeed learned the entire alphabet. We can seed it with the first letter “A”, request a prediction, feed the prediction back in as an input, and repeat the process all the way to “Z”.


    # demonstrate some model predictions
    seed = [char_to_int[alphabet[0]]]
    for i in range(0, len(alphabet)-1):
        x = numpy.reshape(seed, (1, len(seed), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x, verbose=0)
        index = numpy.argmax(prediction)
        print int_to_char[seed[0]], "->", int_to_char[index]
        seed = [index]
    model.reset_states()

We can also see if the network can make predictions starting from an arbitrary letter.

    # demonstrate a random starting point
    letter = "K"
    seed = [char_to_int[letter]]
    print "New start: ", letter
    for i in range(0, 5):
        x = numpy.reshape(seed, (1, len(seed), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x, verbose=0)
        index = numpy.argmax(prediction)
        print int_to_char[seed[0]], "->", int_to_char[index]
        seed = [index]
    model.reset_states()

In [1]:
# Stateful LSTM to learn one-char to one-char mapping
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
# fix random seed for reproducibility
numpy.random.seed(7)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
	seq_in = alphabet[i:i + seq_length]
	seq_out = alphabet[i + seq_length]
	dataX.append([char_to_int[char] for char in seq_in])
	dataY.append(char_to_int[seq_out])
	print(seq_in, '->', seq_out)


Using TensorFlow backend.


A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


In [2]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
print(X.shape)

(25, 1, 1)


In [3]:
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# create and fit the model
batch_size = 1
model = Sequential()
model.add(LSTM(16, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(batch_size, X.shape[1], X.shape[2])

1 1 1


In [None]:
这个形式看着跟(1,1,1) 1对1训练的类似，但是由于用了stateful, 上一次训练的结果被保留，知道整个X 被用完。

In [4]:
for i in range(300):
	model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False)
	model.reset_states()

Epoch 1/1
0s - loss: 3.2659 - acc: 0.0400
Epoch 1/1
0s - loss: 3.2538 - acc: 0.1600
Epoch 1/1
0s - loss: 3.2479 - acc: 0.1600
Epoch 1/1
0s - loss: 3.2422 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2364 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2302 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2234 - acc: 0.0400
Epoch 1/1
0s - loss: 3.2156 - acc: 0.0800
Epoch 1/1
0s - loss: 3.2064 - acc: 0.1200
Epoch 1/1
0s - loss: 3.1950 - acc: 0.1200
Epoch 1/1
0s - loss: 3.1806 - acc: 0.1200
Epoch 1/1
0s - loss: 3.1615 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1359 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1030 - acc: 0.0800
Epoch 1/1
0s - loss: 3.0666 - acc: 0.0800
Epoch 1/1
0s - loss: 3.0357 - acc: 0.0800
Epoch 1/1
0s - loss: 3.0185 - acc: 0.1600
Epoch 1/1
0s - loss: 3.0204 - acc: 0.1200
Epoch 1/1
0s - loss: 3.0309 - acc: 0.1200
Epoch 1/1
0s - loss: 3.0175 - acc: 0.0800
Epoch 1/1
0s - loss: 2.9810 - acc: 0.0800
Epoch 1/1
0s - loss: 2.9444 - acc: 0.0800
Epoch 1/1
0s - loss: 2.9111 - acc: 0.0800
Epoch 1/1
0s - loss: 2.8777 - acc:

In [6]:
# summarize performance of the model
scores = model.evaluate(X, y, batch_size=batch_size, verbose=0)
model.reset_states()
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 72.00%


In [7]:
seed = [char_to_int[alphabet[0]]]
for i in range(0, len(alphabet)-1):
	x = numpy.reshape(seed, (1, len(seed), 1))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	print(int_to_char[seed[0]], "->", int_to_char[index])
	seed = [index]
model.reset_states()

A -> B
B -> C
C -> D
D -> E
E -> E
E -> G
G -> H
H -> I
I -> J
J -> J
J -> L
L -> M
M -> M
M -> N
N -> P
P -> Q
Q -> R
R -> S
S -> S
S -> U
U -> V
V -> V
V -> W
W -> Y
Y -> Z


In [8]:
# demonstrate a random starting point
letter = "K"
seed = [char_to_int[letter]]
print("New start: ", letter)
for i in range(0, 5):
	x = numpy.reshape(seed, (1, len(seed), 1))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	print(int_to_char[seed[0]], "->", int_to_char[index])
	seed = [index]
model.reset_states()

New start:  K
K -> B
B -> C
C -> C
C -> D
D -> E


We can see that the network has memorized the entire alphabet perfectly. It used the context of the samples themselves and learned whatever dependency it needed to predict the next character in the sequence.

We can also see that if we seed the network with the first letter, that it can correctly rattle off the rest of the alphabet.

We can also see that it has only learned the full alphabet sequence and that from a cold start. When asked to predict the next letter from “K” that it predicts “B” and falls back into regurgitating the entire alphabet.