# Language Modeling using LSTM in keras/tensorflow

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. 

We use tensorflow as the backend.

In [7]:
# Load LSTM network and generate text
import sys
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.callbacks import TensorBoard
from keras.utils import np_utils
from keras import backend as K
K.clear_session()

Before running the following cell, make sure you have the file wonderland.txt in data folder in your working directory.

The following snippet, generates the data by splitting the raw text into a list of characters. A vocabulary **chars** is generated using the characters present in the text. Since, the model can take only numbers as input, the characters are mapped to numbers, and a reverse mapping is created inorder to map the output generated by the model back to characters.

In [2]:
# load ascii text and covert to lowercase
filename = "data/wonderland.txt"
raw_text = open(filename).read()
raw_text = raw_text.lower()
# create mapping of unique chars to integers, and a reverse mapping
chars = sorted(list(set(raw_text)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
# summarize the loaded data
n_chars = len(raw_text)
n_vocab = len(chars)
print("Total Characters: ", n_chars)
print("Total Vocab: ", n_vocab)

Total Characters:  163815
Total Vocab:  60


The dataset is split into sets of 100 characters for training and the character following the 100 characters as output. For example, a single set of data would look like this:

seq_in : project gutenberg’s alice’s adventures in wonderland, by lewis carroll

this ebook is for the use of

seq_out : a

All the input sequences(seq_in) are appended to dataX and output sequences(seq_out) to dataY

In [3]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 100
dataX = []
dataY = []
for i in range(0, n_chars - seq_length, 1):
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)
print("Total Patterns: ", n_patterns)

Total Patterns:  163715


The data is then prepared to be fed into the model, and the LSTM model is defined with 1 LSTM layer, 1 dropout layer and a Dense layer with softmax activation.

Dropout layer : A dropout layer randomly ignores a certain percentage of units(neurons) while training, i.e their weights do not get updated

Dense layer : A fully connected layer

Training is carried out for 2 epochs with a batch size of 128

In [6]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (n_patterns, seq_length, 1))
# normalize
X = X / float(n_vocab)
# one hot encode the output variable
y = np_utils.to_categorical(dataY)
# define the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# define the checkpoint
filepath="weights/weights-improvement-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
graph_vis = TensorBoard(log_dir='./logs')
callbacks_list = [checkpoint,graph_vis]
# fit the model
model.fit(X, y, epochs=1, batch_size=128, callbacks=callbacks_list)

Epoch 1/1

Epoch 00001: loss improved from inf to 2.99837, saving model to weights/weights-improvement-01-2.9984.hdf5


<keras.callbacks.History at 0x137809e80>

While training, the weights are saved after each epoch as hdf5 files. You can delete all except the one with the least loss, which can be used later for testing. Once the training is done, we can test the model on a randomly chosen sequence of words from the text

In [24]:
# pick a random seed
start = numpy.random.randint(0, len(dataX)-1)
pattern = dataX[start]
print("Seed:")
print("\"", ''.join([int_to_char[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    sys.stdout.write(result)
    pattern.append(index)
    pattern = pattern[1:len(pattern)]
print("\nDone.")

Seed:
" , two!’ said seven.

‘yes, it is his business!’ said five, ‘and i’ll tell him--it was for
bringing t "
oe tor toet toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe tore toe 