# How to Develop a Character-Based Neural Language Model in Keras

* Tutorial website: https://machinelearningmastery.com/develop-character-based-neural-language-model-keras/
* We will use an arbitrary length of 10 characters for this model.
* There is not a lot of text, and 10 characters is a few words.

In [39]:
import numpy as np
from pickle import dump
from pickle import load
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

## Load Text

In [1]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

In [2]:
# load text
raw_text = load_doc('data/rhyme.txt')
print(raw_text)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.

When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.


## Clean Text

In [3]:
# clean
# we will strip all of the new line characters 
# so that we have one long sequence of characters separated only by white space
tokens = raw_text.split()
raw_text = ' '.join(tokens)

## Create Sequences

In [4]:
# organize into sequences of characters
# Each input sequence will be 10 characters 
# with one output character, making each sequence 11 characters long.
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 399


## Save Sequences

In [5]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [6]:
# save sequences to file
out_filename = 'data/char_sequences.txt'
save_doc(sequences, out_filename)

## Train Language Model - Load Data

In [27]:
# load
in_filename = 'data/char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

## Encode Sequences

In [28]:
# The sequences of characters must be encoded as integers.
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))

In [29]:
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

In [30]:
# vocabulary size = 38 uniques characters
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 38


## Split Inputs and Output

In [31]:
# separate the columns into input and output sequences of characters.
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

In [32]:
# we need to one hot encode each character. 
# That is, each character becomes a vector as long as 
# the vocabulary (38 elements) with a 1 marked for the specific character

sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = np.array(sequences)
y = to_categorical(y, num_classes=vocab_size)

In [33]:
print(X.shape)
print(y.shape)

(399, 10, 38)
(399, 38)


## Fit Model

The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error.
The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

In [34]:
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_2 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
None


In [35]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, epochs=100, verbose=2)

Epoch 1/100
 - 2s - loss: 3.6204 - acc: 0.0877
Epoch 2/100
 - 0s - loss: 3.5290 - acc: 0.1905
Epoch 3/100
 - 0s - loss: 3.2407 - acc: 0.1905
Epoch 4/100
 - 0s - loss: 3.0754 - acc: 0.1905
Epoch 5/100
 - 0s - loss: 3.0108 - acc: 0.1905
Epoch 6/100
 - 0s - loss: 2.9843 - acc: 0.1905
Epoch 7/100
 - 0s - loss: 2.9653 - acc: 0.1905
Epoch 8/100
 - 0s - loss: 2.9482 - acc: 0.1905
Epoch 9/100
 - 0s - loss: 2.9237 - acc: 0.1905
Epoch 10/100
 - 0s - loss: 2.9027 - acc: 0.1905
Epoch 11/100
 - 0s - loss: 2.8897 - acc: 0.1905
Epoch 12/100
 - 0s - loss: 2.8442 - acc: 0.1930
Epoch 13/100
 - 0s - loss: 2.8124 - acc: 0.2030
Epoch 14/100
 - 0s - loss: 2.7739 - acc: 0.2030
Epoch 15/100
 - 0s - loss: 2.7452 - acc: 0.2657
Epoch 16/100
 - 0s - loss: 2.7333 - acc: 0.2130
Epoch 17/100
 - 0s - loss: 2.6580 - acc: 0.2807
Epoch 18/100
 - 0s - loss: 2.6057 - acc: 0.2481
Epoch 19/100
 - 0s - loss: 2.5606 - acc: 0.2882
Epoch 20/100
 - 0s - loss: 2.4961 - acc: 0.2832
Epoch 21/100
 - 0s - loss: 2.4541 - acc: 0.3183
E

<keras.callbacks.History at 0x7f7f7e0d0be0>

You will see that the model learns the problem well, perhaps too well for generating surprising sequences of characters.

## Save Model

In [48]:
# save the model to file
model.save('data/model.h5')

In [49]:
# save the mapping
dump(mapping, open('data/mapping.pkl', 'wb'))

## Generate Text - Load Model

In [50]:
# load the model
model = load_model('data/model.h5')

In [51]:
# load the mapping
mapping = load(open('data/mapping.pkl', 'rb'))

## Generate Characters

In [46]:
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        #encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1]) throws error later on
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

In [47]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

Sing a song of sixpence, A poc
king was in his counting house
hello worl, Whe kin  ais. Whee
