# Character Level Language Model

Using a poem to buid a character level language model


## Language Model Design

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters.

The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character.

After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text.

We will use an arbitrary length of 10 characters for this model.

There is not a lot of text, and 10 characters is a few words.

We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

In [28]:
import numpy as np
from pickle import dump, load
from keras.utils import to_categorical
from keras.models import Sequential, load_model
from keras.layers import Dense
from keras.layers import LSTM
from keras.preprocessing.sequence import pad_sequences

### Load Text
We must load the text into memory so that we can work with it.

Below is a function named load_doc() that will load a text file given a filename and return the loaded text.

In [3]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


In [4]:
# load text
raw_text = load_doc('data.txt')
print(raw_text)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.

When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.


### Clean Text

Next, we need to clean the loaded text.

We will not do much to it here. Specifically, we will strip all of the new line characters so that we have one long sequence of characters separated only by white space.

In [5]:
# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens).lower()

In [6]:
raw_text

"sing a song of sixpence, a pocket full of rye. four and twenty blackbirds, baked in a pie. when the pie was opened the birds began to sing; wasn't that a dainty dish, to set before the king. the king was in his counting house, counting out his money; the queen was in the parlour, eating bread and honey. the maid was in the garden, hanging out the clothes, when down came a blackbird and pecked off her nose."

### Create Sequences
Now that we have a long list of characters, we can create our input-output sequences used to train the model.

Each input sequence will be 10 characters with one output character, making each sequence 11 characters long.

We can create the sequences by enumerating the characters in the text, starting at the 11th character at index 10.

In [7]:
# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 399


### Save Sequences
Finally, we can save the prepared data to file so that we can load it later when we develop our model.

Below is a function save_doc() that, given a list of strings and a filename, will save the strings to file, one per line.

In [8]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [9]:
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

### Train Language Model
In this section, we will develop a neural language model for the prepared sequence data.

The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

### Load Data
The first step is to load the prepared character sequence data from ‘char_sequences.txt‘.

We can use the same load_doc() function developed in the previous section. Once loaded, we split the text by new line to give a list of sequences ready to be encoded.

In [10]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

### Encode Sequences
The sequences of characters must be encoded as integers.

This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers.

We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

In [11]:
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))

In [12]:
sequences = list()
for line in lines:
    # integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

In [13]:
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 29


### Split Inputs and Output
Now that the sequences have been integer encoded, we can separate the columns into input and output sequences of characters.

We can do this using a simple array slice.

In [14]:
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

Next, we need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (38 elements) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character.

We can use the to_categorical() function in the Keras API to one hot encode the input and output sequences.

In [16]:
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = np.array(sequences)
y = to_categorical(y, num_classes=vocab_size)

### Fit Model
The model is defined with an input layer that takes sequences that have 10 time steps and 29 features for the one hot encoded input sequences.

Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition.

The model has a single LSTM hidden layer with 75 memory cells.

The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

In [17]:
# define model
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 75)                31500     
_________________________________________________________________
dense (Dense)                (None, 29)                2204      
Total params: 33,704
Trainable params: 33,704
Non-trainable params: 0
_________________________________________________________________
None


The model is learning a multi-class classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update.

The model is fit for 100 training epochs

In [18]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, epochs=100, verbose=2)

Epoch 1/100
13/13 - 2s - loss: 3.3358 - accuracy: 0.1328
Epoch 2/100
13/13 - 0s - loss: 3.2270 - accuracy: 0.1905
Epoch 3/100
13/13 - 0s - loss: 3.0330 - accuracy: 0.1905
Epoch 4/100
13/13 - 0s - loss: 2.9310 - accuracy: 0.1905
Epoch 5/100
13/13 - 0s - loss: 2.8996 - accuracy: 0.1905
Epoch 6/100
13/13 - 0s - loss: 2.8753 - accuracy: 0.1905
Epoch 7/100
13/13 - 0s - loss: 2.8566 - accuracy: 0.1905
Epoch 8/100
13/13 - 0s - loss: 2.8369 - accuracy: 0.1905
Epoch 9/100
13/13 - 0s - loss: 2.8257 - accuracy: 0.1905
Epoch 10/100
13/13 - 0s - loss: 2.8097 - accuracy: 0.1905
Epoch 11/100
13/13 - 0s - loss: 2.7898 - accuracy: 0.1905
Epoch 12/100
13/13 - 0s - loss: 2.7470 - accuracy: 0.1905
Epoch 13/100
13/13 - 0s - loss: 2.7159 - accuracy: 0.2005
Epoch 14/100
13/13 - 0s - loss: 2.6843 - accuracy: 0.2030
Epoch 15/100
13/13 - 0s - loss: 2.6463 - accuracy: 0.2180
Epoch 16/100
13/13 - 0s - loss: 2.6011 - accuracy: 0.2481
Epoch 17/100
13/13 - 0s - loss: 2.5599 - accuracy: 0.2682
Epoch 18/100
13/13 - 0s

<tensorflow.python.keras.callbacks.History at 0x7f48b829dfd0>

### Save Model
After the model is fit, we save it to file for later use.

The Keras model API provides the save() function that we can use to save the model to a single file, including weights and topology information.

In [19]:
# save the model to file
model.save('model.h5')

We also save the mapping from characters to integers that we will need to encode any input when using the model and decode any output from the model.

In [21]:
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

### Generate Text
We will use the learned language model to generate new sequences of text that have the same statistical properties.

In [26]:
# load the model
model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))

In [27]:
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

In [32]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'sing a son', 30))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 30))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 30))



sing a song of sixpence, a pocket full o
king was in his counting out his conengt
hello worli  aas ingd in hi psnet wa l a
