- A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence
- It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small vocabulary and flexibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train
- Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling
- A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters
- The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character
- After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character
- Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text
- We will use an arbitrary length of 10 characters for this model

In [47]:
from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

#### Load Text

In [1]:
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# load text
raw_text = load_doc('rhyme.txt')
print(raw_text)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.

When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.


#### Clean Text

We will strip all of the new line characters so that we have one long sequence of characters separated only by white space

In [2]:
tokens = raw_text.split()
raw_text = ' '.join(tokens)
raw_text

"Sing a song of sixpence, A pocket full of rye. Four and twenty blackbirds, Baked in a pie. When the pie was opened The birds began to sing; Wasn't that a dainty dish, To set before the king. The king was in his counting house, Counting out his money; The queen was in the parlour, Eating bread and honey. The maid was in the garden, Hanging out the clothes, When down came a blackbird And pecked off her nose."

In [3]:
len(raw_text)

409

#### Create Sequences

Each input sequence will be 10 characters with one output character, making each sequence 11 characters long

In [5]:
length = 10
sequences = list()
for i in range(length,len(raw_text)):
    seq = raw_text[i-length:i+1]
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))    

Total Sequences: 399


In [6]:
sequences

['Sing a song',
 'ing a song ',
 'ng a song o',
 'g a song of',
 ' a song of ',
 'a song of s',
 ' song of si',
 'song of six',
 'ong of sixp',
 'ng of sixpe',
 'g of sixpen',
 ' of sixpenc',
 'of sixpence',
 'f sixpence,',
 ' sixpence, ',
 'sixpence, A',
 'ixpence, A ',
 'xpence, A p',
 'pence, A po',
 'ence, A poc',
 'nce, A pock',
 'ce, A pocke',
 'e, A pocket',
 ', A pocket ',
 ' A pocket f',
 'A pocket fu',
 ' pocket ful',
 'pocket full',
 'ocket full ',
 'cket full o',
 'ket full of',
 'et full of ',
 't full of r',
 ' full of ry',
 'full of rye',
 'ull of rye.',
 'll of rye. ',
 'l of rye. F',
 ' of rye. Fo',
 'of rye. Fou',
 'f rye. Four',
 ' rye. Four ',
 'rye. Four a',
 'ye. Four an',
 'e. Four and',
 '. Four and ',
 ' Four and t',
 'Four and tw',
 'our and twe',
 'ur and twen',
 'r and twent',
 ' and twenty',
 'and twenty ',
 'nd twenty b',
 'd twenty bl',
 ' twenty bla',
 'twenty blac',
 'wenty black',
 'enty blackb',
 'nty blackbi',
 'ty blackbir',
 'y blackbird',
 ' black

#### Save Sequences

In [7]:
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [8]:
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

#### Train Language Model

The model will read encoded characters and predict the next character in the sequence. A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

#### Load Data

In [10]:
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
 
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

#### Encode Sequences

The sequences of characters must be encoded as integers.

This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers

We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values

In [15]:
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))

In [16]:
mapping

{'\n': 0,
 ' ': 1,
 "'": 2,
 ',': 3,
 '.': 4,
 ';': 5,
 'A': 6,
 'B': 7,
 'C': 8,
 'E': 9,
 'F': 10,
 'H': 11,
 'S': 12,
 'T': 13,
 'W': 14,
 'a': 15,
 'b': 16,
 'c': 17,
 'd': 18,
 'e': 19,
 'f': 20,
 'g': 21,
 'h': 22,
 'i': 23,
 'k': 24,
 'l': 25,
 'm': 26,
 'n': 27,
 'o': 28,
 'p': 29,
 'q': 30,
 'r': 31,
 's': 32,
 't': 33,
 'u': 34,
 'w': 35,
 'x': 36,
 'y': 37}

In [31]:
sequences = list()
for line in lines:    
    encoded_seq = [mapping[char] for char in line]
    sequences.append(encoded_seq)

In [32]:
sequences

[[12, 23, 27, 21, 1, 15, 1, 32, 28, 27, 21],
 [23, 27, 21, 1, 15, 1, 32, 28, 27, 21, 1],
 [27, 21, 1, 15, 1, 32, 28, 27, 21, 1, 28],
 [21, 1, 15, 1, 32, 28, 27, 21, 1, 28, 20],
 [1, 15, 1, 32, 28, 27, 21, 1, 28, 20, 1],
 [15, 1, 32, 28, 27, 21, 1, 28, 20, 1, 32],
 [1, 32, 28, 27, 21, 1, 28, 20, 1, 32, 23],
 [32, 28, 27, 21, 1, 28, 20, 1, 32, 23, 36],
 [28, 27, 21, 1, 28, 20, 1, 32, 23, 36, 29],
 [27, 21, 1, 28, 20, 1, 32, 23, 36, 29, 19],
 [21, 1, 28, 20, 1, 32, 23, 36, 29, 19, 27],
 [1, 28, 20, 1, 32, 23, 36, 29, 19, 27, 17],
 [28, 20, 1, 32, 23, 36, 29, 19, 27, 17, 19],
 [20, 1, 32, 23, 36, 29, 19, 27, 17, 19, 3],
 [1, 32, 23, 36, 29, 19, 27, 17, 19, 3, 1],
 [32, 23, 36, 29, 19, 27, 17, 19, 3, 1, 6],
 [23, 36, 29, 19, 27, 17, 19, 3, 1, 6, 1],
 [36, 29, 19, 27, 17, 19, 3, 1, 6, 1, 29],
 [29, 19, 27, 17, 19, 3, 1, 6, 1, 29, 28],
 [19, 27, 17, 19, 3, 1, 6, 1, 29, 28, 17],
 [27, 17, 19, 3, 1, 6, 1, 29, 28, 17, 24],
 [17, 19, 3, 1, 6, 1, 29, 28, 17, 24, 19],
 [19, 3, 1, 6, 1, 29, 28, 17, 

In [19]:
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 38


#### Split Inputs and Output

Now that the sequences have been integer encoded, we can separate the columns into input and output sequences of characters

In [33]:
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

In [34]:
X

array([[12, 23, 27, ..., 32, 28, 27],
       [23, 27, 21, ..., 28, 27, 21],
       [27, 21,  1, ..., 27, 21,  1],
       ...,
       [28, 20, 20, ...,  1, 27, 28],
       [20, 20,  1, ..., 27, 28, 32],
       [20,  1, 22, ..., 28, 32, 19]])

In [39]:
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)

In [40]:
X.shape

(399, 10, 38)

In [41]:
y.shape

(399, 38)

#### Fit Model

In [42]:
model = Sequential()
model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 75)                34200     
_________________________________________________________________
dense (Dense)                (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
None


In [43]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, epochs=100, verbose=2)

Epoch 1/100
13/13 - 0s - loss: 3.6055 - accuracy: 0.0977
Epoch 2/100
13/13 - 0s - loss: 3.4617 - accuracy: 0.1830
Epoch 3/100
13/13 - 0s - loss: 3.1319 - accuracy: 0.1905
Epoch 4/100
13/13 - 0s - loss: 3.0287 - accuracy: 0.1905
Epoch 5/100
13/13 - 0s - loss: 3.0118 - accuracy: 0.1905
Epoch 6/100
13/13 - 0s - loss: 2.9852 - accuracy: 0.1905
Epoch 7/100
13/13 - 0s - loss: 2.9691 - accuracy: 0.1905
Epoch 8/100
13/13 - 0s - loss: 2.9573 - accuracy: 0.1905
Epoch 9/100
13/13 - 0s - loss: 2.9357 - accuracy: 0.1905
Epoch 10/100
13/13 - 0s - loss: 2.9252 - accuracy: 0.1905
Epoch 11/100
13/13 - 0s - loss: 2.9039 - accuracy: 0.1905
Epoch 12/100
13/13 - 0s - loss: 2.8791 - accuracy: 0.1905
Epoch 13/100
13/13 - 0s - loss: 2.8510 - accuracy: 0.1905
Epoch 14/100
13/13 - 0s - loss: 2.8258 - accuracy: 0.2030
Epoch 15/100
13/13 - 0s - loss: 2.7999 - accuracy: 0.2030
Epoch 16/100
13/13 - 0s - loss: 2.7575 - accuracy: 0.2206
Epoch 17/100
13/13 - 0s - loss: 2.7273 - accuracy: 0.2356
Epoch 18/100
13/13 - 0s

<tensorflow.python.keras.callbacks.History at 0x1dc03027e10>

In [44]:
model.save('model.h5')

In [45]:
dump(mapping, open('mapping.pkl', 'wb'))

#### Load Model

In [49]:
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text
 
# load the model
model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))
 
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Sing a song of sixpence, A poc
king was in his counting house
hello worls Fmuet a aaiig donn
