# How to Develop a Word-Level Neural Language Model and Use it to Generate Text

* Tutorial website: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

* In this tutorial, we will develop a model of the text that we can then use to generate new sequences of text.
* Based on the corpus/text "The Republic" by Plato: http://www.gutenberg.org/cache/epub/1497/pg1497.txt
* The cleaned version is called republic_clean.txt and can be found under "data" in this repo
* Salient characteristics of the text:
    * Book/Chapter headings (e.g. “BOOK I.”).
    * British English spelling (e.g. “honoured”)
    * Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
    * Strange names (e.g. “Polemarchus”).
    * Some long monologues that go on for hundreds of lines.
    * Some quoted dialog (e.g. ‘…’)
* We will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily.  
* Now that we have a model design, we can look at transforming the raw text into sequences of 50 input words to 1 output word, ready to fit a model.

In [1]:
import numpy as np
import string
import keras
from random import randint
from pickle import load
from pickle import dump
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
# load document
in_filename = 'data/republic_clean.txt'
doc = load_doc(in_filename)
# print the first 200 characters
print(doc[:200])

﻿BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in wha


In [6]:
# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [7]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up', 'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what', 'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession', 'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished', 'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant', 'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting', 'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', '

In [8]:
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 118632


In [9]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [11]:
# save sequences to file
out_filename = 'data/republic_sequences.txt'
save_doc(sequences, out_filename)

# You will see that each line is shifted along one word, with a new word at the end to be predicted;

In [12]:
# load
in_filename = 'data/republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [19]:
# integer encode sequences of words
# https://keras.io/preprocessing/text/#tokenizer
# We can access the mapping of words to integers as a 
# dictionary attribute called word_index on the Tokenizer object.
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [20]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1

In [26]:
# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = keras.utils.to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [29]:
# define model

model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 50)            370500    
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_2 (Dense)              (None, 7410)              748410    
Total params: 1,269,810
Trainable params: 1,269,810
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=100)
# takes several hours on a GTX 1070

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x7fe790537358>

In [39]:
# save the model to file
model.save('data/model.h5')
# save the tokenizer
dump(tokenizer, open('data/tokenizer.pkl', 'wb'))

### Load Data

In [3]:
# load cleaned text sequences
in_filename = 'data/republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [4]:
seq_length = len(lines[0].split()) - 1

### Load Model

In [2]:
# load the model
model = load_model('data/model.h5')

# load the tokenizer
tokenizer = load(open('data/tokenizer.pkl', 'rb'))

### Generate Text

In [5]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

no one will argue that there is any other method of comprehending by any regular process all true existence or of ascertaining what each thing is in its own nature for the arts in general are concerned with the desires or opinions of men or are cultivated with a view to



In [9]:
# [0] = get the first element of the list / flatten to one-dimensional list
encoded = tokenizer.texts_to_sequences([seed_text])[0]
print(len(encoded))
print(encoded)

51
[47, 33, 18, 977, 9, 36, 5, 45, 42, 1193, 3, 2978, 27, 45, 2605, 745, 37, 38, 383, 17, 3, 6335, 30, 188, 151, 5, 6, 699, 82, 94, 26, 1, 286, 6, 339, 14, 567, 35, 1, 233, 17, 1141, 3, 77, 17, 14, 2227, 35, 8, 230, 4]


In [12]:
encoded = np.array(encoded)
encoded = encoded[:-1]

In [13]:
# predict probabilities for each word
yhat = model.predict_classes(encoded, verbose=0)

# Throws an error: ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array with shape (1,)


ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array with shape (1,)

In [None]:
out_word = ''
for word, index in tokenizer.word_index.items():
    if index == yhat:
        out_word = word
        break

In [None]:
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

In [None]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [None]:
# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)