# Neural Language Modeling

### Problem of Modeling Language
Formal languages, like programming languages, can be fully specified. All the reserved words
can be defined and the valid ways that they can be used can be precisely defined. We cannot do
this with natural language. Natural languages are not designed; they emerge, and therefore
there is no formal specification.


There may be formal rules and heuristics for parts of the language, but as soon as rules
are defined, you will devise or encounter counter examples that contradict the rules. Natural
languages involve vast numbers of terms that can be used in ways that introduce all kinds of
ambiguities, yet can still be understood by other humans. Further, languages change, word usages change: it is a moving target. Nevertheless, linguists try to specify the language with
formal grammars and structures. It can be done, but it is very difficult and the results can
be fragile. An alternative approach to specifying the model of the language is to learn it from
examples.

## Statistical Language Modeling

Statistical Language Modeling, or Language Modeling and LM for short, is the development of
probabilistic models that are able to predict the next word in the sequence given the words that
precede it.


Language modeling is the task of assigning a probability to sentences in a language.
Besides assigning a probability to each sequence of words, the language models
also assigns a probability for the likelihood of a given word (or a sequence of words)
to follow a sequence of words



A language model learns the probability of word occurrence based on examples of text.
Simpler models may look at a context of a short sequence of words, whereas larger models may
work at the level of sentences or paragraphs. Most commonly, language models operate at the
level of words.
The notion of a language model is inherently probabilistic. A language model is a
function that puts a probability measure over strings drawn from some vocabulary.

A language model can be developed and used standalone, such as to generate new sequences
of text that appear to have come from the corpus. Language modeling is a root problem for a
large range of natural language processing tasks. More practically, language models are used
on the front-end or back-end of a more sophisticated model for a task that requires language
understanding.
... language modeling is a crucial component in real-world applications such as
machine-translation and automatic speech recognition, For these reasons, lan-
guage modeling plays a central role in natural-language processing, AI, and machine-
learning research.

A good example is speech recognition, where audio data is used as an input to the model
and the output requires a language model that interprets the input signal and recognizes each
new word within the context of the words already recognized.
Speech recognition is principally concerned with the problem of transcribing the
speech signal as a sequence of words. From this point of view, speech is assumed
to be a generated by a language model which provides estimates of Pr(w) for all word
strings w independently of the observed signal. The goal of speech recognition is
to find the most likely word sequence given the observed acoustic signal.

Similarly, language models are used to generate text in many similar natural language processing tasks, for example:
 Optical Character Recognition

 Handwriting Recognition.

 Machine Translation.

 Spelling Correction.

 Image Captioning.

 Text Summarization

 And much more.

## Neural Language Models


Recently, the use of neural networks in the development of language models has become very
popular, to the point that it may now be the preferred approach. The use of neural networks in
language modeling is often called Neural Language Modeling, or NLM for short. Neural network
approaches are achieving better results than classical methods both on standalone language
models and when models are incorporated into larger models on challenging tasks like speech
recognition and machine translation. A key reason for the leaps in improved performance may
be the method's ability to generalize.

Nonlinear neural network models solve some of the shortcomings of traditional
language models: they allow conditioning on increasingly large context sizes with
only a linear increase in the number of parameters, they alleviate the need for
manually designing backoff orders, and they support generalization across different
contexts.

| Page 109, Neural Network Methods in Natural Language Processing, 2017.


Specifically, a word embedding is adopted that uses a real-valued vector to represent each
word in a projected vector space. This learned representation of words based on their usage
allows words with a similar meaning to have a similar representation.
Neural Language Models (NLM) address the n-gram data sparsity issue through
parameterization of words as vectors (word embeddings) and using them as inputs to
a neural network. The parameters are learned as part of the training process. Word
embeddings obtained through NLMs exhibit the property whereby semantically close
words are likewise close in the induced vector space.

| Character-Aware Neural Language Model, 2015.


This generalization is something that the representation used in classical statistical language
models cannot easily achieve.
\True generalization" is difficult to obtain in a discrete word indice space, since
there is no obvious relation between the word indices.

| Connectionist language modeling for large vocabulary continuous speech recognition, 2002.



Further, the distributed representation approach allows the embedding representation to scale
better with the size of the vocabulary. Classical methods that have one discrete representation
per word fight the curse of dimensionality with larger and larger vocabularies of words that
result in longer and more sparse representations. The neural network approach to language
modeling can be described using the three following model properties, taken from A Neural
Probabilistic Language Model, 2003.


1. Associate each word in the vocabulary with a distributed word feature vector.
2. Express the joint probability function of word sequences in terms of the feature vectors of
these words in the sequence.
3. Learn simultaneously the word feature vector and the parameters of the probability
function.


This represents a relatively simple model where both the representation and probabilistic
model are learned together directly from raw text data. Recently, the neural based approaches
have started to outperform the classical statistical approaches.
We provide ample empirical evidence to suggest that connectionist language mod-
els are superior to standard n-gram techniques, except their high computational
(training) complexity.

Initially, feedforward neural network models were used to introduce the approach. More
recently, recurrent neural networks and then networks with a long-term memory like the Long
Short-Term Memory network, or LSTM, allow the models to learn the relevant context over
much longer input sequences than the simpler feedforward networks.


[an RNN language model] provides further generalization: instead of considering
just several preceding words, neurons with input from recurrent connections are
assumed to represent short term memory. The model learns itself from the data how
to represent memory. While shallow feedforward neural networks (those with just
one hidden layer) can only cluster similar words, recurrent neural network (which
can be considered as a deep architecture) can perform clustering of similar histories.
This allows for instance efficient representation of patterns with variable length.
| Extensions of recurrent neural network language model, 2011.


Recently, researchers have been seeking the limits of these language models. In the paper
Exploring the Limits of Language Modeling, evaluating language models over large datasets,
such as the corpus of one million words, the authors find that LSTM-based neural language
models out-perform the classical methods.

... we have shown that RNN LMs can be trained on large amounts of data, and
outperform competing models including carefully tuned N-grams.
| Exploring the Limits of Language Modeling, 2016.

Further, they propose some heuristics for developing high-performing neural language models
in general:

** 1. Size matters **. The best models were the largest models, specifically number of memory
units.

** 2. Regularization matters **. Use of regularization like dropout on input connections
improves results.

** 3. CNNs vs Embeddings **. Character-level Convolutional Neural Network (CNN) models
can be used on the front-end instead of word embeddings, achieving similar and sometimes
better results.

** 4. Ensembles matter **. Combining the prediction from multiple models can oer large
improvements in model performance.

## How to Develop a Character-Based Neural Language Model

A language model predicts the next word in the sequence based on the specic words that have
come before it in the sequence. It is also possible to develop language models at the character
level using neural networks. The benet of character-based language models is their small
vocabulary and 
exibility in handling any words, punctuation, and other document structure.
This comes at the cost of requiring larger models that are slower to train. Nevertheless, in the
field of neural language models, character-based models offer a lot of promise for a general,flexible and powerful approach to language modeling.

In [14]:
# create sequences

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
# load text
raw_text = load_doc('rhyme.txt')
print(raw_text)
# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)
# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1]
    # store
    sequences.append(seq)
print('Total Sequences: %d' % len(sequences))
# save sequences to file
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

print("sequences\n:",sequences)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.
When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.
The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.
The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.
Total Sequences: 399
sequences
: ['Sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ', 'a song of s', ' song of si', 'song of six', 'ong of sixp', 'ng of sixpe', 'g of sixpen', ' of sixpenc', 'of sixpence', 'f sixpence,', ' sixpence, ', 'sixpence, A', 'ixpence, A ', 'xpence, A p', 'pence, A po', 'ence, A poc', 'nce, A pock', 'ce, A pocke', 'e, A pocket', ', A pocket ', ' A pocket f', 'A pocket fu', ' pocket ful', 'pocket full', 'ocket full ', 'cket full o', 'ket full of', 'et full of ', 't full of r', ' full of ry', 'full of rye', 'ull of rye.', 'll of rye. ', '

In [29]:
from numpy import array
from pickle import dump
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text

# define the model
def define_model(X):
    print("X.shape", X.shape)
    model = Sequential()
    model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
# integer encode line
    encoded_seq = [mapping[char] for char in line]
    # store
    sequences.append(encoded_seq)

    # vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(X)
# fit model
model.fit(X, y, epochs=100, verbose=2)
# save the model to file
model.save('model.h5')
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

Vocabulary Size: 38
X.shape (399, 10, 38)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_2 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_2 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
 - 2s - loss: 3.6142 - acc: 0.0852
Epoch 2/100
 - 0s - loss: 3.5111 - acc: 0.1855
Epoch 3/100
 - 0s - loss: 3.1997 - acc: 0.1905
Epoch 4/100
 - 0s - loss: 3.0472 - acc: 0.1905
Epoch 5/100
 - 0s - loss: 3.0019 - acc: 0.1905
Epoch 6/100
 - 0s - loss: 2.9824 - acc: 0.1905
Epoch 7/100
 - 0s - loss: 2.9683 - acc: 0.1905
Epoch 8/100
 - 0s - loss: 2.9547 - acc: 0.1905
Epoch 9/100
 - 0s - loss: 2.9372 - acc: 0.1905
Epoch 10/100
 - 0s - loss: 2.9143 - acc: 0.1905
Epoch 11/100
 - 0s - 

In [28]:
## Generate Text
from pickle import load
from numpy import array
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
    
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
                # append to input
        in_text += out_char
    return in_text



# load the model
model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line

print(generate_seq(model, mapping, 10, 'king was i', 20))

print(generate_seq(model, mapping, 10, 'Counting', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

# test not in original
print(generate_seq(model, mapping, 10, 'green bay', 20))



Sing a song of sixpence, A poc
king was in his counting house
Counting out his money; The 
hello worlsmm oaa bbllaatdbhss
green bay ain  dAney bfeekdbd


## Develop a Word-Based Neural Language Model
Framing Language Modeling
A statistical language model is learned from raw text and predicts the probability of the next
word in the sequence given the words already present in the sequence. Language models are
a key component in larger models for challenging natural language processing problems, like
machine translation and speech recognition. They can also be developed as standalone models
and used for generating new sequences that have the same statistical properties as the source
text.
Language models both learn and predict one word at a time. The training of the network
involves providing sequences of words as input that are processed one at a time where a prediction
can be made and learned for each input sequence. Similarly, when making predictions, the
process can be seeded with one or a few words, then predicted words can be gathered and
presented as input on subsequent predictions in order to build up a generated output sequence
Therefore, each model will involve splitting the source text into input and output sequences,
such that the model can learn to predict words. There are many ways to frame thesequences
from a source text for language modeling. In this tutorial, we will explore 3 different ways of
developing word-based language models in the Keras deep learning library. There is no single
best approach, just different framings that may suit dierent applications.

### Dataset - Download The Republic by Plato.
http://www.gutenberg.org/cache/epub/1497/pg1497.txt

In [32]:
## Model 1: One-Word-In, One-Word-Out Sequences

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result + ' ' + out_word
    return result
# define the model
def define_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model


# source text
data = """ Jack and Jill went up the hill\n
To fetch a pail of water\n
Jack fell down and broke his crown\n
And Jill came tumbling after\n """
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
    print('Total Sequences: %d' % len(sequences))
# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]
# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(vocab_size)
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate
print(generate_seq(model, tokenizer, 'Jack', 6))

Vocabulary Size: 22
Total Sequences: 1
Total Sequences: 2
Total Sequences: 3
Total Sequences: 4
Total Sequences: 5
Total Sequences: 6
Total Sequences: 7
Total Sequences: 8
Total Sequences: 9
Total Sequences: 10
Total Sequences: 11
Total Sequences: 12
Total Sequences: 13
Total Sequences: 14
Total Sequences: 15
Total Sequences: 16
Total Sequences: 17
Total Sequences: 18
Total Sequences: 19
Total Sequences: 20
Total Sequences: 21
Total Sequences: 22
Total Sequences: 23
Total Sequences: 24
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_3 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
N

Epoch 146/500
 - 0s - loss: 2.2530 - acc: 0.3333
Epoch 147/500
 - 0s - loss: 2.2383 - acc: 0.3333
Epoch 148/500
 - 0s - loss: 2.2235 - acc: 0.3750
Epoch 149/500
 - 0s - loss: 2.2087 - acc: 0.3750
Epoch 150/500
 - 0s - loss: 2.1938 - acc: 0.4167
Epoch 151/500
 - 0s - loss: 2.1788 - acc: 0.4167
Epoch 152/500
 - 0s - loss: 2.1637 - acc: 0.4167
Epoch 153/500
 - 0s - loss: 2.1486 - acc: 0.4167
Epoch 154/500
 - 0s - loss: 2.1334 - acc: 0.4167
Epoch 155/500
 - 0s - loss: 2.1182 - acc: 0.4583
Epoch 156/500
 - 0s - loss: 2.1029 - acc: 0.4583
Epoch 157/500
 - 0s - loss: 2.0875 - acc: 0.5000
Epoch 158/500
 - 0s - loss: 2.0721 - acc: 0.5000
Epoch 159/500
 - 0s - loss: 2.0567 - acc: 0.5417
Epoch 160/500
 - 0s - loss: 2.0413 - acc: 0.5417
Epoch 161/500
 - 0s - loss: 2.0258 - acc: 0.5417
Epoch 162/500
 - 0s - loss: 2.0102 - acc: 0.5417
Epoch 163/500
 - 0s - loss: 1.9947 - acc: 0.5417
Epoch 164/500
 - 0s - loss: 1.9791 - acc: 0.5417
Epoch 165/500
 - 0s - loss: 1.9636 - acc: 0.5833
Epoch 166/500
 - 0s 

Epoch 313/500
 - 0s - loss: 0.4802 - acc: 0.8750
Epoch 314/500
 - 0s - loss: 0.4765 - acc: 0.8750
Epoch 315/500
 - 0s - loss: 0.4728 - acc: 0.8750
Epoch 316/500
 - 0s - loss: 0.4692 - acc: 0.8750
Epoch 317/500
 - 0s - loss: 0.4656 - acc: 0.8750
Epoch 318/500
 - 0s - loss: 0.4621 - acc: 0.8750
Epoch 319/500
 - 0s - loss: 0.4586 - acc: 0.8750
Epoch 320/500
 - 0s - loss: 0.4552 - acc: 0.8750
Epoch 321/500
 - 0s - loss: 0.4518 - acc: 0.8750
Epoch 322/500
 - 0s - loss: 0.4485 - acc: 0.8750
Epoch 323/500
 - 0s - loss: 0.4452 - acc: 0.8750
Epoch 324/500
 - 0s - loss: 0.4419 - acc: 0.8750
Epoch 325/500
 - 0s - loss: 0.4387 - acc: 0.8750
Epoch 326/500
 - 0s - loss: 0.4355 - acc: 0.8750
Epoch 327/500
 - 0s - loss: 0.4324 - acc: 0.8750
Epoch 328/500
 - 0s - loss: 0.4293 - acc: 0.8750
Epoch 329/500
 - 0s - loss: 0.4263 - acc: 0.8750
Epoch 330/500
 - 0s - loss: 0.4232 - acc: 0.8750
Epoch 331/500
 - 0s - loss: 0.4203 - acc: 0.8750
Epoch 332/500
 - 0s - loss: 0.4173 - acc: 0.8750
Epoch 333/500
 - 0s 

 - 0s - loss: 0.2389 - acc: 0.8750
Epoch 481/500
 - 0s - loss: 0.2386 - acc: 0.8750
Epoch 482/500
 - 0s - loss: 0.2382 - acc: 0.8750
Epoch 483/500
 - 0s - loss: 0.2379 - acc: 0.8750
Epoch 484/500
 - 0s - loss: 0.2375 - acc: 0.8750
Epoch 485/500
 - 0s - loss: 0.2372 - acc: 0.8750
Epoch 486/500
 - 0s - loss: 0.2369 - acc: 0.8750
Epoch 487/500
 - 0s - loss: 0.2366 - acc: 0.8750
Epoch 488/500
 - 0s - loss: 0.2362 - acc: 0.8750
Epoch 489/500
 - 0s - loss: 0.2359 - acc: 0.8750
Epoch 490/500
 - 0s - loss: 0.2356 - acc: 0.8750
Epoch 491/500
 - 0s - loss: 0.2353 - acc: 0.8750
Epoch 492/500
 - 0s - loss: 0.2350 - acc: 0.8750
Epoch 493/500
 - 0s - loss: 0.2347 - acc: 0.8750
Epoch 494/500
 - 0s - loss: 0.2344 - acc: 0.8750
Epoch 495/500
 - 0s - loss: 0.2341 - acc: 0.8750
Epoch 496/500
 - 0s - loss: 0.2338 - acc: 0.8750
Epoch 497/500
 - 0s - loss: 0.2335 - acc: 0.8750
Epoch 498/500
 - 0s - loss: 0.2332 - acc: 0.8750
Epoch 499/500
 - 0s - loss: 0.2329 - acc: 0.8750
Epoch 500/500
 - 0s - loss: 0.2327

## Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a
series of words that build up.

This approach may allow the model to use the context of each line to help the model in those
cases where a simple one-word-in-and-out model creates ambiguity. In this case, this comes at
the cost of predicting words across lines, which might be fine for now if we are only interested
in modeling and generating lines of text. Note that in this representation, we will require a
padding of sequences to ensure they meet a fixed length input. This is a requirement when
using Keras. First, we can create the sequences of integers, line-by-line by using the Tokenizer
already fit on the source text.


![title](picture12.png)

In [36]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
    out_word = ''
    for word, index in tokenizer.word_index.items():
        if index == yhat:
            out_word = word
            break
    # append to input
    in_text += ' ' + out_word
    return in_text
# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
# source text
data = """ Jack and Jill went up the hill\n
To fetch a pail of water\n
Jack fell down and broke his crown\n
And Jill came tumbling after\n """
# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create line-based sequences
sequences = list()
for line in data.split('\n'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Vocabulary Size: 22
Total Sequences: 21
Max Sequence Length: 7
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_4 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500
 - 2s - loss: 3.0922 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.0907 - acc: 0.0476
Epoch 3/500
 - 0s - loss: 3.0892 - acc: 0.0476
Epoch 4/500
 - 0s - loss: 3.0877 - acc: 0.0476
Epoch 5/500
 - 0s - loss: 3.0861 - acc: 0.0476
Epoch 6/500
 - 0s - loss: 3.0846 - acc: 0.0952
Epoch 7/500
 - 0s - loss: 3.0830 - acc: 0.0952
Epoch 

 - 0s - loss: 0.8417 - acc: 0.8571
Epoch 155/500
 - 0s - loss: 0.8314 - acc: 0.8571
Epoch 156/500
 - 0s - loss: 0.8214 - acc: 0.8571
Epoch 157/500
 - 0s - loss: 0.8115 - acc: 0.8571
Epoch 158/500
 - 0s - loss: 0.8019 - acc: 0.8571
Epoch 159/500
 - 0s - loss: 0.7924 - acc: 0.8571
Epoch 160/500
 - 0s - loss: 0.7832 - acc: 0.8571
Epoch 161/500
 - 0s - loss: 0.7742 - acc: 0.8571
Epoch 162/500
 - 0s - loss: 0.7653 - acc: 0.8571
Epoch 163/500
 - 0s - loss: 0.7566 - acc: 0.8571
Epoch 164/500
 - 0s - loss: 0.7482 - acc: 0.9048
Epoch 165/500
 - 0s - loss: 0.7399 - acc: 0.8571
Epoch 166/500
 - 0s - loss: 0.7318 - acc: 0.9048
Epoch 167/500
 - 0s - loss: 0.7238 - acc: 0.9048
Epoch 168/500
 - 0s - loss: 0.7160 - acc: 0.9048
Epoch 169/500
 - 0s - loss: 0.7084 - acc: 0.9048
Epoch 170/500
 - 0s - loss: 0.7010 - acc: 0.9048
Epoch 171/500
 - 0s - loss: 0.6936 - acc: 0.9048
Epoch 172/500
 - 0s - loss: 0.6864 - acc: 0.9048
Epoch 173/500
 - 0s - loss: 0.6793 - acc: 0.9048
Epoch 174/500
 - 0s - loss: 0.6724

Epoch 322/500
 - 0s - loss: 0.2095 - acc: 0.9524
Epoch 323/500
 - 0s - loss: 0.2082 - acc: 0.9524
Epoch 324/500
 - 0s - loss: 0.2069 - acc: 0.9524
Epoch 325/500
 - 0s - loss: 0.2057 - acc: 0.9524
Epoch 326/500
 - 0s - loss: 0.2045 - acc: 0.9524
Epoch 327/500
 - 0s - loss: 0.2031 - acc: 0.9524
Epoch 328/500
 - 0s - loss: 0.2020 - acc: 0.9524
Epoch 329/500
 - 0s - loss: 0.2007 - acc: 0.9524
Epoch 330/500
 - 0s - loss: 0.1995 - acc: 0.9524
Epoch 331/500
 - 0s - loss: 0.1983 - acc: 0.9524
Epoch 332/500
 - 0s - loss: 0.1971 - acc: 0.9524
Epoch 333/500
 - 0s - loss: 0.1959 - acc: 0.9524
Epoch 334/500
 - 0s - loss: 0.1947 - acc: 0.9524
Epoch 335/500
 - 0s - loss: 0.1936 - acc: 0.9524
Epoch 336/500
 - 0s - loss: 0.1924 - acc: 0.9524
Epoch 337/500
 - 0s - loss: 0.1914 - acc: 0.9524
Epoch 338/500
 - 0s - loss: 0.1902 - acc: 0.9524
Epoch 339/500
 - 0s - loss: 0.1891 - acc: 0.9524
Epoch 340/500
 - 0s - loss: 0.1880 - acc: 0.9524
Epoch 341/500
 - 0s - loss: 0.1869 - acc: 0.9524
Epoch 342/500
 - 0s 

Epoch 489/500
 - 0s - loss: 0.1046 - acc: 0.9524
Epoch 490/500
 - 0s - loss: 0.1043 - acc: 0.9524
Epoch 491/500
 - 0s - loss: 0.1041 - acc: 0.9524
Epoch 492/500
 - 0s - loss: 0.1038 - acc: 0.9524
Epoch 493/500
 - 0s - loss: 0.1036 - acc: 0.9524
Epoch 494/500
 - 0s - loss: 0.1034 - acc: 0.9524
Epoch 495/500
 - 0s - loss: 0.1031 - acc: 0.9524
Epoch 496/500
 - 0s - loss: 0.1029 - acc: 0.9524
Epoch 497/500
 - 0s - loss: 0.1027 - acc: 0.9524
Epoch 498/500
 - 0s - loss: 0.1024 - acc: 0.9524
Epoch 499/500
 - 0s - loss: 0.1022 - acc: 0.9524
Epoch 500/500
 - 0s - loss: 0.1020 - acc: 0.9524
Jack and
Jill jill


In [38]:
# Model 3: Two-Words-In, One-Word-Out Sequence

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

# define the model
def define_model(vocab_size, max_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
# source text
data = """ Jack and Jill went up the hill\n
To fetch a pail of water\n
Jack fell down and broke his crown\n
And Jill came tumbling after\n """
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# encode 2 words -> 1 word
sequences = list()
for i in range(2, len(encoded)):
    sequence = encoded[i-2:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
# define model
model = define_model(vocab_size, max_length)
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5))
print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3))
print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5))
print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5))

Vocabulary Size: 22
Total Sequences: 23
Max Sequence Length: 3
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 2, 10)             220       
_________________________________________________________________
lstm_5 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_5 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500
 - 2s - loss: 3.0903 - acc: 0.0435
Epoch 2/500
 - 0s - loss: 3.0895 - acc: 0.1304
Epoch 3/500
 - 0s - loss: 3.0887 - acc: 0.0870
Epoch 4/500
 - 0s - loss: 3.0878 - acc: 0.0870
Epoch 5/500
 - 0s - loss: 3.0870 - acc: 0.0870
Epoch 6/500
 - 0s - loss: 3.0861 - acc: 0.0870
Epoch 7/500
 - 0s - loss: 3.0853 - acc: 0.0870
Epoch 8/50

 - 0s - loss: 1.1950 - acc: 0.8261
Epoch 155/500
 - 0s - loss: 1.1707 - acc: 0.8261
Epoch 156/500
 - 0s - loss: 1.1466 - acc: 0.8261
Epoch 157/500
 - 0s - loss: 1.1227 - acc: 0.8261
Epoch 158/500
 - 0s - loss: 1.0990 - acc: 0.8261
Epoch 159/500
 - 0s - loss: 1.0755 - acc: 0.8261
Epoch 160/500
 - 0s - loss: 1.0523 - acc: 0.8261
Epoch 161/500
 - 0s - loss: 1.0293 - acc: 0.8261
Epoch 162/500
 - 0s - loss: 1.0065 - acc: 0.8261
Epoch 163/500
 - 0s - loss: 0.9841 - acc: 0.8261
Epoch 164/500
 - 0s - loss: 0.9619 - acc: 0.8261
Epoch 165/500
 - 0s - loss: 0.9399 - acc: 0.8261
Epoch 166/500
 - 0s - loss: 0.9183 - acc: 0.8261
Epoch 167/500
 - 0s - loss: 0.8970 - acc: 0.8696
Epoch 168/500
 - 0s - loss: 0.8760 - acc: 0.9565
Epoch 169/500
 - 0s - loss: 0.8553 - acc: 0.9565
Epoch 170/500
 - 0s - loss: 0.8349 - acc: 0.9565
Epoch 171/500
 - 0s - loss: 0.8149 - acc: 0.9565
Epoch 172/500
 - 0s - loss: 0.7952 - acc: 0.9565
Epoch 173/500
 - 0s - loss: 0.7759 - acc: 0.9565
Epoch 174/500
 - 0s - loss: 0.7570

Epoch 322/500
 - 0s - loss: 0.0942 - acc: 0.9565
Epoch 323/500
 - 0s - loss: 0.0939 - acc: 0.9565
Epoch 324/500
 - 0s - loss: 0.0935 - acc: 0.9565
Epoch 325/500
 - 0s - loss: 0.0932 - acc: 0.9565
Epoch 326/500
 - 0s - loss: 0.0928 - acc: 0.9565
Epoch 327/500
 - 0s - loss: 0.0925 - acc: 0.9565
Epoch 328/500
 - 0s - loss: 0.0921 - acc: 0.9565
Epoch 329/500
 - 0s - loss: 0.0918 - acc: 0.9565
Epoch 330/500
 - 0s - loss: 0.0915 - acc: 0.9565
Epoch 331/500
 - 0s - loss: 0.0912 - acc: 0.9565
Epoch 332/500
 - 0s - loss: 0.0909 - acc: 0.9565
Epoch 333/500
 - 0s - loss: 0.0906 - acc: 0.9565
Epoch 334/500
 - 0s - loss: 0.0903 - acc: 0.9565
Epoch 335/500
 - 0s - loss: 0.0900 - acc: 0.9565
Epoch 336/500
 - 0s - loss: 0.0897 - acc: 0.9565
Epoch 337/500
 - 0s - loss: 0.0894 - acc: 0.9565
Epoch 338/500
 - 0s - loss: 0.0891 - acc: 0.9565
Epoch 339/500
 - 0s - loss: 0.0888 - acc: 0.9565
Epoch 340/500
 - 0s - loss: 0.0886 - acc: 0.9565
Epoch 341/500
 - 0s - loss: 0.0883 - acc: 0.9565
Epoch 342/500
 - 0s 

 - 0s - loss: 0.0706 - acc: 0.9565
Epoch 490/500
 - 0s - loss: 0.0706 - acc: 0.9565
Epoch 491/500
 - 0s - loss: 0.0705 - acc: 0.9565
Epoch 492/500
 - 0s - loss: 0.0705 - acc: 0.9565
Epoch 493/500
 - 0s - loss: 0.0704 - acc: 0.9565
Epoch 494/500
 - 0s - loss: 0.0704 - acc: 0.9565
Epoch 495/500
 - 0s - loss: 0.0703 - acc: 0.9565
Epoch 496/500
 - 0s - loss: 0.0703 - acc: 0.9565
Epoch 497/500
 - 0s - loss: 0.0703 - acc: 0.9565
Epoch 498/500
 - 0s - loss: 0.0702 - acc: 0.9565
Epoch 499/500
 - 0s - loss: 0.0702 - acc: 0.9565
Epoch 500/500
 - 0s - loss: 0.0701 - acc: 0.9565
Jack and jill went up the hill
And Jill went up the
fell down and broke his crown and
pail of water jack fell down and


## Project: Develop a Neural Language Model for Text Generation



Clean Text
We need to transform the raw text into a sequence of tokens or words that we can use as a
source to train the model. Based on reviewing the raw text (above), below are some specific
operations we will perform to clean the text. 
 Replace `-' with a white space so we can split words better.

 Split words based on white space.

 Remove all punctuation from words to reduce the vocabulary size (e.g. `What?' becomes
`What').
                                                                   
 Remove all words that are not alphabetic to remove standalone punctuation tokens.
                                                                   
 Normalize all words to lowercase to reduce the vocabulary size.
                                                                   
Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a
smaller model that trains faster. We can implement each of these cleaning operations in this
order in a function. 

In [41]:

import string
import re
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # prepare regex for char filtering
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    # remove punctuation from each word
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens
# save tokens to file, one dialog per line


def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()
    
# load document
in_filename = 'republic_clean.txt'
doc = load_doc(in_filename)
print(doc[:200])
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # store
    sequences.append(line)
print('Total Sequences: %d' % len(sequences))
# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

The Project Gutenberg EBook of The Republic, by Plato

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it u
['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'republic', 'by', 'plato', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'title', 'the', 'republic', 'author', 'plato', 'translator', 'b', 'jowett', 'posting', 'date', 'august', 'ebook', 'release', 'date', 'october', 'last', 'updated', 'june', 'language', 'english', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'the', 'republic', 'produced', 'by', 'sue', 'asscher', 'the', 'republic', 'by', 'plato', 'tra

## Training a language model

Model Architecture

### ![title](picture14.png)

In [42]:
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import plot_model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# define the model
def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(LSTM(100, return_sequences=True))
    model.add(LSTM(100))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='model.png', show_shapes=True)
    return model
# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
# separate into input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
# define model
model = define_model(vocab_size, seq_length)
# fit model
model.fit(X, y, batch_size=128, epochs=100)
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 50, 50)            532500    
_________________________________________________________________
lstm_6 (LSTM)                (None, 50, 100)           60400     
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_6 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_7 (Dense)              (None, 10650)             1075650   
Total params: 1,759,050
Trainable params: 1,759,050
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/10

Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


In [43]:
# Generate Text - Predict 

from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, 'r')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
    # map predicted word index to word
    out_word = ''
    for word, index in tokenizer.word_index.items():
        if index == yhat:
            out_word = word
            break
    # append to input
    in_text += ' ' + out_word
    result.append(out_word)
    return ' '.join(result)
# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1
# load the model
model = load_model('model.h5')
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

In [47]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')
# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)

of the future the society of the future have so absorbed their minds that they are unable to see in their true proportions the politics of today they have been intoxicated with great ideas such as liberty or equality or the greatest happiness of the greatest number or the brotherhood of

humanity
