# How to Develop Word-Based Neural Language Models in Python with Keras

Language modeling involves predicting the next word in a sequence given the sequence of words already present.<br>
 The choice of how the language model is framed must match how the language model is intended to be used.

In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.

# Framing Language Modeling

<i>Jack and Jill went up the hill</i><br>
<i>To fetch a pail of water</i><br>
<i>Jack fell down and broke his crown</i><br>
<i>And Jill came tumbling after</i><br>

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.
<br>
They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time.

<b>There is no single best approach, just different framings that may suit different applications.<b/>

Methods for text sequences
<ul>
    <li>One-Word-In, One-Word-Out Sequences</li>
    <li>Line-by-Line Sequence</li>
    <li>Two-Words-In, One-Word-Out Sequence</li>
</ul>

# Model 1: One-Word-In, One-Word-Out Sequences

Given one word as input, the model will learn to predict the next word in the sequence.

In [1]:
from numpy import array
import tensorflow as tf
from tf import keras
from keras.preprocessing.text import Tokenizer # for encoding our text
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

Using TensorFlow backend.


The first step is to encode the text as integers.(<i>similar to variable encoding</i>)<br>
Keras provides the <a href="https://keras.io/preprocessing/text/#tokenizer">Tokenizer</a> class that can be used to perform this encoding.

In [2]:
# source text
data = """ Jack and Jill went up the hill\n
        To fetch a pail of water\n
        Jack fell down and broke his crown\n
        And Jill came tumbling after\n """

First, the Tokenizer is to `fit` on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the `texts_to_sequences()` function.

In [3]:
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.
<br>
The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the `word_index` attribute.

In [4]:
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 22


Next, we need to create sequences of words to fit the model with one word as input and one word as output.

In [5]:
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 24


We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

In [6]:
# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.
<br>
<br>
Keras provides the `to_categorical()` function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

In [7]:
# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

#### Making our artificial neural network

In [8]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
None


In [9]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [10]:
# fit network
model.fit(X, y, epochs=0, verbose=1)

<keras.callbacks.History at 0x2e6d7540400>

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling `model.predict_classes()` to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

In [11]:
# evaluate
in_text = 'Jack'
print(in_text)

Jack


In [12]:
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = array(encoded)
yhat = model.predict_classes(encoded, verbose=0)
for word, index in tokenizer.word_index.items():
    if index == yhat:
        print(word)

went


In [13]:
# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result + ' ' + out_word
    return result

In [14]:
# evaluate
print(generate_seq(model, tokenizer, 'Jack', 6))

Jack went went went went went went


# Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.
<br>
<br>
In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.
<br>
<br>
Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.

In [15]:
from numpy import array
import tensorflow as tf
from tf import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

In [16]:

# source text
data = """ Jack and Jill went up the hill\n
        To fetch a pail of water\n
        Jack fell down and broke his crown\n
        And Jill came tumbling after\n """

In [17]:
# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

In [18]:
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 22


First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

In [19]:
# create line-based sequences
sequences = list()
for line in data.split('\n'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 21


Next, we can pad the prepared sequences. We can do this using the `pad_sequences()` function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

In [20]:
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 7


Next, we can split the sequences into input and output elements, much like before.

In [21]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

In [22]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_2 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [24]:
# fit network
model.fit(X, y, epochs=500, verbose=2)

Epoch 1/500
 - 1s - loss: 3.0912 - acc: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.0898 - acc: 0.0476
Epoch 3/500
 - 0s - loss: 3.0886 - acc: 0.0952
Epoch 4/500
 - 0s - loss: 3.0871 - acc: 0.0952
Epoch 5/500
 - 0s - loss: 3.0857 - acc: 0.0952
Epoch 6/500
 - 0s - loss: 3.0842 - acc: 0.0952
Epoch 7/500
 - 0s - loss: 3.0827 - acc: 0.0952
Epoch 8/500
 - 0s - loss: 3.0812 - acc: 0.0952
Epoch 9/500
 - 0s - loss: 3.0796 - acc: 0.0952
Epoch 10/500
 - 0s - loss: 3.0780 - acc: 0.0952
Epoch 11/500
 - 0s - loss: 3.0762 - acc: 0.0952
Epoch 12/500
 - 0s - loss: 3.0744 - acc: 0.0952
Epoch 13/500
 - 0s - loss: 3.0725 - acc: 0.0952
Epoch 14/500
 - 0s - loss: 3.0705 - acc: 0.0952
Epoch 15/500
 - 0s - loss: 3.0683 - acc: 0.0952
Epoch 16/500
 - 0s - loss: 3.0661 - acc: 0.0952
Epoch 17/500
 - 0s - loss: 3.0637 - acc: 0.0952
Epoch 18/500
 - 0s - loss: 3.0612 - acc: 0.0952
Epoch 19/500
 - 0s - loss: 3.0585 - acc: 0.0952
Epoch 20/500
 - 0s - loss: 3.0557 - acc: 0.0952
Epoch 21/500
 - 0s - loss: 3.0526 - acc: 0.09

<keras.callbacks.History at 0x2e6dfc36390>

We can use the model to generate new sequences as before. The `generate_seq()` function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.

In [25]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text


In [26]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))

Jack fell down and broke


In [27]:
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Jill jill came tumbling after


# Model 3: Two-Words-In, One-Word-Out Sequence

In [28]:
from numpy import array
import tensorflow as tf
from tf import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

In [29]:
# source text
data = """ Jack and Jill went up the hill\n
        To fetch a pail of water\n
        Jack fell down and broke his crown\n
        And Jill came tumbling after\n """


In [30]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

In [31]:
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 22


We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays

In [32]:
# encode 2 words -> 1 word
sequences = list()
for i in range(2, len(encoded)):
    sequence = encoded[i-2:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 23


In [33]:
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

Max Sequence Length: 3


In [34]:
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [35]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 2, 10)             220       
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_3 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
None


In [36]:
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [37]:
# fit network
model.fit(X, y, epochs=500, verbose=2)

Epoch 1/500
 - 1s - loss: 3.0902 - acc: 0.0870
Epoch 2/500
 - 0s - loss: 3.0893 - acc: 0.0435
Epoch 3/500
 - 0s - loss: 3.0885 - acc: 0.1304
Epoch 4/500
 - 0s - loss: 3.0876 - acc: 0.0870
Epoch 5/500
 - 0s - loss: 3.0867 - acc: 0.0870
Epoch 6/500
 - 0s - loss: 3.0858 - acc: 0.0870
Epoch 7/500
 - 0s - loss: 3.0849 - acc: 0.0870
Epoch 8/500
 - 0s - loss: 3.0840 - acc: 0.0870
Epoch 9/500
 - 0s - loss: 3.0831 - acc: 0.0870
Epoch 10/500
 - 0s - loss: 3.0821 - acc: 0.0870
Epoch 11/500
 - 0s - loss: 3.0811 - acc: 0.0870
Epoch 12/500
 - 0s - loss: 3.0801 - acc: 0.0870
Epoch 13/500
 - 0s - loss: 3.0790 - acc: 0.0870
Epoch 14/500
 - 0s - loss: 3.0780 - acc: 0.0870
Epoch 15/500
 - 0s - loss: 3.0769 - acc: 0.0870
Epoch 16/500
 - 0s - loss: 3.0757 - acc: 0.0870
Epoch 17/500
 - 0s - loss: 3.0746 - acc: 0.0870
Epoch 18/500
 - 0s - loss: 3.0734 - acc: 0.0870
Epoch 19/500
 - 0s - loss: 3.0721 - acc: 0.0870
Epoch 20/500
 - 0s - loss: 3.0708 - acc: 0.0870
Epoch 21/500
 - 0s - loss: 3.0695 - acc: 0.0870
E

<keras.callbacks.History at 0x2e6e19b91d0>

In [38]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
    return in_text

In [39]:
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5))

Jack and jill went up the hill


In [40]:
print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3))

And Jill went up the


In [41]:
print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5))

fell down and broke his crown and


In [42]:
print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5))

pail of water jack fell down and


# Thank you

### Shout out to our sponsor



![Deep Analytics](img/Deep_Analytics.png)

visit their <a href="http://deepanalytics.ai/">website</a>

Like their <a href="https://www.facebook.com/DeepAnalyticsAI/">Facebook page</a>


![School of AI](img/School_of_ai_logo.png)

<a href="https://www.facebook.com/groups/harareschoolofai/">Join our Facebook Group</a>