# Text Generation with Keras

This notebook contains an example of text generation using Keras. This example used the first four chapters of Moby Dick to train a text generation model. 

### Define utility functions

In [1]:
def read_file(path):
    with open(path) as f:
        str_text = f.read()
    return str_text

In [2]:
print(read_file('../../datasets/moby_dick_four_chapters.txt'))

Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.  It is a way I have of
driving off the spleen and regulating the circulation.  Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever I find myself involuntarily
pausing before coffin warehouses, and bringing up the rear of every
funeral I meet; and especially whenever my hypos get such an upper
hand of me, that it requires a strong moral principle to prevent me
from deliberately stepping into the street, and methodically knocking
people's hats off--then, I account it high time to get to sea as soon
as I can.  This is my substitute for pistol and ball.  With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship.  There is nothing surprising in this.  If they but
knew it,

### Pre-process the text

Let's use Spacy for tokenization. Remember to first download the models accordingly. See https://spacy.io/usage/models/. Also we do not need to parse or tag the elements, hence we are going to disable those features when loading the model.


In [4]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.3.1/en_core_web_md-2.3.1.tar.gz (50.8 MB)
[K     |████████████████████████████████| 50.8 MB 505 kB/s eta 0:00:011
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-md: filename=en_core_web_md-2.3.1-py3-none-any.whl size=50916641 sha256=343caafa758d0b8d10ca956c261e5ba2cbefba1d281f8f7f1c733706d9a04d8d
  Stored in directory: /tmp/pip-ephem-wheel-cache-t23tpzpk/wheels/43/1d/c1/a0af68d0648debf57f875e9dda56bbac35cfc27bfa187ffc46
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [3]:
import spacy

nlp = spacy.load('en_core_web_md', disable=['parser', 'tagger', 'ner'])

Define a function to deal with punctuation and unnecessary characters.

In [5]:
def separate_punc(doc_text, skip_regex='\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n '):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in skip_regex]

In [6]:
%%time
document = read_file('../../datasets/moby_dick_four_chapters.txt')
tokens = separate_punc(document)
print(f"Total tokens in the corpus: {len(tokens)}")

Total tokens in the corpus: 11338
CPU times: user 122 ms, sys: 80 µs, total: 122 ms
Wall time: 122 ms


### The challenge
The idea is to take a window of 25 words and predict the word #26. We need to create the input data to match the mentioned requirements.

In [7]:
train_len = 25 + 1
text_sequences = []
for i in range(train_len, len(tokens)):
    # starting from 0, grab windows of 25 + 1 tokens and append them to a sequence list.
    seq = tokens[i - train_len: i]
    text_sequences.append(seq)

Now let's see the resulting sequences. The idea is to predict the last word in the produced sequences using the previous 25 words as features/inputs.

In [8]:
print(text_sequences[0])

['call', 'me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on']


In [9]:
print(text_sequences[1])

['me', 'ishmael', 'some', 'years', 'ago', 'never', 'mind', 'how', 'long', 'precisely', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore']


### Modeling with Keras

First we need to transform the words into numbers that the NN can understand. This is basically assigning a numeric ID to every token and use the numeric ids instead of text.

In [10]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)
print(sequences[0])

[956, 14, 263, 51, 261, 408, 87, 219, 129, 111, 954, 260, 50, 43, 38, 315, 7, 23, 546, 3, 150, 259, 6, 2712, 14, 24]


We can double check that we effectively can map every id to a word by inspecting the tokenizer model.

In [11]:
for i in sequences[0]:
    print(f"{i}: {tokenizer.index_word[i]}")

956: call
14: me
263: ishmael
51: some
261: years
408: ago
87: never
219: mind
129: how
111: long
954: precisely
260: having
50: little
43: or
38: no
315: money
7: in
23: my
546: purse
3: and
150: nothing
259: particular
6: to
2712: interest
14: me
24: on


Let's check the vocabulary size

In [12]:
vocab_size = len(tokenizer.word_counts)
print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 2717


It is better to work with the sequences as a numpy array.

In [13]:
import numpy as np

sequences = np.array(sequences)
sequences

array([[ 956,   14,  263, ..., 2712,   14,   24],
       [  14,  263,   51, ...,   14,   24,  957],
       [ 263,   51,  261, ...,   24,  957,    5],
       ...,
       [ 952,   12,  166, ...,  262,   53,    2],
       [  12,  166, 2711, ...,   53,    2, 2717],
       [ 166, 2711,    3, ...,    2, 2717,   26]])

### Prepare the data fro training
We need to separate the first 25 columns from the last one to get the features and labels we are interested to train with. Furthermore, we wat to train/test split for our modeling.

In [14]:
X = sequences[:,:-1]
y = sequences[:,-1]

Our current `y` is a number, we should one hot encode it to make the output of the neural network. Since the target classes are actual words, the number of classes is in fact the vocabulary size + 1 because of keras padding.

In [15]:
from keras.utils import to_categorical

y = to_categorical(y, num_classes=vocab_size+1)

Also, we need to establish the sequence length from our sequence. This is going to be an input for keras layers.

In [16]:
seq_len = X.shape[1]

### Modeling with Keras

In [24]:
from keras.layers import Dense, LSTM, Embedding
from keras.models import Sequential

def create_model(vocabulary_size, seq_len):
    
    model = Sequential()
    model.add(Embedding(vocabulary_size, seq_len, input_length=seq_len))
    model.add(LSTM(75, return_sequences=True))
    model.add(LSTM(50))
    model.add(Dense(256, activation='relu'))
    
    model.add(Dense(vocabulary_size, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In [25]:
model = create_model(vocab_size + 1, seq_len)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 25, 25)            67950     
_________________________________________________________________
lstm_5 (LSTM)                (None, 25, 75)            30300     
_________________________________________________________________
lstm_6 (LSTM)                (None, 50)                25200     
_________________________________________________________________
dense_5 (Dense)              (None, 256)               13056     
_________________________________________________________________
dense_6 (Dense)              (None, 2718)              698526    
Total params: 835,032
Trainable params: 835,032
Non-trainable params: 0
_________________________________________________________________


In [29]:
%%time
from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='loss', patience=10, min_delta=0.01)
train_history = model.fit(X, y, batch_size=128, epochs=200, callbacks=[early_stopping])

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200


Epoch 77/200
Epoch 78/200
Epoch 79/200
Epoch 80/200
Epoch 81/200
Epoch 82/200
Epoch 83/200
Epoch 84/200
Epoch 85/200
Epoch 86/200
Epoch 87/200
Epoch 88/200
Epoch 89/200
Epoch 90/200
Epoch 91/200
Epoch 92/200
Epoch 93/200
Epoch 94/200
Epoch 95/200
Epoch 96/200
Epoch 97/200
Epoch 98/200
Epoch 99/200
Epoch 100/200
Epoch 101/200
Epoch 102/200
Epoch 103/200
Epoch 104/200
Epoch 105/200
Epoch 106/200
Epoch 107/200
Epoch 108/200
Epoch 109/200
Epoch 110/200
Epoch 111/200
Epoch 112/200
Epoch 113/200
Epoch 114/200
Epoch 115/200
Epoch 116/200
Epoch 117/200
Epoch 118/200
Epoch 119/200
Epoch 120/200
Epoch 121/200
Epoch 122/200
Epoch 123/200
Epoch 124/200
Epoch 125/200
Epoch 126/200
Epoch 127/200
Epoch 128/200
Epoch 129/200
Epoch 130/200
Epoch 131/200
Epoch 132/200
Epoch 133/200
Epoch 134/200
Epoch 135/200
Epoch 136/200
Epoch 137/200
Epoch 138/200
Epoch 139/200
Epoch 140/200
Epoch 141/200
Epoch 142/200
Epoch 143/200
Epoch 144/200
Epoch 145/200
Epoch 146/200
Epoch 147/200
Epoch 148/200
Epoch 149/200
E

Epoch 153/200
Epoch 154/200
Epoch 155/200
Epoch 156/200
Epoch 157/200
Epoch 158/200
Epoch 159/200
Epoch 160/200
Epoch 161/200
Epoch 162/200
Epoch 163/200
Epoch 164/200
Epoch 165/200
Epoch 166/200
Epoch 167/200
Epoch 168/200
Epoch 169/200
Epoch 170/200
Epoch 171/200
Epoch 172/200
Epoch 173/200
Epoch 174/200
Epoch 175/200
Epoch 176/200
Epoch 177/200
Epoch 178/200
Epoch 179/200
Epoch 180/200
Epoch 181/200
Epoch 182/200
Epoch 183/200
Epoch 184/200
Epoch 185/200
Epoch 186/200
Epoch 187/200
Epoch 188/200
Epoch 189/200
Epoch 190/200
Epoch 191/200
Epoch 192/200
Epoch 193/200
Epoch 194/200
Epoch 195/200
Epoch 196/200
Epoch 197/200
Epoch 198/200
Epoch 199/200
Epoch 200/200
CPU times: user 10min 46s, sys: 24.1 s, total: 11min 10s
Wall time: 7min 6s


This seems like a pretty good model, let's save it for further reference and usage.

In [30]:
!mkdir models

In [31]:
from pickle import dump, load

# This will save the weights of the network only
model.save("models/text-generation.h5")
# This will save the architecture definition as a yaml file
with open("models/text-generation-def.yaml", "w") as file:
    yaml = model.to_yaml()
    file.write(yaml)

# Finally let's save the tokenizer
with open("models/text-generation-tokenizer.pkl", "wb") as file:
    dump(tokenizer, file)

### Model usage
Let's use the model to generate some text. The idea is to pass a sentence along with the model and tokenizer to predict the next word in the provided sentence. We can use this many times to complete text, as shown in the code below.

In [32]:
from keras.preprocessing.sequence import pad_sequences

def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    output_text = []
    input_text = seed_text
    
    for i in range(num_gen_words):
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # This will chop off the first word in case the sequence is longer than 
        # the seq_len
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        pred_word_index = model.predict_classes(pad_encoded, verbose=0)[0]
        pred_word = tokenizer.index_word[pred_word_index]
        
        input_text += ' ' + pred_word
        output_text.append(pred_word)
        
    return ' '.join(output_text)

In [33]:
generate_text(model, tokenizer, seq_len, "nothing particular to interest me on", 25)

'though it were is landlord liable to my criminal pocket and harpoons that really open to embark in their obstreperously passengers lost myself in confounding'

It does not look bad!, at least produced understandable text, alhtouugh a bit strange when read as a whole. Let's try with one of the original sentences.

In [37]:
for i in range(5):
    sentence = ' '.join(text_sequences[i])
    print("#" * 15)
    print(f"Input sentence #{i+1}: {sentence}\n")
    print(f"Predicted text: {generate_text(model, tokenizer, seq_len, sentence, 5)}")
    print("#" * 15)

###############
Input sentence #1: call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on

Predicted text: shore i thought i would
###############
###############
Input sentence #2: me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore

Predicted text: i thought i would sail
###############
###############
Input sentence #3: ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i

Predicted text: thought i would sail with
###############
###############
Input sentence #4: some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought

Predicted text: i would sail with a
###############
###############
Input sentence #5: years ago never mind how long precisely

When we provide the input that we used for training the output seems very precise, compared to our smaller fixed input. This could lead to two conclusions:
1. The model could be overfitting against the training set.
2. In the AdHoc example we only provided a portion of a real input. Hence padding was required and it probably messes up with the prediction (Not enough context from where to extract the next words in terms of probabilities.)