# Text Generation with LSTM with Keras

## Load the data
We are going to use the moby dick text for this.

In [1]:
def read_file(filepath):
    """Read file.
    Simple function to read all the text from a file.
    Do not use it with large text files."""
    with open(filepath) as f:
        str_text = f.read()
    return str_text

In [2]:
file_path = "../../datasets/moby_dick_four_chapters.txt"
corpus = read_file(file_path)
print(corpus[:500], "...")

Call me Ishmael.  Some years ago--never mind how long
precisely--having little or no money in my purse, and nothing
particular to interest me on shore, I thought I would sail about a
little and see the watery part of the world.  It is a way I have of
driving off the spleen and regulating the circulation.  Whenever I
find myself growing grim about the mouth; whenever it is a damp,
drizzly November in my soul; whenever I find myself involuntarily
pausing before coffin warehouses, and bringing up t ...


Let's import space disabling what we do not need.

Remember that you need to download spacy data first:
* https://spacy.io/
* https://spacy.io/usage/models

In [3]:
!python -m spacy download en_core_web_sm


[93m    Linking successful[0m
    /Users/OhtarMac/anaconda3/envs/nlp_training/lib/python3.7/site-packages/en_core_web_sm
    -->
    /Users/OhtarMac/anaconda3/envs/nlp_training/lib/python3.7/site-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



In [3]:
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'tagger', 'ner'])
# This is needed in case we want to process a bigger text file
nlp.max_length =1198623

Let's clean the text a little bit by eliminating punctuation

In [4]:
def separate_punctuation(doc_text, black_list='\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n '):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in black_list]

In [5]:
tokens = separate_punctuation(corpus)
tokens

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'a',
 'way',
 'i',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'whenever',
 'i',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever',
 'it',
 'is',
 'a',
 'damp',
 'drizzly',
 'november',
 'in',
 'my',
 'soul',
 'whenever',
 'i',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'and',
 'bringing',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funeral',
 'i',
 'meet',
 'and',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 '

### Predict the next word
First, we want to predict words given a certain sequence of previous words, say given the first 25 tokens, we will try to predict the number 26. For this, we need to create token sequencies to feed the neural network.

The number of words for the sequence may vary depending on the use case. This needs to be taken into consideration.

In [6]:
def create_token_sequences(train_len, tokens):
    text_sequences = []
    for i in range(train_len, len(tokens)):
        # basically go train_len characters back 
        seq = tokens[i - train_len: i]
        text_sequences.append(seq)
    return text_sequences

In [7]:
text_sequences = create_token_sequences(26, tokens)
text_sequences[:5]

[['call',
  'me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on'],
 ['me',
  'ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on',
  'shore'],
 ['ishmael',
  'some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
  'particular',
  'to',
  'interest',
  'me',
  'on',
  'shore',
  'i'],
 ['some',
  'years',
  'ago',
  'never',
  'mind',
  'how',
  'long',
  'precisely',
  'having',
  'little',
  'or',
  'no',
  'money',
  'in',
  'my',
  'purse',
  'and',
  'nothing',
 

You can notice that the result is like a sliding window of one word over the text, each sequence just moves one word to the right every time.

Now let's work with keras.

First, as we know, neural networks do not work with text but numbers, hence we need to convert the text sequences into numeric sequences. For this purpose, keras has a built-in Tokenizer that we can use.

In [8]:
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)
sequences[:5]

Using TensorFlow backend.


[[964,
  14,
  265,
  51,
  263,
  416,
  87,
  222,
  129,
  111,
  962,
  262,
  50,
  43,
  37,
  321,
  7,
  23,
  555,
  3,
  150,
  261,
  6,
  2704,
  14,
  24],
 [14,
  265,
  51,
  263,
  416,
  87,
  222,
  129,
  111,
  962,
  262,
  50,
  43,
  37,
  321,
  7,
  23,
  555,
  3,
  150,
  261,
  6,
  2704,
  14,
  24,
  965],
 [265,
  51,
  263,
  416,
  87,
  222,
  129,
  111,
  962,
  262,
  50,
  43,
  37,
  321,
  7,
  23,
  555,
  3,
  150,
  261,
  6,
  2704,
  14,
  24,
  965,
  5],
 [51,
  263,
  416,
  87,
  222,
  129,
  111,
  962,
  262,
  50,
  43,
  37,
  321,
  7,
  23,
  555,
  3,
  150,
  261,
  6,
  2704,
  14,
  24,
  965,
  5,
  60],
 [263,
  416,
  87,
  222,
  129,
  111,
  962,
  262,
  50,
  43,
  37,
  321,
  7,
  23,
  555,
  3,
  150,
  261,
  6,
  2704,
  14,
  24,
  965,
  5,
  60,
  5]]

What we basically obtained is the same word sequences but each word is encoded with a number.

In [9]:
tokenizer.index_word

{1: 'the',
 2: 'a',
 3: 'and',
 4: 'of',
 5: 'i',
 6: 'to',
 7: 'in',
 8: 'it',
 9: 'that',
 10: 'he',
 11: 'his',
 12: 'was',
 13: 'but',
 14: 'me',
 15: 'with',
 16: 'as',
 17: 'you',
 18: 'this',
 19: 'at',
 20: 'is',
 21: 'all',
 22: 'for',
 23: 'my',
 24: 'on',
 25: 'be',
 26: "'s",
 27: 'not',
 28: 'from',
 29: 'there',
 30: 'one',
 31: 'up',
 32: 'what',
 33: 'him',
 34: 'so',
 35: 'bed',
 36: 'now',
 37: 'no',
 38: 'about',
 39: 'into',
 40: 'by',
 41: 'were',
 42: 'out',
 43: 'or',
 44: 'harpooneer',
 45: 'had',
 46: 'then',
 47: 'have',
 48: 'an',
 49: 'upon',
 50: 'little',
 51: 'some',
 52: 'old',
 53: 'like',
 54: 'if',
 55: 'they',
 56: 'would',
 57: 'do',
 58: 'over',
 59: 'landlord',
 60: 'thought',
 61: 'room',
 62: 'when',
 63: 'could',
 64: 'here',
 65: 'head',
 66: "n't",
 67: 'night',
 68: 'such',
 69: 'which',
 70: 'man',
 71: 'did',
 72: 'sea',
 73: 'though',
 74: 'time',
 75: 'other',
 76: 'very',
 77: 'go',
 78: 'these',
 79: 'more',
 80: 'first',
 81: 'sort',


In [10]:
vocabulary_size = len(tokenizer.word_counts)
print(f"We have {vocabulary_size} total different words.")

We have 2709 total different words.


Now, lets convert these sequences into a numpy array for better handling

In [11]:
import numpy as np

sequences_array = np.array(sequences)
sequences_array

array([[ 964,   14,  265, ..., 2704,   14,   24],
       [  14,  265,   51, ...,   14,   24,  965],
       [ 265,   51,  263, ...,   24,  965,    5],
       ...,
       [ 960,   12,  168, ...,  264,   53,    2],
       [  12,  168, 2703, ...,   53,    2, 2709],
       [ 168, 2703,    3, ...,    2, 2709,   26]])

As stated above, we want to predict the next word for a particular sequence, right now our sequences are composed of 26 characters, we want to predict character 26. For this, we need to separate the last column from the rest to have a nice $X$ and $y$ scenario to work with.

In [12]:
from keras.utils import to_categorical

# take up everything up until the last column
X = sequences_array[:, :-1]
# take the last column separately
y = sequences_array[:, -1]
# and convert the current id based encoding into categorical values (one-hot encoding)
# we add one to num_classes because of how keras padding works it needs an additional spot for holding 0
y = to_categorical(y, num_classes=vocabulary_size + 1)

Also, let's review the input shape

In [13]:
X.shape

(11368, 25)

This means we have 11368 samples with 25 words (dimenssions).

## Work with keras

In [14]:
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

def create_model(vocabulary_size, seq_len):
    
    model = Sequential()
    model.add(Embedding(input_dim=vocabulary_size, output_dim=seq_len, input_length=seq_len))
    model.add(LSTM(50, return_sequences=True)) # use some multiple of the sequence length
    model.add(LSTM(50))
    model.add(Dense(50, activation="relu"))
    
    model.add(Dense(vocabulary_size, activation="softmax"))
    
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    
    model.summary()
    
    return model

In [15]:
seq_len = X.shape[1]
model = create_model(vocabulary_size+1, seq_len)

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 25, 25)            67750     
_________________________________________________________________
lstm_1 (LSTM)                (None, 25, 50)            15200     
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dense_1 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_2 (Dense)              (None, 2710)              138210    
Total params: 243,910
Trainable params: 243,910
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(X, y, batch_size=64, epochs=10)

Instructions for updating:
Use tf.cast instead.
Epoch 1/10


Lets save the models

In [None]:
from pickle import dump, load

# This will save the weights of the network only
model.save("models/text-generation.h5")
# This will save the architecture definition as a yaml file
with open("models/text-generation-def.yaml", "w") as file:
    yaml = model.to_yaml()
    file.write(yaml)

# Finally let's save the tokenizer
with open("models/text-generation-tokenizer.pkl", "wb") as file:
    dump(tokenizer, file)