### Develop a Neural Language Model for Text Generation

Try to develop a language model that can predict the probability of the next word in the sequence, based on the
words already observed in the sequence.  For this task will use the book *The Prince by Nicoló Machiavelli*
- http://www.gutenberg.org/ebooks/57037

- http://www.gutenberg.org/cache/epub/731/pg731.txt

- http://www.gutenberg.org/cache/epub/1497/pg1497.txt


Aim is to learn:
- How to prepare text for developing a word-based language model.
- How to design and fit a neural language model with a learned embedding and an LSTM
hidden layer.
- How to use the learned language model to generate new text with similar statistical
properties as the source text

In [316]:
from numpy import array,iinfo
from pickle import dump
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import plot_model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
import string
import re
import pandas as pd
from sys import getsizeof
import time
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

import os
from IPython.display import Image
os.environ["PATH"] += os.pathsep +  'C:\\Users/richard\\Anaconda3\\pkgs\\graphviz-2.38-hfd603c8_2\\Library\\bin\\graphviz'

### Load the Text



In [288]:
def load_doc(filename):
    file = open(filename,'r')
    text = file.read()
    file.close()
    return text

In [289]:
filename = 'the_prince.txt'
doc = load_doc(filename)
print(doc[:400])

ï»¿NICCOLÃ’ MACHIAVELLI TO LORENZO THE MAGNIFICENT

1. The various kinds of Government and the ways by which they are
established.

2. Of Hereditary Monarchies.

3. Of Mixed Monarchies.

4. Why the Kingdom of Darius, occupied by Alexander, did not rebel
against the successors of the latter after his death.

5. The way to govern Cities or Dominions that, previous to being
occupied, lived under thei


### Clean the text
Clean up the text and reduce vocab size
 - remove non-printable chars
 - replace '-' with whitespace so split words better
 - remove punctuation eg. 'Who?' becomes 'Who'
 - remove non alphabetic
 - lowercase

In [290]:
def clean_doc(doc):
    doc = doc.replace('-',' ')
    tokens = doc.split()
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('',w) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word.lower() for word in tokens]
    return tokens

In [291]:
# clean document
tokens = clean_doc(doc)
print(tokens[:300])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['machiavelli', 'to', 'lorenzo', 'the', 'magnificent', 'the', 'various', 'kinds', 'of', 'government', 'and', 'the', 'ways', 'by', 'which', 'they', 'are', 'established', 'of', 'hereditary', 'monarchies', 'of', 'mixed', 'monarchies', 'why', 'the', 'kingdom', 'of', 'darius', 'occupied', 'by', 'alexander', 'did', 'not', 'rebel', 'against', 'the', 'successors', 'of', 'the', 'latter', 'after', 'his', 'death', 'the', 'way', 'to', 'govern', 'cities', 'or', 'dominions', 'that', 'previous', 'to', 'being', 'occupied', 'lived', 'under', 'their', 'own', 'laws', 'of', 'new', 'dominions', 'which', 'have', 'been', 'acquired', 'by', 'ones', 'own', 'arms', 'and', 'powers', 'of', 'new', 'dominions', 'acquired', 'by', 'the', 'power', 'of', 'others', 'or', 'by', 'fortune', 'of', 'those', 'who', 'have', 'attained', 'the', 'position', 'of', 'prince', 'by', 'villainy', 'of', 'the', 'civic', 'principality', 'how', 'the', 'strength', 'of', 'all', 'states', 'should', 'be', 'measured', 'of', 'ecclesiastical', 'pr

### Save Clean Text
Generate a sequence of tokens using 50 input words and 1 output word.  use a sliding window of length 51 across the tokens

In [292]:
length = 50+1
sequences = list()
for i in range(length,len(tokens)):   #0-51,1-52,2-53
    seq = tokens[i-length:i]
    line = ' '.join(seq)
    sequences.append(line)
    #print(sequences)
    #break
print('Total Sequences: %d' % len(sequences))

Total Sequences: 30530


Now have a long list of sequences, save to file for later processing


In [293]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close()

In [294]:
# save sequences to file
out_filename = 'the_prince_sequences.txt'
save_doc(sequences, out_filename)

In [295]:
file = open('the_prince_sequences.txt','r')
text = file.read()
file.close()
print(text[:400])
print(type(text))

machiavelli to lorenzo the magnificent the various kinds of government and the ways by which they are established of hereditary monarchies of mixed monarchies why the kingdom of darius occupied by alexander did not rebel against the successors of the latter after his death the way to govern cities or dominions
to lorenzo the magnificent the various kinds of government and the ways by which they ar
<class 'str'>


### Load and Encode the Sequences

In [296]:
in_filename = 'the_prince_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

The word embedding layer expects input sequences to be comprised of integers. We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping. To do this encoding, we will use the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer. We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words
to a list of integers

In [297]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [298]:
#print(dir(tokenizer))
print(sequences[:1])

[[1744, 3, 1743, 1, 1742, 1, 786, 466, 4, 239, 2, 1, 238, 10, 23, 17, 24, 306, 4, 365, 579, 4, 517, 579, 936, 1, 141, 4, 784, 413, 10, 120, 92, 11, 935, 69, 1, 934, 4, 1, 305, 129, 12, 167, 1, 111, 3, 933, 364, 31, 212]]


In [299]:
#Access the mapping of words to integers as a dictionary attribute called word index
#on the Tokenizer object. We need to know the size of the vocabulary for defining the embedding
#layer later
#find out size of vocab
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

3234


### Sequence Inputs and Output
- extract input and output, then one hot encode

In [300]:
#numpy array
sequences = array(sequences)
print(sequences.shape)
X = sequences[:,:-1]
print(X.shape)
y = sequences[:,-1]
print(y.shape)
y = to_categorical(y,num_classes=vocab_size)
seq_length = X.shape[1]

(30530, 51)
(30530, 50)
(30530,)


### Optimising the datatypes
Values in the arrays are quite small, we can optimise these values to reduce the size of the arrays using pandas.  This should reduce the training time of the deep learning algorithm.

In [301]:
df1 = pd.DataFrame(X)
df2 = pd.DataFrame(y)

In [302]:
#get the info on the data
print(df1.info())
print(" ")
print(df2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30530 entries, 0 to 30529
Data columns (total 50 columns):
0     30530 non-null int32
1     30530 non-null int32
2     30530 non-null int32
3     30530 non-null int32
4     30530 non-null int32
5     30530 non-null int32
6     30530 non-null int32
7     30530 non-null int32
8     30530 non-null int32
9     30530 non-null int32
10    30530 non-null int32
11    30530 non-null int32
12    30530 non-null int32
13    30530 non-null int32
14    30530 non-null int32
15    30530 non-null int32
16    30530 non-null int32
17    30530 non-null int32
18    30530 non-null int32
19    30530 non-null int32
20    30530 non-null int32
21    30530 non-null int32
22    30530 non-null int32
23    30530 non-null int32
24    30530 non-null int32
25    30530 non-null int32
26    30530 non-null int32
27    30530 non-null int32
28    30530 non-null int32
29    30530 non-null int32
30    30530 non-null int32
31    30530 non-null int32
32    30530 non-null int32


Can see that we have int32 datatypes, and float 64 for 0/1 values - do we actually need these ranges?

In [303]:
#some examples of datatypes unsigned and signed
data_types = ["uint16","int16","int32"]
for it in data_types:
    print(iinfo(it))

Machine parameters for uint16
---------------------------------------------------------------
min = 0
max = 65535
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------



We can see from the X dataframe we have only int32 types, do we need such datasize?

In [304]:
df1.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
count,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,...,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0,30530.0
mean,318.335899,318.282673,318.287717,318.336259,318.341173,318.389781,318.392237,318.366951,318.368588,318.374582,...,318.970259,318.960432,318.986898,319.092368,319.086931,319.192794,319.189584,319.189846,319.168752,319.182607
std,605.751158,605.697274,605.69529,605.868836,605.86687,606.040597,606.039463,606.036057,606.036524,606.034319,...,607.320677,607.32333,607.332657,607.558896,607.561,607.787108,607.788466,607.788331,607.778203,607.784047
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,...,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
50%,49.0,49.0,49.0,49.0,49.0,49.0,49.5,49.0,49.0,49.5,...,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
75%,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,...,299.0,299.0,299.0,299.0,299.0,299.75,299.75,299.75,299.0,299.0
max,3224.0,3224.0,3224.0,3225.0,3225.0,3226.0,3226.0,3226.0,3226.0,3226.0,...,3231.0,3231.0,3231.0,3232.0,3232.0,3233.0,3233.0,3233.0,3233.0,3233.0


We can see that the values range from 1 to 7317.  Convert the integer types using a simple function to see how much memory we could save:

In [305]:
def mem_usage(pandas_obj):
    usage_b = pandas_obj.memory_usage(deep=True).sum()
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)

df1_mem_int = df1.select_dtypes(include=['int32'])
converted_int = df1.apply(pd.to_numeric,downcast='unsigned')

print("Size of integer types before {}".format(mem_usage(df1_mem_int)))
print("Size of integer types after {}".format(mem_usage(converted_int)))

Size of integer types before 5.82 MB
Size of integer types after 2.91 MB


**Have converted to uint16 which is optimal size for the datatype and saved 50% in space.**

Repeat with the float columns of the target

In [306]:
df2_mem_float = df2.select_dtypes(include=['float64'])
converted_float = df2.apply(pd.to_numeric,downcast='unsigned')

print("Size of float types before {}".format(mem_usage(df2_mem_float)))
print("Size of float types after {}".format(mem_usage(converted_float)))


Size of float types before 753.28 MB
Size of float types after 94.16 MB


**The zeros and ones in the target have been converted into type uint8 saving over 80% in space**

Using this information lets convert the original numpy arrays

In [307]:
print("original datatype of X array is {}".format(X.dtype))
print("original datatype of y array is {}".format(y.dtype))

original datatype of X array is int32
original datatype of y array is float64


In [308]:
print("X size of %f MB" % ((X.size * X.itemsize)/1024**2))
print("y size of %f MB" % ((y.size * y.itemsize)/1024**2))

X size of 5.823135 MB
y size of 753.280792 MB


In [309]:
X_new = X.astype('uint16')
y_new = y.astype('uint8')

print("New X size of %f MB" % ((X_new.size * X_new.itemsize)/1024**2))
print("New y size of %f MB" % ((y_new.size * y_new.itemsize)/1024**2))
print(" ")
print("New datatype of X array is {}".format(X_new.dtype))
print("New datatype of y array is {}".format(y_new.dtype))

New X size of 2.911568 MB
New y size of 94.160099 MB
 
New datatype of X array is uint16
New datatype of y array is uint8


### Fit the Model
The learned embedding needs to know the size of the vocabulary and the length of input sequences.  It also has a parameter to specify how many dimensions will be used to represent each word - the size of the embedding vector space.
Common values are 50, 100, and 300. Use a two LSTM hidden layers with 100 memory cells each. 

More memory cells and a deeper network may achieve better results. A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as
a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities

In [310]:
# define the model
def define_model(vocab_size, seq_length):
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(LSTM(100, return_sequences=True))
    model.add(LSTM(100))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(vocab_size, activation='softmax'))
    # compile network
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    # summarize defined model
    model.summary()
    plot_model(model, to_file='img/model20.png', show_shapes=True)
    return model

The model is compiled specifying the categorical cross entropy loss needed to fit the model.
Technically, the model is learning a multiclass classification and this is the suitable loss function
for this type of problem. The efficient Adam implementation to mini-batch gradient descent
is used and accuracy is evaluated of the model. Finally, the model is fit on the data for 100
training epochs with a modest batch size of 200 to speed things up.

In [311]:
# define model
model = define_model(vocab_size, seq_length)
# fit model
model.fit(X_new, y_new, batch_size=200, epochs=50)
# save the model to file
model.save('model20.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer20.pkl', 'wb'))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 50, 50)            161700    
_________________________________________________________________
lstm_29 (LSTM)               (None, 50, 100)           60400     
_________________________________________________________________
lstm_30 (LSTM)               (None, 100)               80400     
_________________________________________________________________
dense_29 (Dense)             (None, 100)               10100     
_________________________________________________________________
dense_30 (Dense)             (None, 3234)              326634    
Total params: 639,234
Trainable params: 639,234
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
E

### Using the Language Model

In [314]:
# load cleaned text sequences
in_filename = 'the_prince_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

seq_length = len(lines[0].split()) - 1

In [317]:
# load the model
# load the tokenizer
tokenizer = load(open('tokenizer20.pkl', 'rb'))
model = load_model('model20.h5')

In [318]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')


family or else they are those of recent foundation the newly founded ones are either entirely new as was milan to francesco sforza or else they are as it were new members grafted on to the hereditary possessions of the prince that annexes them as is the kingdom of naples to



Generate a function called generate seq() that takes as input the model,
the tokenizer, input sequence length, the seed text, and the number of words to generate. It
then returns a sequence of words generated by the model.

In [319]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

In [321]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')
# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print(generated)


be opposed by those he has about him and he is easily diverted from his purpose hence it comes to pass that what he does one day he undoes the next no one ever understands what he wishes or intends to do and no reliance is to be placed on his

own laws and as he had not spent his own laws and as he had not spent his own laws and as he had not spent his own laws and as he had not spent his own laws and as he had not spent his own laws and as he


Generated text is not reasonable - this is as expected as the accuracy from the training procedure is very low:
- text maybe too small need to increase size 
- change batch size and epoch number
- change the sequence length
- change the network configuration