___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___
# Text Generation with Neural Networks

## Functions for Processing Text

### Reading in files as a string text

In [1]:
def read_file(filepath):
    
    with open(filepath) as f:
        str_text = f.read()
    
    return str_text

In [2]:
read_file('moby_dick_four_chapters.txt')

'Call me Ishmael.  Some years ago--never mind how long\nprecisely--having little or no money in my purse, and nothing\nparticular to interest me on shore, I thought I would sail about a\nlittle and see the watery part of the world.  It is a way I have of\ndriving off the spleen and regulating the circulation.  Whenever I\nfind myself growing grim about the mouth; whenever it is a damp,\ndrizzly November in my soul; whenever I find myself involuntarily\npausing before coffin warehouses, and bringing up the rear of every\nfuneral I meet; and especially whenever my hypos get such an upper\nhand of me, that it requires a strong moral principle to prevent me\nfrom deliberately stepping into the street, and methodically knocking\npeople\'s hats off--then, I account it high time to get to sea as soon\nas I can.  This is my substitute for pistol and ball.  With a\nphilosophical flourish Cato throws himself upon his sword; I quietly\ntake to the ship.  There is nothing surprising in this.  If t

### Tokenize and Clean Text

In [3]:
#!python -m spacy download en_core_web_sm

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm',disable=['parser', 'tagger','ner'])
nlp.max_length = 1198623

In [5]:
# Remove punctuation marks that are commonly used in English grammar

In [6]:
def separate_punc(doc_text):
    return [token.text.lower() for token in nlp(doc_text) if token.text not in '\n\n \n\n\n!"-#$%&()--.*+,-/:;<=>?@[\\]^_`{|}~\t\n ']

In [7]:
d = read_file('moby_dick_four_chapters.txt')
tokens = separate_punc(d)

In [8]:
tokens

['call',
 'me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail',
 'about',
 'a',
 'little',
 'and',
 'see',
 'the',
 'watery',
 'part',
 'of',
 'the',
 'world',
 'it',
 'is',
 'a',
 'way',
 'i',
 'have',
 'of',
 'driving',
 'off',
 'the',
 'spleen',
 'and',
 'regulating',
 'the',
 'circulation',
 'whenever',
 'i',
 'find',
 'myself',
 'growing',
 'grim',
 'about',
 'the',
 'mouth',
 'whenever',
 'it',
 'is',
 'a',
 'damp',
 'drizzly',
 'november',
 'in',
 'my',
 'soul',
 'whenever',
 'i',
 'find',
 'myself',
 'involuntarily',
 'pausing',
 'before',
 'coffin',
 'warehouses',
 'and',
 'bringing',
 'up',
 'the',
 'rear',
 'of',
 'every',
 'funeral',
 'i',
 'meet',
 'and',
 'especially',
 'whenever',
 'my',
 'hypos',
 'get',
 'such',
 'an',
 'upper',
 'hand',
 '

In [9]:
len(tokens)

11338

In [10]:
11338/30

377.93333333333334

## Create Sequences of Tokens

In [11]:
# organize into sequences of tokens
train_len = 30+1 # 50 training words , then one target word

# Empty list of sequences
text_sequences = []

for i in range(train_len, len(tokens)):
    
    # Grab train_len# amount of characters
    seq = tokens[i-train_len:i]
    
    # Add to list of sequences
    text_sequences.append(seq)

In [12]:
' '.join(text_sequences[0])

'call me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought i would'

In [13]:
' '.join(text_sequences[1])

'me ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought i would sail'

In [14]:
' '.join(text_sequences[2])

'ishmael some years ago never mind how long precisely having little or no money in my purse and nothing particular to interest me on shore i thought i would sail about'

In [15]:
' '.join(text_sequences[379])

'the green fields gone what do they here but look here come more crowds pacing straight for the water and seemingly bound for a dive strange nothing will content them but'

In [16]:
len(text_sequences)

11307

# Keras

### Keras Tokenization

In [17]:
!pip install keras



In [18]:
#!pip install tensorflow
import tensorflow as tf
tf.version.VERSION
#print(tensorflow.__version__)

'2.3.1'

In [19]:
!pip install ruamel.yaml



In [20]:
!pip install --upgrade tensorflow

Requirement already up-to-date: tensorflow in d:\conda_drive\lib\site-packages (2.3.1)


In [21]:
from keras.preprocessing.text import Tokenizer

In [22]:
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)

In [40]:
print(type(sequences))
sequences[0]

<class 'numpy.ndarray'>


array([ 956,   14,  263,   51,  261,  408,   87,  219,  129,  111,  954,
        260,   50,   43,   38,  315,    7,   23,  546,    3,  150,  259,
          6, 2712,   14,   25, 2710,    5,   60,    5,   56])

In [24]:
tokenizer.index_word

{1: 'the',
 2: 'a',
 3: 'and',
 4: 'of',
 5: 'i',
 6: 'to',
 7: 'in',
 8: 'it',
 9: 'that',
 10: 'he',
 11: 'his',
 12: 'was',
 13: 'but',
 14: 'me',
 15: 'with',
 16: 'as',
 17: 'at',
 18: 'this',
 19: 'you',
 20: 'is',
 21: 'all',
 22: 'for',
 23: 'my',
 24: 'be',
 25: 'on',
 26: "'s",
 27: 'not',
 28: 'from',
 29: 'there',
 30: 'one',
 31: 'up',
 32: 'what',
 33: 'him',
 34: 'so',
 35: 'bed',
 36: 'now',
 37: 'about',
 38: 'no',
 39: 'into',
 40: 'by',
 41: 'were',
 42: 'out',
 43: 'or',
 44: 'harpooneer',
 45: 'had',
 46: 'then',
 47: 'have',
 48: 'an',
 49: 'upon',
 50: 'little',
 51: 'some',
 52: 'old',
 53: 'like',
 54: 'if',
 55: 'they',
 56: 'would',
 57: 'do',
 58: 'over',
 59: 'landlord',
 60: 'thought',
 61: 'room',
 62: 'when',
 63: 'could',
 64: "n't",
 65: 'night',
 66: 'here',
 67: 'head',
 68: 'such',
 69: 'which',
 70: 'man',
 71: 'did',
 72: 'sea',
 73: 'time',
 74: 'other',
 75: 'very',
 76: 'go',
 77: 'these',
 78: 'more',
 79: 'though',
 80: 'first',
 81: 'sort',


In [25]:
for i in sequences[0]:
    print(f'{i} : {tokenizer.index_word[i]}')

956 : call
14 : me
263 : ishmael
51 : some
261 : years
408 : ago
87 : never
219 : mind
129 : how
111 : long
954 : precisely
260 : having
50 : little
43 : or
38 : no
315 : money
7 : in
23 : my
546 : purse
3 : and
150 : nothing
259 : particular
6 : to
2712 : interest
14 : me
25 : on
2710 : shore
5 : i
60 : thought
5 : i
56 : would


In [26]:
tokenizer.word_counts

OrderedDict([('call', 32),
             ('me', 2941),
             ('ishmael', 158),
             ('some', 903),
             ('years', 160),
             ('ago', 99),
             ('never', 534),
             ('mind', 194),
             ('how', 381),
             ('long', 444),
             ('precisely', 42),
             ('having', 167),
             ('little', 912),
             ('or', 1130),
             ('no', 1193),
             ('money', 140),
             ('in', 6727),
             ('my', 2126),
             ('purse', 81),
             ('and', 11491),
             ('nothing', 331),
             ('particular', 177),
             ('to', 7742),
             ('interest', 24),
             ('on', 2041),
             ('shore', 27),
             ('i', 8521),
             ('thought', 804),
             ('would', 837),
             ('sail', 124),
             ('about', 1209),
             ('a', 12372),
             ('see', 496),
             ('the', 18525),
             ('watery', 31),


In [27]:
vocabulary_size = len(tokenizer.word_counts)

In [28]:
vocabulary_size

2717

### Convert to Numpy Matrix

In [32]:
import numpy as np

In [33]:
sequences_ = np.array(sequences)

In [39]:
print(type(sequences_))
sequences_

<class 'numpy.ndarray'>


array([[ 956,   14,  263, ...,   60,    5,   56],
       [  14,  263,   51, ...,    5,   56,  316],
       [ 263,   51,  261, ...,   56,  316,   37],
       ...,
       [ 301,    1,  374, ...,  262,   53,    2],
       [   1,  374,    4, ...,   53,    2, 2717],
       [ 374,    4,   11, ...,    2, 2717,   26]])

# Creating an LSTM based model

In [41]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding

In [42]:
def create_model(vocabulary_size, seq_len):
    model = Sequential()
    model.add(Embedding(vocabulary_size, 25, input_length=seq_len))
    model.add(LSTM(150, return_sequences=True))
    model.add(LSTM(150))
    model.add(Dense(150, activation='relu'))

    model.add(Dense(vocabulary_size, activation='softmax'))
    
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
   
    model.summary()
    
    return model

### Train / Test Split

In [43]:
from keras.utils import to_categorical

In [54]:
sequences

array([[ 956,   14,  263, ...,   60,    5,   56],
       [  14,  263,   51, ...,    5,   56,  316],
       [ 263,   51,  261, ...,   56,  316,   37],
       ...,
       [ 301,    1,  374, ...,  262,   53,    2],
       [   1,  374,    4, ...,   53,    2, 2717],
       [ 374,    4,   11, ...,    2, 2717,   26]])

In [45]:
# First 49 words
sequences[:,:-1]

array([[ 956,   14,  263, ...,    5,   60,    5],
       [  14,  263,   51, ...,   60,    5,   56],
       [ 263,   51,  261, ...,    5,   56,  316],
       ...,
       [ 301,    1,  374, ...,   11,  262,   53],
       [   1,  374,    4, ...,  262,   53,    2],
       [ 374,    4,   11, ...,   53,    2, 2717]])

In [46]:
# last Word
sequences[:,-1]

array([  56,  316,   37, ...,    2, 2717,   26])

In [47]:
X = sequences[:,:-1]

In [48]:
y = sequences[:,-1]

In [49]:
y = to_categorical(y, num_classes=vocabulary_size+1)

In [50]:
seq_len = X.shape[1]

In [51]:
seq_len

30

### Training the Model

In [52]:
# define model
model = create_model(vocabulary_size+1, seq_len)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 30, 25)            67950     
_________________________________________________________________
lstm (LSTM)                  (None, 30, 150)           105600    
_________________________________________________________________
lstm_1 (LSTM)                (None, 150)               180600    
_________________________________________________________________
dense (Dense)                (None, 150)               22650     
_________________________________________________________________
dense_1 (Dense)              (None, 2718)              410418    
Total params: 787,218
Trainable params: 787,218
Non-trainable params: 0
_________________________________________________________________


---

----

In [57]:
from pickle import dump,load

In [59]:
# fit model
model.fit(X, y, batch_size=128, epochs=60,verbose=1)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<tensorflow.python.keras.callbacks.History at 0x4fb4ef48>

In [60]:
# save the model to file
model.save('epochBIG.h5')
# save the tokenizer
dump(tokenizer, open('epochBIG', 'wb'))

# Generating New Text

In [61]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

In [62]:
def generate_text(model, tokenizer, seq_len, seed_text, num_gen_words):
    '''
    INPUTS:
    model : model that was trained on text data
    tokenizer : tokenizer that was fit on text data
    seq_len : length of training sequence
    seed_text : raw string text to serve as the seed
    num_gen_words : number of words to be generated by model
    '''
    
    # Final Output
    output_text = []
    
    # Intial Seed Sequence
    input_text = seed_text
    
    # Create num_gen_words
    for i in range(num_gen_words):
        
        # Take the input text string and encode it to a sequence
        encoded_text = tokenizer.texts_to_sequences([input_text])[0]
        
        # Pad sequences to our trained rate (50 words in the video)
        pad_encoded = pad_sequences([encoded_text], maxlen=seq_len, truncating='pre')
        
        # Predict Class Probabilities for each word
        pred_word_ind = model.predict_classes(pad_encoded, verbose=0)[0]
        
        # Grab word
        pred_word = tokenizer.index_word[pred_word_ind] 
        
        # Update the sequence of input text (shifting one over with the new word)
        input_text += ' ' + pred_word
        
        output_text.append(pred_word)
        
    # Make it look like a sentence.
    return ' '.join(output_text)

### Grab a random seed sequence

In [64]:
text_sequences[1]

['me',
 'ishmael',
 'some',
 'years',
 'ago',
 'never',
 'mind',
 'how',
 'long',
 'precisely',
 'having',
 'little',
 'or',
 'no',
 'money',
 'in',
 'my',
 'purse',
 'and',
 'nothing',
 'particular',
 'to',
 'interest',
 'me',
 'on',
 'shore',
 'i',
 'thought',
 'i',
 'would',
 'sail']

In [76]:
import random
random.seed(101)
random_pick = random.randint(0,len(text_sequences))

In [77]:
random_seed_text = text_sequences[random_pick]

In [78]:
random_seed_text

['thought',
 'i',
 'to',
 'myself',
 'the',
 'man',
 "'s",
 'a',
 'human',
 'being',
 'just',
 'as',
 'i',
 'am',
 'he',
 'has',
 'just',
 'as',
 'much',
 'reason',
 'to',
 'fear',
 'me',
 'as',
 'i',
 'have',
 'to',
 'be',
 'afraid',
 'of',
 'him']

In [92]:
seed_text = ' '.join(random_seed_text)

In [93]:
seed_text

"thought i to myself the man 's a human being just as i am he has just as much reason to fear me as i have to be afraid of him"

In [71]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=100)

"as a country i began creaking and that proceeded of the room that and conclude to be sure jolly was anon a foot but a leaders i was thou unusual regarding but a right was was a great central parliament voyage with a world 's hear there be troubled been passengers steaks cutlery in the blackness and be been a shipmate charm had aside over the uses of the room i was there be a heathen look was a rejoinder i short on a dive england hags generally endeavored of misty spray and planing me a highest array of monstrous"

In [72]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=200)

"as a country i began creaking and that proceeded of the room that and conclude to be sure jolly was anon a foot but a leaders i was thou unusual regarding but a right was was a great central parliament voyage with a world 's hear there be troubled been passengers steaks cutlery in the blackness and be been a shipmate charm had aside over the uses of the room i was there be a heathen look was a rejoinder i short on a dive england hags generally endeavored of misty spray and planing me a highest array of monstrous clubs and spears me over with the arrantest cords of the manhattoes belted round and the chips like a midnight gale.--it squitchy ship weltering deck not had occurred to a room that a leaders i was thou unusual regarding but a right was was a great central parliament voyage with a world 's hear there be troubled been passengers steaks cutlery in the blackness and be been a shipmate charm had aside over the uses of the room i was there be a heathen look was a rejoinder i short 

In [94]:
generate_text(model,tokenizer,seq_len,seed_text=seed_text,num_gen_words=500)

"as a country i began creaking and that proceeded of the room that and conclude to be sure jolly was anon a foot but a leaders i was thou unusual regarding but a right was was a great central parliament voyage with a world 's hear there be troubled been passengers steaks cutlery in the blackness and be been a shipmate charm had aside over the uses of the room i was there be a heathen look was a rejoinder i short on a dive england hags generally endeavored of misty spray and planing me a highest array of monstrous clubs and spears me over with the arrantest cords of the manhattoes belted round and the chips like a midnight gale.--it squitchy ship weltering deck not had occurred to a room that a leaders i was thou unusual regarding but a right was was a great central parliament voyage with a world 's hear there be troubled been passengers steaks cutlery in the blackness and be been a shipmate charm had aside over the uses of the room i was there be a heathen look was a rejoinder i short 

### Exploring Generated Sequence

In [81]:
full_text = read_file('moby_dick_four_chapters.txt')

In [83]:
for i,word in enumerate(full_text.split()):
    if word == 'inkling':
        print(' '.join(full_text.split()[i-20:i+20]))
        print('\n')

were stains of some sort or other. At first I knew not what to make of this; but soon an inkling of the truth occurred to me. I remembered a story of a white man--a whaleman too--who, falling among the




# Great Job!

In [84]:
import keras
from keras.models import Sequential
from keras.layers import Dense,LSTM,Embedding