## **Description**

**Word-level text generation in python using keras model and tensorflow.**

Upload a text document that is then cleaned, prepared, and turned into sequences of 51 words (50 input and 1 output).

Words and sequences are tokenized using the keras tokenizer, mapping unique words to integers and encoding input sequences.

Sequences are then fit to a sequential language model. Accuracy is relative to the number of epochs, the size of the embedding vector space, and the number of neurons.

A randomized seed text is generated which is encoded using the same tokenizer that was used for input sequences. Each word is predicted as an integer using np.argmax(model.predict()) which is then mapped to the word from the tokenizer. The text is then fully generated.

With varying levels of accuracy in the model, some text generation is more realistic than others. Even with high accuracy there is a chance for the text to be nonsensical depending on the seed text and the model.

## **Upload File**

In [None]:
 #upload new file if necessary
 #from google.colab import files
 #uploaded = files.upload()

## **Imports**

In [None]:
import string
import numpy as np
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from random import randint
from pickle import load
from pickle import dump
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from tensorflow.keras.utils import to_categorical


## **Load Text**

In [None]:
def load_text(fileName):
  file = open(fileName, 'r')
  text = file.read()
  file.close
  return text

In [None]:
textFile = 'gatsby.txt'
text = load_text(textFile)
# checks load
print(text[:200])

﻿
                                  I

In my younger and more vulnerable years my father gave me some advice
that I’ve been turning over in my mind ever since.

“Whenever you feel like criticizing any


## **Prepare Text**

In [None]:
# cleans up characters that could cause issues
def text_prep(text):
 text = text.replace('--', ' ')
 words = text.split()
 table = str.maketrans('', '', string.punctuation)
 words = [w.translate(table) for w in words]
 words = [word for word in words if word.isalpha()]
 words = [word.lower() for word in words]
 return words

tokens = text_prep(text)

# quick functionality checks
print(tokens[:100])
print('Total words: %d' % len(tokens))
print('Unique words: %d' % len(set(tokens)))

['i', 'in', 'my', 'younger', 'and', 'more', 'vulnerable', 'years', 'my', 'father', 'gave', 'me', 'some', 'advice', 'that', 'been', 'turning', 'over', 'in', 'my', 'mind', 'ever', 'since', 'you', 'feel', 'like', 'criticizing', 'he', 'told', 'me', 'remember', 'that', 'all', 'the', 'people', 'in', 'this', 'world', 'had', 'the', 'advantages', 'that', 'he', 'say', 'any', 'more', 'but', 'always', 'been', 'unusually', 'communicative', 'in', 'a', 'reserved', 'way', 'and', 'i', 'understood', 'that', 'he', 'meant', 'a', 'great', 'deal', 'more', 'than', 'that', 'in', 'consequence', 'inclined', 'to', 'reserve', 'all', 'judgements', 'a', 'habit', 'that', 'has', 'opened', 'up', 'many', 'curious', 'natures', 'to', 'me', 'and', 'also', 'made', 'me', 'the', 'victim', 'of', 'not', 'a', 'few', 'veteran', 'bores', 'the', 'abnormal', 'mind']
Total words: 46724
Unique words: 5987


In [None]:
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
  short = tokens[i-length:i]
  line = ' '.join(short)
  sequences.append(line)

# save tokens to file
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()
 
# save sequences to file
out_filename = 'text_sequences.txt'
save_doc(sequences, out_filename)

# check sequences
print('All Sequences: %d' % len(sequences))

All Sequences: 46673


## **Encoding**

In [None]:
# redefine load text for loading doc
def load_doc(fileName):
  file = open(fileName, 'r')
  text = file.read()
  file.close
  return text

in_filename = 'text_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# encodes sequences to integers values
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
vocab_size = len(tokenizer.word_index) + 1

# input and output
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

## **Train Model**

In [None]:
# define model
# accuracy of generation at 100-200 epochs is functional
# may take too long to run on higher values, additional accuracy not worth time
# EMBEDDING: 50, 100, 300, etc.
# LSTM: higher the better, 100 standard
model = Sequential()
model.add(Embedding(vocab_size, 150, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))

# summarize model
print(model.summary())

# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=100)

# save the model
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 50, 150)           898200    
                                                                 
 lstm (LSTM)                 (None, 50, 100)           100400    
                                                                 
 lstm_1 (LSTM)               (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 100)               10100     
                                                                 
 dense_1 (Dense)             (None, 5988)              604788    
                                                                 
Total params: 1,693,888
Trainable params: 1,693,888
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch

## **Generate Text**

In [None]:
# load data from language model
def load_doc(filename):
	file = open(filename, 'r')
	text = file.read()
	file.close()
	return text

# load text sequences
in_filename = 'text_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

# set sequence length
seq_length = len(lines[0].split()) - 1

# load saved model and tokenizer
model = load_model('model.h5')
tokenizer = load(open('tokenizer.pkl', 'rb'))

In [None]:
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
	result = list()
	input_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode text as integer
		encoded = tokenizer.texts_to_sequences([input_text])[0]
		# truncate sequences
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
		# predict probabilities for each integer
		yhat = np.argmax(model.predict(encoded))
		# map predicted integer index to word
		output = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				output = word
				break
		# append to input
		input_text += ' ' + output
		result.append(output)
	return ' '.join(result)
 
# select and print seed text
seed_text = lines[randint(0,len(lines))]
print("Seed text: " + seed_text + '\n')

 # generate text
generated_text = generate_seq(model, tokenizer, seq_length, seed_text, 25)
print("Generated text: " + generated_text)

Seed text: tapped on the front glass want to get one of those she said earnestly want to get one for the apartment nice to we backed up to a grey old man who bore an absurd resemblance to john d rockefeller in a basket swung from his neck cowered a dozen very

Generated text: recent puppies of an indeterminate breed kind are often on the corrugated room of the ventura effort servants an overenlarged motor these cars moved here


## **Sources**

https://stackabuse.com/text-generation-with-python-and-tensorflow-keras/

https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/

https://www.thepythoncode.com/article/text-generation-keras-python

http://ethen8181.github.io/machine-learning/keras/rnn_language_model_basic_keras.html