In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### tensorflow = 2.3 and keras = 2.4 is needed for this code to execute.

In [None]:
!pip install q keras==2.4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting keras==2.4
  Using cached Keras-2.4.0-py2.py3-none-any.whl (170 kB)
Installing collected packages: keras
  Attempting uninstall: keras
    Found existing installation: Keras 1.2.2
    Uninstalling Keras-1.2.2:
      Successfully uninstalled Keras-1.2.2
Successfully installed keras-2.4.0


In [None]:
!pip install tensorflow==2.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import tensorflow as tf
print(tf. __version__)

2.3.0


In [None]:
import keras

keras.__version__

'2.4.0'

Data Preparation
We will start by preparing the data for modeling.
Here we are developing a model of the text that we can then use to generate new sequences of text.

The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word.

#### Load Text:
function to load the entire text file into memory and return it. The function is called load_doc() 

In [None]:
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

In [None]:
# load document
in_filename = "/content/drive/My Drive/Colab Notebooks/NLP/republic_clean.txt"
doc = load_doc(in_filename)
print(doc[:200])

﻿INTRODUCTION AND ANALYSIS.

The Republic of Plato is the longest of his works with the exception
of the Laws, and is certainly the greatest of them. There are nearer
approaches to modern metaphysics 


Clean Text:
Replace ‘–‘ with a white space so we can split words better.
Split words based on white space.

Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).

Remove all words that are not alphabetic to remove standalone punctuation tokens.

Normalize all words to lowercase to reduce the vocabulary size.

clean_doc() that takes document as an argument and returns an array of clean tokens.

In [None]:
import string
 
# turn a doc into clean tokens
def clean_doc(doc):
	# replace '--' with a space ' '
	doc = doc.replace('--', ' ')
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# make lower case
	tokens = [word.lower() for word in tokens]
	return tokens

cleaning operation on our loaded document

In [None]:
# clean document
tokens = clean_doc(doc)
print(tokens[:200])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['and', 'analysis', 'the', 'republic', 'of', 'plato', 'is', 'the', 'longest', 'of', 'his', 'works', 'with', 'the', 'exception', 'of', 'the', 'laws', 'and', 'is', 'certainly', 'the', 'greatest', 'of', 'them', 'there', 'are', 'nearer', 'approaches', 'to', 'modern', 'metaphysics', 'in', 'the', 'philebus', 'and', 'in', 'the', 'sophist', 'the', 'politicus', 'or', 'statesman', 'is', 'more', 'ideal', 'the', 'form', 'and', 'institutions', 'of', 'the', 'state', 'are', 'more', 'clearly', 'drawn', 'out', 'in', 'the', 'laws', 'as', 'works', 'of', 'art', 'the', 'symposium', 'and', 'the', 'protagoras', 'are', 'of', 'higher', 'excellence', 'but', 'no', 'other', 'dialogue', 'of', 'plato', 'has', 'the', 'same', 'largeness', 'of', 'view', 'and', 'the', 'same', 'perfection', 'of', 'style', 'no', 'other', 'shows', 'an', 'equal', 'knowledge', 'of', 'the', 'world', 'or', 'contains', 'more', 'of', 'those', 'thoughts', 'which', 'are', 'new', 'as', 'well', 'as', 'old', 'and', 'not', 'of', 'one', 'age', 'only',

### Save Clean Text
we have organized the long list of tokens into sequences of 50 input words and 1 output word.

That is, sequences of 51 words.

We did this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

We have transform the tokens into space-separated strings for later storage in a file.

The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.

In [None]:
# organize into sequences of tokens
length = 51 + 1
sequences = list()

for i in range(length, len(tokens)):
	# select sequence of tokens
	seq = tokens[i-length:i]
	# convert into a line
	line = ' '.join(seq)
	# store
	sequences.append(line)
print('Total Sequences: %d' % len(sequences))

Total Sequences: 216638


Printing statistics on the list, we can see that we will have exactly 216638 training patterns to fit our model.

In [None]:
print(sequences[0])

and analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the


Next, we have to save the sequences to a new file for later loading.

We have defined a new function for saving lines of text to a file. This new function is called save_doc() .  It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

In [None]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

We can call this function and save our training sequences to the file ‘republic_sequences.txt‘.

In [None]:
# save sequences to file
out_filename = 'republic_sequences.txt'
save_doc(sequences, out_filename)

### Train Language Model
We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
It learns the representation at the same time as learning the model.
It learns to predict the probability for the next word using the context of the last 100 words.
Specifically, we have used an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

#### Load training data.

### **Load Sequences**
We can load our training data using the load_doc() function we developed in the previous section.

Once loaded, we can split the data into separate training sequences by splitting based on new lines.

The snippet below will load the ‘republic_sequences.txt‘ data file from the current working directory.

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# load
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

In [None]:
print(lines[0:100])

['and analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the', 'analysis the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the state', 'the republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them there are nearer approaches to modern metaphysics in the philebus and in the sophist the politicus or statesman is more ideal the form and institutions of the state are', 'republic of plato is the longest of his works with the exception of the laws and is certainly the greatest of them th

### **Encode Sequences**
The word embedding layer expects input sequences to be comprised of integers.

We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping.

To do this encoding, we have used the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

In [None]:
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# integer encode sequences of words
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)

In [None]:
print(sequences)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

Words are assigned values from 1 to the total number of words ( 10436). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary is 10436; that means the array must be 10436 + 1 in length.

Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.Here euqals (**10437**)

In [None]:
# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

10437


### **Sequence Inputs and Output**
**Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.**

We can do this with array slicing.

After separating, **we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.**

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the **to_categorical()** that can be used to one hot encode the output words for each input-output sequence pair.

**Finally, we need to specify to the Embedding layer** how long input sequences are. We know that there are 50 words because we designed the model, **but a good generic way to specify that is to use the second dimension (number of columns)** **of the input data’s shape.** That way, if you change the length of sequences when preparing data, you do not need to change this data loading code; it is generic.

In [None]:
# separate into input and output
from array import array
from numpy import array
from keras.utils.np_utils import to_categorical
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

In [None]:
seq_length

51

## **Fit Model**
We can now define and fit our language model on the training data.

The learned embedding needs to know the size of the vocabulary and the length of input sequences.

We have used a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. 

The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary.

A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

In [None]:
# Larger LSTM Network to Generate Text for Alice in Wonderland
from numpy import array
from pickle import dump
from keras.preprocessing.text import Tokenizer
#from keras.utils import to_categorical
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
# define model

model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 51, 50)            521850    
_________________________________________________________________
lstm_4 (LSTM)                (None, 51, 100)           60400     
_________________________________________________________________
lstm_5 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_4 (Dense)              (None, 100)               10100     
_________________________________________________________________
dense_5 (Dense)              (None, 10437)             1054137   
Total params: 1,726,887
Trainable params: 1,726,887
Non-trainable params: 0
_________________________________________________________________
None


 the model is compiled specifying the categorical cross entropy loss needed to fit the model.

the model is learning a multi-class classification and this is the suitable loss function for this type of problem. The efficient Adam optimizer is being used and accuracy is evaluated of the model.

Adam is an optimization algorithm that can be used instead of the classical stochastic gradient descent procedure to update network weights.

Adam benefits: easy to implement
             : Well suited for problems that are large in terms of data and/or parameters.
             : uses features of Adaptive Gradient Algorithm (AdaGrad)  and 
             Root Mean Square Propagation (RMSProp).

In [None]:
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f278a9c93d0>

an accuracy of just over 50% of predicting the next word in the sequence, is always good. We are not aiming for 100% accuracy (e.g. a model that memorized the text), but rather a model that captures the essence of the text.
we can optimize by increasing hidden layers and more LSTM layer

Save Model

In [None]:
# save the model to file
model.save('model.h5')
# save the tokenizer
dump(tokenizer, open('tokenizer.pkl', 'wb'))

## **Use Language Model**

Now that we have a trained language model, we can use it.

In this case, we can use it to generate new sequences of text that have the same statistical properties as the source text.

### Load Data

In [None]:
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text.

The model will require 50 words as input.

Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

### Load Model

We can now load the model from file.

Keras provides the load_model() function for loading the model

In [None]:
# load the model
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences


model = load_model('model.h5')

We can also load the tokenizer from file using the Pickle API.

In [None]:
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))

## **Generate Text**
The first step in generating text is preparing a seed input.

We will select a random line of text from the input text for this purpose. Once selected, we will print it so that we have some idea of what was used.

In [None]:
print(len(lines))

216638


In [None]:
# select a seed text
seed_text = lines[randint(0,len(lines))]
print(seed_text + '\n')

intemperate and worthless subjects even though they might have made large fortunes out of them as to the story of pindar that asclepius was slain by a thunderbolt for restoring a rich man to life that is a lie following our old rule we must say either that he did not take



Next, we can generate new words, one at a time.

First, the seed text must be encoded to integers using the same tokenizer that we used when training the model.

In [None]:
encoded = tokenizer.texts_to_sequences([seed_text])[0]

In [None]:
print(encoded)
print(len(encoded))

[3409, 3, 3410, 433, 159, 240, 15, 168, 20, 133, 821, 2937, 106, 2, 25, 17, 4, 1, 1403, 2, 2234, 9, 1128, 50, 2938, 23, 7, 6859, 26, 3967, 7, 374, 54, 4, 76, 9, 5, 7, 457, 1060, 58, 198, 277, 21, 61, 71, 137, 9, 8, 278, 12, 149]
52


The model can predict the next word directly by calling model.predict_classes() that will return the index of the word with the highest probability.

In [None]:
# predict probabilities for each word
import numpy as np
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

yhat = model.predict_classes(encoded, verbose=0)
#yhat = np.argmax(model.predict_class(encoded))
#yhat = (model.predict(encoded) > 0.5).astype("int32")
#yhat = model.predict_proba(encoded, verbose=0)
#predict_x=model.predict(encoded)
#predict_x = np.argmax(model.predict(encoded), axis=-1)
#classes_x=np.argmax(predict_x,axis=1)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


We can then look up the index in the Tokenizers mapping to get the associated word.

In [None]:
out_word = ''
for word, index in tokenizer.word_index.items():
	if index == yhat:
		out_word = word
		break

ValueError: ignored

Importantly, the input sequence is going to get too long. We can truncate it to the desired length after the input sequence has been encoded to integers. Keras provides the pad_sequences() function that we can use to perform this truncation.

In [None]:
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')

We can wrap all of this into a function called generate_seq() that takes as input the model, the tokenizer, input sequence length, the seed text, and the number of words to generate. It then returns a sequence of words generated by the model.

In [None]:
from random import randint
from pickle import load
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
 
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# generate a sequence from a language model
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
	result = list()
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# truncate sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
		result.append(out_word)
	return ' '.join(result)
 
# load cleaned text sequences
in_filename = 'republic_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
seq_length = len(lines[0].split()) - 1
 
# load the model
model = load_model('model.h5')
 
# load the tokenizer
tokenizer = load(open('tokenizer.pkl', 'rb'))
 
# select a seed text
seed_text = lines[randint(0,len(lines))]
print("text input : " , seed_text + '\n')
seed_text = 'i am happy'
print("text input : " , seed_text + '\n')

# generate new text
generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
print("Text Generated console: \n\n", generated)

text input :  abate until he have attained the knowledge of the true nature of every essence by a sympathetic and kindred power in the soul and by that power drawing near and mingling and becoming incorporate with very being having begotten mind and truth he will have knowledge and will live and grow truly

text input :  i am happy

Text Generated console: 

 and the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the state of the
