## Project
#### Description:

* In this project, text generation is implemented with an RNN with GRU, based
on the text by Isaac Asimov, The Foundation.
*   Based on the implementation: https://www.tensorflow.org/tutorials/text/text_generation#generate_text
*   This uses the dataset from The Foundation by Isaac Asimov.
* Details: see each comment and block with explanations.




#### 1. Download the dataset.

In [1]:
!wget https://ia800708.us.archive.org/35/items/Foundation_201811/3%20Foundation_djvu.txt

--2020-08-14 19:46:44--  https://ia800708.us.archive.org/35/items/Foundation_201811/3%20Foundation_djvu.txt
Resolving ia800708.us.archive.org (ia800708.us.archive.org)... 207.241.230.78
Connecting to ia800708.us.archive.org (ia800708.us.archive.org)|207.241.230.78|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 423786 (414K) [text/plain]
Saving to: ‘3 Foundation_djvu.txt.2’


2020-08-14 19:46:45 (735 KB/s) - ‘3 Foundation_djvu.txt.2’ saved [423786/423786]



In [2]:
import tensorflow as tf

import numpy as np
import os
import time

In [3]:
tf.test.is_gpu_available()


Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [4]:
# Read, then decode.
text = open('3 Foundation_djvu.txt', 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

print(text[:500])
text.find('THE STORY')


Length of text: 411158 characters
ASI MOV 


THE FOUNDATION NOVELS 



FOUNDATION 
















FOUNDATION 


ISAAC ASIMOV 

Copyright © 1951 


To the memory of my mother (1895 — 1973) 


Contents 


THE STORY BEHIND THE “FOUNDATION” 

PART I - THE PSYCHOHISTORIANS 

PART II - THE ENCYCLOPEDISTS 

PART III - THE MAYORS 

PART IV - THE TRADERS 

PART V - THE MERCHANT PRINCES 








THE STORY BEHIND THE "FOUNDATION” 


By ISAAC ASIMOV 


The date was August 1, 1941. World War II had been raging for two years. 
France had fal


170

In [5]:
# find where text starts.
text.find('By ISAAC ASIMOV')


393

In [6]:
# crop
text = text[393:]
text[:500]

'By ISAAC ASIMOV \n\n\nThe date was August 1, 1941. World War II had been raging for two years. \nFrance had fallen, the Battle of Britain had been fought, and the Soviet Union \nhad just been invaded by Nazi Germany. The bombing of Pearl Harbor was four \nmonths in the future. \n\nBut on that day, with Europe in flames, and the evil shadow of Adolf \nHitler apparently falling over all the world, what was chiefly on my mind was a \nmeeting toward which I was hastening. \n\nI was 21 years old, a graduate stud'

In [7]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])


print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')


87 unique characters
{
  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '"' :   3,
  '$' :   4,
  '%' :   5,
  "'" :   6,
  '(' :   7,
  ')' :   8,
  '*' :   9,
  ',' :  10,
  '-' :  11,
  '.' :  12,
  '/' :  13,
  '0' :  14,
  '1' :  15,
  '2' :  16,
  '3' :  17,
  '4' :  18,
  '5' :  19,
  ...
}


In [8]:

print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))


'By ISAAC ASIM' ---- characters mapped to int ---- > [28 78  1 35 45 27 27 29  1 27 45 35 39]


In [9]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])


B
y
 
I
S


In [10]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))


'By ISAAC ASIMOV \n\n\nThe date was August 1, 1941. World War II had been raging for two years. \nFrance h'
'ad fallen, the Battle of Britain had been fought, and the Soviet Union \nhad just been invaded by Nazi'
' Germany. The bombing of Pearl Harbor was four \nmonths in the future. \n\nBut on that day, with Europe '
'in flames, and the evil shadow of Adolf \nHitler apparently falling over all the world, what was chief'
'ly on my mind was a \nmeeting toward which I was hastening. \n\nI was 21 years old, a graduate student i'


In [11]:
def split_input_target(chunk):
  input_text = chunk[:-1]
  target_text = chunk[1:]
  return input_text, target_text

dataset = sequences.map(split_input_target)


In [12]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))


Input data:  'By ISAAC ASIMOV \n\n\nThe date was August 1, 1941. World War II had been raging for two years. \nFrance '
Target data: 'y ISAAC ASIMOV \n\n\nThe date was August 1, 1941. World War II had been raging for two years. \nFrance h'


In [13]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
  print("Step {:4d}".format(i))
  print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
  print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))


Step    0
  input: 28 ('B')
  expected output: 78 ('y')
Step    1
  input: 78 ('y')
  expected output: 1 (' ')
Step    2
  input: 1 (' ')
  expected output: 35 ('I')
Step    3
  input: 35 ('I')
  expected output: 45 ('S')
Step    4
  input: 45 ('S')
  expected output: 27 ('A')


In [14]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset


<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [15]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024


In [16]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model


In [17]:
model = build_model(
    vocab_size = len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE)


In [18]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 100, 87) # (batch_size, sequence_length, vocab_size)


In [19]:
model.summary()


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           22272     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 87)            89175     
Total params: 4,049,751
Trainable params: 4,049,751
Non-trainable params: 0
_________________________________________________________________


In [20]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([13,  4, 32,  2, 56, 61,  8, 35, 70, 37, 75, 22, 70, 26, 18, 65, 65,
       36, 21, 31, 66, 60, 51, 84, 18, 40, 61, 48, 86,  7, 63,  6, 72, 57,
        0, 43, 39,  5,  5, 16, 47, 60,  6, 10, 49, 35, 73, 57, 78, 72, 53,
       22, 24, 80, 75, 41, 66, 50, 10, 80, 21, 56, 52, 53, 79,  2, 53, 68,
       27, 86, 59, 65, 25,  5, 17, 48,  1, 78, 23, 63, 42, 75, 62, 73, 33,
       58, 71, 48, 11, 61,  1, 14, 28, 22, 33,  5, 72, 56, 79,  9])

## Experiment.
To check how it works (a bunch of mumbo-jumbo is expected as model is NOT trained yet)

In [21]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))


Input: 
 '. That was for the inhabitants of Terminus itself. The men of the Outer \nPlanets could hear only cen'

Next Char Predictions: 
 "/$F!ch)IqKv8q?4llJ7EmgY’4NhV”(j'sd\nQM%%2Ug',WItdys^8:~vOmX,~7cZ^z!^oA”fl;%3V y9jPvitGerV-h 0B8G%scz*"


In [22]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())


Prediction shape:  (64, 100, 87)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.464743


In [23]:
model.compile(optimizer='adam', loss=loss)


In [24]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


#Training

In [None]:
EPOCHS = 100
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])



In [26]:
tf.train.latest_checkpoint(checkpoint_dir)


'./training_checkpoints/ckpt_100'

In [27]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            22272     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 87)             89175     
Total params: 4,049,751
Trainable params: 4,049,751
Non-trainable params: 0
_________________________________________________________________


#Inference

In [28]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 280

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
    predictions = model(input_eval)
    # remove the batch dimension
    predictions = tf.squeeze(predictions, 0)

    # using a categorical distribution to predict the character returned by the model
    predictions = predictions / temperature
    predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

    # We pass the predicted character as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)

    text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))


In [29]:
print(generate_text(model, start_string='The Foundation'))

The Foundation series had been written at a bitor of 
religious projectation medital rights to bring inch of sacrilege of the highest order?” 

Wienis presented his lips might and said: “Well, now, I have come to a complacency. 

“Back to the Imperie mean that had 
shoulders. In no 



passeng


In [30]:
messages = ( 'Hi','How are you?',"What's your name?", 'Tell me about yourself',
            'Do you love me?', "What's the meaning of life?", "How is the weather today?",
            "Let's have a dinner", "Are you a bot?", "Why not?")

for m in messages:
  print('Input: '+m+'\n')
  print('output: '+generate_text(model, start_string=m).replace('/n', '\n'))
  print('\n\n')


Input: Hi

output: His possible. How would you believe e important things of 
metallically: “I seem to Twer shortly and began a ship its railings with believe in’s interested in par 

“What is it?” replay in One hands with events 
of economics and so forth. Its accepted cigars and breat diw 
bourdedi



Input: How are you?

output: How are you? Atho-d most power plants on their active is a vouches of the old Imperial Nyak of 
by unstracted reverie. “No, milord, 
cause Empire action to better his 
view. He resistance had removed the “Action to Lupon ‘have half ones of the Foundation. The known prod Dorngray; his kings o



Input: What's your name?

output: What's your name? We know that some 
day we’re to write a small man; Mr. A side is necessary. You mean what 
you do that I know -- not expansion of his vigile seco-sure that 
will final and wondered, -vanious men of the Galaxy, if you 
lack the ancient days when the Galaxy for heavious capable on



Input: Tell me about yourself

outp