<a href="https://colab.research.google.com/github/Keenandrea/GRU/blob/master/GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Our Setup
---
## import TensorFlow, import libraries
---
TensorFlow is an open-source software library employed for machine learning applications that we'll be leveraging to construct our neural network.

Here the line **tf.enable_eager_execution()** allows us to apply eager execution to our model, an imperative programming environment that evaluates operations immediately. In short, it will equip us with a more interactive frontend to TensorFlow. 

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf
tf.enable_eager_execution()

from keras.utils import np_utils

import numpy as np
import os
import time

## download personal dataset
---
## mounting the *Google Drive*
---
To mount our *Google Drive* inside of a *Google Colab* instance, use the following commands:

In [0]:
from google.colab import drive
drive.mount('/content/drive')

All right. Now, with our *Google Drive* mounted, we can pull and push files between our *Drive* and our *Colab* instance with ease.

First, let's download a textfile. To add flavor, we'll choose a textfile whose author not only writes masterfully, but stylistically, too. Check out [Gutenberg](https://www.gutenberg.org/catalog/). At the link, we'll find 59,000 books out of which we're invited to freely download as *.txt*.

Once we've chosen our textfile, we'll open it in a notepad editor and delete the leading and trailing paragraphs that either introduce the text or explain Project Gutenberg's terms.

With our textfile cleaned, we'll open our *Colab* and upload the texfile within. To do this, click the **>** slider menu near the upper-lefthand corner of your *Colab* notebook instance. At the top of the menu, click the leftmost tab labeled **Files** and click **UPLOAD**. Our *File Explorer* will open. From it, select the your textfile.

---
## read the data
---
Let's have a looksee at the text. We'll read it, decode it, and understand it.

In [0]:
# Read, then decode for py2 compatability
text = open('metakaf.txt', 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 121108 characters


Have a look at the first 50 characters in the text:

In [0]:
print(text[:50])

One morning, when Gregor Samsa woke from troubled 


Check out the unique characters in the file:

In [0]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

62 unique characters


## process the text

---
Next things next, we need to map strings to numerical representation before training. How do we get this? We create two lookup tables:

In [0]:
# lookup table mapping from unique characters
# to numbers
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
# lookup table mapping from numbers to characters
text_as_int = np.array([char2idx[c] for c in text])

Run this cell to get the inside scoop the integer representation mapping for each character:

In [0]:
print('{')
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

{
  '\n':   0,
  '\r':   1,
  ' ' :   2,
  '!' :   3,
  '"' :   4,
  "'" :   5,
  '(' :   6,
  ')' :   7,
  ',' :   8,
  '-' :   9,
  '.' :  10,
  ':' :  11,
  ';' :  12,
  '?' :  13,
  'A' :  14,
  'B' :  15,
  'C' :  16,
  'D' :  17,
  'E' :  18,
  'F' :  19,
  ...
}


And so on. And this cell to show how the first 13 characters from the text are mapped to integers:


In [0]:
print ('{} ---- characters mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

'One morning, ' ---- characters mapped to int ---- > [27 49 40  2 48 50 53 49 44 49 42  8  2]


## the prediction task

---
Let's say you're given a character, or even a sequence of characters. Based off that, if you're asked to predict the most probable next character, what sort of methodology would you employ?

Answer or not, we're going to show you the *RNNs* methodology to this problem in the upcoming. First, know that RNNs maintain an internal state that depends on its previously seen elements to predict the next element.

---

## create training examples and targets

---
We divide the text into example sequences of specified length. Target sequences of the same length are corresponded to individual example sequences.

For instance, say our specified length is 4 and our text is 'Bathe'. Under these conditions, the input sequence would be 'Bath', and the target sequence: 'athe'. And so on.


In [0]:
seq_length = 100
examples_per_epoch = len(text)//seq_length

# create training examples and targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

Instructions for updating:
Colocations handled automatically by placer.
O
n
e
 
m


Using the *batch* method, we easily convert these indicidual characters to sized sequences.

In [0]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'One morning, when Gregor Samsa woke from troubled dreams, he found\r\nhimself transformed in his bed in'
'to a horrible vermin.  He lay on\r\nhis armour-like back, and if he lifted his head a little he could\r\n'
'see his brown belly, slightly domed and divided by arches into stiff\r\nsections.  The bedding was hard'
'ly able to cover it and seemed ready\r\nto slide off any moment.  His many legs, pitifully thin compare'
'd\r\nwith the size of the rest of him, waved about helplessly as he\r\nlooked.\r\n\r\n"What\'s happened to me?'


By the *map* method, we apply a simple function to each batch, which duplicates and shifts to form the input and target for each sequence.

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

For those visual learners out there:

In [0]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'One morning, when Gregor Samsa woke from troubled dreams, he found\r\nhimself transformed in his bed i'
Target data: 'ne morning, when Gregor Samsa woke from troubled dreams, he found\r\nhimself transformed in his bed in'


Each index of these vectors are processed as one single time step. Each timestep in course runs the same index prediction, however, the *RNN* also considers the previous step context as well as the current input character.

In [0]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 27 ('O')
  expected output: 49 ('n')
Step    1
  input: 49 ('n')
  expected output: 40 ('e')
Step    2
  input: 40 ('e')
  expected output: 2 (' ')
Step    3
  input: 2 (' ')
  expected output: 48 ('m')
Step    4
  input: 48 ('m')
  expected output: 50 ('o')


## training batches

---

Before the model architecture feeds on our data, we shuffle the data and pack it into batches. Why are we shuffling the data? I'm glad you asked. Without getting too technical, shuffling is a solution to evaluation of the loss function **W** on the training dataset **X**. 

When **X** is unchanged over training iterations, the evaluation of **W** on **X** is a value regarded as the elevation of the surface. The surface will have numerous local minima. Gradient descent algorithms are susceptible to becoming stuck in these minima while better solutions may be nearby. 

By shuffling the rows, and then training on a subset, or, batch of **X** during every iteration, our **X** will change with the iteration. The result is the likely possibility that no two iterations over the entire sequence of training iterations and epochs will be performed on the exact same **X**.

The affect allows us to easily leap from a local minimum and aquire a better **W**, which is a definite characteristic of a viable model.

In [0]:
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE

BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<DatasetV1Adapter shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## build that model architecture

---

Enough of that. Now let's get into the architecture of our network. Although you're welcome to add complexity, for this example, three layers define the architecture:

1.   **Embedding**: this is the input layer. Think of it as a trainable lookup table that will map the numbers of each character to a vector with *embedding_dim* dimensions. Dimensionality of your embeddings is the length of the word vectors.
2.   **GRU**: this is a type of RNN with a size *units=rnn_units*. Once again, to build a more complex architecture, you can emply *LSTM* layers here in place of the *GRU*.
3.   **Dense**: this is the output layer, with *vocab_size* outputs.



In [0]:
vocab_size = len(vocab)
# embedding dimension 
embedding_dim = 256
# number of RNN units
rnn_units = 1024

NameError: ignored

Next we define a function to build the architecture. We use *CuDNNGRU* since we are running our model on *GPU*.

In [0]:
if tf.test.is_gpu_available():
  rnn = tf.keras.layers.CuDNNGRU
else:
  import functools
  rnn = functools.partial(
    tf.keras.layers.GRU, recurrent_activation='sigmoid')

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, 
                              batch_input_shape=[batch_size, None]),
    rnn(rnn_units,
        return_sequences=True, 
        recurrent_initializer='glorot_uniform',
        stateful=True),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab), 
  embedding_dim=embedding_dim, 
  rnn_units=rnn_units, 
  batch_size=BATCH_SIZE)

So, how does it work? For each character the model looks up the embedding, runs the *GRU* one timestep with the embedding as input, and then applies the dense layer to generate logits that will predicting the log-liklihood of the next character.

Let's look under the hood. First, let's check the shape of the output:

In [0]:
for input_example_batch, target_example_batch in dataset.take(1): 
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape)
  # (batch_size, sequence_length, vocab_size)

(64, 100, 62)


Now let's get a summary of the model architecture:

In [0]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           15872     
_________________________________________________________________
cu_dnngru (CuDNNGRU)         (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 62)            63550     
Total params: 4,017,726
Trainable params: 4,017,726
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model, we need to sample from the output distribution, to get actual character indices. Distribution is defined by the logits over the character vocabulary.

In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

So we get, at each timestep, a prediction of the next character index:

In [0]:
sampled_indices

array([29, 40, 18, 45, 36,  7, 47, 17, 54, 36, 47, 39, 22,  3, 57,  5, 31,
       41, 32,  4, 28, 45,  9, 53, 31, 20,  9,  3, 25, 25, 11, 60, 42, 22,
       30, 22,  5, 46,  0, 32, 59, 39, 41, 49, 14, 19, 56, 25, 38, 24, 49,
        5, 56, 55, 41, 27, 20, 27, 21, 16, 37, 27, 52,  1, 44,  3, 20,  8,
       33, 34, 61, 51, 43, 35,  1, 54, 42, 57, 35, 30,  1, 16, 34, 36, 26,
       61, 49, 48, 59, 13, 25, 59, 45, 29, 40, 27, 51, 24, 23, 50])

Decoding these will show us the text predicted by this untrained model:

In [0]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 'would probably be the only one who would\r\ndare enter a room dominated by Gregor crawling about the b'

Next Char Predictions: 
 'QeEja)lDsaldI!v\'TfU"Pj-rTG-!MM:ygISI\'k\nUxdfnAFuMcLn\'utfOGOHCbOq\ri!G,VWzphY\rsgvYS\rCWaNznmx?MxjQeOpLJo'


## train the model

---

Our problem can be treated as a standard classification problem. WHy? Because, given the previous *RNN* state, and the input this current time step, predict the **class** of the next character.

---

## optimizer and loss function attatched

---

Standard *sparse_softmax_crossentropy* loss function works in this case because it applies itself across the last dimension of the predictions.

In [0]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)") 
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 62)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.1263423


Using *Adam* as our optimizer, we configure the training procedure and compile:

In [0]:
model.compile(
    optimizer = tf.train.AdamOptimizer(),
    loss = loss)

## checkpoints

---

We will use checkpoints to ensure that checkpoints are saved during training:

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

## execute the sucker

---

Define the number of epochs. Train network with *model.fit*. Save as *history* to compare metrics:

In [0]:
EPOCHS=100

In [0]:
history = model.fit(dataset.repeat(), epochs=EPOCHS, steps_per_epoch=steps_per_epoch, callbacks=[checkpoint_callback])

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## generate text

---

To restore checkpoints, you'll find a directory:

In [0]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_100'

Once more, build the architecture, this time using the weights from the checkpoint directory:

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

## loop of generation

---



In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 2000

  # Converting our start string to numbers (vectorizing) 
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 0.33

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a multinomial distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.multinomial(predictions, num_samples=1)[-1,0].numpy()
      
      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)
      
      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

Prompt generation with a *start_string*:

In [0]:
print(generate_text(model, start_string=u"Time "))

Time on her face as if she had some tremendous
good news to report, but would only do it if she was clever; she was already
intend, the other food there staring
dry-eyed at the table.

Gregor hardly slept at all, either night or day.  Sometimes tice than a month, and his condition seemed serious
enough to remind even his father that Gregor, despite its breadth and its weight, the bulk of
his body eventually followed slowly in the dirtter wanted to spare them what distress she could as they were
indeed suffering enough.

It was impossible for her to bring her mother out of her face as if she had some tremendous
good news to report, but would only do it if she was clever; she was already
in hear not even touched at all as if it could not be used any more.
She quickly dropped it all into a bin, and a chair to the window, climbing up onto
the sill and, propped up in the chair where the gentlemen and sat - leaving
the chair where the gentlemen and sat - leaving
the chair whe

## customized training 

---



In [0]:
model = build_model(
  vocab_size = len(vocab), 
  embedding_dim=embedding_dim, 
  rnn_units=rnn_units, 
  batch_size=BATCH_SIZE)

In [0]:
optimizer = tf.train.AdamOptimizer()

In [0]:
EPOCHS = 1

for epoch in range(EPOCHS):
    start = time.time()
    
    # initializing the hidden state at the start of every epoch
    # initally hidden is None
    hidden = model.reset_states()
    
    for (batch_n, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              # feeding the hidden state back into the model
              # This is the interesting step
              predictions = model(inp)
              loss = tf.losses.sparse_softmax_cross_entropy(target, predictions)
              
          grads = tape.gradient(loss, model.trainable_variables)
          optimizer.apply_gradients(zip(grads, model.trainable_variables))

          if batch_n % 100 == 0:
              template = 'Epoch {} Batch {} Loss {:.4f}'
              print(template.format(epoch+1, batch_n, loss))

    # saving (checkpoint) the model every 5 epochs
    if (epoch + 1) % 5 == 0:
      model.save_weights(checkpoint_prefix.format(epoch=epoch))

    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

model.save_weights(checkpoint_prefix.format(epoch=epoch))

# the end.