#Natural Language Processing (Character Generation) with Recurrent Neural Network


We will generate charater using a RNN.  We will work with a dataset of Shakespeare's writing.  Given a sequence of characters from this data ("Shakespear"), train a model to predict the next character in the sequence ("e"). Longer sequences of text can be generated by calling the model repeatedly.

- The model is character-based.
- The model is trained on small batches of text (100 characters each), and is able to generate a longer sequence of text with coherent structure.

*This guide is based on the following: https://www.tensorflow.org/tutorials/text/text_generation*


In [1]:
# %tensorflow_version 2.x
from keras.preprocessing import sequence
import keras
import tensorflow as tf
import os
import numpy as np

###Download the Shakespeare Dataset

Here, we use an extract from a Shadespheare play for training.  We can use our own text paragraph data and use it for network training.

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# If you load your own data file, use the following lines:
# from google.colab import files
# path_to_file = list(files.upload().keys())[0]

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


###Read Contents of File
Let's look at the contents of the file.

In [3]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [4]:
# Take a look at the first 250 characters in text
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



In [5]:
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

65 unique characters


###Encode
Encode each unique character as a different integer.

In [6]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

def text_to_int(text):
  return np.array([char2idx[c] for c in text])

text_as_int = text_to_int(text)

In [7]:
# Chect how a part of the text is encoded
print("Text:", text[:13])
print("Encoded:", text_to_int(text[:13]))

Text: First Citizen
Encoded: [18 47 56 57 58  1 15 47 58 47 64 43 52]


Make a function that can convert the numeric values back to text.


In [8]:
def int_to_text(ints):
  try:
    ints = ints.numpy()
  except:
    pass
  return ''.join(idx2char[ints])

print(int_to_text(text_as_int[:13]))

First Citizen


###Create Training Examples

Given a character, or a sequence of characters, what is the most probable next character? This is the task we are training the model to perform.

Our task is to feed the model a sequence and have it return the next character. This means we need to split our text data from above into many shorter sequences that we can pass to the model as training examples. 

The training examples we prepapre will use a *seq_length* sequence as input and a *seq_length* sequence as the output where that sequence is the original sequence shifted one letter to the right. For example:

```input: Hell | output: ello```

In [9]:
# Create a stream of characters from our text data
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

# Use the batch method to turn this stream of characters into batches of desired length
seq_length = 100  # length of sequence for a training example
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
# Use these sequences of length 101 and split them into input and output.
def split_input_target(chunk):  # for the example: hello
    input_text = chunk[:-1]  # hell
    target_text = chunk[1:]  # ello
    return input_text, target_text  # hell, ello

dataset = sequences.map(split_input_target)  # we use map to apply the above function to every entry

In [10]:
for x, y in dataset.take(3):
  print("\n\nEXAMPLE\n")
  print("INPUT")
  print(int_to_text(x))
  print("\nOUTPUT")
  print(int_to_text(y))



EXAMPLE

INPUT
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You

OUTPUT
irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You 


EXAMPLE

INPUT
are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you 

OUTPUT
re all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you k


EXAMPLE

INPUT
now Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us k

OUTPUT
ow Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us ki


Make training batches.

In [11]:
BATCH_SIZE = 64
VOCAB_SIZE = len(vocab)  # vocab is number of unique characters
EMBEDDING_DIM = 256
RNN_UNITS = 1024

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

data = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

###Build the Model
We use an embedding layer, a LSTM layer, and one dense layer that contains a node for each unique character in our training data.  The dense layer will give us a probability distribution over all nodes.

In [12]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    # tf.keras.layers.LSTM(rnn_units,
    #                     return_sequences=True,
    #                     stateful=True,
    #                     recurrent_initializer='glorot_uniform'),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),                        
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, BATCH_SIZE)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (64, None, 256)           16640     
                                                                 
 gru (GRU)                   (64, None, 1024)          3938304   
                                                                 
 dense (Dense)               (64, None, 65)            66625     
                                                                 
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


###Create a Loss Function
Our model will output a (64, sequence_length, 65) shaped tensor that represents the probability distribution of each character at each timestep for every sequence in the batch.

We will create the loss function.

Before we do that, let's have a look at a sample input and the output from the untrained model. 

In [13]:
for input_example_batch, target_example_batch in data.take(1):
  example_batch_predictions = model(input_example_batch)  # ask our model for a prediction on our first batch of training data (64 entries)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")  # print out the output shape

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [14]:
# we can see that the predicition is an array of 64 arrays, one for each entry in the batch
print(len(example_batch_predictions))
print(example_batch_predictions)

64
tf.Tensor(
[[[ 8.09970032e-03  6.13838434e-03 -2.38363072e-03 ...  1.46691210e-03
    7.20192678e-04  4.67452453e-03]
  [ 1.23554142e-03  1.27322841e-02  2.67284457e-03 ...  2.01242510e-03
   -4.22798982e-03 -1.06103823e-03]
  [-3.88011499e-03  1.67730395e-02 -1.95827754e-03 ... -9.32315737e-03
    3.42263468e-03 -3.96622624e-03]
  ...
  [-1.71232049e-03 -1.18810823e-02  6.50520157e-03 ...  3.91472783e-03
    5.92756830e-03 -1.13492040e-03]
  [ 5.86044509e-04  2.43921229e-03  5.86786447e-03 ...  4.49814368e-03
    1.59921544e-03  3.03998590e-04]
  [-9.37147066e-03 -1.19402912e-03  1.71975158e-02 ... -1.94741040e-03
   -5.59058972e-05  8.80604237e-03]]

 [[-1.40620470e-02  6.08991925e-03  4.92367242e-03 ... -3.65527114e-04
    1.73991208e-03  1.01860575e-02]
  [-1.35698952e-02  7.01499544e-03  1.98390372e-02 ... -6.37959456e-03
    4.81852936e-03  9.53868404e-03]
  [-8.55355151e-03  9.06031765e-03  1.38079906e-02 ... -3.15323332e-03
   -1.87813072e-03  8.18508305e-03]
  ...
  [-9.063

In [15]:
# Let's examine one prediction
pred = example_batch_predictions[0]
print(len(pred))
print(pred)
# notice this is a 2d array of length 100, where each interior array is the prediction for the next character at each time step

100
tf.Tensor(
[[ 8.0997003e-03  6.1383843e-03 -2.3836307e-03 ...  1.4669121e-03
   7.2019268e-04  4.6745245e-03]
 [ 1.2355414e-03  1.2732284e-02  2.6728446e-03 ...  2.0124251e-03
  -4.2279898e-03 -1.0610382e-03]
 [-3.8801150e-03  1.6773039e-02 -1.9582775e-03 ... -9.3231574e-03
   3.4226347e-03 -3.9662262e-03]
 ...
 [-1.7123205e-03 -1.1881082e-02  6.5052016e-03 ...  3.9147278e-03
   5.9275683e-03 -1.1349204e-03]
 [ 5.8604451e-04  2.4392123e-03  5.8678645e-03 ...  4.4981437e-03
   1.5992154e-03  3.0399859e-04]
 [-9.3714707e-03 -1.1940291e-03  1.7197516e-02 ... -1.9474104e-03
  -5.5905897e-05  8.8060424e-03]], shape=(100, 65), dtype=float32)


In [16]:
# and finally we look at a prediction at the first timestep
time_pred = pred[0]
print(len(time_pred))
print(time_pred)
# and its 65 values representing the probabillity of each character occuring next

65
tf.Tensor(
[ 8.0997003e-03  6.1383843e-03 -2.3836307e-03  8.2186554e-03
  1.1495135e-02 -2.6099742e-03  6.8107359e-03  3.5126626e-03
 -1.4112883e-02 -4.2046211e-03  5.3821774e-03 -2.8223924e-03
  8.2037058e-03 -4.2741410e-03  6.5277684e-03  1.2610498e-02
 -3.7535799e-03 -1.9975320e-02 -9.0720812e-03  8.9615834e-04
  2.0754093e-02 -1.1187958e-02 -3.5771986e-03 -4.4551301e-03
  1.2063186e-02 -3.9395876e-04 -1.7110255e-02 -6.2909843e-03
 -1.1072943e-02 -4.4826424e-04  8.8599790e-03 -7.7391379e-03
 -4.8631686e-03  1.7013170e-03 -1.3226453e-03 -2.2525289e-03
  1.4139873e-03  9.2694778e-03  1.4085907e-02 -4.6552117e-03
  5.0507062e-03 -6.2675080e-03  1.9518514e-03 -6.5165488e-03
  1.5347479e-03  2.2661393e-03 -1.5478481e-03  9.7501362e-03
  2.1289894e-03 -1.2230990e-02 -4.0325276e-03 -7.2775455e-03
 -1.5716385e-03  1.3202277e-03  1.7302515e-02 -4.2650620e-03
 -8.4259892e-03 -4.4300533e-03 -5.9992829e-03 -1.8494204e-05
  1.9781897e-03 -3.4665046e-03  1.4669121e-03  7.2019268e-04
  4.674524

In [17]:
# If we want to determine the predicted character we need to sample the output distribution (pick a value based on probabillity)
sampled_indices = tf.random.categorical(pred, num_samples=1)

# now we can reshape that array and convert all the integers to numbers to see the actual characters
sampled_indices = np.reshape(sampled_indices, (1, -1))[0]
predicted_chars = int_to_text(sampled_indices)

predicted_chars  # and this is what the model predicted for training sequence 1

"nha!vPhOJ$!!'RIcnGLKDn!.Fd-QAqVhGtqzhqzauFpRDZZxehhVp,eZMgZ-!a&Wm,.WH,pnZguaqmnXYDVmRp:e:ZBXM tot.!H"

So, we need to create a loss function that can compare that output to the expected output and give us some numeric value representing how close the two were. 

In [18]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

###Compile the Model
We can think of our problem as a classification problem where the model predicts the probabillity of each unique letter coming next. 


In [19]:
model.compile(optimizer='adam', loss=loss)

###Create Checkpoints
Now we are going to setup and configure our model to save checkpoinst as it trains. This will allow us to load our model from a checkpoint and continue training it.

In [20]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

###Train the Model
**If this is taking a while go to Runtime > Change Runtime Type and choose "GPU" under hardware accelerator.**

In [21]:
history = model.fit(data, epochs=12, callbacks=[checkpoint_callback])

Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12


###Load the Model
We'll rebuild the model from a checkpoint using a batch_size of 1 so that we can feed one piece of text to the model and have it make a prediction.

In [22]:
model = build_model(VOCAB_SIZE, EMBEDDING_DIM, RNN_UNITS, batch_size=1)

Once the model finishes training, we can find the **lastest checkpoint** that stores the models weights using the following line.

In [23]:
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

We can load **any checkpoint** we want by specifying the exact file to load.

In [24]:
# checkpoint_num = 10
# model.load_weights(tf.train.load_checkpoint("./training_checkpoints/ckpt_" + str(checkpoint_num)))
# model.build(tf.TensorShape([1, None]))

###Generate Text
Generate some text using any starting string we choose.

Generating text with this model is to run it in a loop, and keep track of the model's internal state as you execute it.  Each time you call the model you pass in some text and an internal state. The model returns a prediction for the next character and its new state. Pass the prediction and state back in to continue generating text.

![alt text](https://www.tensorflow.org/text/tutorials/images/text_generation_sampling.png)


In [25]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 800

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
    
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [27]:
inp = input("Type a starting string: ")
print(generate_text(model, inp))

Type a starting string: are you not famished
are you not famished,
That you twice worship for every name and smile
TRANIO:
What, none of Mambert, his name boards vault
Shall be a rapher's slave. Camoriously I could I bear
Which 'longeth thyself, and thus I'll consented us:
Darthup on my graciou, lade; for he hath arrolp called sword
That I, that sweet desainted-steel'd without her:
Indeed, I rame, as we window upon yourself.
Where in this face the singled way?

Bes Romeo?

FRIAR JOHN:
Hath brought his friends with words, among myself;
I thank thee, for Cused, sir: and you, Montague is good, that away.

HENRY BOLINGBROKE:
My houses best.

LEONTES:
Nour enemies!
Now, in God's name, and ask grave-mourstaced fellows--
And beat them to yourse would set up speak
With all see hath named and me debasine from thine eyes; and if thou go, take me in
commonweaking 


##Sources

1. Chollet François. Deep Learning with Python. Manning Publications Co., 2018.
2. “Text Generation with an RNN &nbsp;: &nbsp; TensorFlow Core.” TensorFlow, www.tensorflow.org/tutorials/text/text_generation.
3. “Understanding LSTM Networks.” Understanding LSTM Networks -- Colah's Blog, https://colah.github.io/posts/2015-08-Understanding-LSTMs/.