## Creating the dataset

In [None]:
import tensorflow as tf

In [None]:
shakespeare_url = "https://homl.info/shakespeare"
filepath = tf.keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
  shakespeare_text = f.read()

Downloading data from https://homl.info/shakespeare


In [None]:
print(shakespeare_text[:80])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.


Use textVectorization to encode this text. SPlit it by character to get character level encoding rather than the default word level encoding.

In [None]:
text_vec_layer = tf.keras.layers.TextVectorization(split="character",
                                                   standardize="lower") # make lowercase
text_vec_layer.adapt([shakespeare_text])
encoded = text_vec_layer([shakespeare_text])[0]

In [None]:
encoded

<tf.Tensor: shape=(1115394,), dtype=int64, numpy=array([21,  7, 10, ..., 22, 28, 12])>

Each character is now mapped to an integer starting at 2. The TextVectorization layer reserved the value 0 for padding tokens and it reserved 1 for unknown characters. We dont need either of these tokens for now, so lets subtract 2 from the character IDs and compute the number of distinct characters and total num of characters:

In [None]:
encoded -= 2 # drop tokens 0 (pad) and 1(unknown), which we will not use for text generation
n_tokens = text_vec_layer.vocabulary_size() - 2 # num of distinct chars = 39
dataset_size = len(encoded) # total num of chars = 1,115,394

Just like all other NLP problems, can turn this very long sequence into a dataset of windows that we can then use to train a sequence-to-sequence RNN.
* The targets will be similar to inputs, but shifted by one time step into the future
* For example, one sample in the dataset may be a sequence of character IDs representing the text "to be or not to b"(without final e), and the corresponding target - a sequence of character IDs with the text "o be or not to be"(with final e but without the leading t)

Write a small utility function to convert a long sequence of character IDs into a dataset of input/target window parts


In [None]:
def to_dataset(sequence, length, shuffle=False, seed=None, batch_size=32):
  ds = tf.data.Dataset.from_tensor_slices(sequence)
  ds = ds.window(length + 1, shift=1, drop_remainder=True)
  ds = ds.flat_map(lambda window_ds: window_ds.batch(length + 1))
  if shuffle:
    ds = ds.shuffle(buffer_size=100_000, seed=seed)
  ds = ds.batch(batch_size)
  return ds.map(lambda window: (window[:, :-1], window[:, 1:])).prefetch(1)

This funtions works:
* takes a sequence as input (encoded text), creates a dataset containing all the windows of the desired length
* increases the length by one, since we need the next character for the target
* then it shuffles the windows(optionally), batches them, splits them into input/output pairs and activated prefetching

### Split set data
90% of text for training, 5% for validation, 5% for testing

In [None]:
length = 100

tf.random.set_seed(42)

train_set = to_dataset(encoded[:1_000_000], length=length, shuffle=True, seed=42)

valid_set = to_dataset(encoded[1_000_000:1_060_000], length=length)

test_set = to_dataset(encoded[1_060_000:], length=length)

### Create the model

Since the dataset is reasonably large, need more than a simple RNN, build and train a model with one GRU layer composed of 128 units.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=n_tokens, output_dim=16),
    tf.keras.layers.GRU(128, return_sequences=True),
    tf.keras.layers.Dense(n_tokens, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="nadam",
              metrics=["accuracy"])

model_ckpt = tf.keras.callbacks.ModelCheckpoint('my_shakespeare_model',
                                                monitor="val_accuracy",
                                                save_best_only=True)

history = model.fit(train_set,
                    validation_data=valid_set,
                    epochs=10,
                    callbacks=[model_ckpt])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
shakespeare_model = tf.keras.Sequential([
    text_vec_layer,
    tf.keras.layers.Lambda(lambda X: X - 2),
    model,
])

In [None]:
y_proba = shakespeare_model.predict(["To be or not to b"])[0, -1]
y_pred = tf.argmax(y_proba) # choose most probable character ID
text_vec_layer.get_vocabulary()[y_pred + 2]



'e'

THe model predicted the next character correctly. Lets use it to pretend we're shakespeare.

To generate new text using the char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it to the end of the text, then give the extended text to the model to guess the next letter, and so on.

This is called greedy decoding. But in practice this leads to same words being repeated over and over.

Instead we can sample the next character randomly with a probability equal to the estimated probability, using tensorflows tf.random.categorical() method. This will generate a more diverse and interesting text. the categorical function samples random class indices, given the class log probabilities(logits).

In [None]:
log_probas = tf.math.log([[0.5, 0.4, 0.1]]) # probas 50%, 40%, 10%
tf.random.set_seed(42)
tf.random.categorical(log_probas, num_samples=8) # draw 8 samples

<tf.Tensor: shape=(1, 8), dtype=int64, numpy=array([[0, 0, 1, 1, 1, 0, 0, 0]])>

To have more control over the text, we can divide the logits by a number called the temperature, which we can tweak as we wish.

A temp close to 0 favors high-probability chars while a high temp gives all chars a equal probability.

* Lower temps are typically preferred when generating fairly rigid and precise text, such as  math equations
* Higher temps are preffered when generating more diverse and creative text.

The next_char function uses this approach to pick th next character to add to the input text:

In [None]:
def next_char(text, temperature=1):
  y_proba = shakespeare_model.predict([text])[0, -1:]
  rescaled_logits = tf.math.log(y_proba) / temperature
  char_id = tf.random.categorical(rescaled_logits, num_samples=1)[0, 0]
  return text_vec_layer.get_vocabulary()[char_id + 2]

Write a helper function to repeatedly call next_char to get the next character and append it to the given text

In [None]:
def extend_text(text, n_chars=50, temperature=1):
  for _ in range(n_chars):
    text += next_char(text, temperature)
  return text

In [None]:
tf.random.set_seed(42)
print(extend_text("to be or not to be", temperature=0.01))

print(extend_text("to be or not to be", temperature=1))

print(extend_text("to be or not to be", temperature=100))


to be or not to be a shame,
and the duke is not the duke is not the 
to be or not to begin obs
do i cannot be a shop father, it is
resolv
to be or not to bepevicm-vilv!?$mz?gmjz :3?ljb'va;!td&
i.ur3l'-j!3eu


Shakespeare seems to be suffering from a heatwave from that last text. To generate a more convincing text, a common technique is to sample only from the top k characters, or only from the smallest set of top characters whose total probability exceeds some threshold(nucleas sampling).

Alternatively you could try using beam search, or using more GRU layers and more neurons per layer, training for longer, and adding some regularization if needed.

Also note the model is incapable of learning patterns longer than `length`, which is just 100 characters. You could try makeing this window larger but it will also make training harder, and even GRU and LSTM cells cannot handle very long sequences.

An alternative is to use stateful RNN.

## Stateful RNN

Until now only used stateless RNNs: at each training iteration, the model starts with hidden state full of 0s, then it updates this state at each step, and after the last time step, it throws away as it is not needed anymore.

What if we instructed the RNN to preserve this final state after processing a training batch and use it as the initial state for the next training batch? this way the model could learn long term patterns despite only backpropagating thru short sequences. This is called a **stateful RNN**

