## RNN Shakespeare

The goal is to train a model that can predict the next time step (four notes), given a sequence of time steps from a chorale. Then use this model to generate Bach-like music, one note at a time; this can be accomplished by giving the model the start of a chorale and asking it to predict the next time step, then appending these time steps to the input sequence and asking the model for the next note, and so on.

### Setup

First, we will import libraries and define constants and functions that will help us during the examples in this notebook.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Is this notebook running on Colab or Kaggle?
IS_COLAB = "google.colab" in sys.modules
IS_KAGGLE = "kaggle_secrets" in sys.modules

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. LSTMs and CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")
    if IS_KAGGLE:
        print("Go to Settings > Accelerator and select GPU.")

# Common imports
import numpy as np
import os
from pathlib import Path

# to make this notebook's output stable across runs
my_seed = 42
np.random.seed(my_seed)
tf.random.set_seed(my_seed)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "nlp"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

No GPU was detected. LSTMs and CNNs can be very slow without a GPU.


To better understand how tensorflow process data, we will process the sequence 0 to 14 in the following way.

* Process each element of the sequence individually: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
* split the data into windows of length 5 with a shift of 2. 

       [0, 1, 2, 3, 4]

       [2, 3, 4, 5, 6]

       [4, 5, 6, 7, 8]

       [6, 7, 8, 9, 10]

       [8, 9, 10, 11, 12]

       [10, 11, 12, 13, 14]

* Create a 1D sequence of windows: [[0, 1, 2, 3, 4], [2, 3, 4, 5, 6], [4, 5, 6, 7, 8], [6, 7, 8, 9, 10], [8, 9, 10, 11, 12], [10, 11, 12, 13, 14]]

* Shuffle the elements of the sequence and separate the target variable. If we don't shuffle we would see something like this:

       [([0, 1, 2, 3], [1, 2, 3, 4]),
    
        ([2, 3, 4, 5],  [3, 4, 5, 6]),
    
        ([4, 5, 6, 7],  [5, 6, 7, 8]),
    
        ([6, 7, 8, 9],  [7, 8, 9, 10]),
    
        ([8, 9, 10, 11],[9, 10, 11, 12]),
    
        ([10, 11, 12, 13],[11, 12, 13, 14])]
    
* Finally, create batches of size 3

#### Batch 1

    [([0, 1, 2, 3], [1, 2, 3, 4]),
    
    ([2, 3, 4, 5],  [3, 4, 5, 6]),
    
    ([4, 5, 6, 7],  [5, 6, 7, 8]),
    
#### Batch 2
    
    ([6, 7, 8, 9],  [7, 8, 9, 10]),
    
    ([8, 9, 10, 11],[9, 10, 11, 12]),
    
    ([10, 11, 12, 13],[11, 12, 13, 14])]



In [2]:
np.random.seed(42)
tf.random.set_seed(42)

n_steps = 5
dataset = tf.data.Dataset.from_tensor_slices(tf.range(15))
dataset = dataset.window(n_steps, shift=2, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(n_steps))
dataset = dataset.shuffle(10).map(lambda window: (window[:-1], window[1:]))
dataset = dataset.batch(3).prefetch(1)
for index, (X_batch, Y_batch) in enumerate(dataset):
    print("_" * 20, "Batch", index, "\nX_batch")
    print(X_batch.numpy())
    print("=" * 5, "\nY_batch")
    print(Y_batch.numpy())

____________________ Batch 0 
X_batch
[[6 7 8 9]
 [2 3 4 5]
 [4 5 6 7]]
===== 
Y_batch
[[ 7  8  9 10]
 [ 3  4  5  6]
 [ 5  6  7  8]]
____________________ Batch 1 
X_batch
[[ 0  1  2  3]
 [ 8  9 10 11]
 [10 11 12 13]]
===== 
Y_batch
[[ 1  2  3  4]
 [ 9 10 11 12]
 [11 12 13 14]]


2023-05-24 12:43:39.185619: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz


### Understanding the Data

First, we download all Shakespare work

In [3]:
shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

In [4]:
print(shakespeare_text[:148])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?



We can get the distinct characters used.

In [5]:
char_set = "".join(sorted(set(shakespeare_text.lower())))
print(char_set)
print("Number of chars: ", len(char_set))


 !$&',-.3:;?abcdefghijklmnopqrstuvwxyz
Number of chars:  39


The Keras's Tokenizer class encodes every character as an integer, it wuill find all all the characters used and map each of them to a different ID, from 1 to the number of distinct characters.

In [6]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

We use `char_level=True` to get character-level encoding, rather than the default word-level encoding. The tokenizer converts all the text to lowercase by default.

In [7]:
tokenizer.texts_to_sequences(["Cheeto"])

[[19, 7, 2, 2, 3, 4]]

In [8]:
tokenizer.sequences_to_texts([[19, 7, 2, 2, 3, 4]])

['c h e e t o']

From the tokenizer, we can get the distinct chatacters used, as well as the total number of characters used. Note that the number of different characters matches with the previous result. 

In [9]:
max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

print("max_id = ", max_id)
print("dataset_size = ", dataset_size)

max_id =  39
dataset_size =  1115394


Now that we have the tokenizer, we proceed to encode the whole text with numbers from 0 to 38.

In [10]:
[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Then we get the training set. It is important to avoid overlap between the training, validation, and test set. When dealing with time series, you would in general split across time; for example from a start date to an end date. It is ofeten safer to split across time, bit this assumes that the patterns the RNN can learn in the past, will still exist in the future.

In short, splitting a time series into a training set, validation set, and test set is not a trivial task. 

For this example we will take the first 90% of the text for the training set.

In [11]:
train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

Currently, the training set is a single sequence of over a million characters, so we can't just use it to feed the RNN; so we will split the data into windows

In [12]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

By default, the `window` method creates nonoverlapping windows, but to get the largest possible training set, we use `shift=1`. So the first window have the characters 0 to 100, the second window contains characters 1 to 101, and so on.

In [13]:
dataset = dataset.flat_map(lambda window: window.batch(window_length))

The `window` method creates a nested dataset, but we can't use this for training, since the model expects tensors, so we use the `flat_map` method, that converts the nested dataset into a flat dataset, and by using the `badge` method, we get a flat dataset of tensors of size `window_length`.

In [14]:
np.random.seed(my_seed)
tf.random.set_seed(my_seed)

Now, we shuffle the windows and batch the windows. Then, we can separate the targets from the inputs.

In [15]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

When training a model, categorical features should be encoded, ususally using the one-hot encoder. 

In [16]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

Finally, we just need to add prefetching, that will do its best to always be one batch ahead, so while the training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready.

In [17]:
dataset = dataset.prefetch(1)

In [18]:
for X_batch, Y_batch in dataset.take(1):
    print(X_batch.shape, Y_batch.shape)

(32, 100, 39) (32, 100)


To predict the next character based on the previous 100 characters, we can use an RNN with 2 GRU layers of 128 units each, and a 20% dropout on both the inputs and hidden states (`recurrent_dropout`).

The last Dense layer must have 29 units, because there are 39 distinct characters, and we want to output a probability for each possible character.

In [19]:
model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     #dropout=0.2, recurrent_dropout=0.2),
                     dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     #dropout=0.2, recurrent_dropout=0.2),
                     dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, epochs=10)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The `preprocess` function uses the tokenizer and the one-hot encoder in some input to be used for prediction.

In [26]:
def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

Let's try a small example consisting on predicting the next character in a string. The result is the character with highest probability.

In [27]:
X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

'u'

Now, let's use the `tf.random.categorical` method to sample the next character according to their probabilities. 

In [22]:
tf.random.set_seed(my_seed)
tf.random.categorical([[np.log(0.5), np.log(0.4), np.log(0.1)]], num_samples=40).numpy()

array([[0, 1, 0, 2, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 2, 1, 0, 2, 1,
        0, 1, 2, 1, 1, 1, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 2]])

As the `temperature` parameter increses, the probabilities are smaller.

In [23]:
def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

The `complete_text` method generates the next 50 characters from a given text; this is done by predicting the next character 50 times.

In [42]:
def complete_text(text, n_chars=100, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

In [43]:
tf.random.set_seed(my_seed)
print(complete_text("t", temperature=0.2))

t the maid of me;
the more that she may contributors and so good for me as free
for the maid and be t


In [44]:
print(complete_text("t", temperature=1))

ti no music, i putt you with
these mettle
shom his being unto the chrsem ypors
and seek
to mine own c


In [45]:
print(complete_text("hello", temperature=0.1))

hellow the men and see
that she may contrace to her father is a fitther for my hands.

gremio:
no, sir, i


In [46]:
print(complete_text("what", temperature=0.1))

what i will not have the head?

petruchio:
sir, i shall not be so so stand the men and see the state
and


In [48]:
print(complete_text("hi", temperature=1))

hio:
we would; or my mind hands; my father ever in my pethel? or to be not rap enough,
ded you at his 


In [50]:
print(complete_text("", temperature=1))

do:
her father comes your elders lucentio.

duke vincentio:
sir, gold the swelt and naw peruses a prov
