# Songwriting and Language Generation using TensorFlow
## Writing a sonnet in the style of William Shakespeare using a RNN

This project is based heavily upon the first lab exercise from MIT Deep Learning 6.S191, in which a Recurrent Neural Network (RNN) is used to generate synthetic music based upon Irish folk songs.

I have adapted the code to generate Shakespearean Sonnets rather than music, using the entirety of the sonnets of William Shakespeare to train the model.

For anyone else looking to learn more about Deep Learning I thoroughly recommend checking out the lectures and exercises from the aforementioned MIT open course at http://introtodeeplearning.com/

Sonnets sourced from http://www.shakespeares-sonnets.com/all.php

Source code used under the MIT License.
© MIT 6.S191: Introduction to Deep Learning

http://introtodeeplearning.com

In [2]:
import tensorflow as tf
import pandas as pd
import numpy as np
import re
import os
from tqdm import tqdm
from bs4 import BeautifulSoup
import requests

In [None]:
# Check that we are using a GPU, if not switch runtimes
#   using Runtime > Change Runtime Type > GPU
assert len(tf.config.list_physical_devices('GPU')) > 0

First, we need to obtain our input data - the complete collection of Shakespeare's sonnets.
We can scrape this information using the Beautiful Soup library.

In [3]:
url = 'http://www.shakespeares-sonnets.com/all.php'
page = requests.get(url)

# create an instance of the BeautifulSoup class, which will parse the html (content) from the requests response
soup = BeautifulSoup(page.content, 'html.parser')
alltext = soup.get_text()

In [4]:
# trim to retrieve only the sonnets from the webpage
alltext = alltext[alltext.index('All Sonnets'):alltext.index('Copyright')]
print(alltext)

All Sonnets
I.
From fairest creatures
we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And, tender churl, mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
II.
When
forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held: 
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days; 
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much mo

Now that we have the raw input data, we need to tidy it up a little. For example, each sonnet is labelled by its number in Roman numerals. We don't want to include these Roman numerals in the vocabulary when generating new sonnets, so we must remove them from the text.

In [5]:
regex = "[XICVL]+\.\n"
alltext = re.sub(regex, "", alltext)
print(alltext)

All Sonnets
From fairest creatures
we desire increase,
That thereby beauty's rose might never die,
But as the riper should by time decease,
His tender heir might bear his memory:
But thou contracted to thine own bright eyes,
Feed'st thy light's flame with self-substantial fuel,
Making a famine where abundance lies,
Thy self thy foe, to thy sweet self too cruel:
Thou that art now the world's fresh ornament,
And only herald to the gaudy spring,
Within thine own bud buriest thy content,
And, tender churl, mak'st waste in niggarding:
Pity the world, or else this glutton be,
To eat the world's due, by the grave and thee.
When
forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held: 
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days; 
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more prai

Convert the text to lowercase and remove the newline characters, to allow us to count instances of the same word (regardless of their capitalisation).

In [6]:
alltext = alltext.replace("\n"," ").lower()
print(alltext)



Split the text by whitespace characters to generate a list of words.

In [7]:
alltext = alltext.split(' ')
print(alltext)



A cursory glance at our body of text shows that there are instances of characters that will prevent us from creating a vocabulary of unique words.
For example, we don't want to distinguish between "joy" and "joy;", so we should remove the semicolons from the text.

In [8]:
char_to_remove = ["(", ")", "\"", "\'", "", ":", ";", ",", ".", "!", "?", "\“", "\…", "<u+203d>", "\r", "\xa0", "-"]
clean_text = list(set(alltext))
for character in char_to_remove:
    clean_text = [word.replace(character,"") for word in clean_text]
clean_text = [word for word in clean_text if word != ""]
print(clean_text)



Now that we've cleaned up our input text, let's create a vocabulary of all the unique words in the text. This will be the set of words that our neural network will be able to draw from to create a new sonnet.

In [9]:
# Find all unique characters in the joined string
vocab = sorted(set(clean_text))
print(vocab)
print("There are", len(vocab), "unique words in the sonnets")

There are 3196 unique words in the sonnets


We now create a mapping to represent each unique word in the vocabulary with its own integer value. We also create a reverse mapping to allow us to translate back from id numbers to words. This will allow the neural network to work with numerical representations of the words, rather than entire words themselves.

In [10]:
# Create a mapping of words to numbers
word2idx = {u:i for i, u in enumerate(vocab)}

# Reverse the mapping
idx2word = np.array(vocab)

Using the mapping, we can convert phrases of words into vectors of numbers, which will be used to train the model. This is a vital step, because the Embedding layer (which is the first hidden layer of the neural network, a flexible layer which can learn that certain groupings of words appear together more often) requires vectors of numbers as input. The vectorize_string function will also be useful at the end when we will need to vectorize an input seed phrase to the model to generate new text.

In [11]:
# Function to vectorize a given input string of words
def vectorize_string(string):
  vectorized_words = []
  for word in string:
      vectorized_words.append(word2idx[word])
  vectorized_words = np.array(vectorized_words)
  return vectorized_words

To demonstrate this in action, let's see what a vectorized representation of the first 10 words of the input text would look like:

In [12]:
vectorized_words = vectorize_string(clean_text[:10])
print(vectorized_words)

[2604 1807  134 3135 2894 3073 2176  716 2116 2144]


To train the model, we need to break the training data into batches and feed them to the model sequentially.
Each [training example? batch?] is comprised of <code>seq_length</code> time steps, each time step being one word input followed by one word output. Setting <code>seq_length</code> to a value greater than 1 therefore allows us to join multiple individual neural networks together sequentially, and due to the fact that we will be using LSTM nodes, which can retain and pass on information, the entire model will be able to learn to predict words based on the context of the <code>seq_length</code> preceding words. [WHAT IS THE BATCH SIZE?]

In [14]:
def get_batch(vectorized_words, seq_length, batch_size):
  # the length of the vectorized_words string
  n = vectorized_words.shape[0] - 1
  # randomly choose the starting indices for the examples in the training batch
  idx = np.random.choice(n-seq_length, batch_size)

  # construct a list of input sequences for the training batch
  input_batch = [vectorized_words[i:i+seq_length] for i in idx]
  # construct a list of output sequences for the training batch
  output_batch = [vectorized_words[i+1:i+1+seq_length] for i in idx]

  # x_batch, y_batch provide the true inputs and targets for network training
  x_batch = np.reshape(input_batch, [batch_size, seq_length])
  y_batch = np.reshape(output_batch, [batch_size, seq_length])

  return x_batch, y_batch

In [17]:
# Demonstrate the batching over the timesteps
x_batch, y_batch = get_batch(vectorized_words, seq_length=4, batch_size=1)
print(x_batch)
print(y_batch)
for i, (input_idx, target_idx) in enumerate(zip(np.squeeze(x_batch), np.squeeze(y_batch))):
    print("Step {:3d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2word[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2word[target_idx])))

[[1807  134 3135 2894]]
[[ 134 3135 2894 3073]]
Step   0
  input: 1807 ('none')
  expected output: 134 ('aprils')
Step   1
  input: 134 ('aprils')
  expected output: 3135 ('word')
Step   2
  input: 3135 ('word')
  expected output: 2894 ('unseen')
Step   3
  input: 2894 ('unseen')
  expected output: 3073 ('why')


Next, we'll write a function that can create a hidden layer of parallel LSTM units.

In this notation, <code>rnn_units</code> refers to the number of rnn cells that take up the hidden layer of a given timestep's neural network.

You can basically imagine an individual timestep as having its own neural network, where each node in the hidden layer is an LSTM node instead of a basic neuron.

In [18]:
def LSTM(rnn_units): 
  return tf.keras.layers.LSTM(
    rnn_units, 
    return_sequences=True, 
    recurrent_initializer='glorot_uniform',
    recurrent_activation='sigmoid',
    stateful=True,
  )

Now we can set out the structure of the network itself (i.e. the structure of each of the individual NNs that join together sideways to form the <code>num_steps</code> RNN), which will comprise:<br><br>
i) an Embedding layer, <br>
[TO CLARIFY - DOES THE EMBEDDING LAYER TAKE A vocab_length DIM VECTOR [0,0,0,0,1,0,0], REPRESENTING A GIVEN TIMESTEP'S WORD, OR DOES IT TAKE IN THE VECTORIZED_WORDS VECTOR (e.g. [213,32,4,492] IF THE seq_length = 4]<br>
[TO CLARIFY - WHAT EXACTLY IS EACH LITTLE NN'S INPUT AT A GIVEN TIMESTEP? I BELIEVE IT IS THE LATTER, SINCE THE POINT OF AN EMBEDDING LAYER IS TO ALLOW THE MODEL TO LEARN THAT CERTAIN GROUPS OF WORDS APPEAR TOGETHER (" You could one-hot encoded all the words but you will lose the notion of similarity between them."]
which will transform an input vector of dimension <code>vocab_size</code> (as this is a classification exercise: if we had a three-word vocabulary <code>['ONE','TWO','THREE']</code>, the word <code>'TWO'</code> would correspond to an input of <code>[1,0,0]</code>) into a dense vector [CLARIFY THIS LAST PART].<br>
(note that one word is passed in per timestep, so at each timestep the size of the input vector <code>x_t</code> will be equal to the vocab size, as each word is represented by a different <code>[0,0,1]</code> vector).<br><br>
ii) an LSTM layer, containing rnn_units number of LSTM cells (as explained in the previous cell).<br><br>
iii) an Output layer, with the number of nodes being equal to the vocabulary size (as this is a classification exercise: if we had a three-word vocabulary <code>['ONE','TWO','THREE']</code>, the word 'TWO' would correspond to an output of <code>[0,1,0]</code>)

In [19]:
### Defining the RNN Model ###

def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    # Layer 1: Embedding layer to transform indices into dense vectors 
    #   of a fixed embedding size
    tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]),

    # Layer 2: LSTM with `rnn_units` number of units. 
    LSTM(rnn_units),

    # Layer 3: Dense (fully-connected) layer that transforms the LSTM output
    #   into the vocabulary size. 
    tf.keras.layers.Dense(vocab_size)
  ])

  return model

# Build a simple model with default hyperparameters. You will get the 
#   chance to change these later.
model = build_model(len(vocab), embedding_dim=256, rnn_units=1024, batch_size=32)

In [21]:
print(len(vocab))
model.summary()

3196
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (32, None, 256)           818176    
_________________________________________________________________
lstm (LSTM)                  (32, None, 1024)          5246976   
_________________________________________________________________
dense (Dense)                (32, None, 3196)          3275900   
Total params: 9,341,052
Trainable params: 9,341,052
Non-trainable params: 0
_________________________________________________________________


In [None]:
x, y = get_batch(vectorized_words, seq_length=100, batch_size=32)
pred = model(x)
print("Input shape:      ", x.shape, " # (batch_size, sequence_length)")
print("Prediction shape: ", pred.shape, "# (batch_size, sequence_length, vocab_size)")

In [None]:
# obtain predictions from untrained model
sampled_indices = tf.random.categorical(pred[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

In [None]:
# Decode predictions from untrained model, find they're a bit rubbish
print(x)
print("Input: \n", repr(" ".join(idx2word[x[0]])))
print()
print("Next Word Predictions: \n", repr(" ".join(idx2word[sampled_indices])))

In [None]:
### TRAINING THE MODEL: Part 1: Defining the loss function ###

# define the loss function to compute and return the loss between the true labels and predictions (logits). 
# Set the argument from_logits=True.
def compute_loss(labels, logits):
  loss = tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
  return loss

# compute the loss using the true next characters from the example batch 
# and the predictions from the untrained model several cells above
example_batch_loss = compute_loss(y, pred)

print("Prediction shape: ", pred.shape, " # (batch_size, sequence_length, vocab_size)") 
print("scalar_loss:      ", example_batch_loss.numpy().mean())

In [None]:
### Hyperparameter setting and optimization ###

# Optimization parameters:
num_training_iterations = 2000  # Increase this to train longer
batch_size = 4  # Experiment between 1 and 64
seq_length = 100  # Experiment between 50 and 500
learning_rate = 5e-3  # Experiment between 1e-5 and 1e-1

# Model parameters: 
vocab_size = len(vocab)
embedding_dim = 256 
rnn_units = 1024  # Experiment between 1 and 2048

# Checkpoint location: 
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "my_ckpt")

In [None]:
### Define optimizer and training operation ###

# instantiate a new model for training using the `build_model`
# function and the hyperparameters created above.'''
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size)

# Instantiate an optimizer with its learning rate.
#   Checkout the tensorflow website for a list of supported optimizers.
#   https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/
#   Try using the Adam optimizer to start
optimizer = tf.keras.optimizers.Adam(learning_rate)


@tf.function
def train_step(x, y): 
  # Use tf.GradientTape()
  with tf.GradientTape() as tape:
  
    # Feed the current input into the model and generate predictions
    y_hat = model(x)
  
    # compute the loss
    loss = compute_loss(y, y_hat)

  # Now, compute the gradients 
#    complete the function call for gradient computation. 
#       Remember that we want the gradient of the loss with respect all 
#       of the model parameters. 
#       HINT: use `model.trainable_variables` to get a list of all model
#       parameters.
  grads = tape.gradient(loss, model.trainable_variables)
  
  # Apply the gradients to the optimizer so it can update the model accordingly
  optimizer.apply_gradients(zip(grads, model.trainable_variables))
  return loss

##################
# Begin training!#
##################

history = []
if hasattr(tqdm, '_instances'): tqdm._instances.clear() # clear if it exists

pbar = tqdm(range(num_training_iterations))
for iter in pbar:

  # Grab a batch and propagate it through the network
  x_batch, y_batch = get_batch(vectorized_words, seq_length, batch_size)
  loss = train_step(x_batch, y_batch)

  # Update the progress bar
  history.append(loss.numpy().mean())
  pbar.set_description("loss: {}".format(loss.numpy().mean()))

  # Update the model with the changed weights!
  if iter % 100 == 0:     
    model.save_weights(checkpoint_prefix)
    
    
# Save the trained model and the weights
model.save_weights(checkpoint_prefix)

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

# Restore the model weights for the last checkpoint after training
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))

model.summary()

In [None]:
def generate_text(model, start_string, generation_length=1000):
  # Evaluation step (generating ABC text using the learned RNN model)

  # Convert the start string to numbers (vectorize)
  input_eval = vectorize_string(start_string)
  print(input_eval)
  input_eval = [word2idx[word] for word in start_string] 
  print(input_eval)
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  model.reset_states()
  tqdm._instances.clear()

  for i in tqdm(range(generation_length)):
      # evaluate the inputs and generate the next word predictions
      predictions = model(input_eval)
      
      # Remove the batch dimension
      predictions = tf.squeeze(predictions, 0)
      
      # use a multinomial distribution to sample
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      
      # Pass the prediction along with the previous hidden state
      #   as the next inputs to the model
      input_eval = tf.expand_dims([predicted_id], 0)
      
      # add the predicted word to the generated text
      text_generated.append(idx2word[predicted_id])
    
  return ([start_string, text_generated])

In [None]:
# Use the model and the function defined above to generate song lyrics of 100 words
# Choose a word that appears in the vocabulary (lower case) to seed the generator
generated_text = generate_text(model, start_string=["shall"], generation_length=100)

In [None]:
print(generated_text)