<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [9]:
import requests
import pandas as pd

In [10]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [11]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [12]:
text = " ".join(data)



chars = list(set(text))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i: c for i, c in enumerate(chars)}

In [13]:
int_char

{0: 'Z',
 1: '_',
 2: 'E',
 3: '9',
 4: '|',
 5: '%',
 6: 'a',
 7: '5',
 8: '$',
 9: 'F',
 10: 'f',
 11: '`',
 12: '6',
 13: '&',
 14: 'z',
 15: '!',
 16: 'u',
 17: 'T',
 18: '0',
 19: 'b',
 20: '‘',
 21: 'A',
 22: ':',
 23: '’',
 24: 'î',
 25: 'â',
 26: '8',
 27: ')',
 28: '*',
 29: '“',
 30: '(',
 31: '-',
 32: 'à',
 33: 'H',
 34: '”',
 35: 'Æ',
 36: '1',
 37: 'R',
 38: 'G',
 39: 'o',
 40: 't',
 41: ',',
 42: '4',
 43: 'ê',
 44: 'N',
 45: 'y',
 46: 'W',
 47: 'O',
 48: 'Q',
 49: '"',
 50: 'i',
 51: '\\',
 52: 'B',
 53: '.',
 54: 'Y',
 55: ' ',
 56: '?',
 57: 'K',
 58: ';',
 59: '}',
 60: ']',
 61: 'é',
 62: '7',
 63: 'p',
 64: 'œ',
 65: 'É',
 66: 'j',
 67: "'",
 68: 'k',
 69: 'e',
 70: 's',
 71: 'J',
 72: 'V',
 73: '@',
 74: 'D',
 75: 'L',
 76: 'æ',
 77: 'w',
 78: 'q',
 79: '/',
 80: '—',
 81: 'm',
 82: 'n',
 83: 'U',
 84: 'M',
 85: 'g',
 86: 'x',
 87: '3',
 88: 'd',
 89: 'h',
 90: '\t',
 91: 'S',
 92: 'r',
 93: 'l',
 94: 'I',
 95: 'v',
 96: '2',
 97: 'ç',
 98: 'C',
 99: 'X',
 100: 'c

In [14]:
maxlen = 25
step = 5

encoded = [char_int[c] for c in text]

sequences = []
next_char = []

for i in range(0, len(encoded)-maxlen, step):
    sequences.append(encoded[i:i+maxlen])
    next_char.append(encoded[i+maxlen])
    
print(sequences[:1])

[[55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 36, 55, 55, 9, 92]]


In [15]:
for i in sequences[0]:
  print(int_char[i])

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
 
F
r


In [16]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [17]:
next_char[0], int_char[next_char[0]]

(39, 'o')

In [18]:
import numpy as np

In [19]:
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)

In [20]:
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

In [21]:
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t, char] = 1
    y[i, next_char[i]] = 1

In [22]:
x.shape

(1111149, 25, 104)

In [23]:
x[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [24]:
loss = tf.keras.losses.CategoricalCrossentropy(from_logits = False)

In [29]:
model = Sequential([
    LSTM(32, input_shape=(maxlen, len(chars))),
    Dense(len(chars), activation='softmax')
])

model.compile(loss=loss, optimizer='adam')

In [30]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [38]:
from tensorflow.keras.callbacks import LambdaCallback
from numpy import random
import sys
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [39]:
# fit the model

model.fit(x, y,
          batch_size=32,
          epochs=2,
          callbacks=[print_callback])

Epoch 1/2
----- Generating text after Epoch: 0
----- Generating with seed: "us far forth. By accident"
us far forth. By accidenten; for     A not                                        My stat-stapp I much will wibe; Asseatt? One then tense, if oul due, Jonrichter: Wear a     heresby and hor, the Moll-a meeppele; And the if you werriggly what mast if whenit! O gougle not bettaur?  BELESS. My leck it,     shored—servet her the fall this with othal-if of as the refeck bughing.   HOLRIBAND LLUFFO. I and hin yoush now     Stra
Epoch 2/2
----- Generating text after Epoch: 1
----- Generating with seed: " and fair, Anticipating t"
 and fair, Anticipating the bliting lote;   [_Exouftle have and strundst’s to presurcelin, he amad contherians.  MOPHILIZEPO. The pare sobly me serar-in befmeace brust firy Dut pracuss, frath duparice, Greact. COTHULET. I reines there trong; the evierd colve deet be inthurch complece shall you yoce a doad rreved as thine Drae, in trine; At wile and wove I had king t

<tensorflow.python.keras.callbacks.History at 0x7f5a3817d5f8>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN

In [40]:
import tensorflow as tf

import numpy as np
import os
import time

In [41]:
# Read, then decode for py2 compat.
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))


Length of text: 5555770 characters


In [42]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))


104 unique characters


In [43]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])


In [44]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])


 
 
 
 
 


In [45]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))


'                    1  From fairest creatures we desire increase, That thereby beauty’s rose might ne'
'ver die, But as the riper should by time decease, His tender heir might bear his memory: But thou con'
'tracted to thine own bright eyes, Feed’st thy light’s flame with self-substantial fuel, Making a fami'
'ne where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world’s'
' fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And, '


In [46]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)


In [47]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))


Input data:  '                    1  From fairest creatures we desire increase, That thereby beauty’s rose might n'
Target data: '                   1  From fairest creatures we desire increase, That thereby beauty’s rose might ne'


In [48]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [49]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024


In [50]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [51]:
# !rm -rf "training_checkpoints"

In [52]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)


In [53]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 100, 104) # (batch_size, sequence_length, vocab_size)


In [54]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()


In [55]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))


Input: 
 'es err.   [_Exeunt King, Bertram, Helena, Lords, and Attendants._]  LAFEW. Do you hear, monsieur? A '

Next Char Predictions: 
 ';]sGçî—*[œ`)@|OÆ\\Rî)n*L/uED|qfç1}GrFÉ\tTFnP/dâNlW\tcZ]M%m”!GOQc\\\\?eàcyD7.d_D""klhUèÆj_]M:“dq/—Bæà;4l[’'


In [56]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())


Prediction shape:  (64, 100, 104)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.644499


In [57]:
model.compile(optimizer='adam', loss=loss)


In [58]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


In [59]:
EPOCHS=10


In [60]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [61]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))


In [68]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 5000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = .5

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))


In [69]:
from pprint import pprint

print(generate_text(model, start_string=u"KING "))


KING HENRY. What say you to him?   SPEED. The gods the man that thou art as full of them they are.                      Exit                                                                               Exit  SCENE II. The state I think that thou dost forth,     And then the heavens to thee, and rare not called me to the death of all these days.     A blushing cursed sight!                                                                                                         Exeunt PROTEUS and Tranio Emilia the two offers the business to their silent poison.                                   Exeunt PROTEUS      Sweet brother, I will not have thee a servant to her eyes.                                        Exit MARCUS, and attendants                                                                                                                                                                                                                                                               

In [64]:
# from google.colab import drive
# drive.mount('/content/drive')

KING HENRY. \
What say you to him?   

SPEED. \
The gods the man that thou art as full of them they are.                      Exit                                                                               
Exit  

SCENE II. 
The state I think that thou dost forth,     And then the heavens to thee, and rare not called me to the death of all these days.     A blushing cursed sight!                                                                                                         Exeunt 

PROTEUS and Tranio Emilia the two offers the business to their silent poison.                                   Exeunt 

PROTEUS\
      Sweet brother, I will not have thee a servant to her eyes.                                        Exit MARCUS, and attendants                                                                                                                                                                                                                                                                                                                                    Exeunt  SCENE II. 
The same. A Room in the Palace.   
      
Enter Cleopatra to the COUNTREYMAN What is the matter?         

BORACHIO.      \
 I pray thee, lady, I will not to my back.     The colour cannot hold the infant form of love,     When they are created on the children of the death of love.     I will not touch the walls of my desire is mine,     Then who shall not be sad sick of my head     As I see what they are return'd and sleep.     Are you best to be whip?   SILVIA. Alas, poor soul, the gods are fall'n off them.                                                                                                                                                         Exit                                                                                                                                Exeunt 
 
 ALCIBIADES  \  
 He that doth doubt the sun she lives.     What will you hear him sit, and then the place shall chance to cry.                                                                                                           [A plain blushing of the plain]   PROTEUS. Ay, but what think you, my lord?   TIMON. What said she hath been this thing that your blessing makes a man than the cause of the deed!     What say you to be so dear as I am a soldier than my breast.                          Exit   COSTARD. Is this the word?         DON JOHN.       If I do so, I will not change this work     With our discourse of his age, but stopp'd,     And when the cloud the old record of all the bed When they no more than the forest of the state Where the sport is that she swears he shall never love her.                                                                                 Exit  SCENE II. A Room in Olivia’s House.  ACT III  SCENE I. The same. A Room in the palace.   Enter Clown and Attendants.  DUKE. What say'st thou, man?   CORIOLANUS. What, will you love me?   TIMON. What say'st thou that shall I desire you to her so long?     And then I bid you hear the start of mine     Which makes you justice of my love.     What says my lord?   FIRST MURDERER. And I will go     To set them up to the consent of our soldiers,     And that which stands in vain, sir, the true love and discontent     To hear the break of honour, to be from the sea.     What can the slave of mine, this service is no more to be the day.     What should you not have seen, and this same save your pity,     And so will I rest to the clock.     My hand hath sent to be the chamber of the state     To hear the morning leaves of the man doth shine   Which the next war shall stain the start of heart     She is the proceeding of the prime of honour in the matter for the people of her mortal father;     And with you to behold the prison a great company     To the tender time of nature bear the fact     With all the persons of the world.     My lord, let it be constant as a piece of sovereignty.     I ever saw him downe in the peril of her bearing.     What says my lord?   LAUNCE. Nay, that determine on the ballad makes me, and then do me good.     What shall be so bold to be to my body to his love,     The person of his father’s matter thence: He leave him to answer man’s painted butts A merry riast that the pale dog is made to be your cause.   [_Exeunt._]  SCENE III. The same. A Room in the Palace.   Enter Clowne, and the rest of the Count Olympus.  FRIAR LAWRENCE. Sometimes the world will take us. I will be as set the trumpet sounds, And make me live in her some pen to take a child.   [_Exit_.]  SCENE II. The same. A room in the Castle.  ACT II Scene I. Another part of the state where lies of mercy, Shall be as from the world to lose his face with him.  POLIXENES. If you will make a strange thing for your father’s house, Which sometime like a creature and the turtle of the world.                                                                                                                                    
