<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [28]:
import requests
import pandas as pd

In [29]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [30]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,ALL’S WELL THAT ENDS WELL,2777,7738,ALL’S WELL THAT ENDS WELL\r\n\r\n\r\n\r\nConte...
1,THE TRAGEDY OF ANTONY AND CLEOPATRA,7739,11840,THE TRAGEDY OF ANTONY AND CLEOPATRA\r\n\r\nDRA...
2,AS YOU LIKE IT,11841,14631,AS YOU LIKE IT\r\n\r\nDRAMATIS PERSONAE.\r\n\r...
3,THE COMEDY OF ERRORS,14632,17832,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
4,THE TRAGEDY OF CORIOLANUS,17833,27806,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...


In [94]:
text

with open("/content/drive/My Drive/All_the_Writing.txt") as f:
    text = f.read()


chars = list(set(text))

char_int = {c:i for i, c in enumerate(chars)}
int_char = {i: c for i, c in enumerate(chars)}

In [32]:
int_char

{0: 'h',
 1: 'î',
 2: '(',
 3: '8',
 4: '1',
 5: 'P',
 6: 'G',
 7: '!',
 8: 'w',
 9: 'j',
 10: 'e',
 11: ';',
 12: 'c',
 13: 'b',
 14: 'I',
 15: 'Q',
 16: '/',
 17: 'J',
 18: 'D',
 19: "'",
 20: 'R',
 21: 's',
 22: 'M',
 23: 'n',
 24: '`',
 25: 'q',
 26: '_',
 27: 'ê',
 28: '\\',
 29: 'x',
 30: ':',
 31: 'a',
 32: 'N',
 33: '[',
 34: 'æ',
 35: 'i',
 36: 'A',
 37: 'm',
 38: ' ',
 39: ']',
 40: 'œ',
 41: 'y',
 42: '2',
 43: 'C',
 44: 'W',
 45: 'f',
 46: 'B',
 47: '-',
 48: 't',
 49: '"',
 50: 'E',
 51: 'V',
 52: 'ç',
 53: '?',
 54: ')',
 55: '$',
 56: 'v',
 57: '9',
 58: 'r',
 59: 'â',
 60: '—',
 61: 'L',
 62: 'u',
 63: '7',
 64: 'O',
 65: 'T',
 66: 'k',
 67: 'É',
 68: 'H',
 69: '”',
 70: '0',
 71: 'S',
 72: 'l',
 73: '3',
 74: 'Z',
 75: 'g',
 76: 'U',
 77: '“',
 78: '}',
 79: '|',
 80: '.',
 81: '‘',
 82: 'é',
 83: 'Æ',
 84: 'd',
 85: 'à',
 86: '*',
 87: '@',
 88: 'o',
 89: 'X',
 90: '&',
 91: '%',
 92: 'p',
 93: 'F',
 94: 'è',
 95: '6',
 96: '’',
 97: '\t',
 98: '5',
 99: 'K',
 100: 'z

In [33]:
maxlen = 25
step = 5

encoded = [char_int[c] for c in text]

sequences = []
next_char = []

for i in range(0, len(encoded)-maxlen, step):
    sequences.append(encoded[i:i+maxlen])
    next_char.append(encoded[i+maxlen])
    
print(sequences[:1])

[[65, 68, 50, 38, 71, 64, 32, 32, 50, 65, 71, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38]]


In [34]:
for i in sequences[0]:
  print(int_char[i])

T
H
E
 
S
O
N
N
E
T
S
 
 
 
 
 
 
 
 
 
 
 
 
 
 


In [35]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

In [36]:
next_char[0], int_char[next_char[0]]

(38, ' ')

In [37]:
import numpy as np

In [38]:
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)

In [39]:
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

In [40]:
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t, char] = 1
    y[i, next_char[i]] = 1

In [41]:
x.shape

(1114071, 25, 104)

In [42]:
x[0]

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [43]:
loss = tf.keras.losses.CategoricalCrossentropy(from_logits = False)

In [44]:
model = Sequential([
    LSTM(128, input_shape=(maxlen, len(chars))),
    Dense(len(chars), activation='softmax')
])

model.compile(loss=loss, optimizer='adam')

In [45]:
def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / 1
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [46]:
from tensorflow.keras.callbacks import LambdaCallback
def on_epoch_end(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    
    start_index = random.randint(0, len(text) - maxlen - 1)
    
    generated = ''
    
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    for i in range(400):
        x_pred = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1
            
        preds = model.predict(x_pred, verbose=0)[0]
        next_index = sample(preds)
        next_char = int_char[next_index]
        
        sentence = sentence[1:] + next_char
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()


print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

In [47]:
# fit the model

# model.fit(x, y,
#           batch_size=32,
#           epochs=10,
#           callbacks=[print_callback])

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN

In [48]:
import tensorflow as tf

import numpy as np
import os
import time

In [109]:
# Read, then decode for py2 compat.
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))


Length of text: 1755916 characters


In [110]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))


101 unique characters


In [111]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])


In [112]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])


﻿
 
P
s
y


In [113]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))


'\ufeff\xa0Psyche Revived\nI.\n“I have an almost morbid interest in everything queer and out of the way”\nI know '
'I’m in love with her the moment I see her. Of course I am. How could I not be? I’m in love with her e'
'ven though I haven’t said a word to her, because it’s the only possible reaction. I’m in love with he'
'r, and I never stood a chance.\nIt’s a Thursday. Nearly the end of second period. We’re reading The Me'
'tamorphoses of Apuleius or, as Ms. Spicer tells us in her breathy near-whisper, The Golden Ass. It’s '


In [114]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)


In [115]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))


Input data:  '\ufeff\xa0Psyche Revived\nI.\n“I have an almost morbid interest in everything queer and out of the way”\nI know'
Target data: '\xa0Psyche Revived\nI.\n“I have an almost morbid interest in everything queer and out of the way”\nI know '


In [116]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [117]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024


In [118]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [119]:
# !rm -rf "training_checkpoints"

In [120]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)


In [121]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")


(64, 100, 101) # (batch_size, sequence_length, vocab_size)


In [122]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()


In [123]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))


Input: 
 'nd this deep-seated admiration was actually romantic attraction. But it hit me a few years ago that '

Next Char Predictions: 
 '\xa0N”Nw"d6gG3~’D]<5\ufeffEE_3BFg‘)(—z5\'U/;XY1>ONGExéZ’/8<S6Teué’2t?@k:E\\M3S),w→xr^‘\'NknB7;TpbR->uLpR7éA$-GX'


In [124]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())


Prediction shape:  (64, 100, 101)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.614859


In [125]:
model.compile(optimizer='adam', loss=loss)


In [126]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


In [127]:
EPOCHS=10


In [None]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10

In [None]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))


In [None]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = .5

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))


In [92]:
from pprint import pprint

print(generate_text(model, start_string=u"I "))


ROMEO: There is no true than thou art not to be so full of woman’s picture.   [_Exeunt._]  SCENE III. The same. A Street. Scene II. The same. Before PAGE THI So farewell with the Duke of Buckingham,     To make the cause all this to my poor father's death,     The spine of all the world, my boy shall make me say     That I may tell my fortune come to see     The sceptre's ministers of sweet beds.     For there is patiently not so much       To speak the accent of his father what he speak.   TAMORA. I will not well encourage him of such a thought     I understand not the trick of a condition of his wife,     The forest I may fear to slaughter him.     If you will know your mercy may be gentleman,     I am as agreed to pardon him.                                                                                                                                                                                                                                               Exit   TIMON. I do beli

In [93]:
# from google.colab import drive
# drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
