
## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [**infinite monkeys typing for an infinite amount of time**](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of William Shakespeare. So let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

We will focus specifically on Shakespeare's Sonnets to improve our model's learning ability from the data.

In [2]:
import random
import sys
import os

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import LambdaCallback

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional
from tensorflow.keras.layers import LSTM

%matplotlib inline

# a custom data prep class that we'll be using 
from data_cleaning_toolkit_class import data_cleaning_toolkit

### Use request to pull data from a URL

[**Read through the request documentation**](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) to learn how to download the Shakespeare Sonnets from the Gutenberg website. 

**Protip:** Do not overthink it.

In [3]:
# download all of Shakespear's Sonnets from the Project Gutenberg website

# here's the link for the sonnets
url_shakespeare_sonnets = "https://www.gutenberg.org/cache/epub/1041/pg1041.txt"

# use request and the url to download all of the sonnets - save the result to `r`



# YOUR CODE HERE
r = requests.get(url_shakespeare_sonnets)

In [4]:
# move the downloaded text out of the request object - save the result to `raw_text_data`
# hint: take a look at the attributes of `r`
# YOUR CODE HERE
raw_text_data = r.text

In [5]:
# check the data type of `raw_text_data`
type(raw_text_data)

str

### Data Cleaning

In [6]:
# as usual, we are tasked with cleaning up messy data
# Question: Do you see any characters that we could use to split up the text?
raw_text_data[:3000]

"\ufeffThe Project Gutenberg EBook of Shakespeare's Sonnets, by William Shakespeare\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.org\r\n\r\n\r\nTitle: Shakespeare's Sonnets\r\n\r\nAuthor: William Shakespeare\r\n\r\nPosting Date: April 7, 2014 [EBook #1041]\r\nRelease Date: September, 1997\r\nLast Updated: March 10, 2010\r\n\r\nLanguage: English\r\n\r\n\r\n*** START OF THIS PROJECT GUTENBERG EBOOK SHAKESPEARE'S SONNETS ***\r\n\r\n\r\n\r\n\r\nProduced by Joseph S. Miller and Embry-Riddle Aeronautical\r\nUniversity Library. HTML version by Al Haines.\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n\r\n  I\r\n\r\n  From fairest creatures we desire increase,\r\n  That thereby beauty's rose might never die,\r\n  But as the riper

In [7]:
# split the text into lines and save the result to `split_data`
# YOUR CODE HERE
split_data = raw_text_data.split('\n')

In [8]:
# we need to drop all the boilerplate text (i.e., titles and descriptions) as well as white spaces
# so that we are left with only the sonnets themselves 
split_data[:20] 

["\ufeffThe Project Gutenberg EBook of Shakespeare's Sonnets, by William Shakespeare\r",
 '\r',
 'This eBook is for the use of anyone anywhere at no cost and with\r',
 'almost no restrictions whatsoever.  You may copy it, give it away or\r',
 're-use it under the terms of the Project Gutenberg License included\r',
 'with this eBook or online at www.gutenberg.org\r',
 '\r',
 '\r',
 "Title: Shakespeare's Sonnets\r",
 '\r',
 'Author: William Shakespeare\r',
 '\r',
 'Posting Date: April 7, 2014 [EBook #1041]\r',
 'Release Date: September, 1997\r',
 'Last Updated: March 10, 2010\r',
 '\r',
 'Language: English\r',
 '\r',
 '\r',
 "*** START OF THIS PROJECT GUTENBERG EBOOK SHAKESPEARE'S SONNETS ***\r"]

**Use list index slicing to remove the titles and descriptions, so we only have the sonnets.**


In [9]:
# sonnets exist between these indices 
# titles and descriptions exist outside of these indices

# use index slicing to isolate the sonnet lines - save the result to `sonnets`

# YOUR CODE HERE
sonnets = split_data[45:2661]

In [10]:
sonnets[-2:]

['    Came there for cure and this by that I prove,\r',
 "    Love's fire heats water, water cools not love.\r"]

In [11]:
# notice how all non-sonnet lines have far fewer characters than the actual sonnet lines?
# well, let's use that observation to filter out all the non-sonnet lines
sonnets[200:240]

["    And nothing 'gainst Time's scythe can make defence\r",
 '    Save breed, to brave him when he takes thee hence.\r',
 '\r',
 '  XIII\r',
 '\r',
 '  O! that you were your self; but, love you are\r',
 '  No longer yours, than you your self here live:\r',
 '  Against this coming end you should prepare,\r',
 '  And your sweet semblance to some other give:\r',
 '  So should that beauty which you hold in lease\r',
 '  Find no determination; then you were\r',
 "  Yourself again, after yourself's decease,\r",
 '  When your sweet issue your sweet form should bear.\r',
 '  Who lets so fair a house fall to decay,\r',
 '  Which husbandry in honour might uphold,\r',
 "  Against the stormy gusts of winter's day\r",
 "  And barren rage of death's eternal cold?\r",
 '    O! none but unthrifts. Dear my love, you know,\r',
 '    You had a father: let your son say so.\r',
 '\r',
 '  XIV\r',
 '\r',
 '  Not from the stars do I my judgement pluck;\r',
 '  And yet methinks I have astronomy,\r',
 '  But 

In [12]:
# any string with less than n_chars characters will be filtered out - save results to `filtered_sonnets`

# YOUR CODE HERE
filtered_sonnets = [x.strip() for x in sonnets if len(x) > 15]

In [13]:
# ok - much better!
# but we still need to remove all the punctuation and case normalize the text
filtered_sonnets[2400:]

[]

### Use Custom Data Cleaning Tool 

Use one of the methods in the `data_cleaning_toolkit` to clean your data.

There is an example of this in the guided project.

In [14]:
# instantiate the data_cleaning_toolkit class - save result to `dctk`

# YOUR CODE HERE
dctk = data_cleaning_toolkit()

In [15]:
# use data_cleaning_toolkit to remove punctuation and to case normalize - save results to `clean_sonnets`

# YOUR CODE HERE
clean_sonnets = [dctk.clean_data(x) for x in filtered_sonnets]

In [16]:
# much better!
clean_sonnets

['from fairest creatures we desire increase',
 'that thereby beautys rose might never die',
 'but as the riper should by time decease',
 'his tender heir might bear his memory',
 'but thou contracted to thine own bright eyes',
 'feedst thy lights flame with selfsubstantial fuel',
 'making a famine where abundance lies',
 'thy self thy foe to thy sweet self too cruel',
 'thou that art now the worlds fresh ornament',
 'and only herald to the gaudy spring',
 'within thine own bud buriest thy content',
 'and tender churl makst waste in niggarding',
 'pity the world or else this glutton be',
 'to eat the worlds due by the grave and thee',
 'when forty winters shall besiege thy brow',
 'and dig deep trenches in thy beautys field',
 'thy youths proud livery so gazed on now',
 'will be a tatterd weed of small worth held',
 'then being asked where all thy beauty lies',
 'where all the treasure of thy lusty days',
 'to say within thine own deep sunken eyes',
 'were an alleating shame and thriftl

In [17]:
def key(x):
  return len(x)

lensort = clean_sonnets.copy()
lensort.sort(key=key)

In [18]:
lensort

['if any be a satire to decay',
 'is poorly imitated after you',
 'o no it is an everfixed mark',
 'was usd in giving gentle doom',
 'i hate she alterd with an end',
 'without accusing you of injury',
 'o what a happy title do i find',
 'nor my beloved as an idol show',
 'that followed it as gentle day',
 'that use is not forbidden usury',
 'to tie up envy evermore enlargd',
 'i hate from hate away she threw',
 'and savd my life saying not you',
 'upon thy self thy beautys legacy',
 'leaving thee living in posterity',
 'the living record of your memory',
 'self so selfloving were iniquity',
 'for such a time do i now fortify',
 'o fearful meditation where alack',
 'and lace itself with his society',
 'now proud as an enjoyer and anon',
 'or gluttoning on all or all away',
 'and heavy ignorance aloft to fly',
 'and given grace a double majesty',
 'the injuries that to myself i do',
 'o benefit of ill now i find true',
 'so i return rebukd to my content',
 'at my abuses reckon up their o

### Use Your Data Tool to Create Character Sequences 

We'll need the `create_char_sequences` method for this task. However, this method requires a parameter called `maxlen,` which is responsible for setting the maximum sequence length. 

So what would be a good sequence length, exactly? 

To answer that question, let's do some statistics! 

In [19]:
def calc_stats(corpus):
  
    """
    Calculates statistics on the length of every line in the sonnets
    """
    
    # write a list comprehension that calculates each sonnets line length - save the results to `doc_lens` 

    # use NumPy to calculate and return the mean, median, std, max, min of the doc lens - all in one line of code

    # YOUR CODE HERE
    doc_lens = [len(x) for x in corpus]

    mean = np.mean(doc_lens)
    median = np.median(doc_lens)
    std = np.std(doc_lens)
    min = np.min(doc_lens)
    max = np.max(doc_lens)

    return mean, median, std, min, max

In [20]:
# sonnet line length statistics 
mean ,med, std, max_, min_ = calc_stats(clean_sonnets)
mean, med, std, max_, min_ 

(40.8784222737819, 41.0, 4.04121152010423, 27, 57)

In [21]:
# using the results of the sonnet line length statistics, use your judgement and select a value for maxlen
# use .create_char_sequences() to create sequences

# YOUR CODE HERE
maxlen = 40
dctk.create_char_sequences(clean_sonnets, maxlen=maxlen)

Created 18042 sequences.


Take a look at the `data_cleaning_toolkit_class.py` file. 

In the first four lines of code in the `create_char_sequences` method, class attributes `n_features` and `unique_chars` are created. Let's call them in the cells below.

In [22]:
# number of input features for our LSTM model
print(dctk.n_features)

27


In [23]:
# unique characters that appear in our sonnets 
dctk.unique_chars

['f',
 'p',
 'd',
 'c',
 'j',
 'q',
 'a',
 'r',
 'w',
 's',
 'z',
 'h',
 'g',
 't',
 'e',
 'm',
 'i',
 ' ',
 'v',
 'o',
 'l',
 'x',
 'u',
 'y',
 'k',
 'n',
 'b']

## Time for Questions 

----
**Question 1:** 

Why is the `number of unique characters` (i.e., **dctk.unique_chars**) and the `number of model input features` (i.e., **dctk.n_features**) the same?

**Hint:** The model that we will shortly build here is very similar to the text generation model we built in the guided project.

**Answer 1:**

Because we're generating it by characters rather than words


**Question 2:**

Take a look at the printout of `dctk.unique_chars` one more time. Notice that there is a white space. 

Why is it desirable to have a white space as a possible character to predict?

**Answer 2:**

So that it generates spaces in between words.

----

### Use Our Data Tool to Create X and Y Splits

You'll need the `create_X_and_Y` method for this task. 

In [24]:
# TODO: provide a walkthrough of data_cleaning_toolkit with unit tests that check for understanding 
X, y = dctk.create_X_and_Y()

![](https://miro.medium.com/max/891/0*jGB1CGQ9HdeUwlgB)

In [25]:
# notice that our input matrix isn't a matrix - it's a rank three tensor
X.shape

(18042, 40, 27)

In $X$.shape, we see three numbers (*n1*, *n2*, *n3*). What do these numbers mean?

Well, *n1* tells us the number of samples that we have. But what about the other two?

In [26]:
# first index returns a signle sample, which we can see is a sequence 
first_sample_index = 0 
X[first_sample_index]

array([[ True, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

Notice that each sequence (i.e., $X[i]$ where $i$ is some index value) is `maxlen` long and has `dctk.n_features` number of features. Let's try to understand this shape better.

In [27]:
# each sequence is maxlen long and has dctk.n_features number of features
X[first_sample_index].shape

(40, 27)

**Each row corresponds to a character vector,** and there is `maxlen` number of character vectors. 

**Each column corresponds to a unique character,** and there are `dctk.n_features` number of features. 


In [28]:
# let's index for a single character vector 
first_char_vect_index = 0
X[first_sample_index][first_char_vect_index]

array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False])

Notice that there is a single `TRUE` value, and all the rest of the values are `FALSE`. 

This is a one-hot encoding for which character appears at each index within a sequence. Specifically, the cell above is looking at the first character in the sequence.

Only a single character can appear as the first character in a sequence, so there will be a single `TRUE` value, and the rest will be `FALSE`. 

Let's say that `TRUE` appears in the $ith$ index; by  $ith$ index we mean some index in the general case. So how can we find out which character corresponds to?

To answer this question, we need to use the character-to-integer look-up dictionaries. 

In [29]:
# take a look at the index to character dictionary
# if a TRUE appears in the 0th index of a character vector,
# then we know that whatever char you see below next to the 0th key 
# is the character that character vector is endcoding for
dctk.int_char

{0: 'f',
 1: 'p',
 2: 'd',
 3: 'c',
 4: 'j',
 5: 'q',
 6: 'a',
 7: 'r',
 8: 'w',
 9: 's',
 10: 'z',
 11: 'h',
 12: 'g',
 13: 't',
 14: 'e',
 15: 'm',
 16: 'i',
 17: ' ',
 18: 'v',
 19: 'o',
 20: 'l',
 21: 'x',
 22: 'u',
 23: 'y',
 24: 'k',
 25: 'n',
 26: 'b'}

In [30]:
# let's look at an example to tie it all together

seq_len_counter = 0

# index for a single sample 
for seq_of_char_vects in X[first_sample_index]:
    
    # get index with max value, which will be the one TRUE value 
    index_with_TRUE_val = np.argmax(seq_of_char_vects)
    
    print (dctk.int_char[index_with_TRUE_val])
    
    seq_len_counter+=1
    
print ("Sequence length: {}".format(seq_len_counter))

f
r
o
m
 
f
a
i
r
e
s
t
 
c
r
e
a
t
u
r
e
s
 
w
e
 
d
e
s
i
r
e
 
i
n
c
r
e
a
s
Sequence length: 40


## Time for Questions 

----
**Question 1:** 

In your own words, how would you describe the numbers from the shape printout of `X.shape` to a classmate?


**Answer 1:**

It is a three dimensional list, and basically its a lot of lists of 40 lists that all contain 27 boolean values to signify a certain letter

----


### Build a Text Generation Model

Now that we have prepped our data (and understood that process), let's finally build out our character generation model, similar to what we did in the guided project.

In [31]:
def sample(preds, temperature=1.0):
    """
    Helper function to sample an index from a probability array
    """
    # convert preds to array 
    preds = np.asarray(preds).astype('float64')
    # scale values 
    preds = np.log(preds) / temperature
    # exponentiate values
    exp_preds = np.exp(preds)
    # this equation should look familar to you (hint: it's an activation function)
    preds = exp_preds / np.sum(exp_preds)
    # Draw samples from a multinomial distribution
    probas = np.random.multinomial(1, preds, 1)
    # return the index that corresponds to the max probability 
    return np.argmax(probas)

def on_epoch_end(epoch, _):
    """"
    Function invoked at the end of each epoch. Prints the text generated by our model.
    """
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    

    # randomly pick a starting index 
    # will be used to take a random sequence of chars from `text`
    start_index = random.randint(0, len(text) - dctk.maxlen - 1)
    
    # this is our seed string (i.e. input seqeunece into the model)
    generated = ''

    # start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
    sentence = text[start_index: start_index + dctk.maxlen]

    # add to generated
    generated += sentence

    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    # use model to predict what the next 40 chars should be that follow the seed string
    for i in range(40):

        # shape of a single sample in a rank 3 tensor 
        x_dims = (1, dctk.maxlen, dctk.n_features)
        # create an array of zeros with shape x_dims
        # recall that python considers zeros and boolean FALSE as the same
        x_pred = np.zeros(x_dims)

        # create a seq vector for our randomly select sequence 
        # i.e. create a numerical encoding for each char in the sequence 
        for t, char in enumerate(sentence):
            # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
            x_pred[0, t, dctk.char_int[char]] = 1

        # next, take the seq vector and pass into model to get a prediction of what the next char should be 
        preds = model.predict(x_pred, verbose=0)[0]
        # use the sample helper function to get index for next char 
        next_index = sample(preds)
        # use look up dict to get next char 
        next_char = dctk.int_char[next_index]

        # append next char to sequence 
        sentence = sentence[1:] + next_char 
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

In [32]:
# need this for on_epoch_end()
text = " ".join(clean_sonnets)

In [33]:
# create callback object that will print out text generation at the end of each epoch 
# use for real-time monitoring of model performance
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

----
### Train Model

Build a text generation model using LSTMs. Feel free to reference the model used in the guided project. 

It is recommended that you train this model to at least 50 epochs (but more if you're computer can handle it). 

You are free to change up the architecture as you wish. 

Just in case you have difficultly training a model, there is a pre-trained model saved to a file called `trained_text_gen_model.h5` that you can load in (the same way that you learned how to load in Keras models in Sprint 2 Module 4). 

In [34]:
from tensorflow.keras.optimizers import Adam

In [38]:
# build text generation model layer by layer 
# fit model

# YOUR CODE HERE
model = Sequential()
model.add(LSTM(
    256,
    input_shape=(dctk.maxlen, dctk.n_features),
    return_sequences=False
))
model.add(Dense(
    dctk.n_features,
    activation='softmax'
))
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.003))
model.fit(X, y, batch_size=128, epochs=200, callbacks=[print_callback])

Epoch 1/200

----- Generating text after Epoch: 0
----- Generating with seed: " contains and that is this and this with"
 contains and that is this and this withee uthe teali nau tbetmms os se di mngsa
Epoch 2/200

----- Generating text after Epoch: 1
----- Generating with seed: "alse womens fashion an eye more bright t"
alse womens fashion an eye more bright toor arayntut tort reter voruls wor nos d
Epoch 3/200

----- Generating text after Epoch: 2
----- Generating with seed: "dead since i left you mine eye is in my "
dead since i left you mine eye is in my des ewedrthka garess ow owler thaci thec
Epoch 4/200

----- Generating text after Epoch: 3
----- Generating with seed: " hungry ocean gain advantage on the king"
 hungry ocean gain advantage on the kinghs as in my tul thoustoutain nates dat i
Epoch 5/200

----- Generating text after Epoch: 4
----- Generating with seed: "eated till nature as she wrought thee fe"
eated till nature as she wrought thee feit they live i dey ale thit i n

<keras.callbacks.History at 0x7feca470c7c0>

In [39]:
# save trained model to file 
model.save("trained_text_gen_model200.h5")

### Let's Play With Our Trained Model 

Now that we have a trained model that, though far from perfect, can generate actual English words, we can look at the predictions to continue learning more about how a text generation model works.

We can also take this as an opportunity to unpack the `def on_epoch_end` function to understand better how it works. 

In [None]:
# this is our joined clean sonnet data
text

In [41]:
# randomly pick a starting index 
# will be used to take a random sequence of chars from `text`
# run this cell a few times and you'll see `start_index` is random
start_index = random.randint(0, len(text) - dctk.maxlen - 1)
start_index

59429

In [42]:
# next use the randomly selected starting index to sample a sequence from the `text`

# this is our seed string (i.e., input sequence into the model)
generated = ''

# start the sentence at index `start_index` and include the next` dctk.maxlen` number of chars
sentence = text[start_index: start_index + dctk.maxlen]

# add to generated
generated += sentence

generated

'm long hence as he shows now my love is '

In [43]:
# this block of code lets us know what the seed string is 
# i.e., the input sequence into the model
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)

----- Generating with seed: "m long hence as he shows now my love is "
m long hence as he shows now my love is 

40

In [44]:
# use model to predict what the next 40 chars should be that follow the seed string
for i in range(40):

    # shape of a single sample in a rank 3 tensor 
    x_dims = (1, dctk.maxlen, dctk.n_features)
    # create an array of zeros with shape x_dims
    # recall that python considers zeros and boolean FALSE as the same
    x_pred = np.zeros(x_dims)

    # create a seq vector for our randomly select sequence 
    # i.e. create a numerical encoding for each char in the sequence 
    for t, char in enumerate(sentence):
        # for sample 0 in seq index t and character `char` encode a 1 (which is the same as a TRUE)
        x_pred[0, t, dctk.char_int[char]] = 1

    # next, take the seq vector and pass into model to get a prediction of what the next char should be 
    preds = model.predict(x_pred, verbose=0)[0]
    # use the sample helper function to get index for next char 
    next_index = sample(preds)
    # use look up dict to get next char 
    next_char = dctk.int_char[next_index]

    # append next char to sequence 
    sentence = sentence[1:] + next_char 

In [45]:
# this is the seed string
generated

'm long hence as he shows now my love is '

In [46]:
# these are the 40 chars that the model thinks should come after the seed stirng
sentence

'thy figuting me of should hath my crues '

In [47]:
# how put it all together
generated + sentence

'm long hence as he shows now my love is thy figuting me of should hath my crues '

# Resources and Stretch Goals

## Stretch Goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g., plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://www.tensorflow.org/text/tutorials/text_generation) - code for training an RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem and provides an example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN