
## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)


Our goal in this project is to build a Shakespeare Sonnet Generator using **Recurrent Neural Networks and LSTMs**.<br>
Given a prompt of a few words as input, its task is to generate follow-on text that reads like a Shakespeare Sonnet!<br>


To build our Sonnet Generator we will use a **sequence model**. Given a short sequence, a sequence  model predicts the **most likely next item in the sequence**. Sequence models are astonishingly versatile and powerful, because the **sequence** we want to predict can be quite general! It can be composed of **words**, or of **characters**, or of **musical notes**, or of data points in a **time series** such as EKG voltages, or stock prices, or even a sequence of **DNA nucleotides**! 

We will train our model on the entire corpus of Shakespeare's Sonnets, and the model will learn from that data the most likely patterns of characters.

# Imports

In [1]:
import random
import sys
import os

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.callbacks import LambdaCallback

from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Bidirectional
from tensorflow.keras.layers import LSTM

%matplotlib inline

# import a custom text data preparation class
!wget https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-3-Deep-Learning/main/module1-rnn-and-lstm/data_cleaning_toolkit_class.py
from data_cleaning_toolkit_class import data_cleaning_toolkit

--2022-06-01 14:37:54--  https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-3-Deep-Learning/main/module1-rnn-and-lstm/data_cleaning_toolkit_class.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6666 (6.5K) [text/plain]
Saving to: ‘data_cleaning_toolkit_class.py’


2022-06-01 14:37:54 (59.7 MB/s) - ‘data_cleaning_toolkit_class.py’ saved [6666/6666]



### Use [requests](https://requests.readthedocs.io/en/master/user/quickstart/#make-a-request) to pull data from a URL

Download the Shakespeare Sonnets from the Gutenberg website. 


In [2]:
# download all of Shakespeare's Sonnets from the Project Gutenberg website
url_shakespeare_sonnets = "https://www.gutenberg.org/cache/epub/1041/pg1041.txt"

# use requests and the url to download all of the sonnets
data = requests.get(url_shakespeare_sonnets)

In [3]:
# extract the downloaded text from the requests object and save to new variable
raw_text_data = data.text

In [4]:
# confirm the data type of `raw_text_data`
assert(type(raw_text_data)==str)

### Data Cleaning

In [5]:
# preview data to get an idea of how to begin cleaning
raw_text_data[:3000]

'\ufeffThe Project Gutenberg eBook of The Sonnets, by William Shakespeare\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.\r\n\r\nTitle: The Sonnets\r\n\r\nAuthor: William Shakespeare\r\n\r\nRelease Date: September, 1997 [eBook #1041]\r\n[Most recently updated: November 25, 2021]\r\n\r\nLanguage: English\r\n\r\n\r\nProduced by:  the Project Gutenberg Shakespeare Team\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE SONNETS ***\r\n\r\n\r\n\r\n\r\nTHE SONNETS\r\n\r\nby William Shakespeare\r\n\r\n\r\n\r\n\r\n  I\r\n\r\n  From fairest creatures we desire increase,\r\n  That t

In [7]:
# split the text into separate lines and save the result to new variable
split_data = raw_text_data.split('\r\n')
split_data

['\ufeffThe Project Gutenberg eBook of The Sonnets, by William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this eBook or online at',
 'www.gutenberg.org. If you are not located in the United States, you',
 'will have to check the laws of the country where you are located before',
 'using this eBook.',
 '',
 'Title: The Sonnets',
 '',
 'Author: William Shakespeare',
 '',
 'Release Date: September, 1997 [eBook #1041]',
 '[Most recently updated: November 25, 2021]',
 '',
 'Language: English',
 '',
 '',
 'Produced by:  the Project Gutenberg Shakespeare Team',
 '',
 '*** START OF THE PROJECT GUTENBERG EBOOK THE SONNETS ***',
 '',
 '',
 '',
 '',
 'THE SONNETS',
 '',
 'by William Shakespeare',
 '',
 '',
 '',
 '',
 '  I',
 '',
 '  From fairest crea

Drop all the boiler plate text (i.e. titles and descriptions) and extra white spaces so we are left with only the sonnets themselves 

In [22]:
# find first line of sonnets
split_data[:80]

['\ufeffThe Project Gutenberg eBook of The Sonnets, by William Shakespeare',
 '',
 'This eBook is for the use of anyone anywhere in the United States and',
 'most other parts of the world at no cost and with almost no restrictions',
 'whatsoever. You may copy it, give it away or re-use it under the terms',
 'of the Project Gutenberg License included with this eBook or online at',
 'www.gutenberg.org. If you are not located in the United States, you',
 'will have to check the laws of the country where you are located before',
 'using this eBook.',
 '',
 'Title: The Sonnets',
 '',
 'Author: William Shakespeare',
 '',
 'Release Date: September, 1997 [eBook #1041]',
 '[Most recently updated: November 25, 2021]',
 '',
 'Language: English',
 '',
 '',
 'Produced by:  the Project Gutenberg Shakespeare Team',
 '',
 '*** START OF THE PROJECT GUTENBERG EBOOK THE SONNETS ***',
 '',
 '',
 '',
 '',
 'THE SONNETS',
 '',
 'by William Shakespeare',
 '',
 '',
 '',
 '',
 '  I',
 '',
 '  From fairest crea

In [20]:
# find last line of sonnets
split_data[-400:]

['  In loving thee thou know’st I am forsworn,',
 '  But thou art twice forsworn, to me love swearing;',
 '  In act thy bed-vow broke, and new faith torn,',
 '  In vowing new hate after new love bearing:',
 '  But why of two oaths’ breach do I accuse thee,',
 '  When I break twenty? I am perjur’d most;',
 '  For all my vows are oaths but to misuse thee,',
 '  And all my honest faith in thee is lost:',
 '  For I have sworn deep oaths of thy deep kindness,',
 '  Oaths of thy love, thy truth, thy constancy;',
 '  And, to enlighten thee, gave eyes to blindness,',
 '  Or made them swear against the thing they see;',
 '    For I have sworn thee fair; more perjur’d I,',
 '    To swear against the truth so foul a lie!',
 '',
 '  CLIII',
 '',
 '  Cupid laid by his brand and fell asleep:',
 '  A maid of Dian’s this advantage found,',
 '  And his love-kindling fire did quickly steep',
 '  In a cold valley-fountain of that ground;',
 '  Which borrow’d from this holy fire of Love,',
 '  A dateless 

**Use list index slicing to remove the titles and descriptions, so we only have the sonnets.**


In [9]:
# find first and last lines of sonnet
first_sonnet_line = '  From fairest creatures we desire increase,'
last_sonnet_line = '    Love’s fire heats water, water cools not love.'

# find index boundaries (start, end)
start_index = split_data.index(first_sonnet_line)
end_index = split_data.index(last_sonnet_line)

# use index slicing to isolate the sonnet lines from the boiler plate text
sonnets = split_data[start_index:end_index]

In [17]:
# see how many lines were removed
print(len(split_data))
print(len(sonnets))

3004
2615


There are still many lines that should not be counted as part of the sonnets

In [11]:
# these non-sonnet lines have far fewer characters than the actual sonnet lines
sonnets[200:240]

['    And nothing ’gainst Time’s scythe can make defence',
 '    Save breed, to brave him when he takes thee hence.',
 '',
 '  XIII',
 '',
 '  O! that you were your self; but, love you are',
 '  No longer yours, than you your self here live:',
 '  Against this coming end you should prepare,',
 '  And your sweet semblance to some other give:',
 '  So should that beauty which you hold in lease',
 '  Find no determination; then you were',
 '  Yourself again, after yourself’s decease,',
 '  When your sweet issue your sweet form should bear.',
 '  Who lets so fair a house fall to decay,',
 '  Which husbandry in honour might uphold,',
 '  Against the stormy gusts of winter’s day',
 '  And barren rage of death’s eternal cold?',
 '    O! none but unthrifts. Dear my love, you know,',
 '    You had a father: let your son say so.',
 '',
 '  XIV',
 '',
 '  Not from the stars do I my judgement pluck;',
 '  And yet methinks I have astronomy,',
 '  But not to tell of good or evil luck,',
 '  Of plagu

In [23]:
# use best judgement to decide on a good value for  
# the minimum number of characters that a sonnet should have
min_chars = 15

# use 'min_chars' to filter out all the non-sonnet lines
filtered_sonnets = [i for i in sonnets if len(i) > min_chars]

In [24]:
# view section of text to determine next cleaning steps
filtered_sonnets[:300]

['  From fairest creatures we desire increase,',
 '  That thereby beauty’s rose might never die,',
 '  But as the riper should by time decease,',
 '  His tender heir might bear his memory:',
 '  But thou, contracted to thine own bright eyes,',
 '  Feed’st thy light’s flame with self-substantial fuel,',
 '  Making a famine where abundance lies,',
 '  Thy self thy foe, to thy sweet self too cruel:',
 '  Thou that art now the world’s fresh ornament,',
 '  And only herald to the gaudy spring,',
 '  Within thine own bud buriest thy content,',
 '  And tender churl mak’st waste in niggarding:',
 '    Pity the world, or else this glutton be,',
 '    To eat the world’s due, by the grave and thee.',
 '  When forty winters shall besiege thy brow,',
 '  And dig deep trenches in thy beauty’s field,',
 '  Thy youth’s proud livery so gazed on now,',
 '  Will be a tatter’d weed of small worth held:',
 '  Then being asked, where all thy beauty lies,',
 '  Where all the treasure of thy lusty days;',
 ' 

### Use Custom Data Cleaning Tool 

We still need to remove all the punctuation and case normalize the text.

Use the appropriate methods in the `data_cleaning_toolkit` to clean your data.


In [25]:
# instantiate the data_cleaning_toolkit class
dctk = data_cleaning_toolkit()

In [26]:
# use data_cleaning_toolkit to remove punctuation and to case normalize
clean_sonnets = [dctk.clean_data(text) for text in filtered_sonnets]

In [27]:
# view cleaned sonnets
display(clean_sonnets)
print(len(clean_sonnets))

['from fairest creatures we desire increase',
 'that thereby beautys rose might never die',
 'but as the riper should by time decease',
 'his tender heir might bear his memory',
 'but thou contracted to thine own bright eyes',
 'feedst thy lights flame with selfsubstantial fuel',
 'making a famine where abundance lies',
 'thy self thy foe to thy sweet self too cruel',
 'thou that art now the worlds fresh ornament',
 'and only herald to the gaudy spring',
 'within thine own bud buriest thy content',
 'and tender churl makst waste in niggarding',
 'pity the world or else this glutton be',
 'to eat the worlds due by the grave and thee',
 'when forty winters shall besiege thy brow',
 'and dig deep trenches in thy beautys field',
 'thy youths proud livery so gazed on now',
 'will be a tatterd weed of small worth held',
 'then being asked where all thy beauty lies',
 'where all the treasure of thy lusty days',
 'to say within thine own deep sunken eyes',
 'were an alleating shame and thriftl

2154


### Use Your Data Tool to Create Character Sequences for the LSTM model

The `create_char_sequences` method requires a parameter called `maxlen,` which is responsible for setting the maximum sequence length. 

To determine a good max sequence length, first calculate some statistics! 

In [28]:
def calc_stats(corpus):
    """
    Calculates statistics on the length of every line in the sonnets
    """
    
    # calculates each sonnet's line length
    doc_lens = [len(line) for line in corpus]

    # calculate and return the mean, median, std, max, min of the doc lengths

    return [np.mean(doc_lens),
            np.median(doc_lens),
            np.std(doc_lens),
            np.max(doc_lens),
            np.min(doc_lens)]


In [29]:
# sonnet line length statistics 
mean, med, std, max_, min_ = calc_stats(clean_sonnets)
mean, med, std, max_, min_ 

(40.87743732590529, 41.0, 4.041890872647064, 57, 27)

In [30]:
# from the results of the sonnet line length statistics
# use judgement to select a value for maxlen

# a good value could be half the median length of a sonnet line
maxlen = 20
dctk.create_char_sequences(clean_sonnets, maxlen=maxlen)

Created 18037 sequences.


Take a look at the `data_cleaning_toolkit_class.py` file. 

In the first 4 lines of code in the `create_char_sequences` method, class attributes `n_features` and `unique_chars` are created. <br>
Call these two attributes in the cells below to see that the number of unique characters is the same as the number input features for our model because each of the unique characters is a possible prediction for this classification model.

In [31]:
# number of input features for our LSTM model
dctk.n_features

27

In [32]:
# unique characters that appear in our sonnets 
dctk.unique_chars

['t',
 ' ',
 'b',
 'k',
 'g',
 'r',
 'z',
 'u',
 'e',
 'i',
 'q',
 'y',
 'p',
 'a',
 'w',
 'd',
 'v',
 'l',
 'h',
 's',
 'm',
 'f',
 'c',
 'x',
 'n',
 'j',
 'o']

### Use Our Data Tool to Create X and Y Splits

TODO: provide a walkthrough of data_cleaning_toolkit with unit tests




In [33]:
# use data_cleaning_toolkit to separate X and y
X, y = dctk.create_X_and_Y()

In [34]:
# our input array isn't a matrix - it's a rank three tensor
X.shape

(18037, 20, 27)

In $X$.shape, we see three numbers (*n1*, *n2*, *n3*). 

*n1* tells us the number of samples that we have. But what about the other two?

In [35]:
# first index returns a single sample, which we can see is a sequence 
first_sample_index = 0 
X[first_sample_index]

array([[False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False,  True, False, False, False, False, False],
       [False, False, False, False, False,  True, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False],
       [False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False,  True],
       [False, False, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
        False, False,  True, False, False, False, False, False, False],
       [False,  True, False, False, False, False, False, False, False,
        False, False, False, False, False, False, False, False, False,
  

Notice that each sequence (i.e., $X[i]$ where $i$ is same index value) is `maxlen` long and <br>
has a number of features equal to `dctk.n_features`.

In [36]:
# each sequence is maxlen long and has dctk.n_features number of features
X[first_sample_index].shape

(20, 27)

**Each row corresponds to a character vector,** and there is `maxlen` number of character vectors. 

**Each column corresponds to a unique character,** and there are `dctk.n_features` number of features. 


In [37]:
# index for a single character vector 
first_char_vect_index = 0
X[first_sample_index][first_char_vect_index]

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False])

Notice that there is a single `True` value, and all the rest of the values are `False`. 

This is a one-hot encoding for which character appears at each index within a sequence. Specifically, the cell above is looking at the first character in the sequence.

Only a single character can appear as the first character in a sequence, so there will be a single `True` value, and the rest will be `False`. 

Let's say that `True` appears in the $ith$ index; by  $ith$ index we mean some index in the general case. To find out which character each vector corresponds to, use the character-to-integer look-up dictionary. 

In [38]:
# take a look at the index to character dictionary
# if a TRUE appears in the 0th index of a character vector,
# then we know that whatever char you see below next to the 0th key 
# is the character that character vector is encoding for
dctk.int_char

{0: 't',
 1: ' ',
 2: 'b',
 3: 'k',
 4: 'g',
 5: 'r',
 6: 'z',
 7: 'u',
 8: 'e',
 9: 'i',
 10: 'q',
 11: 'y',
 12: 'p',
 13: 'a',
 14: 'w',
 15: 'd',
 16: 'v',
 17: 'l',
 18: 'h',
 19: 's',
 20: 'm',
 21: 'f',
 22: 'c',
 23: 'x',
 24: 'n',
 25: 'j',
 26: 'o'}

In [39]:
# let's look at an example to tie it all together
seq_len_counter = 0

# index for a single sample 
for seq_of_char_vects in X[first_sample_index]:
    
    # get index with max value, which will be the one TRUE value 
    index_with_TRUE_val = np.argmax(seq_of_char_vects)
    
    print (dctk.int_char[index_with_TRUE_val])
    
    seq_len_counter+=1
    
print ("Sequence length: {}".format(seq_len_counter))

f
r
o
m
 
f
a
i
r
e
s
t
 
c
r
e
a
t
u
r
Sequence length: 20


----


### Build a Shakespeare Sonnet Text Generation Model

Now that we have prepped our data, let's finally build out our character generation model.<br>

First, we'll create a callback to monitor the training -- by printing a sample of text generated by the model at the end of each epoch.

Helper function to generate a sample character:

In [40]:
def sample(preds, temperature=1.0):
    """
    Helper function to generate a sample character
    Input is a predictions vector from our model,
    for example a set of 27 character probabilities
    Output is the index of the generated character 
    """
    # convert predictions to an array 
    preds = np.asarray(preds).astype('float64')

    # use the temperature hyper-parameter to "warp" 
    # (sharpen or spread out) the probability distribution 
    preds = np.log(preds) / temperature

    # use the softmax activation function to create a new list of probabilities 
    # corresponding to the "warped" probability distribution
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    # Draw a single sample from a multinomial distribution, given these probabilities
    #   The sample will be a one-hot encoded character
    """ Notes on the np.random.multinomial() function 
       The first argument is the number of "trials" we want: 1 in this case
       The second argument is the list of probabilities for each character
       The third argument is number of sets of "trials" we want: 1 in this case
       By analogy with a dice-rolling experiment: 

       This "trial" consists of generating a single "throw" of a 27-sided die;
       each face corresponds to a character and its associated probability
    """

    probas = np.random.multinomial(1, preds, 1)
    
    # return the index that corresponds to the max probability 
    return np.argmax(probas)


Create the `on_epoch_end` function to be passed into `LambdaCallback()`

In [41]:
def on_epoch_end(epoch, _):
    """"
    Function invoked at the end of each epoch.
    Prints the text generated by our model.
    """
    
    print()
    print('----- Generating text after Epoch: %d' % epoch)
    

    # randomly pick a starting index 
    # will be used to take a random sequence of chars from `text`
    start_index = random.randint(0, len(text) - dctk.maxlen - 1)
    
    # this is our seed string (i.e. input sequence into the model)
    generated = ''

    # start the sentence at index `start_index` and 
    # include the next `dctk.maxlen` number of chars
    sentence = text[start_index: start_index + dctk.maxlen]

    # add to generated
    generated += sentence

    
    print('----- Generating with seed: "' + sentence + '"')
    sys.stdout.write(generated)
    
    # use model to predict what the next maxlen 
    # chars should be that follow the seed string
    for i in range(maxlen):

        # shape of a single sample in a rank 3 tensor 
        x_dims = (1, dctk.maxlen, dctk.n_features)
        # create an array of zeros with shape x_dims
        # recall that python considers zeros and boolean FALSE as the same
        x_pred = np.zeros(x_dims)

        # create a seq vector for our randomly select sequence 
        # i.e. create a numerical encoding for each char in the sequence 
        for t, char in enumerate(sentence):
            # for sample 0 in seq index t and character `char`
            # encode a 1 (which is the same as a TRUE)
            x_pred[0, t, dctk.char_int[char]] = 1

        # pass the sequence vector into the model to get
        # a prediction of what the next char should be 
        preds = model.predict(x_pred, verbose=0)[0]
        
        # use the sample helper function to get index for next char 
        next_index = sample(preds)
        # use look up dict to get next char 
        next_char = dctk.int_char[next_index]

        # append next char to sequence 
        sentence = sentence[1:] + next_char 
        
        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

In [42]:
# need this for on_epoch_end()
text = " ".join(clean_sonnets)
print(f'All of Shakespeare\'s sonnets comprise about {len(text)} characters.')

All of Shakespeare's sonnets comprise about 90203 characters


Create the callback object

In [43]:
# create callback obj that will print text generation at the end of each epoch 
# use for real-time monitoring of model performance
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

----
### Build and Train Model

Build a text generation model using LSTMs.


In [45]:
# build text generation model layer by layer 
model = Sequential([
                    LSTM(128,
                         input_shape=(dctk.maxlen, dctk.n_features),
                         activation='relu',
                         return_sequences=False),
                    Dense(dctk.n_features,
                          activation='softmax')
                    ])

model.compile(loss='categorical_crossentropy',
              optimizer='adam')

# fit model
history = model.fit(X, y,
          batch_size=200,
          epochs=150,
          callbacks=[print_callback])

Epoch 1/150
----- Generating text after Epoch: 0
----- Generating with seed: "curse being fond on "
curse being fond on tarilmhm  v c hlileh
Epoch 2/150
----- Generating text after Epoch: 1
----- Generating with seed: " woe and moan the ex"
 woe and moan the exkos mh cshe ail oo  
Epoch 3/150
----- Generating text after Epoch: 2
----- Generating with seed: "nd sorrows end thy b"
nd sorrows end thy bsit ygtinhf rey ra y
Epoch 4/150
----- Generating text after Epoch: 3
----- Generating with seed: "enough am i that vex"
enough am i that vexe meatr s erer vhir 
Epoch 5/150
----- Generating text after Epoch: 4
----- Generating with seed: " that said i hate to"
 that said i hate toug forens tof while 
Epoch 6/150
----- Generating text after Epoch: 5
----- Generating with seed: "w to greet i hate sh"
w to greet i hate shea din ar he thy lis
Epoch 7/150
----- Generating text after Epoch: 6
----- Generating with seed: " as he takes from yo"
 as he takes from yow leesound iy snyee 
Epoch 8/150
-

  # This is added back by InteractiveShellApp.init_path()


ak not then i
Epoch 83/150
----- Generating text after Epoch: 82
----- Generating with seed: "still farther off fr"
still farther off from have detay be and
Epoch 84/150
----- Generating text after Epoch: 83
----- Generating with seed: "vely heat still to e"
vely heat still to enrres lieven and thy
Epoch 85/150
----- Generating text after Epoch: 84
----- Generating with seed: " written embassage t"
 written embassage that dot recave is th
Epoch 86/150
----- Generating text after Epoch: 85
----- Generating with seed: "ase dost thou upon t"
ase dost thou upon thy grovsoup lors whl
Epoch 87/150
----- Generating text after Epoch: 86
----- Generating with seed: "t with mine compare "
t with mine compare ta the hark which it
Epoch 88/150
----- Generating text after Epoch: 87
----- Generating with seed: "ight but day by nigh"
ight but day by night off lif no dilight
Epoch 89/150
----- Generating text after Epoch: 88
----- Generating with seed: "ad and in my madness"
ad and in my madness the o

### Save the trained model to a file

In [46]:
# save trained model to file 
model.save("trained_text_gen_model.h5")

### Try Out the Trained Model 

Now that we have a trained model that, though far from perfect, can generate actual English words, we can look at the predictions to continue learning more about how a text generation model works.


In [47]:
# this is our joined clean sonnet data
text



In [86]:
# randomly pick a starting index 
# will be used to take a random sequence of chars from `text`
start_index = random.randint(0, len(text) - dctk.maxlen - 1)
start_index

69013

In [87]:
# use the randomly selected starting index to sample a sequence from the `text`

# this is our seed string (i.e., input sequence into the model)
generated = ''

# start the sentence at index `start_index` and
# include the next `dctk.maxlen` number of chars
sentence = text[start_index: start_index + dctk.maxlen]

# add to generated
generated += sentence

In [88]:
# display the "seed string" i.e. the input sequence into the model
print('----- Input seed: "' + sentence + '"')

----- Input seed: "did i frame my feedi"


In [89]:
# use model to predict what the next maxlen 
# chars should be that follow the seed string
for i in range(maxlen):

    # shape of a single sample in a rank 3 tensor 
    x_dims = (1, dctk.maxlen, dctk.n_features)
    # create an array of zeros with shape x_dims
    x_pred = np.zeros(x_dims)

    # create a seq vector for our randomly selected sequence 
    for t, char in enumerate(sentence):
        # for sample 0 in seq index t and character `char`
        # encode a 1 (which is the same as a TRUE)
        x_pred[0, t, dctk.char_int[char]] = 1

    # take the seq vector and pass into model to get
    # a prediction of what the next char should be 
    preds = model.predict(x_pred, verbose=0)[0]
    
    # use the sample helper function to get index for next char 
    next_index = sample(preds)
    # use look up dict to get next char 
    next_char = dctk.int_char[next_index]

    # append next char to sequence 
    sentence = sentence[1:] + next_char 

In [90]:
# this is the seed string
generated

'did i frame my feedi'

In [91]:
# these are the maxlen chars the model thinks should come after the seed string
sentence

'ng but sheeca blore '

In [92]:
# how put it all together
generated + sentence

'did i frame my feeding but sheeca blore '

## Stretch Goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g., plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data
