<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [0]:
# I'm gonna import with pandas - it'll be good practice :)
import pandas as pd

df = pd.read_fwf("https://www.gutenberg.org/files/100/100-0.txt").reset_index()

df = df.drop('Unnamed: 0', axis=1)  # Dropping the unnamed column
df = df.dropna().reset_index(drop=True)  # Dropping NaN values, resetting index.

df = df[63:138996].reset_index(drop=True)  # This contains the actual text

data = df['index'].tolist()  # Send to list.

In [0]:
import re  # Regex

new_data = " ".join(data) 

final_data = re.sub(r'[^a-zA-Z^0-9]', ' ', new_data)  # Filter our characters


characters = list(set(final_data))  # This set is our final set of characters

In [0]:
# Create character-integer lookup.

char_int = {character:integer for integer, character in enumerate(characters)}  # Assigns char to int for each integer and character in characters
int_char = {integer:character for integer, character in enumerate(characters)}  # And int to char for each integer and character in characters

In [4]:
# Create sequences

max_len = 40  # Our maximum sequence length.
step = 5  # 5 steps
sequences = []  # Empty list to populate with sequences.
next_char = []  # Empty list to populate with the next character in sequence.

encoded = [char_int[c] for c in final_data]  # This will encode values.

# Creating sequences.
for i in range(0, len(encoded) - max_len, step):
  sequences.append(encoded[i: i + max_len])
  next_char.append(encoded[i + max_len])
print('Sequences: ', len(sequences))

Sequences:  1055018


In [5]:
# Now, we need to create our X and y variables.
import numpy as np

X = np.zeros((len(sequences), max_len, len(characters)), dtype=np.bool)  # Makes bool to use as X variables.
y = np.zeros((len(sequences), len(characters)), dtype=np.bool)  # Makes bools to use as y variables.

# Vectorizing sequences.
for i, sequence in enumerate(sequences):
  for t, char in enumerate(sequence):
    X[i, t, char] = 1
  y[i, next_char[i]] = 1

print('X shape:', X.shape)
print('y shape:', y.shape)

X shape: (1055018, 40, 63)
y shape: (1055018, 63)


In [6]:
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import Sequential

# Build our model.
model = Sequential()  # Model instantiation.
model.add(LSTM(128, input_shape=(max_len, len(characters))))  # LSTM layer with input shape of the maximum length of sequence, length of characters
model.add(Dense(len(characters), activation='softmax', name='OutputLayer'))  # Dense layer (output) with the number of characters as number of nodes + softmax (because multi-class classification)

model.compile(optimizer='nadam', loss='categorical_crossentropy')  # Compile our model
model.summary()  # Let's take a look at the summary. It's good practice for making sure everything looks right!

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 128)               98304     
_________________________________________________________________
OutputLayer (Dense)          (None, 63)                8127      
Total params: 106,431
Trainable params: 106,431
Non-trainable params: 0
_________________________________________________________________


In [0]:
import random
import sys
from tensorflow.keras.callbacks import LambdaCallback

def sample(preds):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')  # Preds as float
    preds = np.log(preds) / 1  # Get the logarithm of preds / 1
    exp_preds = np.exp(preds)  # Use numpy exponential method
    preds = exp_preds / np.sum(exp_preds)  # Our new preds
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def print_generated_text(epoch, _):
    # Function invoked at end of each epoch. Prints generated text.
    print()  # For readability

    start_index = random.randint(0, len(final_data) - max_len - 1)  # Randomly generate where to start
    
    generated = ''  # Empty string
    
    sentence = final_data[start_index: start_index + max_len]  # This will be our seed to generate text with.
    generated += sentence  # Add the sentence to our generated text.
    
    print('Generating with seed: "' + sentence + '"')  # Generating with seed output
    sys.stdout.write(generated)  # This will output text. Using stdout.write due to how the output is presented; may not work with print() 
    

    number_of_characters = 125  # This will be the number of characters generated. (I couldn't get this working otherwise, talk about in 1:1?)

    # Here's where the magic happens!
    for i in range(number_of_characters):
        x_pred = np.zeros((1, max_len, len(characters)))
        for t, char in enumerate(sentence):
            x_pred[0, t, char_int[char]] = 1.
            
        preds = model.predict(x_pred, verbose=0)[0]  # Make our prediction
        next_index = sample(preds)  # Picking the next index for our prediction.
        next_char = int_char[next_index]  # int_char lookup for our next index.
        
        sentence = sentence[1:] + next_char  # Adding next character to the sentence.
        
        sys.stdout.write(next_char)  # Write each character.
        sys.stdout.flush()
    print()  # For readability


print_callback = LambdaCallback(on_epoch_end=print_generated_text)  # This is our custom callback function.

In [8]:
# Fit the model

model.fit(X, y,
          batch_size=32,
          epochs=25,
          callbacks=[print_callback])  # Usage of our callback here is what will print the text generated by the LSTM model.

Epoch 1/25
Generating with seed: "NT SIR WALTER HERBERT SIR WILLIAM BRANDO"
NT SIR WALTER HERBERT SIR WILLIAM BRANDO  Ney is I have sers  for his his wirbwer lard d the wauld  and bur sitson awaty En all our ush now zame  Which seed  but in 
Epoch 2/25
Generating with seed: "KING RICHARD  Well you deserve  They wel"
KING RICHARD  Well you deserve  They well shy wants in the Fremio s sut  Or lears  trush  and so  a BERORIA  ney much there  My but purse  Will say  for which common
Epoch 3/25
Generating with seed: "evenge will come  KING  Break not your s"
evenge will come  KING  Break not your seen famouns quintience The lood  the pell to the muster  Thy godage your good as everton to masy ney I  he made it is his vas
Epoch 4/25
Generating with seed: "w torments me to rehearse  I kill d a ma"
w torments me to rehearse  I kill d a madishanged moon  O encle his eyes of StYoung and RICEMON  PERCIBIAND  I will ever among  were now all  The shod make aboverave
Epoch 5/25
Generating with seed:

  


umpet to sit with sure  To the husband me upon
Epoch 22/25
Generating with seed: "man born  Master Parson  who writes hims"
man born  Master Parson  who writes himself  By only house of my is  if not obtwomb whom  yet s soverytapoous away  Are poor name and that parts spil  and danger off
Epoch 23/25
Generating with seed: "ure s wit  CELIA  Peradventure this is n"
ure s wit  CELIA  Peradventure this is no art their more  HORATIO  Seard such bend  God d merry colour d on thee so for a beart  And what men my in the hand bible ho
Epoch 24/25
Generating with seed: "leeding must be cur d   I am a Suitour  "
leeding must be cur d   I am a Suitour  CLEOPATRA OF PHICHILLA ULYTRA ENg  nor inflerf you keep from fix down  Winds fertlest thy both and witting da  O most a beaut
Epoch 25/25
Generating with seed: "y  I beseech thee  apparel thy head  And"
y  I beseech thee  apparel thy head  And you you and But with me the rade set Unto his mind  stwift  Enter FRETHe SALENTIUST  Your souss  may much 

<tensorflow.python.keras.callbacks.History at 0x7f07765af1d0>

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN