<a href="https://colab.research.google.com/github/Collin-Campbell/DS-Unit-4-Sprint-3-Deep-Learning/blob/main/module1-rnn-and-lstm/LS_DS_431_RNN_and_LSTM_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 3 Assignment 1*

# Recurrent Neural Networks and Long Short Term Memory (LSTM)

![Monkey at a typewriter](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Chimpanzee_seated_at_typewriter.jpg/603px-Chimpanzee_seated_at_typewriter.jpg)

It is said that [infinite monkeys typing for an infinite amount of time](https://en.wikipedia.org/wiki/Infinite_monkey_theorem) will eventually type, among other things, the complete works of Wiliam Shakespeare. Let's see if we can get there a bit faster, with the power of Recurrent Neural Networks and LSTM.

This text file contains the complete works of Shakespeare: https://www.gutenberg.org/files/100/100-0.txt

Use it as training data for an RNN - you can keep it simple and train character level, and that is suggested as an initial approach.

Then, use that trained RNN to generate Shakespearean-ish text. Your goal - a function that can take, as an argument, the size of text (e.g. number of characters or lines) to generate, and returns generated text of that size.

Note - Shakespeare wrote an awful lot. It's OK, especially initially, to sample/use smaller data and parameters, so you can have a tighter feedback loop when you're trying to get things running. Then, once you've got a proof of concept - start pushing it more!

In [1]:
import requests
import pandas as pd

In [4]:
url = "https://www.gutenberg.org/files/100/100-0.txt"

r = requests.get(url)
r.encoding = r.apparent_encoding
data = r.text
data = data.split('\r\n')
toc = [l.strip() for l in data[44:130:2]]
# Skip the Table of Contents
data = data[135:]

# Fixing Titles
toc[9] = 'THE LIFE OF KING HENRY V'
toc[18] = 'MACBETH'
toc[24] = 'OTHELLO, THE MOOR OF VENICE'
toc[34] = 'TWELFTH NIGHT: OR, WHAT YOU WILL'

locations = {id_:{'title':title, 'start':-99} for id_,title in enumerate(toc)}

# Start 
for e,i in enumerate(data):
    for t,title in enumerate(toc):
        if title in i:
            locations[t].update({'start':e})
            

df_toc = pd.DataFrame.from_dict(locations, orient='index')
df_toc['end'] = df_toc['start'].shift(-1).apply(lambda x: x-1)
df_toc.loc[42, 'end'] = len(data)
df_toc['end'] = df_toc['end'].astype('int')

df_toc['text'] = df_toc.apply(lambda x: '\r\n'.join(data[ x['start'] : int(x['end']) ]), axis=1)

In [5]:
#Shakespeare Data Parsed by Play
df_toc.head()

Unnamed: 0,title,start,end,text
0,THE TRAGEDY OF ANTONY AND CLEOPATRA,-99,14379,
1,AS YOU LIKE IT,14380,17171,AS YOU LIKE IT\r\n\r\n\r\nDRAMATIS PERSONAE.\r...
2,THE COMEDY OF ERRORS,17172,20372,THE COMEDY OF ERRORS\r\n\r\n\r\n\r\nContents\r...
3,THE TRAGEDY OF CORIOLANUS,20373,30346,THE TRAGEDY OF CORIOLANUS\r\n\r\nDramatis Pers...
4,CYMBELINE,30347,30364,CYMBELINE.\r\nLaud we the gods;\r\nAnd let our...


In [16]:
data2 = []

for s in data:
  data2.append(s.strip())

In [22]:
data3 = []

for s in data2:
  if s.isdigit() == False:
    data3.append(s)
  else:
    continue

In [24]:
data4 = []

for s in data3:
  if s != '':
    data4.append(s)
  else:
    continue

In [52]:
data4 = data4[:140036] # removing credits in the text 

In [57]:
full_text = ''

for s in data4:
  full_text += ' ' + s


In [58]:
full_text = full_text.strip()

In [60]:
new_list = []

for i in full_text.split():
  if i.isdigit() == False:
    new_list.append(i)

final_text = ''

for s in new_list:
  final_text += ' ' + s

final_text = final_text.strip()

In [62]:
# Encode Data as Chars

# Find the unique characters
chars = list(set(final_text))

# Lookup tables
char_int = {c:i for i, c in enumerate(chars)} 
int_char = {i:c for i, c in enumerate(chars)}

print('The number of unique characters in the text:', len(chars))

The number of unique characters in the text: 99


In [64]:
# Create the sequence data
maxlen = 30
step = 5

# Encode the characters using the lookup tables
encoded = [char_int[c] for c in final_text]

# Initialize empty lists to hold the sequences
sequences = [] # Each element is 40 chars long
next_char = [] # One element for each sequence

# Loop through the entire text
for i in range(0, len(encoded) - maxlen, step): 
    sequences.append(encoded[i : i + maxlen])
    next_char.append(encoded[i + maxlen])

print('sequences: ', len(sequences))

sequences:  1051537


In [65]:
import tensorflow as tf
from tensorflow.keras.preprocessing import sequence

# Pad sequences so all are equal
seq = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=30)

# Create x & y
import numpy as np

# Create arrays of zeros (False)
x = np.zeros((len(sequences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)

# Turn on the location (set to True) when the character is present
for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
        x[i,t,char] = 1

    y[i, next_char[i]] = 1

In [66]:
# Build the model: a single LSTM
from keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.layers import Bidirectional, Embedding

model = Sequential()
model.add(Embedding(output_dim=64, input_dim=len(chars)))
model.add(Bidirectional(LSTM(64)))
model.add(Dense(len(chars), activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam')

In [70]:
# Fit the model
model.fit(seq, y, batch_size=32,
          epochs=30, verbose=2)


Epoch 1/30
32861/32861 - 814s - loss: 1.6995
Epoch 2/30
32861/32861 - 817s - loss: 1.6685
Epoch 3/30
32861/32861 - 815s - loss: 1.6469
Epoch 4/30
32861/32861 - 816s - loss: 1.6303
Epoch 5/30
32861/32861 - 814s - loss: 1.6175
Epoch 6/30
32861/32861 - 813s - loss: 1.6064
Epoch 7/30
32861/32861 - 811s - loss: 1.5970
Epoch 8/30
32861/32861 - 815s - loss: 1.5889
Epoch 9/30
32861/32861 - 814s - loss: 1.5820
Epoch 10/30
32861/32861 - 812s - loss: 1.5760
Epoch 11/30
32861/32861 - 813s - loss: 1.5708
Epoch 12/30
32861/32861 - 813s - loss: 1.5659
Epoch 13/30
32861/32861 - 812s - loss: 1.5617
Epoch 14/30
32861/32861 - 815s - loss: 1.5581
Epoch 15/30
32861/32861 - 814s - loss: 1.5550
Epoch 16/30
32861/32861 - 811s - loss: 1.5515
Epoch 17/30
32861/32861 - 813s - loss: 1.5489
Epoch 18/30
32861/32861 - 812s - loss: 1.5467
Epoch 19/30
32861/32861 - 813s - loss: 1.5437
Epoch 20/30
32861/32861 - 814s - loss: 1.5417
Epoch 21/30
32861/32861 - 804s - loss: 1.5392
Epoch 22/30
32861/32861 - 805s - loss: 1.53

In [None]:
model.save('ShakespeareBot')

In [None]:
from tensorflow import keras
model = keras.models.load_model('ShakespeareBot')

In [71]:
# Predict and convert text back into characters
def generate_text(model, seed, length):

  encoded = [char_int[c] for c in seed]

  generated = ''
  generated += seed
  model.reset_states()

  start_index = 0 

  for _ in range(length):

      sample = encoded[start_index:start_index+10]      
      sample = np.array(sample)
      sample = np.expand_dims(sample,0)

      pred = model.predict(sample)
      pred = tf.squeeze(pred, 0)
      next_char = np.argmax(pred)
      encoded.append(next_char)
      generated += int_char[next_char]

      start_index += 1

  return generated

In [77]:
seed_text = "HAMLET: "

generate_text(model, seed_text, 30)

'HAMLET: So I wilt thee and the VALLET '

# Resources and Stretch Goals

## Stretch goals:
- Refine the training and generation of text to be able to ask for different genres/styles of Shakespearean text (e.g. plays versus sonnets)
- Train a classification model that takes text and returns which work of Shakespeare it is most likely to be from
- Make it more performant! Many possible routes here - lean on Keras, optimize the code, and/or use more resources (AWS, etc.)
- Revisit the news example from class, and improve it - use categories or tags to refine the model/generation, or train a news classifier
- Run on bigger, better data

## Resources:
- [The Unreasonable Effectiveness of Recurrent Neural Networks](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) - a seminal writeup demonstrating a simple but effective character-level NLP RNN
- [Simple NumPy implementation of RNN](https://github.com/JY-Yoon/RNN-Implementation-using-NumPy/blob/master/RNN%20Implementation%20using%20NumPy.ipynb) - Python 3 version of the code from "Unreasonable Effectiveness"
- [TensorFlow RNN Tutorial](https://github.com/tensorflow/models/tree/master/tutorials/rnn) - code for training a RNN on the Penn Tree Bank language dataset
- [4 part tutorial on RNN](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) - relates RNN to the vanishing gradient problem, and provides example implementation
- [RNN training tips and tricks](https://github.com/karpathy/char-rnn#tips-and-tricks) - some rules of thumb for parameterizing and training your RNN