<div id="container" style="position:relative;">
<div style="float:left"><h1> Introduction to Recurrent Neural Networks </h1></div>
<div style="position:relative; float:right"><img style="height:65px" src ="https://drive.google.com/uc?export=view&id=1EnB0x-fdqMp6I5iMoEBBEuxB_s7AmE2k" />
</div>
</div>

So far we have seen standard Neural Networks (NNs/ANNs) as well as Convolutional Neural Networks (CNNs). These were powerful methods, but neither have the notion of a series where each value can affect the next. What if we have a set of inputs where a series of inputs could decide the next value? For example, words in a sentence or essay, or price movements. This is where Recurrent Neural Networks (RNNs) come into play.

Required libraries:
1. `tensorflow`
2. `matplotlib`
3. `pydot`
4. `graphviz`

In [1]:
# Let's load up some libraries
import warnings
warnings.filterwarnings("ignore")

import os.path
import math
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import random
from collections import deque

# We will use tensorflow.tensorflow.keras in this notebook

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM, BatchNormalization, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.utils import plot_model
from tensorflow.keras.models import model_from_json

from numpy.random import seed
seed(1)


### Series Data

In series data $\vec{X} = [x_1, x_2, ..., x_n]$ we assume that the data points are not independent. In essence, every $x_t$ has an effect on the next $x_{t+1}$ and knowing the previous data point allows us to get some information about the next data point. More loosely speaking, not only do the values carry information, but their order does as well.

One of the best examples is language, where the order of the words has a tremendous effect on the change in meaning. Consider the following two sentences:
1. A person made a neural network.
2. A neural network made a person.

The first sentence describes an ordinary occurrence in our classroom, the second sentence marks the beginning of SkyNet.

#### Effect on Neural Network design

From a functional perspective, we would like our neurons to be slightly modified. Whereas before we were interested in each neuron simply intaking a value, doing some processing on it, and then producing an output (right image), now we would like each neuron to take in a series. Furthermore, we'd like the output from each element in the series to be an input for the next element, and once the entire series has been inputted, we'd like to see the neuron fire (left image).

<img src="http://drive.google.com/uc?export=view&id=15kwpMDsF6yzZpTtaqXlZmO_c9ZOaU7tr" width=500 height=400>

<center> <i>Image source: https://www.sciencedirect.com/science/article/pii/S088523081400093X#fig0010)</i> </center>

The basis for Recurrent Neural Networks (RNNs) are these Recurrent Neurons. We can think of each neuron as a series of neurons which take in a sequence ($x_0, x_1, ..., x_t$ in the image below), transmitting a state along the sequence of neurons (the arrows going across the series of neurons $A$), and outputting either a final output ($h_t$), or a series of outputs for another recurrent neuron to consume ($h_0, h_1, ..., h_t$).

<img src="http://drive.google.com/uc?export=view&id=13jVNRNYmvirzCKZYld5KlbgXKtt4uJxI" height=400 width=600>

<center> <i>Image source: https://machinelearning-blog.com/2018/02/21/recurrent-neural-networks/ ) </i> </center>


### Effect on Data

This modifies the way we feed the data into the network as well. For a regular Feed-forward NN, each observation would be composed of $d$ dimensions, and we would have a total of $n$ observations to create a dataset of $n \times d$. However, for recurrent networks, for each dimension/feature, we would have a sequence of $q$ observations, so that each data point is a $q \times d$ matrix. This means our entire dataset will be of shape $n \times q \times d$ (this 3D "matrix" is called a **Tensor**).

So for example, if we have a dataset where every data point is made up of 4 input features, each of which is a series of 20 measurements (e.g. 20 days), then every data point is described by a $20 \times 4$ matrix. And if we have 1,000 data points, our dataset will be of size $1000 \times 20 \times 4$.

## Let's write some poetry! (or at least try)

Let's say we wanted to produce an RNN which can write poetry for us. We will give it some starting phrase, and we expect it to complete a poem for us. 
There are two ways to explore this problem. We can look at poems (and language) as a sequence of characters, or as a sequence of words.

We will start with the simple case of character sequences.

### Splitting the Data into Characters

#### Data Wrangling

The most important part of every data science project is getting the data and transforming it to our required format. Let's grab a dataset of poetry

In [2]:
# data was taken from https://www.kaggle.com/johnhallman/complete-poetryfoundationorg-dataset
poems_df = pd.read_csv('/Users/yuanyaning/Downloads/kaggle_poem_dataset.csv', index_col=0)
poems_df.head()

Unnamed: 0,Author,Title,Poetry Foundation ID,Content
0,Wendy Videlock,!,55489,"Dear Writers, I’m compiling the first in what ..."
1,Hailey Leithauser,0,41729,"Philosophic\nin its complex, ovoid emptiness,\..."
2,Jody Gladding,1-800-FEAR,57135,We'd like to talk with you about fear t...
3,Joseph Brodsky,1 January 1965,56736,The Wise Men will unlearn your name.\nAbove yo...
4,Ted Berrigan,3 Pages,51624,For Jack Collom\n10 Things I do Every Day\n\np...


Poetry is quite interesting because it is unique to every author, so we should see if we can find an author in this dataset with a large collection of poems

In [3]:
poems_df['Author'].value_counts().head()

William Shakespeare      85
Anonymous                82
Alfred, Lord Tennyson    78
Rae Armantrout           62
William Wordsworth       59
Name: Author, dtype: int64

William Shakespeare... not terribly surprising. But in our case let's try a close second, Alfred Tennyson (the Shakespeare poems have a few strange characters in them that make things harder).

In [5]:
tennyson_poems = poems_df[poems_df['Author'] == "Alfred, Lord Tennyson"]
tennyson_poems.head()

Unnamed: 0,Author,Title,Poetry Foundation ID,Content
1730,"Alfred, Lord Tennyson","Break, Break, Break",45318,"Break, break, break,\nOn thy cold gray stones,..."
2134,"Alfred, Lord Tennyson",The Charge of the Light Brigade,45319,"I\n\nHalf a league, half a league,\nHalf a lea..."
2315,"Alfred, Lord Tennyson",Claribel,45320,Where Claribel low-lieth\nThe breezes pause an...
2687,"Alfred, Lord Tennyson",Crossing the Bar\n \n \n \n Launch Audio in...,45321,"Sunset and evening star,\nAnd one clear call f..."
3529,"Alfred, Lord Tennyson",The Eagle,45322,He clasps the crag with crooked hands;\nClose ...


Let's build a dataset of the poems:

In [6]:
all_chars = []
dataset = []

# cycle through all the poems
poem_number = 0
for poem in tennyson_poems['Content']:
    
    # split the poem into its individual characters
    poem_characters = list(poem.lower())
    dataset.append(poem_characters)
    
    # Also create a list of all unique characters we saw
    for char in poem_characters:
        if char not in all_chars:
            all_chars.append(char)
    



Let's inspect our list of characters:

In [7]:
print(all_chars)

['b', 'r', 'e', 'a', 'k', ',', ' ', '\n', 'o', 'n', 't', 'h', 'y', 'c', 'l', 'd', 'g', 's', '!', 'i', 'w', 'u', 'm', '.', 'f', "'", 'p', 'v', ';', 'x', '“', '”', '?', 'j', '-', 'z', ':', '"', 'q', '—', '(', ')', 'ë', 'ï', 'æ', '’', 'ä', 'é', 'ö', 'è', '{', '}']


As well as our dataset:

In [9]:
print(dataset[0])

['b', 'r', 'e', 'a', 'k', ',', ' ', 'b', 'r', 'e', 'a', 'k', ',', ' ', 'b', 'r', 'e', 'a', 'k', ',', '\n', 'o', 'n', ' ', 't', 'h', 'y', ' ', 'c', 'o', 'l', 'd', ' ', 'g', 'r', 'a', 'y', ' ', 's', 't', 'o', 'n', 'e', 's', ',', ' ', 'o', ' ', 's', 'e', 'a', '!', '\n', 'a', 'n', 'd', ' ', 'i', ' ', 'w', 'o', 'u', 'l', 'd', ' ', 't', 'h', 'a', 't', ' ', 'm', 'y', ' ', 't', 'o', 'n', 'g', 'u', 'e', ' ', 'c', 'o', 'u', 'l', 'd', ' ', 'u', 't', 't', 'e', 'r', '\n', 't', 'h', 'e', ' ', 't', 'h', 'o', 'u', 'g', 'h', 't', 's', ' ', 't', 'h', 'a', 't', ' ', 'a', 'r', 'i', 's', 'e', ' ', 'i', 'n', ' ', 'm', 'e', '.', '\n', '\n', 'o', ',', ' ', 'w', 'e', 'l', 'l', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'f', 'i', 's', 'h', 'e', 'r', 'm', 'a', 'n', "'", 's', ' ', 'b', 'o', 'y', ',', '\n', 't', 'h', 'a', 't', ' ', 'h', 'e', ' ', 's', 'h', 'o', 'u', 't', 's', ' ', 'w', 'i', 't', 'h', ' ', 'h', 'i', 's', ' ', 's', 'i', 's', 't', 'e', 'r', ' ', 'a', 't', ' ', 'p', 'l', 'a', 'y', '!', '\n', 'o', ',

Let's discuss what we have so far. We have a data collection where each element is a poem, split up into a list of characters. We want to transform it into a dataset $X$ and $y$, where each data point $x_i$ has a single feature containing a sequence of several characters, and each label $y_i$ is the character that should come after the sequence observed.

Suppose we decided on a sequence length of 10, and we explored the first poem:

```
Break, break, break,
On thy cold gray stones, O Sea!
And I would that my tongue could utter
The thoughts that arise in me.

O, well for the fisherman's boy,
That he shouts with his sister at play!
O, well for the sailor lad,
That he sings in his boat on the bay!

And the stately ships go on
To their haven under the hill;
But O for the touch of a vanish'd hand,
And the sound of a voice that is still!

Break, break, break
At the foot of thy crags, O Sea!
But the tender grace of a day that is dead
Will never come back to me.
```

We would like our dataset to look like this:

$x_0$: `Break, bre`<br>
$y_0$: `a`

$x_1$: `reak, brea`<br>
$y_1$: `k`

$x_2$: `eak, break`<br>
$y_2$: `,`

$\vdots$

Let's construct a loop to build this dataset for us. We will use the `collections.deque` object. It is like a list, but once it fills up and we add another element, it kicks the oldest element out. In essence, it is a list that can never be longer than `maxlen`.

In [10]:
SEQUENCE_LENGTH = 10

X = []
y = []

# for each poem
for poem in dataset:
    char_deque = deque(maxlen=SEQUENCE_LENGTH)
    
    # go through the characters and place them in a deque, once the deque fills up and we try to add
    # another character, the oldest character will be thrown out
    for i in range(len(poem)-1):
        char = poem[i]
        char_deque.append(char)
        
        if (len(char_deque) == SEQUENCE_LENGTH):
            X.append(list(char_deque))
            y.append(poem[i+1])
            


Let's inspect our $X$ and $y$

In [11]:
for i in range(5):
    print("X:",X[i])
    print("y:",y[i])
    print("*******")

X: ['b', 'r', 'e', 'a', 'k', ',', ' ', 'b', 'r', 'e']
y: a
*******
X: ['r', 'e', 'a', 'k', ',', ' ', 'b', 'r', 'e', 'a']
y: k
*******
X: ['e', 'a', 'k', ',', ' ', 'b', 'r', 'e', 'a', 'k']
y: ,
*******
X: ['a', 'k', ',', ' ', 'b', 'r', 'e', 'a', 'k', ',']
y:  
*******
X: ['k', ',', ' ', 'b', 'r', 'e', 'a', 'k', ',', ' ']
y: b
*******


#### Almost there...

There are a few more things we have to do, like convert the arrays into numpy arrays for our RNN to consume. However, recall that all machine learning techniques rely on the data being numeric. We will transform the data into numbers in a few stages.

First, we will prepare two dictionaries: one to convert characters to numbers, and one to do the conversion back.

In [14]:
number_to_char = {i: j for i,j in enumerate(all_chars)}
char_to_number = {j: i for i,j in enumerate(all_chars)}

Now we will convert every character in $X$ and $y$ into numbers.

In [15]:
for i in range(len(X)):
    for j in range(len(X[0])):
        X[i][j] = char_to_number[X[i][j]]
        
    y[i] = char_to_number[y[i]]

In [16]:
for i in range(5):
    print("X:",X[i])
    print("y:",y[i])
    print("*******")

X: [0, 1, 2, 3, 4, 5, 6, 0, 1, 2]
y: 3
*******
X: [1, 2, 3, 4, 5, 6, 0, 1, 2, 3]
y: 4
*******
X: [2, 3, 4, 5, 6, 0, 1, 2, 3, 4]
y: 5
*******
X: [3, 4, 5, 6, 0, 1, 2, 3, 4, 5]
y: 6
*******
X: [4, 5, 6, 0, 1, 2, 3, 4, 5, 6]
y: 0
*******


Now we will transform our data into numpy arrays.

In [17]:
X = np.array(X)
y = np.array(y)

In [18]:
# Let's look at the shapes
print(X.shape)
print(y.shape)

(185712, 10)
(185712,)


Recall that the data shape has to be $n \times q \times d$. So we need to reshape this data. 

### Thinking Exercise

1. How many features do we have?

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

We have one feature of 10 observations.

In [22]:
X = X.reshape((X.shape[0], X.shape[1], 1))
print(X.shape)
print(y.shape)

(185712, 10, 1)
(185712,)


We will shuffle the data quite a bit, just to introduce more randomness into how the NN sees the data, so let's create a function for that.

In [23]:
def shuffle_data(X_data, y_data):
    
    y_data = y_data.reshape((y_data.shape[0], 1, 1))
    combined_data = np.hstack((X_data, y_data))
    
    np.random.shuffle(combined_data)

    X_data = combined_data[:, :-1]
    y_data = combined_data[:, -1]
    
    return X_data, y_data.reshape(-1, 1)

In [24]:
X, y = shuffle_data(X, y)

In [25]:
print(X.shape)
print(y.shape)

(185712, 10, 1)
(185712, 1)


shuffle the order of each list not the numbers (chars) in a list

### Train and Validation data

We're going to split the data into training and validation sets. We don't need a test set here because we don't need a perfect unbiased estimate of how our model will perform on new data. Our unbiased estimate will come from using the trained model to generate some new poetry.

In [27]:
# Create training and validation sets
validate_set_size = int(0.1 * X.shape[0]) #10% for validation 

train_set_limit = X.shape[0] - validate_set_size # 90% for training

# Split train
train_X = X[:train_set_limit]
train_y = y[:train_set_limit]

# Split validation
validation_X = X[train_set_limit : ]
validation_y = y[train_set_limit : ]



In [28]:
print(train_X.shape)      
print(validation_X.shape)

(167141, 10, 1)
(18571, 10, 1)


First, let's try a regular neural network. To do so, we need to "flatten" each of our data points from a 2D tensor to a 1D tensor.

In [42]:
flat_train_X = np.reshape(train_X, (-1, train_X.shape[1]))
flat_validation_X = np.reshape(validation_X, (-1, validation_X.shape[1]))

print(flat_train_X.shape)
print(flat_validation_X.shape)

(167141, 10)
(18571, 10)


To help us visualize RNNs later, let's look at a visualization of our Feed-Forward Neural Network.

<img src="https://drive.google.com/uc?export=view&id=1ur2OAo1U5tMhzdlEhjiQckj2ZnazRdRq" alt="Drawing" style="width: 600px;"/>
 
<center> <i>(Image prepared using http://alexlenail.me/NN-SVG/)</i> </center>

Here $n$ is the size of our input.

The first layer will read our flattened input (the characters turned into numbers) and the network will perform a forward pass through the layers, until the output layer outputs a probability distribution over our 52 characters.

Since we will be constructing rather large networks, we will be working with pre-trained networks and simply load them to make them work, but we will still review the code to build the network.

First, let's declare a `Sequential` model. This means all layers will be added in sequence.

In [43]:
FNN_model = Sequential()

Now let's add our first layer. Since this is our first layer we need to explicitly specify the shape of the input (the shape of each data point).

In [44]:
FNN_model.add(Dense(1024, activation='relu', input_shape=(flat_train_X.shape[1:])))

Notice we are using the `relu` activation function. Recall that function is a function which returns either 0 or $x$, whichever is larger.

$$ReLU(x) = max(0, x)$$

Let's also add some dropout and batch normalization. Dropout will drop out some random output from this layer at every training pass, and batch normalization will make sure output from this layer has mean 0, and standard deviation of 1.

In [45]:
FNN_model.add(Dropout(0.2))          # randomly drop 20% of the previous layer output
FNN_model.add(BatchNormalization())

Let's add another layer of `relu` neurons, along with more dropout and batch normalization.

In [46]:
FNN_model.add(Dense(612, activation='relu'))
FNN_model.add(Dropout(0.2))
FNN_model.add(BatchNormalization())

And add another layer of neurons.

In [47]:
FNN_model.add(Dense(32, activation='relu'))
FNN_model.add(Dropout(0.1))

Finally, we add an output layer that will give us a probability distribution. It will have as many neurons as we have classes (52 in this case), and will use the `softmax` function to make sure all outputs are fractions between 0 and 1, and they all add up to 1.

Recall that the softmax function is defined on a vector of outputs $\vec{O} = [o_1, o_2, o_3, ..., o_n]$

$$softmax(o_i) = \frac{e^{o_i}}{\Sigma_{i=1}^n e^{o_i}} $$

In [48]:
class_number = len(all_chars) #52

FNN_model.add(Dense(class_number, activation='softmax'))

Now we have to compile the model.

In [49]:

sgd = SGD(lr=0.01, decay=0.0, momentum=0.0, nesterov=False, clipnorm=2.0)

# Compile model. Still need to compile even if we load from H5 file since the model must be compiled
# to do predictions
FNN_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=sgd,
    metrics=['accuracy']
)

In [50]:
# Return the model summary
FNN_model.summary()

# Save an image of the model's architecture to a file
plot_model(FNN_model, to_file='Feed Forward NN.png', show_shapes=True, show_layer_names=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 1024)              11264     
_________________________________________________________________
dropout_3 (Dropout)          (None, 1024)              0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 1024)              4096      
_________________________________________________________________
dense_6 (Dense)              (None, 612)               627300    
_________________________________________________________________
dropout_4 (Dropout)          (None, 612)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 612)               2448      
_________________________________________________________________
dense_7 (Dense)              (None, 32)                19616     
__________

We could train the model now, but since this could take a while (more so for the future model), we'll use a pre-trained model.

In [52]:
if not os.path.exists('FNN_model.h5'):
    EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through the entire data
    BATCH_SIZE = 480   # at each epoch, it will split the data into units of 480 samples, and train on those

    FNN_model.fit(
        flat_train_X, train_y,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        validation_data=(flat_validation_X, validation_y))
    
    # great way to save the model as a json/h5 file set 
    # https://machinelearningmastery.com/save-load-keras-deep-learning-models/
    model_json = FNN_model.to_json()
    with open("FNN_model.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    FNN_model.save_weights("FNN_model.h5")
    print("Saved model to disk")
else:
    FNN_model.load_weights("FNN_model.h5")
    print("Loaded weights model from disk") 
    print("No need to train, model is fully trained")
    loss_score, accuracy_score = FNN_model.evaluate(flat_validation_X, validation_y, batch_size=480, verbose=1)
    print("Accuracy score: "+str(round(accuracy_score*100,2))+"%")

Loaded weights model from disk
No need to train, model is fully trained
Accuracy score: 25.1%


Now let's check how well this performs. We'll build a loop that will continiously produce some character for us based on a sequence we will "seed" the network with.

In [53]:
input_phrase = "To be or not to be."

# we will predict 200 characters forward after the input_phrase
for i in range(200):
    
    # get the last 10 characters of our input_phrase and convert them to numbers
    network_input = list(input_phrase[-SEQUENCE_LENGTH:])
    for j in range(len(network_input)):
        network_input[j] = char_to_number[network_input[j]]
    # convert into an array then reshape it to explicitly have 1 feature
    network_input = np.array(network_input)
    network_input = network_input.reshape((1, SEQUENCE_LENGTH))

    # get probabilistic predictions from the neural network
    # randomly draw a single predicted character from the full list with their probabilities determined by the network's prediction
    predict_proba = FNN_model.predict(network_input)[0]
    predict_char = np.random.choice(all_chars, 1, p = predict_proba)[0]

    input_phrase += predict_char

print(input_phrase)


To be or not to be.rr mnredsäsel fhg,vnatend baoteaamtoodenntherso rea
anh nundse,vore'toma
herrldsdy,ohetdge ie deasdt kni syæee foe be d"
ote mtfnha dh ga
ssaeyi led wle yhe ald lopl aari th wtu sl tot
 figl rridg th 


### Hmmmm....

Not great results. This doesn't look like English even.

So now let's see how an RNN can help us. Let's inspect our series data again:

In [54]:
print(train_X.shape)
print(validation_X.shape)

(167141, 10, 1)
(18571, 10, 1)


And build our RNN model

In [55]:
RNN_model1 = Sequential()

Now, let's add our first layer. We always need to specify an input size for our first layer.

In [56]:
RNN_model1.add(LSTM(1024, activation='relu', input_shape=(train_X.shape[1:]), return_sequences=True))

Notice something interesting, we are asking our recurrent layer to return a sequence (`return_sequences=True`). Why? 

Well, if we want to add a second recurrent layer, it must accept a sequence as an input from the first layer. In this case, if each data point has 1 feature (our sequence) that contains a 10-measurement series, then the output of this layer will be of size $10 \times 1024$ as well.

Let's add some dropout and batch normalization for good measure:

In [57]:
RNN_model1.add(Dropout(0.2))
RNN_model1.add(BatchNormalization())

Now let's add a second recurrent layer:

In [58]:
RNN_model1.add(LSTM(612, activation='relu'))
RNN_model1.add(Dropout(0.2))
RNN_model1.add(BatchNormalization())

Notice this time we are not returning a series, the output shape will be only $1 \times 612$ (rather than $10 \times 612$).

In [59]:
RNN_model1.add(Dense(32, activation='relu'))
RNN_model1.add(Dropout(0.1))

Finally, let's add an output layer. Since we are dealing with classification, our default activation is the softmax function:

In [60]:
RNN_model1.add(Dense(class_number, activation='softmax'))

And there we go! We have a model! 

<img src="https://drive.google.com/uc?export=view&id=1aRiD7_DYgn7h068DMtHrKmuV4GcnyQvV" alt="Drawing" style="width: 600px;"/>

<center> <i>Image prepared using http://alexlenail.me/NN-SVG/)</i> </center>

Here $n$ is again the number of input data points, but notice now the first two layers process 2D tensors each.

Again, since the model is too big and takes a long time to train, we will load a pre-trained model

In [64]:
sgd = SGD(lr=0.01, decay=0.0, momentum=0.0, nesterov=False, clipnorm=2.0)

# Compile model
RNN_model1.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=sgd,
    metrics=['accuracy']
)

In [65]:
# Display its summary
RNN_model1.summary()

# Save an image of its architecture to file
plot_model(RNN_model1, to_file='RNN_model1.png', show_shapes=True, show_layer_names=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 10, 1024)          4202496   
_________________________________________________________________
dropout_6 (Dropout)          (None, 10, 1024)          0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 10, 1024)          4096      
_________________________________________________________________
lstm_1 (LSTM)                (None, 612)               4007376   
_________________________________________________________________
dropout_7 (Dropout)          (None, 612)               0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 612)               2448      
_________________________________________________________________
dense_9 (Dense)              (None, 32)                19616     
__________

In [67]:
if not os.path.exists('RNN_model1.h5'):
    EPOCHS = 3       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
    BATCH_SIZE = 480   # at each epoch, it will split the data into units of 48 samples, and train on those


    RNN_model1.fit(train_X, train_y,
                   batch_size=BATCH_SIZE,
                   epochs=EPOCHS,
                   validation_data=(validation_X, validation_y))
    
    model_json = RNN_model1.to_json()
    with open("RNN_model1.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    RNN_model1.save_weights("RNN_model1.h5")
    print("Saved model to disk")
else:
    # load weights into new model
    print("Loading weights from h5 file")
    RNN_model1.load_weights("RNN_model1.h5")
    print("Loaded weights from disk")
    print("No need to train, model is fully trained")
    loss_score, accuracy_score = RNN_model1.evaluate(validation_X, validation_y, batch_size=480, verbose=1)
    print("Accuracy score: "+str(round(accuracy_score*100,2))+"%")

Train on 167141 samples, validate on 18571 samples
Epoch 1/3
  1440/167141 [..............................] - ETA: 27:38 - loss: 4.1264 - acc: 0.0257

KeyboardInterrupt: 

Our model is trained, let's get some predictions:

In [42]:
input_phrase = "To be or not to be."

# we will predict 200 characters forward after the input_phrase
for i in range(200):
    
    # get the last 10 characters of our input_phrase and convert them to numbers
    network_input = list(input_phrase[-SEQUENCE_LENGTH:])
    for j in range(len(network_input)):
        network_input[j] = char_to_number[network_input[j]]
    # convert into an array then reshape it to explicitly have 1 feature
    network_input = np.array(network_input)
    network_input = network_input.reshape((1, SEQUENCE_LENGTH, 1))

    # get probabilistic predictions from the neural network
    # randomly draw a single predicted character from the full list with their probabilities determined by the network's prediction
    predict_proba = RNN_model1.predict(network_input)[0]
    predict_char = np.random.choice(all_chars, 1, p = predict_proba)[0]

    input_phrase += predict_char
    print(i, end="\r")
    
print(input_phrase)



To be or not to be.as that ituma oitdl the a wmgik bug the ratkt igsg fryisoerfdd
britnë(
berankirerr;ofel we gor, and wnri bocer n'ei—ruutl la winti the rd,s ofecomettowt, imlp
fe bearteik shum pitie tnand alh monr,
sa


This isn't much better. To get RNNs working with text better, we need to embed our vocabulary. Recall that word embedding means taking each word and embedding into a high dimensional space as a vector. Turns out we can do the same with characters, or really any finite collection.

Keras offers us an embedding layer as part of the neural network.

---
#### Exercise 1

1. Look at the Keras documentation for embedding layers, and find out how to add an embedding layer to our model before the recurrent LSTM layer. Use an embedding dimension of 30 (i.e. embed the characters into a 30 dimensional space). dimension is arbitraty but smaller than input_dim.

*Warning: The code below this exercise will not run if the embedding layer (part of this exercise) isn't added!* 


case-specific embedding of words for model

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

It can be used alone to learn a word embedding that can be saved and used in another model later.
It can be used as part of a deep learning model where the embedding is learned along with the model itself.
It can be used to load a pre-trained word embedding model, a type of transfer learning.
The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.


output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.


input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

In [83]:
RNN_model2 = Sequential()
# add something here as part of exercise
RNN_model2.add(Embedding(52, 30, input_length=10))

RNN_model2.add(LSTM(1024, activation='relu', return_sequences=True)) #input_shape=(train_X.shape[1:]), is not necessary
RNN_model2.add(Dropout(0.2))
RNN_model2.add(BatchNormalization())

RNN_model2.add(LSTM(612, activation='relu'))
RNN_model2.add(Dropout(0.2))
RNN_model2.add(BatchNormalization())

RNN_model2.add(Dense(32, activation='relu'))
RNN_model2.add(Dropout(0.1))

RNN_model2.add(Dense(class_number, activation='softmax'))

The RNN with the embedding layer will look like this:

<img src="https://drive.google.com/uc?export=view&id=1hgdDOMNow465V4Bs6cOFYAY4Fon2cLO3" alt="Drawing" style="width: 600px;"/>

---

In [84]:
sgd = SGD(lr=0.01, decay=0.0, momentum=0.0, nesterov=False, clipnorm=2.0)

# Compile model
RNN_model2.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=sgd,
    metrics=['accuracy']
)

In [85]:
# Display its summary
RNN_model2.summary()

# Save an image of its architecture to file
plot_model(RNN_model2, to_file='RNN_model2.png', show_shapes=True, show_layer_names=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 10, 30)            1560      
_________________________________________________________________
lstm_8 (LSTM)                (None, 10, 1024)          4321280   
_________________________________________________________________
dropout_18 (Dropout)         (None, 10, 1024)          0         
_________________________________________________________________
batch_normalization_12 (Batc (None, 10, 1024)          4096      
_________________________________________________________________
lstm_9 (LSTM)                (None, 612)               4007376   
_________________________________________________________________
dropout_19 (Dropout)         (None, 612)               0         
_________________________________________________________________
batch_normalization_13 (Batc (None, 612)               2448      
__________

In [86]:
if not os.path.exists('RNN_model2.json'):
    EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
    BATCH_SIZE = 480   # at each epoch, it will split the data into units of 480 samples, and train on those

    train_X = train_X.reshape((-1, SEQUENCE_LENGTH))
    model.fit(
        train_X, train_y,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        validation_data=(validation_X.reshape((-1, SEQUENCE_LENGTH)), validation_y))
    
    model_json = RNN_model2.to_json()
    with open("RNN_model2.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    RNN_model2.save_weights("RNN_model2.h5")
    print("Saved model to disk")
else:
    print("Loading weights from h5 file")
    RNN_model2.load_weights("RNN_model2.h5")
    print("Loaded weights from disk")
    print("No need to train, model is fully trained")
    loss_score, accuracy_score = RNN_model2.evaluate(validation_X.reshape((-1, SEQUENCE_LENGTH)), 
                                                     validation_y, 
                                                     batch_size=480, 
                                                     verbose=1)
    print("Accuracy score: "+str(round(accuracy_score*100,2))+"%")

Loading weights from h5 file
Loaded weights from disk
No need to train, model is fully trained
Accuracy score: 38.15%


In [88]:
input_phrase = "To be or not to be."

# we will predict 200 characters forward after the input_phrase
for i in range(200):
    
    # get the last 10 characters of our input_phrase and convert them to numbers
    network_input = list(input_phrase[-SEQUENCE_LENGTH:])
    for j in range(len(network_input)):
        network_input[j] = char_to_number[network_input[j]]
    # convert into an array then reshape it to explicitly have 1 feature
    network_input = np.array(network_input)
    network_input = network_input.reshape((1, SEQUENCE_LENGTH))

    # get probabilistic predictions from the neural network
    # randomly draw a single predicted character from the full list with their probabilities determined by the network's prediction
    predict_proba = RNN_model2.predict(network_input)[0]
    predict_char = np.random.choice(all_chars, 1, p = predict_proba)[0]


    input_phrase += predict_char
    print(i, end="\r")

print(input_phrase)



To be or not to be.e
queile ar sakearns, unldd thy halk boddy legatl wheid cha moat.

a“ve }the roer that gron of lows crike ponortdedre in sa me? hyhe the, clapkreor he rihgt rad—odl onsleg and orf memenn’d "hetr.

ite


This seems a bit more English like, and we actually will get better results if we train the model for longer. However, we can approach this problem a different way, we can split our data into words instead of characters.

### Splitting the Data Into Words

So far, we've split the data character by character. This process has a two-fold benefit. First, the embedding is much smaller (only $10 \times 52$) so it is faster to learn, but also the predictions are only divided into 52 classes.

However, the system spends the majority of its training time learning correct word spelling. While we could potentially train the system longer and employ some auto-correction to fix spelling errors, we can instead split the poems dataset into words.

In [89]:
import re
all_words = []
dataset = []

# cycle through all the poems
for poem in tennyson_poems['Content']:
    poem = poem.lower()
    
    # Regex method for removing special characters from a string.
    # [^A-Za-z0-9 \n]+ matches all characters that are NOT alphanumeric, a space, or '\n'
    # https://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string
    poem_transformed = re.sub('[^A-Za-z0-9 \n]+', '', poem)
    
        
    # Notice now we are splitting by spaces rather than turning the poem into a list of characters
    poem_transformed = poem_transformed.replace('\n', ' \n ').replace('  ', ' ')
    # split the poem into its individual words
    poem_words = poem_transformed.split(' ')
    dataset.append(poem_words)
    
    # Also create a list of all unique words we saw
    for word in poem_words:
        if word not in all_words:
            all_words.append(word)

    
print(dataset[0])


['break', 'break', 'break', '\n', 'on', 'thy', 'cold', 'gray', 'stones', 'o', 'sea', '\n', 'and', 'i', 'would', 'that', 'my', 'tongue', 'could', 'utter', '\n', 'the', 'thoughts', 'that', 'arise', 'in', 'me', '\n', '\n', 'o', 'well', 'for', 'the', 'fishermans', 'boy', '\n', 'that', 'he', 'shouts', 'with', 'his', 'sister', 'at', 'play', '\n', 'o', 'well', 'for', 'the', 'sailor', 'lad', '\n', 'that', 'he', 'sings', 'in', 'his', 'boat', 'on', 'the', 'bay', '\n', '\n', 'and', 'the', 'stately', 'ships', 'go', 'on', '\n', 'to', 'their', 'haven', 'under', 'the', 'hill', '\n', 'but', 'o', 'for', 'the', 'touch', 'of', 'a', 'vanishd', 'hand', '\n', 'and', 'the', 'sound', 'of', 'a', 'voice', 'that', 'is', 'still', '\n', '\n', 'break', 'break', 'break', '\n', 'at', 'the', 'foot', 'of', 'thy', 'crags', 'o', 'sea', '\n', 'but', 'the', 'tender', 'grace', 'of', 'a', 'day', 'that', 'is', 'dead', '\n', 'will', 'never', 'come', 'back', 'to', 'me']


Let's check how many classes we are working with when splitting by words rather than by characters.

In [90]:
print("class number when splitting into words:",len(all_words))

class number when splitting into words: 5318


Now that the poem dataset is split, we will convert it into the same format of $X$ and $y$ as we've had before. Notice we are also keeping the newline character (`\n`) as a separate word so the system will learn when to place a newline.

In [91]:
SEQUENCE_LENGTH = 10

X = []
y = []

# for each poem
for poem in dataset:
    word_deque = deque(maxlen=SEQUENCE_LENGTH)
    
    # go through the words and place them in a deque, once the deque fills up and we try to add
    # another word, the oldest word will be thrown out
    for i in range(len(poem)-1):
        word = poem[i]
        word_deque.append(word)
        
        if (len(word_deque) == SEQUENCE_LENGTH):
            X.append(list(word_deque))
            y.append(poem[i+1])

Inspect our $X$ and $y$ as always

In [92]:
for i in range(5):
    print("X:",X[i])
    print("y:",y[i])
    print("*******")

X: ['break', 'break', 'break', '\n', 'on', 'thy', 'cold', 'gray', 'stones', 'o']
y: sea
*******
X: ['break', 'break', '\n', 'on', 'thy', 'cold', 'gray', 'stones', 'o', 'sea']
y: 

*******
X: ['break', '\n', 'on', 'thy', 'cold', 'gray', 'stones', 'o', 'sea', '\n']
y: and
*******
X: ['\n', 'on', 'thy', 'cold', 'gray', 'stones', 'o', 'sea', '\n', 'and']
y: i
*******
X: ['on', 'thy', 'cold', 'gray', 'stones', 'o', 'sea', '\n', 'and', 'i']
y: would
*******


Build our word to numbers and numbers to words converters

In [93]:
number_to_word = {i: j for i,j in enumerate(all_words)}
word_to_number = {j: i for i,j in enumerate(all_words)}

In [94]:
for i in range(len(X)):
    for j in range(len(X[0])):
        X[i][j] = word_to_number[X[i][j]]
        
    y[i] = word_to_number[y[i]]

Validate our shapes, and split into a train and validation set

In [95]:
X = np.array(X)
y = np.array(y)
# Let's look at the shapes
print(X.shape)
print(y.shape)

X = X.reshape((X.shape[0], X.shape[1], 1))
print(X.shape)
print(y.shape)

(38865, 10)
(38865,)
(38865, 10, 1)
(38865,)


In [96]:
X, y = shuffle_data(X, y)
print(X.shape)
print(y.shape)

(38865, 10, 1)
(38865, 1)


In [97]:
# Create train, validate, and test data
validate_set_size = int(0.1 * X.shape[0])

train_set_limit = X.shape[0] - validate_set_size

# Split train
train_X = X[:train_set_limit]
train_y = y[:train_set_limit]

# Split validation
validation_X = X[train_set_limit : ]
validation_y = y[train_set_limit : ]

print(train_X.shape)      
print(validation_X.shape)

(34979, 10, 1)
(3886, 10, 1)


Again, as before, the model could take very long to train so we will use a pre-trained model.

In [98]:
class_number = len(all_words)

    
RNN_model3 = Sequential()
# This time we will embed the words into a higher dimensional space, 300-dimensional
RNN_model3.add(Embedding(len(all_words), 300, input_length=SEQUENCE_LENGTH))

RNN_model3.add(LSTM(1024, activation='relu', return_sequences=True))
RNN_model3.add(Dropout(0.2))
RNN_model3.add(BatchNormalization())

RNN_model3.add(LSTM(612, activation='relu'))
RNN_model3.add(Dropout(0.2))
RNN_model3.add(BatchNormalization())

RNN_model3.add(Dense(32, activation='relu'))
RNN_model3.add(Dropout(0.1))

RNN_model3.add(Dense(class_number, activation='softmax'))

Our new network will look as follows. Notice the only change is in the output layer size, and the embedding size

<img src="https://drive.google.com/uc?export=view&id=15UWSQFYNKEyKNVN7jTyuRlddHWITENlz" alt="Drawing" style="width: 600px;"/>

In [99]:
sgd = SGD(lr=0.01, decay=0.0, momentum=0.0, nesterov=False, clipnorm=2.0)

# Compile model
RNN_model3.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=sgd,
    metrics=['accuracy']
)

In [100]:
# Display its summary
RNN_model3.summary()

# Save an image of its architecture to file
plot_model(RNN_model3, to_file='RNN_model3.png', show_shapes=True, show_layer_names=True)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 10, 300)           1595400   
_________________________________________________________________
lstm_10 (LSTM)               (None, 10, 1024)          5427200   
_________________________________________________________________
dropout_21 (Dropout)         (None, 10, 1024)          0         
_________________________________________________________________
batch_normalization_14 (Batc (None, 10, 1024)          4096      
_________________________________________________________________
lstm_11 (LSTM)               (None, 612)               4007376   
_________________________________________________________________
dropout_22 (Dropout)         (None, 612)               0         
_________________________________________________________________
batch_normalization_15 (Batc (None, 612)               2448      
__________

In [101]:
if not os.path.exists('RNN_model3.json'):
    EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
    BATCH_SIZE = 480   # at each epoch, it will split the data into units of 48 samples, and train on those

    train_X = train_X.reshape((-1, SEQUENCE_LENGTH))
    RNN_model3.fit(train_X, train_y,
                   batch_size=BATCH_SIZE,
                   epochs=EPOCHS,
                   validation_data=(validation_X.reshape((-1, SEQUENCE_LENGTH)), validation_y))
else:
    print("Loading weights from h5 file")
    RNN_model3.load_weights('RNN_model3.h5')
    print("Loaded weights from disk")
    print("No need to train, model is fully trained")
    loss_score, accuracy_score = RNN_model3.evaluate(validation_X.reshape((-1, SEQUENCE_LENGTH)), 
                                                     validation_y, 
                                                     batch_size=480, 
                                                     verbose=1)
    print("Accuracy score: "+str(round(accuracy_score*100,2))+"%")


Loading weights from h5 file
Loaded weights from disk
No need to train, model is fully trained
Accuracy score: 23.06%


Note that we now have 5,318 possible classes (for words), so ~22% accuracy is a pretty good score for this problem!

Now that we have a model, we want to check its performance in completing a sentence.
Recall before we used the phrase `to be or not to be` as the seed and let the model complete the sentence from there character by character.

If we use the same phrase, we need to pre-process it a bit first.

---

#### Exercise 2

1. Use the phrase below, and pre-process it so it can be used by the `RNN_model3` network we've built (remember the input must be individual words, and that the network expects an input of length 10 every time we request a word as output)
2. For all four network training exercises, we've carved out a validation set. What would we use it for during the training process (hint: it may be too costly to train the network for 100 epochs only to find out its peak performance is at 50 epochs)?

In [105]:
if not os.path.exists('RNN_model3.json'):
    EPOCHS = 40       # NNs operate in epochs, meaning this is how many times the neural network will go through 
                      # the entire data
    BATCH_SIZE = 480   # at each epoch, it will split the data into units of 480 samples, and train on those

    train_X = train_X.reshape((-1, SEQUENCE_LENGTH))
    model.fit(
        train_X, train_y,
        batch_size=BATCH_SIZE,
        epochs=EPOCHS,
        validation_data=(validation_X.reshape((-1, SEQUENCE_LENGTH)), validation_y))
    
    model_json = RNN_modele.to_json()
    with open("RNN_model3.json", "w") as json_file:
        json_file.write(model_json)
    # serialize weights to HDF5
    RNN_model2.save_weights("RNN_model3.h5")
    print("Saved model to disk")
else:
    print("Loading weights from h5 file")
    RNN_model3.load_weights("RNN_model3.h5")
    print("Loaded weights from disk")
    print("No need to train, model is fully trained")
    loss_score, accuracy_score = RNN_model3.evaluate(validation_X.reshape((-1, SEQUENCE_LENGTH)), 
                                                     validation_y, 
                                                     batch_size=480, 
                                                     verbose=1)
    print("Accuracy score: "+str(round(accuracy_score*100,2))+"%")

Loading weights from h5 file
Loaded weights from disk
No need to train, model is fully trained
Accuracy score: 23.06%


"dear sir I heard you like the sea"

In [109]:
input_phrase = "To be or not to be."

# we will predict 20 characters forward after the input_phrase
for i in range(20):
    
    # get the last 10 characters of our input_phrase and convert them to numbers
    network_input = list(input_phrase[-SEQUENCE_LENGTH:])
    for j in range(len(network_input)):
        network_input[j] = char_to_number[network_input[j]]
    # convert into an array then reshape it to explicitly have 1 feature
    network_input = np.array(network_input)
    network_input = network_input.reshape((1, SEQUENCE_LENGTH))

    # get probabilistic predictions from the neural network
    # randomly draw a single predicted character from the full list with their probabilities determined by the network's prediction
    predict_proba = RNN_model3.predict(network_input)[0]
    predict_char = np.random.choice(all_chars, 1, p = predict_proba)[0]


    input_phrase += predict_char
    print(i, end="\r")

print(input_phrase)

ValueError: 'a' and 'p' must have same size

<div id="container" style="position:relative;">
<div style="position:relative; float:right"><img style="height:25px""width: 50px" src ="https://drive.google.com/uc?export=view&id=14VoXUJftgptWtdNhtNYVm6cjVmEWpki1" />
</div>
</div>