TODO:

* Brief intro section on what LSTMs are
* Go through coding up the math, step by step
* Go through the theory of how data must flow through them
* Go through the classes

# New - with classes

How do we design the classes for an RNN?

Before we do this, we have to know what it is we are designing. Let's set out what we do know about RNNs.

The goal of the model will be to predict the next character in a sequence. It will do this by taking as input a one-hot encoded version of the characters input sequence and comparing its predictions to a one hot encoded version of the characters in the output sequence, with all the characters shifted one forward. 

We know that these sequences will be fed into an "RNN Layer", one character at a time. Each time, the layers will feed back into themselves a "cell state" and a "hidden state", - a typical RNN would just feed back in a "hidden state", but when our cells are LSTM cells they feed back a "cell state" as well.

![](img/Olah_RNNs.png)

We can implement this as a series of nodes that maintain a hidden state and a cell state that get updated at each time step. 

The first node, the very first time the network is trained, will receive as information: 

* A one-hot encoded version of the first letter in the network
* An initial hidden state, which can be initialized to all zeros
* An initial cell state, which can also be initialized to all zeros

The result looks like this:

![](img/LSTM_1.png)

It will return the hidden state and the cell state to be passed on to the next node - keeping in mind that what is really happening is that we are passing the hidden state and cell state back into the RNN layer itself. So, equivalently to what is drawn above, we could draw:

![](img/LSTM_2.png)

That's the RNN-specific stuff. The rest of the stuff is neural net specific.

## Activations

In [7]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def dsigmoid(y):
    return y * (1 - y)


def tanh(x):
    return np.tanh(x)


def dtanh(y):
    return 1 - y * y


def softmax(x):
    return np.exp(x) / np.sum(np.exp(x)) #softmax

`LSTM_Param`

In [8]:
class LSTM_Param:
    def __init__(self, value):
        self.value = value
        self.deriv = np.zeros_like(value) #derivative
        self.momentum = np.zeros_like(value) #momentum for AdaGrad
        
    def clear_gradient(self):
        self.deriv = np.zeros_like(self.value) #derivative
        
    def clip_gradient(self):
        self.deriv = np.clip(self.deriv, -1, 1, out=self.deriv)
        
    def update(self, learning_rate):
        self.momentum += self.deriv * self.deriv # Calculate sum of gradients
        self.value += -(learning_rate * self.deriv / np.sqrt(self.momentum + 1e-8))
        
    def update_sgd(self, learning_rate):
        self.value -= learning_rate * self.deriv

`LSTM_Params`

In [9]:
class LSTM_Params:
    
    def __init__(self, hidden_size, vocab_size):
        self.stack_size = hidden_size + vocab_size
        
        self.W_f = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_i = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_c = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_o = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_v = LSTM_Param(np.random.normal(size=(hidden_size, vocab_size), loc=0, scale=0.1))
        
        self.B_f = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_i = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_c = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_o = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_v = LSTM_Param(np.zeros((1, vocab_size)))

        
    def all_params(self):
        return [self.W_f, self.W_i, self.W_c, self.W_o, self.W_v, 
                self.B_f, self.B_i, self.B_c, self.B_o, self.B_v]
        
    def clear_gradients(self):
        for param in self.all_params():
            param.clear_gradient()
        
    def clip_gradients(self):
        for param in self.all_params():
            param.clip_gradient()       
       
    def update_params(self, learning_rate, method="ada"):
        for param in self.all_params():
            if method == "ada":
                param.update(learning_rate)  
            elif method == "sgd":
                param.update_sgd(learning_rate)

In [10]:
# The cross entropy derivative, that doesn't work...
def cross_entropy_deriv(prediction, y):
    return np.array([-yi / predi + (1-yi) / (1-predi) for yi, predi in zip(y, prediction)])

Now we get to the LSTM-specific stuff.

We discussed above how the forward pass works. Now let's cover the tricky part: the backward pass, i.e. the "Backpropagation through time" algorithm.

Conceptually, this algorithm works the same way as a normal algorithm used to train neural nets: you have a series of quantities - neurons, weights - that have been done in the forward pass, defined by equations. You want to compute the amount that changing each of these quantities affects the loss. We'll go into the details of how to do this within each node, but at the level of the entire model:

* In the forward pass, each node sent forward a value for its hidden output and its cell state output. In the backward pass, therefore, each node will receive _gradients_ for its hidden outputs and cell state outputs that tell us how much these values ultimately impacted the loss. 

* In addition, recall that each node corresponds to a time step along the sequence being fed into the LSTM model. Thus, each node will receive the gradient from the actual loss - the softmax prediction over all possible characters compared with the one hot encoded version of the correct character.

* Similarly to before, we'll initialize these gradients to zero. Each node will output the gradient to be passed to the node _prior_ to it during the backward pass.

In [11]:
import numpy as np
np.zeros(2)

array([0., 0.])

In [112]:
class LSTM_Layer:
    def __init__(self, sequence_length, vocab_size, hidden_size, learning_rate):
        self.nodes = [LSTM_Node(hidden_size, vocab_size) for x in range(sequence_length)]
        self.sequence_length = sequence_length
        self.start_H = np.zeros((1, hidden_size))
        self.start_C = np.zeros((1, hidden_size))
        self.hidden_size = hidden_size
        self.params = LSTM_Params(hidden_size, vocab_size)

        
    def _initialize_seq_array(self):
        return np.array([[]])

    
    def _append_to_output_seq(self, output_seq, new_x_output):
        if output_seq.shape[1] == 0:
            output_seq = np.append(output_seq, new_x_output, axis=1)
        else:
            output_seq = np.append(output_seq, new_x_output, axis=0) 
        return output_seq

    
    def forward(self, x_seq_in, first_iter=False):
        '''
        Takes in a vector, outputs a vector of Xs
        '''
        
        x_seq_out = self._initialize_seq_array()
        
        H_in = self.start_H
        C_in = self.start_C
        
        for i in range(self.sequence_length):
#             print("In layer, forward through sequence element", i)
            x_in = np.array(x_seq_in[i], ndmin=2)
#             import pdb; pdb.set_trace()
            x_out, H_in, C_in = self.nodes[i].forward(x_in, H_in, C_in, self.params)
                    
            x_seq_out = self._append_to_output_seq(x_seq_out, x_out)
                
        return x_seq_out

    def backward(self, loss_grad):
        H_grad = np.zeros((1, self.hidden_size))
        C_grad = np.zeros((1, self.hidden_size))
        
        Y_seq_out = self._initialize_seq_array()
        
        for t in range(self.sequence_length-1, -1, -1):
#             print("In layer, backward through sequence element", t)            
            Y_grad_back = np.array(loss_grad[t], ndmin=2)
            Y_grad_out, H_grad, C_grad = self.nodes[t].backward(Y_grad_back, H_grad, C_grad, self.params)
            
            Y_seq_out = self._append_to_output_seq(Y_seq_out, Y_grad_out)
        
#         import pdb; pdb.set_trace()
        return Y_seq_out

In [113]:
class LSTM_Model:
    '''
    An LSTM model with one LSTM layer that feeds data through it and generates an output.
    '''
    def __init__(self, layers, sequence_length, vocab_size, hidden_size, learning_rate):
        '''
        Initialize list of nodes of length the sequence length
        List the vocab size and the hidden size 
        Initialize the params
        '''
        self.layers = [LSTM_Layer(sequence_length, vocab_size, hidden_size, learning_rate) for i in range(layers)]
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.start_H = np.zeros(hidden_size)
        self.start_C = np.zeros(hidden_size)
        self.learning_rate = learning_rate
    
    def forward(self, x_batch):

        for i, layer in enumerate(self.layers):
#             print("Forward through Layer", i)
            x_batch_out = layer.forward(x_batch)
                
        return x_batch_out
  

    def loss(self, prediction, y_batch, kind="mse"):
        if kind == "mse":
            return (prediction - y_batch) ** 2
        # TODO: other loss functions


    def loss_gradient(self, prediction, y_batch, kind="mse"):
        '''
        Return a gradient: how much our prediction influences how much we "missed" by.
        '''
        if kind == "mse":
            return -1.0 * (y_batch - prediction)
        # TODO: other loss functions 


    def backward(self, loss_grad):

        for i, layer in enumerate(list(reversed(self.layers))):
#             print("Backward through layer", i)
            loss_grad = layer.backward(loss_grad)
            

        return 


    def single_step(self, x_seq, y_seq, first_iter):
        prediction = self.forward(x_seq)
#         prediction = softmax(x_out)
        loss_gradient = self.loss_gradient(prediction, y_seq)
        loss = np.sum(self.loss(prediction, y_seq))
        self.backward(loss_gradient)
        # FOR LAYER IN LAYERS?
        for layer in self.layers:
            layer.params.update_params(self.learning_rate)
            layer.params.clear_gradients()
        
        return loss

TODO: break out `LSTM_Node` into separate notebook

In [114]:
class LSTM_Node:
    '''
    An LSTM Node that takes in input and generates output. 
    Has a size of its hidden layers and a vocabulary size it expects.
    '''
    def __init__(self, hidden_size, vocab_size):
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size


    def forward(self, x, h_prev, C_prev, LSTM_Params):

        self.C_prev = C_prev

        self.z = np.column_stack((x, h_prev))
        
        self.f = sigmoid(np.dot(self.z, LSTM_Params.W_f.value) + LSTM_Params.B_f.value)
        self.i = sigmoid(np.dot(self.z, LSTM_Params.W_i.value) + LSTM_Params.B_i.value)
        self.C_bar = tanh(np.dot(self.z, LSTM_Params.W_c.value) + LSTM_Params.B_c.value)

        self.C = self.f * C_prev + self.i * self.C_bar
        self.o = sigmoid(np.dot(self.z, LSTM_Params.W_o.value) + LSTM_Params.B_o.value)
        self.H = self.o * tanh(self.C)

        self.v = np.dot(self.H, LSTM_Params.W_v.value) + LSTM_Params.B_v.value
        
        return self.v, self.H, self.C 


    def backward(self, loss_grad, dh_next, dC_next, LSTM_Params):

        LSTM_Params.W_v.deriv += np.dot(self.H.T, loss_grad)
        LSTM_Params.B_v.deriv += loss_grad

        dh = np.dot(loss_grad, LSTM_Params.W_v.value.T)        
        dh += dh_next
        do = dh * tanh(self.C)
        do_int = dsigmoid(self.o) * do
        
        LSTM_Params.W_o.deriv += np.dot(self.z.T, do_int)
        LSTM_Params.B_o.deriv += do_int

        dC = np.copy(dC_next)
        dC += dh * self.o * dtanh(tanh(self.C))
        
        dC_bar = dC * self.i
        dC_bar = dtanh(self.C_bar) * dC_bar
        
        LSTM_Params.W_c.deriv += np.dot(self.z.T, dC_bar)
        LSTM_Params.B_c.deriv += dC_bar

        di = dC * self.C_bar
        di_int = dsigmoid(self.i) * di
        LSTM_Params.W_i.deriv += np.dot(self.z.T, di_int)
        LSTM_Params.B_i.deriv += di_int

        df = dC * self.C_prev
        df_int = dsigmoid(self.f) * df
        LSTM_Params.W_f.deriv += np.dot(self.z.T, df_int)
        LSTM_Params.B_f.deriv += df_int

        dz = (np.dot(df_int, LSTM_Params.W_f.value.T)
             + np.dot(di_int, LSTM_Params.W_i.value.T)
             + np.dot(dC_bar, LSTM_Params.W_c.value.T)
             + np.dot(do_int, LSTM_Params.W_o.value.T))

        dx_prev = dz[:, :self.vocab_size]
        dh_prev = dz[:, self.vocab_size:]

        dC_prev = self.f * dC

        return dx_prev, dh_prev, dC_prev

In [115]:
class Character_generator:
    
    def __init__(self, text_file, model):
        self.data = open(text_file, 'r').read()
        self.model = model
        self.chars = list(set(self.data))
        self.vocab_size = len(self.chars)
        self.char_to_idx = {ch:i for i,ch in enumerate(self.chars)}
        self.idx_to_char = {i:ch for i,ch in enumerate(self.chars)}
        self.iterations = 0
        self.start_pos = 0
        self.smooth_loss = -np.log(1.0 / self.vocab_size) * self.model.sequence_length


    def generate_sequences(self, start_pos, seq_length):
        input_sequence = ([self.char_to_idx[ch] 
                           for ch in self.data[start_pos:start_pos + seq_length]])
        target_sequence = ([self.char_to_idx[ch] 
                            for ch in self.data[start_pos+1:start_pos + seq_length+1]])
        return input_sequence, target_sequence
    

    def sequence_to_model_input(self, sequence, vocab_size):
        out_batch = np.zeros((len(sequence), vocab_size))
        for i, el in enumerate(sequence):
            out_batch[i, el] = 1        
        return out_batch

    
    def generate_batch(self, start_pos):
        input_sequence, target_sequence = self.generate_sequences(start_pos, self.model.sequence_length)
        return self.sequence_to_model_input(input_sequence, self.vocab_size), \
            self.sequence_to_model_input(target_sequence, self.vocab_size) 


    def train(self, steps, check_every, update_method="ada"):
        # TODO: break this up into multiple functions
        start_pos = 0
        while True:
            if start_pos + self.model.sequence_length >= len(self.data) or self.iterations == 0:
                g_h_prev = np.zeros((self.model.hidden_size, 1))
                g_C_prev = np.zeros((self.model.hidden_size, 1))

            x_batch, y_batch = self.generate_batch(start_pos)
            first_iter = True if self.iterations == 0 else False
            
            loss = self.model.single_step(x_batch, y_batch, first_iter)
            self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
            if self.iterations % check_every == 0:
                print("Loss", self.smooth_loss)

            start_pos += self.model.sequence_length
            self.iterations += 1
            if start_pos + self.model.sequence_length > len(self.data):
                start_pos = 0
                
            if self.iterations > steps:
                break

In [116]:
mod = LSTM_Model(sequence_length=20, vocab_size=62, hidden_size=100, learning_rate=0.1, layers=2)
character_generator = Character_generator('input.txt', mod)
character_generator.train(10000, check_every=200, update_method="ada")

Loss 82.4816169393013
Loss 79.47612303898009
Loss 68.54495870003961
Loss 59.446624096503136
Loss 51.94253426893785
Loss 45.74520660824235
Loss 40.65686125469361
Loss 36.463563717108244
Loss 32.989857641300276
Loss 30.140074830660605
Loss 27.806228209703203
Loss 25.888363345816202
Loss 24.34103242512716
Loss 23.035837355607153
Loss 21.95528385229945
Loss 21.075042183243387
Loss 20.35322202897249
Loss 19.74840807308291
Loss 19.24599530052374
Loss 18.83394362360962
Loss 18.48799162364949
Loss 18.18620973289722
Loss 17.937781156103064
Loss 17.73414071497532
Loss 17.585850530343418
Loss 17.458996165757306
Loss 17.324239275830482
Loss 17.227398548324366
Loss 17.14302600410365
Loss 17.07589437258715
Loss 16.995659338600476
Loss 16.947016592962072
Loss 16.885917146428575
Loss 16.81460580787019
Loss 16.76024881764919
Loss 16.7299986214696
Loss 16.710543422840804
Loss 16.726004698750213


KeyboardInterrupt: 