# New - with classes

How do we design the classes for an RNN?

Before we do this, we have to know what it is we are designing. Let's set out what we do know about RNNs.

The goal of the model will be to predict the next character in a sequence. It will do this by taking as input a one-hot encoded version of the characters input sequence and comparing its predictions to a one hot encoded version of the characters in the output sequence, with all the characters shifted one forward. 

We know that these sequences will be fed into an "RNN Layer", one character at a time. Each time, the layers will feed back into themselves a "cell state" and a "hidden state", - a typical RNN would just feed back in a "hidden state", but when our cells are LSTM cells they feed back a "cell state" as well.

![](img/Olah_RNNs.png)

We can implement this as a series of nodes that maintain a hidden state and a cell state that get updated at each time step. 

The first node, the very first time the network is trained, will receive as information: 

* A one-hot encoded version of the first letter in the network
* An initial hidden state, which can be initialized to all zeros
* An initial cell state, which can also be initialized to all zeros

The result looks like this:

![](img/LSTM_1.png)

It will return the hidden state and the cell state to be passed on to the next node - keeping in mind that what is really happening is that we are passing the hidden state and cell state back into the RNN layer itself. So, equivalently to what is drawn above, we could draw:

![](img/LSTM_2.png)

That's the RNN-specific stuff. The rest of the stuff is neural net specific.

## Activations

In [22]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def dsigmoid(y):
    return y * (1 - y)


def tanh(x):
    return np.tanh(x)


def dtanh(y):
    return 1 - y * y


def softmax(x):
    return np.exp(x) / np.sum(np.exp(x)) #softmax

`LSTM_Param`

In [23]:
class LSTM_Param:
    def __init__(self, value):
        self.value = value
        self.deriv = np.zeros_like(value) #derivative
        self.momentum = np.zeros_like(value) #momentum for AdaGrad
        
    def clear_gradient(self):
        self.deriv = np.zeros_like(self.value) #derivative
        
    def clip_gradient(self):
        self.deriv = np.clip(self.deriv, -1, 1, out=self.deriv)
        
    def update(self, learning_rate):
        self.momentum += self.deriv * self.deriv # Calculate sum of gradients
        self.value += -(learning_rate * self.deriv / np.sqrt(self.momentum + 1e-8))
        
    def update_sgd(self, learning_rate):
        self.value -= learning_rate * self.deriv

`LSTM_Params`

In [24]:
class LSTM_Params:
    
    def __init__(self, hidden_size, vocab_size):
        self.stack_size = hidden_size + vocab_size
        
        self.W_f = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_i = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_c = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_o = LSTM_Param(np.random.normal(size=(self.stack_size, hidden_size), loc=0, scale=0.1))
        self.W_v = LSTM_Param(np.random.normal(size=(hidden_size, vocab_size), loc=0, scale=0.1))
        
        self.B_f = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_i = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_c = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_o = LSTM_Param(np.zeros((1, hidden_size)))
        self.B_v = LSTM_Param(np.zeros((1, vocab_size)))

        
    def all_params(self):
        return [self.W_f, self.W_i, self.W_c, self.W_o, self.W_v, 
                self.B_f, self.B_i, self.B_c, self.B_o, self.B_v]
        
    def clear_gradients(self):
        for param in self.all_params():
            param.clear_gradient()
        
    def clip_gradients(self):
        for param in self.all_params():
            param.clip_gradient()       
       
    def update_params(self, learning_rate, method="ada"):
        for param in self.all_params():
            if method == "ada":
                param.update(learning_rate)  
            elif method == "sgd":
                param.update_sgd(learning_rate)

In [25]:
# The cross entropy derivative, that doesn't work...
def cross_entropy_deriv(prediction, y):
    return np.array([-yi / predi + (1-yi) / (1-predi) for yi, predi in zip(y, prediction)])

Now we get to the LSTM-specific stuff.

We discussed above how the forward pass works. Now let's cover the tricky part: the backward pass, i.e. the "Backpropagation through time" algorithm.

Conceptually, this algorithm works the same way as a normal algorithm used to train neural nets: you have a series of quantities - neurons, weights - that have been done in the forward pass, defined by equations. You want to compute the amount that changing each of these quantities affects the loss. We'll go into the details of how to do this within each node, but at the level of the entire model:

* In the forward pass, each node sent forward a value for its hidden output and its cell state output. In the backward pass, therefore, each node will receive _gradients_ for its hidden outputs and cell state outputs that tell us how much these values ultimately impacted the loss. 

* In addition, recall that each node corresponds to a time step along the sequence being fed into the LSTM model. Thus, each node will receive the gradient from the actual loss - the softmax prediction over all possible characters compared with the one hot encoded version of the correct character.

* Similarly to before, we'll initialize these gradients to zero. Each node will output the gradient to be passed to the node _prior_ to it during the backward pass.

In [35]:
class LSTM_Model:
    '''
    An LSTM model with one LSTM layer that feeds data through it and generates an output.
    '''
    def __init__(self, sequence_length, vocab_size, hidden_size, learning_rate):
        '''
        Initialize list of nodes of length the sequence length
        List the vocab size and the hidden size 
        Initialize the params
        '''
        self.nodes = [LSTM_Node(hidden_size, vocab_size) for x in range(sequence_length)]
        self.sequence_length = sequence_length
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.start_H = np.zeros(hidden_size)
        self.start_C = np.zeros(hidden_size)
        self.params = LSTM_Params(hidden_size, vocab_size)
        self.learning_rate = learning_rate

    def forward(self, x_batch, first_iter=False):

        x_batch_out = np.array([[]])
        if first_iter:
            h_in = np.zeros((1, self.hidden_size))
            c_in = np.zeros((1, self.hidden_size))
        else:
            h_in = self.nodes[-1].H
            c_in = self.nodes[-1].C
        for i, node in enumerate(self.nodes):
            x_in = np.array(x_batch[i], ndmin=2)

            x_out, h_in, c_in = node.forward(x_in, h_in, c_in, self.params)

            if x_batch_out.shape[1] == 0:
                x_batch_out = np.append(x_batch_out, x_out, axis=1)
            else:
                x_batch_out = np.append(x_batch_out, x_out, axis=0)
                
        return x_batch_out 
    
    def loss(self, prediction, y_batch, kind="mse"):
        if kind == "mse":
            return (prediction - y_batch) ** 2
        # TODO: other loss functions


    def loss_gradient(self, prediction, y_batch, kind="mse"):
        '''
        Return a gradient: how much our prediction influences how much we "missed" by.
        '''
        if kind == "mse":
            return -1.0 * (y_batch - prediction)
        # TODO: other loss functions 

    def backward(self, loss_grad):

        H_grad = np.zeros((1, self.hidden_size))
        C_grad = np.zeros((1, self.hidden_size))
        Y_grad = loss_grad
        
        T = self.sequence_length - 1
        K = self.sequence_length # BPTT length
        
        # BPTT
        num_iterations = T - K

        for t in range(T, T-K, -1):

            Y_grad_in = np.array(Y_grad[T-t], ndmin=2)

            Y_grad_out, H_grad, C_grad = \
                self.nodes[T-t].backward(Y_grad_in, H_grad, C_grad, self.params)
                    
        return 

    def single_step(self, x_seq, y_seq, first_iter):
        prediction = self.forward(x_seq, first_iter)
#         prediction = softmax(x_out)
        loss_gradient = self.loss_gradient(prediction, y_seq)
        loss = np.sum(self.loss(prediction, y_seq))
        self.backward(loss_gradient)

        self.params.update_params(self.learning_rate)
        self.params.clear_gradients()  
        
        return loss

TODO: break out `LSTM_Node` into separate notebook

In [1]:
class LSTM_Node:
    '''
    An LSTM Node that takes in input and generates output. 
    Has a size of its hidden layers and a vocabulary size it expects.
    '''
    def __init__(self, hidden_size, vocab_size):
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size


    def forward(self, x, h_prev, C_prev, LSTM_Params):
#         assert x.shape == (self.vocab_size, 1)
#         assert h_prev.shape == (self.hidden_size, 1)
#         assert C_prev.shape == (self.hidden_size, 1)
#         print("X shape on in:", x.shape)
        self.C_prev = C_prev

        self.z = np.column_stack((x, h_prev))
        
        self.f = sigmoid(np.dot(self.z, LSTM_Params.W_f.value) + LSTM_Params.B_f.value)
        self.i = sigmoid(np.dot(self.z, LSTM_Params.W_i.value) + LSTM_Params.B_i.value)
        self.C_bar = tanh(np.dot(self.z, LSTM_Params.W_c.value) + LSTM_Params.B_c.value)

        self.C = self.f * C_prev + self.i * self.C_bar
        self.o = sigmoid(np.dot(self.z, LSTM_Params.W_o.value) + LSTM_Params.B_o.value)
        self.H = self.o * tanh(self.C)

        self.v = np.dot(self.H, LSTM_Params.W_v.value) + LSTM_Params.B_v.value
        
        return self.v, self.H, self.C 


    def backward(self, loss_grad, dh_next, dC_next, LSTM_Params):

        LSTM_Params.W_v.deriv += np.dot(self.H.T, loss_grad)
        LSTM_Params.B_v.deriv += loss_grad

        dh = np.dot(loss_grad, LSTM_Params.W_v.value.T)        
        dh += dh_next
        do = dh * tanh(self.C)
        do_int = dsigmoid(self.o) * do
        
        LSTM_Params.W_o.deriv += np.dot(self.z.T, do_int)
        LSTM_Params.B_o.deriv += do_int

        dC = np.copy(dC_next)
        dC += dh * self.o * dtanh(tanh(self.C))
        
        dC_bar = dC * self.i
        dC_bar = dtanh(self.C_bar) * dC_bar
        
        LSTM_Params.W_c.deriv += np.dot(self.z.T, dC_bar)
        LSTM_Params.B_c.deriv += dC_bar

        di = dC * self.C_bar
        di_int = dsigmoid(self.i) * di
        LSTM_Params.W_i.deriv += np.dot(self.z.T, di_int)
        LSTM_Params.B_i.deriv += di_int

        df = dC * self.C_prev
        df_int = dsigmoid(self.f) * df
        LSTM_Params.W_f.deriv += np.dot(self.z.T, df_int)
        LSTM_Params.B_f.deriv += df_int

        dz = (np.dot(df_int, LSTM_Params.W_f.value.T)
             + np.dot(di_int, LSTM_Params.W_i.value.T)
             + np.dot(dC_bar, LSTM_Params.W_c.value.T)
             + np.dot(do_int, LSTM_Params.W_o.value.T))

        dx_prev = dz[:, :self.vocab_size]
        dh_prev = dz[:, self.vocab_size:]

        dC_prev = self.f * dC

        return dx_prev, dh_prev, dC_prev

In [37]:
class Character_generator:
    
    def __init__(self, text_file, model):
        self.data = open(text_file, 'r').read()
        self.model = model
        self.chars = list(set(self.data))
        self.vocab_size = len(self.chars)
        self.char_to_idx = {ch:i for i,ch in enumerate(self.chars)}
        self.idx_to_char = {i:ch for i,ch in enumerate(self.chars)}
        self.iterations = 0
        self.start_pos = 0
        self.smooth_loss = -np.log(1.0 / self.vocab_size) * self.model.sequence_length

    def generate_sequences(self, start_pos, seq_length):
        input_sequence = ([self.char_to_idx[ch] 
                           for ch in self.data[start_pos:start_pos + seq_length]])
        target_sequence = ([self.char_to_idx[ch] 
                            for ch in self.data[start_pos+1:start_pos + seq_length+1]])
        return input_sequence, target_sequence

    def sequence_to_model_input(self, sequence, vocab_size):
        out_batch = np.zeros((len(sequence), vocab_size))
        for i, el in enumerate(sequence):
            out_batch[i, el] = 1        
        return out_batch
    
    def generate_batch(self, start_pos):
        input_sequence, target_sequence = self.generate_sequences(start_pos, self.model.sequence_length)
        return self.sequence_to_model_input(input_sequence, self.vocab_size), \
            self.sequence_to_model_input(target_sequence, self.vocab_size) 


    def train(self, steps, check_every, update_method="ada"):
        # TODO: break this up into multiple functions
        start_pos = 0
        while True:
            if start_pos + self.model.sequence_length >= len(self.data) or self.iterations == 0:
                g_h_prev = np.zeros((self.model.hidden_size, 1))
                g_C_prev = np.zeros((self.model.hidden_size, 1))

            x_batch, y_batch = self.generate_batch(start_pos)
            first_iter = True if self.iterations == 0 else False
            
            loss = self.model.sinxgle_step(x_batch, y_batch, first_iter)
            self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
            if self.iterations % check_every == 0:
                print("Loss", self.smooth_loss)

            start_pos += self.model.sequence_length
            self.iterations += 1
            if start_pos + self.model.sequence_length > len(self.data):
                start_pos = 0
                
            if self.iterations > steps:
                break

In [38]:
mod = LSTM_Model(sequence_length=20, vocab_size=62, hidden_size=100, learning_rate=0.1)
character_generator = Character_generator('input.txt', mod)
character_generator.train(10000, check_every=200, update_method="ada")

Loss 82.48118475003945
Loss 77.3516037991388
Loss 66.94072482037333
Loss 58.29441905469024
Loss 51.1564512253889
Loss 45.312907638603065
Loss 40.49052297484099
Loss 36.52278065004479
Loss 33.2550695527356
Loss 30.556911788344255
Loss 28.362218734234656
Loss 26.523370266868458
Loss 25.059915795216018
Loss 23.841283182909685
Loss 22.793647297846977
Loss 21.95013404912796
Loss 21.27143312348296
Loss 20.67903270366561
Loss 20.213145427596853
Loss 19.815557597289033
Loss 19.484377165885835
Loss 19.201297342806615
Loss 18.950924216500226
Loss 18.756157967038902
Loss 18.60171089537948
Loss 18.4778805686488
Loss 18.34201187611537
Loss 18.24694220542492
Loss 18.15178635022785
Loss 18.074674176475845
Loss 18.00398575925051
Loss 17.94974214020912
Loss 17.889178950701115
Loss 17.82091725805772
Loss 17.769147182477994
Loss 17.741681245132167
Loss 17.711600436073315
Loss 17.72208013367575
Loss 17.711767037698483
Loss 17.678750208818286
Loss 17.66345669710529
Loss 17.669007799948858
Loss 17.647023016

KeyboardInterrupt: 

# Old

### Imports

In [None]:
import numpy as np
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from IPython import display
plt.style.use('seaborn-white')

### Read and process data

In [None]:
data = open('input.txt', 'r').read()

Process data and calculate indexes

In [None]:
chars = list(set(data))
data_size, X_size = len(data), len(chars)
print("data has %d characters, %d unique" % (data_size, X_size))
char_to_idx = {ch:i for i,ch in enumerate(chars)}
idx_to_char = {i:ch for i,ch in enumerate(chars)}

### Constants and Hyperparameters

In [None]:
H_size = 100 # Size of the hidden layer
T_steps = 25 # Number of time steps (length of the sequence) used for training
learning_rate = 1e-1 # Learning rate
weight_sd = 0.1 # Standard deviation of weights for initialization
z_size = H_size + X_size # Size of concatenate(H, X) vector

### Activation Functions and Derivatives

#### Sigmoid

\begin{align}
\sigma(x) &= \frac{1}{1 + e^{-x}}\\
\frac{d\sigma(x)}{dx} &= \sigma(x) \cdot (1 - \sigma(x))
\end{align}

#### Tanh

\begin{align}
\frac{d\text{tanh}(x)}{dx} &= 1 - \text{tanh}^2(x)
\end{align}

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def dsigmoid(y):
    return y * (1 - y)


def tanh(x):
    return np.tanh(x)


def dtanh(y):
    return 1 - y * y

### Parameters

In [None]:
class Param:
    def __init__(self, name, value):
        self.name = name
        self.v = value #parameter value
        self.d = np.zeros_like(value) #derivative
        self.m = np.zeros_like(value) #momentum for AdaGrad

We use random weights with normal distribution (`0`, `weight_sd`) for $tanh$ activation function and (`0.5`, `weight_sd`) for $sigmoid$ activation function.

Biases are initialized to zeros.

In [None]:
class Parameters:
    def __init__(self):
        self.W_f = Param('W_f', 
                         np.random.randn(H_size, z_size) * weight_sd + 0.5)
        self.b_f = Param('b_f',
                         np.zeros((H_size, 1)))

        self.W_i = Param('W_i',
                         np.random.randn(H_size, z_size) * weight_sd + 0.5)
        self.b_i = Param('b_i',
                         np.zeros((H_size, 1)))

        self.W_C = Param('W_C',
                         np.random.randn(H_size, z_size) * weight_sd)
        self.b_C = Param('b_C',
                         np.zeros((H_size, 1)))

        self.W_o = Param('W_o',
                         np.random.randn(H_size, z_size) * weight_sd + 0.5)
        self.b_o = Param('b_o',
                         np.zeros((H_size, 1)))

        #For final layer to predict the next character
        self.W_v = Param('W_v',
                         np.random.randn(X_size, H_size) * weight_sd)
        self.b_v = Param('b_v',
                         np.zeros((X_size, 1)))
        
    def all(self):
        return [self.W_f, self.W_i, self.W_C, self.W_o, self.W_v,
               self.b_f, self.b_i, self.b_C, self.b_o, self.b_v]
        
parameters = Parameters()

### Forward pass

![LSTM](http://blog.varunajayasiri.com/ml/lstm.svg)

*Operation $z$ is the concatenation of $x$ and $h_{t-1}$*

#### Concatenation of $h_{t-1}$ and $x_t$
\begin{align}
z & = [h_{t-1}, x_t] \\
\end{align}

#### LSTM functions
\begin{align}
f_t & = \sigma(W_f \cdot z + b_f) \\
i_t & = \sigma(W_i \cdot z + b_i) \\
\bar{C}_t & = tanh(W_C \cdot z + b_C) \\
C_t & = f_t * C_{t-1} + i_t * \bar{C}_t \\
o_t & = \sigma(W_o \cdot z + b_t) \\
h_t &= o_t * tanh(C_t) \\
\end{align}

#### Logits
\begin{align}
v_t &= W_v \cdot h_t + b_v \\
\end{align}

#### Softmax
\begin{align}
\hat{y_t} &= \text{softmax}(v_t)
\end{align}

$\hat{y_t}$ is `y` in code and $y_t$ is `targets`.


In [None]:
def forward(x, h_prev, C_prev, p = parameters):
    assert x.shape == (X_size, 1)
    assert h_prev.shape == (H_size, 1)
    assert C_prev.shape == (H_size, 1)
    
    z = np.row_stack((h_prev, x))
    f = sigmoid(np.dot(p.W_f.v, z) + p.b_f.v)
    i = sigmoid(np.dot(p.W_i.v, z) + p.b_i.v)
    C_bar = tanh(np.dot(p.W_C.v, z) + p.b_C.v)

    C = f * C_prev + i * C_bar
    o = sigmoid(np.dot(p.W_o.v, z) + p.b_o.v)
    h = o * tanh(C)

    v = np.dot(p.W_v.v, h) + p.b_v.v
    y = np.exp(v) / np.sum(np.exp(v)) #softmax

    return z, f, i, C_bar, C, o, h, v, y

### Backward pass

#### Loss

\begin{align}
L_k &= -\sum_{t=k}^T\sum_j y_{t,j} log \hat{y_{t,j}} \\
L &= L_1 \\
\end{align}

#### Gradients

\begin{align}
dv_t &= \hat{y_t} - y_t \\
dh_t &= dh'_t + W_y^T \cdot dv_t \\
do_t &= dh_t * \text{tanh}(C_t) \\
dC_t &= dC'_t + dh_t * o_t * (1 - \text{tanh}^2(C_t))\\
d\bar{C}_t &= dC_t * i_t \\
di_t &= dC_t * \bar{C}_t \\
df_t &= dC_t * C_{t-1} \\
\\
df'_t &= f_t * (1 - f_t) * df_t \\
di'_t &= i_t * (1 - i_t) * di_t \\
d\bar{C}'_{t-1} &= (1 - \bar{C}_t^2) * d\bar{C}_t \\
do'_t &= o_t * (1 - o_t) * do_t \\
dz_t &= W_f^T \cdot df'_t \\
     &+ W_i^T \cdot di_t \\
     &+ W_C^T \cdot d\bar{C}_t \\
     &+ W_o^T \cdot do_t \\
\\
[dh'_{t-1}, dx_t] &= dz_t \\
dC'_t &= f_t * dC_t
\end{align}

* $dC'_t = \frac{\partial L_{t+1}}{\partial C_t}$ and $dh'_t = \frac{\partial L_{t+1}}{\partial h_t}$
* $dC_t = \frac{\partial L}{\partial C_t} = \frac{\partial L_t}{\partial C_t}$ and $dh_t = \frac{\partial L}{\partial h_t} = \frac{\partial L_{t}}{\partial h_t}$
* All other derivatives are of $L$
* `target` is target character index $y_t$
* `dh_next` is $dh'_{t}$ (size H x 1)
* `dC_next` is $dC'_{t}$ (size H x 1)
* `C_prev` is $C_{t-1}$ (size H x 1)
* $df'_t$, $di'_t$, $d\bar{C}'_t$, and $do'_t$ are *also* assigned to `df`, `di`, `dC_bar`, and `do` in the **code**.
* *Returns* $dh_t$ and $dC_t$

#### Model parameter gradients

\begin{align}
dW_v &= dv_t \cdot h_t^T \\
db_v &= dv_t \\
\\
dW_f &= df'_t \cdot z^T \\
db_f &= df'_t \\
\\
dW_i &= di'_t \cdot z^T \\
db_i &= di'_t \\
\\
dW_C &= d\bar{C}'_t \cdot z^T \\
db_C &= d\bar{C}'_t \\
\\
dW_o &= do'_t \cdot z^T \\
db_o &= do'_t \\
\\
\end{align}

In [None]:
def backward(target, dh_next, dC_next, C_prev,
             z, f, i, C_bar, C, o, h, v, y,
             p = parameters):
    
    assert z.shape == (X_size + H_size, 1)
    assert v.shape == (X_size, 1)
    assert y.shape == (X_size, 1)
    
    for param in [dh_next, dC_next, C_prev, f, i, C_bar, C, o, h]:
        assert param.shape == (H_size, 1)
        
    dv = np.copy(y)
    dv[target] -= 1

    p.W_v.d += np.dot(dv, h.T)
    p.b_v.d += dv

    dh = np.dot(p.W_v.v.T, dv)        
    dh += dh_next
    do = dh * tanh(C)
    do = dsigmoid(o) * do
    p.W_o.d += np.dot(do, z.T)
    p.b_o.d += do

    dC = np.copy(dC_next)
    dC += dh * o * dtanh(tanh(C))
    dC_bar = dC * i
    dC_bar = dtanh(C_bar) * dC_bar
    p.W_C.d += np.dot(dC_bar, z.T)
    p.b_C.d += dC_bar

    di = dC * C_bar
    di = dsigmoid(i) * di
    p.W_i.d += np.dot(di, z.T)
    p.b_i.d += di

    df = dC * C_prev
    df = dsigmoid(f) * df
    p.W_f.d += np.dot(df, z.T)
    p.b_f.d += df

    dz = (np.dot(p.W_f.v.T, df)
         + np.dot(p.W_i.v.T, di)
         + np.dot(p.W_C.v.T, dC_bar)
         + np.dot(p.W_o.v.T, do))
    dh_prev = dz[:H_size, :]
    dC_prev = f * dC
    
    return dh_prev, dC_prev

### Forward Backward Pass

Clear gradients before each backward pass

In [None]:
def clear_gradients(params = parameters):
    for p in params.all():
        p.d.fill(0)

Clip gradients to mitigate exploding gradients

In [None]:
def clip_gradients(params = parameters):
    for p in params.all():
        np.clip(p.d, -1, 1, out=p.d)

Calculate and store the values in forward pass. Accumulate gradients in backward pass and clip gradients to avoid exploding gradients.

* `input`, `target` are list of integers, with character indexes.
* `h_prev` is the array of initial `h` at $h_{-1}$ (size H x 1)
* `C_prev` is the array of initial `C` at $C_{-1}$ (size H x 1)
* *Returns* loss, final $h_T$ and $C_T$

In [None]:
def forward_backward(inputs, targets, h_prev, C_prev):
    global paramters
    
    # To store the values for each time step
    x_s, z_s, f_s, i_s,  = {}, {}, {}, {}
    C_bar_s, C_s, o_s, h_s = {}, {}, {}, {}
    v_s, y_s =  {}, {}
    
    # Values at t - 1
    h_s[-1] = np.copy(h_prev)
    C_s[-1] = np.copy(C_prev)
    
    loss = 0
    # Loop through time steps
    assert len(inputs) == T_steps
    for t in range(len(inputs)):
        x_s[t] = np.zeros((X_size, 1))
        x_s[t][inputs[t]] = 1 # Input character
        
        (z_s[t], f_s[t], i_s[t],
        C_bar_s[t], C_s[t], o_s[t], h_s[t],
        v_s[t], y_s[t]) = \
            forward(x_s[t], h_s[t - 1], C_s[t - 1]) # Forward pass

        # The 0 included only because y_s is 2 dimensional (since we are using batch size 1)
        loss += -np.log(y_s[t][targets[t], 0]) # Loss for at t
        
    clear_gradients()

    dh_next = np.zeros_like(h_s[0]) #dh from the next character
    dC_next = np.zeros_like(C_s[0]) #dc from the next character

    for t in reversed(range(len(inputs))):
        # Backward pass
        dh_next, dC_next = \
            backward(target = targets[t], dh_next = dh_next,
                     dC_next = dC_next, C_prev = C_s[t-1],
                     z = z_s[t], f = f_s[t], i = i_s[t], C_bar = C_bar_s[t],
                     C = C_s[t], o = o_s[t], h = h_s[t], v = v_s[t],
                     y = y_s[t])

    clip_gradients()
        
    return loss, h_s[len(inputs) - 1], C_s[len(inputs) - 1]

### Sample the next character

In [None]:
def sample(h_prev, C_prev, first_char_idx, sentence_length):
    x = np.zeros((X_size, 1))
    x[first_char_idx] = 1

    h = h_prev
    C = C_prev

    indexes = []
    
    for t in range(sentence_length):
        _, _, _, _, C, _, h, _, p = forward(x, h, C)
        idx = np.random.choice(range(X_size), p=p.ravel())
        x = np.zeros((X_size, 1))
        x[idx] = 1
        indexes.append(idx)

    return indexes

## Training (Adagrad)

Update the graph and display a sample output

In [None]:
def update_status(inputs, h_prev, C_prev):
    #initialized later
    global plot_iter, plot_loss
    global smooth_loss
    
    # Get predictions for 200 letters with current model

    sample_idx = sample(h_prev, C_prev, inputs[0], 200)
    txt = ''.join(idx_to_char[idx] for idx in sample_idx)

    # Clear and plot
    plt.plot(plot_iter, plot_loss)
    display.clear_output(wait=True)
    plt.show()

    #Print prediction and loss
    print("----\n %s \n----" % (txt, ))
    print("iter %d, loss %f" % (iteration, smooth_loss))

Update parameters

\begin{align}
\theta_i &= \theta_i - \eta\frac{d\theta_i}{\sum dw_{\tau}^2} \\
d\theta_i &= \frac{\partial L}{\partial \theta_i}
\end{align}

In [None]:
def update_paramters(params = parameters):
    for p in params.all():
        p.m += p.d * p.d # Calculate sum of gradients
        #print(learning_rate * dparam)
        p.v += -(learning_rate * p.d / np.sqrt(p.m + 1e-8))

To delay the keyboard interrupt to prevent the training 
from stopping in the middle of an iteration 

In [None]:
import signal

class DelayedKeyboardInterrupt(object):
    def __enter__(self):
        self.signal_received = False
        self.old_handler = signal.signal(signal.SIGINT, self.handler)

    def handler(self, sig, frame):
        self.signal_received = (sig, frame)
        print('SIGINT received. Delaying KeyboardInterrupt.')

    def __exit__(self, type, value, traceback):
        signal.signal(signal.SIGINT, self.old_handler)
        if self.signal_received:
            self.old_handler(*self.signal_received)

In [None]:
# Exponential average of loss
# Initialize to a error of a random model
smooth_loss = -np.log(1.0 / X_size) * T_steps

iteration, pointer = 0, 0

# For the graph
plot_iter = np.zeros((0))
plot_loss = np.zeros((0))

Training loop

In [None]:
while True:
    try:
        with DelayedKeyboardInterrupt():
            # Reset
            if pointer + T_steps >= len(data) or iteration == 0:
                g_h_prev = np.zeros((H_size, 1))
                g_C_prev = np.zeros((H_size, 1))
                pointer = 0

            
            inputs = ([char_to_idx[ch] 
                       for ch in data[pointer: pointer + T_steps]])
            targets = ([char_to_idx[ch] 
                        for ch in data[pointer + 1: pointer + T_steps + 1]])
            
            loss, g_h_prev, g_C_prev = \
                forward_backward(inputs, targets, g_h_prev, g_C_prev)
            smooth_loss = smooth_loss * 0.999 + loss * 0.001

            # Print every hundred steps
            if iteration % 100 == 0:
                update_status(inputs, g_h_prev, g_C_prev)

            update_paramters()

            plot_iter = np.append(plot_iter, [iteration])
            plot_loss = np.append(plot_loss, [loss])

            pointer += T_steps
            iteration += 1
    except KeyboardInterrupt:
        update_status(inputs, g_h_prev, g_C_prev)
        break

### Gradient Check

Approximate the numerical gradients by changing parameters and running the model. Check if the approximated gradients are equal to the computed analytical gradients (by backpropagation).

Try this on `num_checks` individual paramters picked randomly for each weight matrix and bias vector.

In [None]:
from random import uniform

Calculate numerical gradient

In [None]:
def calc_numerical_gradient(param, idx, delta, inputs, target, h_prev, C_prev):
    old_val = param.v.flat[idx]
    
    # evaluate loss at [x + delta] and [x - delta]
    param.v.flat[idx] = old_val + delta
    loss_plus_delta, _, _ = forward_backward(inputs, targets,
                                             h_prev, C_prev)
    param.v.flat[idx] = old_val - delta
    loss_mins_delta, _, _ = forward_backward(inputs, targets, 
                                             h_prev, C_prev)
    
    param.v.flat[idx] = old_val #reset

    grad_numerical = (loss_plus_delta - loss_mins_delta) / (2 * delta)
    # Clip numerical error because analytical gradient is clipped
    [grad_numerical] = np.clip([grad_numerical], -1, 1) 
    
    return grad_numerical

Check gradient of each paramter matrix/vector at `num_checks` individual values

In [None]:
def gradient_check(num_checks, delta, inputs, target, h_prev, C_prev):
    global parameters
    
    # To calculate computed gradients
    _, _, _ =  forward_backward(inputs, targets, h_prev, C_prev)
    
    
    for param in parameters.all():
        #Make a copy because this will get modified
        d_copy = np.copy(param.d)

        # Test num_checks times
        for i in range(num_checks):
            # Pick a random index
            rnd_idx = int(uniform(0, param.v.size))
            
            grad_numerical = calc_numerical_gradient(param,
                                                     rnd_idx,
                                                     delta,
                                                     inputs,
                                                     target,
                                                     h_prev, C_prev)
            grad_analytical = d_copy.flat[rnd_idx]

            err_sum = abs(grad_numerical + grad_analytical) + 1e-09
            rel_error = abs(grad_analytical - grad_numerical) / err_sum
            
            # If relative error is greater than 1e-06
            if rel_error > 1e-06:
                print('%s (%e, %e) => %e'
                      % (param.name, grad_numerical, grad_analytical, rel_error))

In [None]:
gradient_check(10, 1e-5, inputs, targets, g_h_prev, g_C_prev)