In [1]:
import tensorflow as tf

# TensorFlow Feed-Forward Neural Network

Instead of using NumPy to construct our neural network, we will use TensorFlow. The purpose of this notebook is to get familiar with TensorFlow's low-level functionality involving static graphs and sessions. As a result, we won't (yet) be using any of the great features in TensorFlow (like eager execution or automatic differentiation). This should nearly mirror the neural network implemented with NumPy.

## Layers 

For our neural network, we want to abstract away from individual neurons and focus on layers. Each element of the network will be defined by a certain layer.

### Base Layer

The abstract base layer ensures that all called methods by the `Sequential` model exist on the layers.

In [2]:
class Layer:
    """Base class for neural network layers."""

Some layers don't have variables to define in the `build` method, so we define an empty one here.

In [3]:
    def build(self):
        """Some layers do not have variables to define.
        
        Args:
            graph (Graph): A TensorFlow graph to add the variables to.
            
        """

# Add this method to the Layer class
Layer.build = build

We supply the `sgd_step` method as some layers won't need to update any weights because they don't have any (e.g. activation layers).

In [4]:
    def build_sgd_step(self, lrate):
        """Some layers do not have weights to update on gradient descent steps.
        
        Args:
            lrate (float): Learning rate for stochastic gradient descent.
            
        """

# Add this method to the Layer class
Layer.build_sgd_step = build_sgd_step

### Input Layer

This layer acts as a placeholder for data fed into our network.

In [5]:
class Input(Layer):
    """Placeholder for data inputs into the network."""

To set up this layer, we just need to know the number of input features.

In [6]:
    def __init__(self, m):
        """Initializes the layer's dimensions.
        
        Args:
            m (int): Number of input features to the network.
            
        """
        self.m = m
        
# Add this method to the Input layer class
Input.__init__ = __init__

Once the architecture is defined, we need to build the graph by connecting layers. This is accomplished through build methods. This first one defines variables within the layer. 

In [7]:
    def build(self):
        """Creates the input layer placeholders in a provided graph.
        
        Args:
            graph (Graph): A tf.Graph instance within which the variables should
                be defined.
        
        """
        with tf.variable_scope(name_or_scope=None, default_name='Input'):
            self.X = tf.placeholder(tf.float32, shape=(self.m, None), name='X')

# Add this method to the Input layer class
Input.build = build

The forward pass of this layer is simple, just return the inputs.

In [8]:
    def build_forward(self, A):
        """Returns the forward tensor for the layer.
        
        Returns:
            Tensor: Result of forward operation.
            
        """
        self.forward = self.X
        
        return self.forward
    
# Add this method to the Input layer class
Input.build_forward = build_forward

The backward pass should do nothing as the function is the identity.

In [9]:
    def build_backward(self, dLdZ):
        """Returns the backward tensor of the layer.
        
        Args:
            dLdZ (Tensor): An n by b matrix of loss gradients with
                respect to the output of the layer.
        
        Returns:
            Tensor: A m by b matrix of loss gradients with respect
                to the input of the layer.
                
        """
        self.backward = dLdZ
        
        return self.backward

# Add this method to the Input layer class
Input.build_backward = build_backward

There are no weights to update in `sgd_step`.

### Linear Layer

This is the simplest layer that makes up the majority of our neural network.

In [10]:
class Linear(Layer):
    """A simple, fully-connected linear layer."""

To set up this layer, we need to know the input and output dimensions ahead of time. Using this information, we randomly initialize the weight matrices. We use TensorFlow variables instead of NumPy arrays. The `get_variable` defines node on the default graph.

In [11]:
    def __init__(self, m, n):
        """Initializes the layer input and output dimensions. 

        Note: Kernel is initialized using normal distribution with mean 0 and 
        variance 1 / m. All biases are initialized to zero.

        Args:
            m (int): Number of inputs to the layer.
            n (int): Number of outputs from the layer.

        """
        self.m, self.n = m, n

# Add this method to the Linear layer class
Linear.__init__ = __init__

We create a build method to define the variables within the layer.

In [12]:
    def build(self):
        """Creates the layer variables in the provided graph.
        
        Args:
            graph (Graph): A tf.Graph instance within which the variables should
                be defined.        
        
        """
        with tf.variable_scope(name_or_scope=None, default_name='Linear'):

            self.W0 = tf.get_variable('W0', (self.n, 1), initializer=tf.zeros_initializer)
            self.W = tf.get_variable('W', (self.m, self.n), initializer=tf.random_normal_initializer(0.0, tf.sqrt(1 / self.m)))
                
# Add this method to the Linear layer class
Linear.build = build

The `forward` method will compute the output of the layer given a set of $m$ inputs from the previous layer for a batch of size $b$.

In [13]:
    def build_forward(self, A):
        """Computes the forward pass through the linear network for a batch.

        Args:
            A (Tensor): An m by b matrix representing the m activations from the
                previous layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix representing the result of passing the 
                activations through the network layer for a batch of size b.

        """
        self.A = A
        self.forward = tf.add(tf.matmul(tf.transpose(self.W), self.A), self.W0)
        
        return self.forward

# Add this method to the Linear layer class
Linear.build_forward = build_forward

The `backward` method will compute the gradient of the loss with respect to the inputs to the layer for a batch of size $b$. Note: There is an implicit sum over all $b$ in the `dLdW` calculation.

In [14]:
    def build_backward(self, dLdZ):
        """Uses the gradient of loss with respect to outputs of the layer for a 
        batch to update the sum of gradients of the loss with respect to the 
        weights for the entire batch. Also returns the gradient of the loss with 
        respect to the inputs to the layer for a batch.

        Args:
            dLdZ (Tensor): An n by b matrix representing the gradient of the loss
                with respect to the layer outputs for a batch of size b.

        Returns:
            Tensor: An m by b matrix representing the gradient of the loss with 
                respect to the inputs to the layer for a batch of size b.

        """
        self.dLdW = tf.matmul(self.A, tf.transpose(dLdZ))  # Implicit sum over all b
        self.dLdW0 = tf.reduce_sum(dLdZ, axis=1, keepdims=True)

        self.backward = tf.matmul(self.W, dLdZ)
            
        return self.backward

# Add this method to the Linear layer class
Linear.build_backward = build_backward

Lastly, we need a method to update the weight matrices using the current weight gradients for a batch.

In [15]:
    def build_sgd_step(self, lrate):
        """Performs a single step of gradient descent to update the weights for a 
        single batch of points.

        Args:
            lrate (float): A learning rate to scale the gradient for the update.

        """
        self.sgd_step_W = self.W.assign(tf.subtract(self.W, tf.scalar_mul(lrate, self.dLdW)))
        self.sgd_step_W0 = self.W0.assign(tf.subtract(self.W0, tf.scalar_mul(lrate, self.dLdW0)))

        self.sgd_step = tf.group(self.sgd_step_W, self.sgd_step_W0)
        
        return self.sgd_step
        
# Add this method to the Linear layer class
Linear.build_sgd_step = build_sgd_step

### Hyperbolic Tangent Activation Layer

This layer encapsulates the hyperbolic tangent activation function.

In [16]:
class Tanh(Layer):
    """Hyperbolic tangent activation layer."""

The `forward` method take a preactivation from the previous layer and computes the activation using the hyperbolic tangent function.

In [17]:
    def build_forward(self, Z):
        """Computes the output of the hyperbolic tangent activation layer.

        Args:
            Z (Tensor): An n by b matrix representing the input pre-activations
                of the layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix representing the output of the layer after
                using the hyperbolic tangent activation on all inputs for a batch
                of size b.

        """
        self.forward = tf.tanh(Z)

        return self.forward

# Add this method to the Tanh layer class
Tanh.build_forward = build_forward

The `backward` method computes the gradient of the loss with respect to the inputs to the activation layer.

In [18]:
    def build_backward(self, dLdA):
        """Computes the gradient of the loss with respect to the inputs to the
        layer using the gradient of the loss with respect to the outputs of the
        layer for a single batch.

        Args:
            dLdA (Tensor): An n by b matrix representing the gradient of the loss
                with respect to the outputs for the layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix representing the gradient of the loss with
                respect to the inputs of the layer for a batch of size b.

        """
        self.backward = tf.multiply(tf.subtract(tf.constant(1.0), tf.square(self.forward)), dLdA)

        return self.backward
            
# Add this method to the Tanh layer class
Tanh.build_backward = build_backward

This layer has no weights to update. Therefore, no `sgd_step` function is required.

### Rectified Linear Unit Activation Layer

This layer encapsulates the rectified linear unit activation function.

In [19]:
class ReLU(Layer):
    """Rectified linear unit layer."""

The `forward` method take a preactivation from the previous layer and computes the activation using the relu function.

In [20]:
    def build_forward(self, Z):
        """Computes the output of the rectified linear unit layer.
        
        Args:
            Z (Tensor): An n by b matrix representing the input pre-activations
                of the layer for a batch of size b.
        
        Returns:
            Tensor: An n by b matrix representing the output of the layer after
                using the rectified linear activation on all inputs for a batch
                of size b.
        
        """
        self.forward = tf.maximum(tf.constant(0.0), Z)
        
        return self.forward
    
# Add this method to the ReLU layer class
ReLU.build_forward = build_forward

The `backward` method computes the gradient of the loss with respect to the inputs to the activation layer.

In [21]:
    def build_backward(self, dLdA):
        """Computes the gradient of the loss with respect to the inputs to the
        layer using the gradient of the loss with respect to the outputs of the
        layer for a single batch.

        Args:
            dLdA (Tensor): An n by b matrix representing the gradient of the loss
                with respect to the outputs for the layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix representing the gradient of the loss with
                respect to the inputs of the layer for a batch of size b.

        """
        self.backward = tf.multiply(tf.sign(self.forward), dLdA)
        
        return self.backward
    
# Add this method to the ReLU layer class
ReLU.build_backward = build_backward

This layer has no weights to update. Therefore, no `sgd_step` function is required.

### Softmax Activation Layer

This layer encapsulates the softmax activation function.

In [22]:
class SoftMax(Layer):
    """Softmax activation layer."""

The `forward` method take a preactivation from the previous layer and computes the activation using the softmax function.

In [23]:
    def build_forward(self, Z):
        """Computes the softmax activation given the inputs from the previous
        layer for a single batch.

        Args:
            Z (Tensor): An n by b matrix representing the inputs to the softmax
                layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix of outputs from softmax for a batch of 
                size b.

        """
        self.forward = tf.add(1.e-8, tf.divide(tf.exp(tf.subtract(Z, tf.reduce_max(Z))), tf.reduce_sum(tf.exp(tf.subtract(Z, tf.reduce_max(Z))), axis=0, keepdims=True)))
            
        return self.forward

# Add this method to the SoftMax layer class
SoftMax.build_forward = build_forward

The `backward` method computes the gradient of the loss with respect to the inputs to the activation layer. Note that I *do not* assume that $\partial \mathrm{Loss} / \partial Z^L$ is passed in directly. More information on how this works can be found in the 'Einstein Summation' notebook.

In [24]:
    def build_backward(self, dLdA):
        """Computes the gradient of the loss with respect to the inputs to the
        layer using the gradient of the loss with respect to the outputs of the
        layer for a single batch.

        Args:
            dLdA (Tensor): An n by b matrix representing the gradient of the loss
                with respect to the outputs for the layer for a batch of size b.

        Returns:
            Tensor: An n by b matrix representing the gradient of the loss with
                respect to the inputs of the layer for a batch of size b.
                
        """
        n, _ = dLdA.shape

        dAdZ = tf.add(tf.einsum('jk,jk,ji->ijk', self.forward, tf.subtract(tf.constant(1.0), self.forward), tf.eye(n)),
                      tf.einsum('jk,ik,ji->ijk', tf.negative(self.forward), self.forward, tf.subtract(tf.constant(1.0), tf.eye(n))))

        self.backward = tf.einsum('ikj,kj->ij', dAdZ, dLdA)
        
        return self.backward

# Add this method to the SoftMax layer class
SoftMax.build_backward = build_backward

This layer has no weights to update. Therefore, no `sgd_step` function is required.

### Negative Log-Likelihood Multi-Class Loss Layer

This isn't really a layer, but it functions quite similarly to one. It will take predictions and actual labels and compute the categorical cross-entropy loss.

In [25]:
class NLL(Layer):
    """Negative log-likelihood loss layer."""

The `forward` method will compute the loss between predicted and actual labels using categorical cross-entropy loss.

In [26]:
    def build_forward(self, A, Y):
        """Computes the loss given the predicted and actual results.

        Args:
            Ypred (Tensor): An n by b matrix representing the predicted results
                from the network for a batch of size b.
            Y (Tensor): An n by b matrix representing the actual expected results
                for a batch of size b.

        Returns:
            Tensor: A scalar representing the total loss for each of the outputs
                in a batch of size b.

        """
        self.A = A
        self.Y = Y

        self.forward = tf.negative(tf.reduce_sum(tf.multiply(self.Y, tf.log(self.A))))
        
        return self.forward

# Add this method to the NLL layer class
NLL.build_forward = build_forward

The `backward` method will compute the gradient of the loss with respect to the predicted outputs from the network. (Note: this is *not* in terms of the pre-activations, but the actual activations. To learn more about this, look at the 'Einstein Summation' notebook.)

In [27]:
    def build_backward(self):
        """Computes the gradient of the loss with respect to predicted targets for
        a single batch.
        
        Returns:
            Tensor: An n by b matrix representing the gradient of loss with
                respect to predicted targets for a batch of size b.
                
        """
        self.backward = tf.negative(tf.divide(self.Y, self.A))
        
        return self.backward

# Add this method to the NLL layer class
NLL.build_backward = build_backward

This layer has no weights to update. Therefore, no `sgd_step` function is required.

In [28]:
class Accuracy(Layer):
    
    def build_forward(self, A, Y):
        
        self.forward = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(Y, axis=0), tf.argmax(A, axis=0)), tf.float32))
        
        return self.forward
    
    def build_backward(self):
        pass

## Model

Now that we have all the components to make up a simple neural network, we can combine them together into a model.

### Sequential Model

This is the simplest type of model which just linearly stacks each layer together.

In [29]:
class Sequential:
    """A standard neural network model with linear stacked layers."""

Before we can do anything with the model, we need to know what layers should be included and what loss should be used to compute gradient updates

In [32]:
    def __init__(self, input, layers, loss):
        """Initialize the layers and the loss for the network.
        
        Args:
            input (Layer): An input placeholder layer to begin the network.
            layers (list of Layers): A list of layers to make up the linear
                neural network.
            loss (Layer): A final layer to use to compute the loss of the
                neural network.
        
        """
        self.input = input
        self.layers = layers
        self.loss = loss

# Add this method to the Sequential model class
Sequential.__init__ = __init__

In [None]:
    def build(self):
        
        self.graph = tf.Graph()
        
        with self.graph.as_default():
        
            self.input.build()
            for layer in self.layers:
                layer.build()
            self.loss.build()

            self.forward = self.input.build_forward()
            for layer in self.layers:
                self.forward = layer.build_forward(self.forward)
            self.loss_forward = self.loss.build_forward(self.forward)
            
            self.backward = self.loss.build_backward()

            self.build_backward()
            self.build_sgd_step()

            self.sess = tf.Session(graph=self.graph)
        

To make predictions with the network, we use the `forward` method. This passes the data through every layer and returns the result.

In [40]:
    def build_forward(self):
        """Connects the layers to compute the output for a 
        training input batch.
        
        """
        self.forward = self.input.build_forward(self.graph)
        
        for layer in self.layers:
            self.forward = layer.forward(self.forward)
            
        return self.forward

# Add this method to the Sequential model class
Sequential.build_forward = build_forward

To train the network, we will use stochastic gradient descent. Before we define the stochastic gradient descent training loop, we have to back-propogate the error throughout the layers of the network. To do this, we use the `backward` method.

In [41]:
    def build_backward(self):
        """Computes the gradients of the loss with respect to each weight
        in the neural network to prepare for stochastic gradient descent.
        
        Args:
            dLdA (ndarray): An n by b matrix representing the gradient of the
                loss with respect to the outputs of the neural network for a
                batch of size b.
        
        """
        self.backward = self.loss.backward()
        
        for layer in self.layers[::-1]:
            self.backward = layer.backward(self.backward)
        
        return self.backward

# Add this method to the Sequential model class
Sequential.build_backward = build_backward

Once the error is propogated through all the layers, each layer can update their weight matrices. For a single step, this is achieved through the `sgd_step` method.

In [42]:
    def build_sgd_step(self, lrate):
        """Runs a single update step on the weight matrices throughout the
        neural network using stochastic gradient descent.
        
        Args:
            lrate (float): Learning rate for the update step.
        
        """
        self.sgd_step = tf.group(*[layer.build_sgd_step(lrate) for layer in self.layers])
        
        return self.sgd_step

# Add this method to the Sequential model class
Sequential.build_sgd_step = build_sgd_step

Now we loop over the data applying many stochastic gradient descent update steps.

In [44]:
    def sgd(self, X, Y, iters=100, lrate=0.005):
        """Trains the neural network by running stochastic gradient descent.
        
        Args:
            X (ndarray): A d by n matrix representing n training data points
                each with d dimensions.
            Y (ndarray): A 1 by n matrix representing n training labels.
            iters (int): The number of iterations to run stochastic graident
                descent.
            lrate (float): The step size for stochastic gradient descent.
        
        """
        _, n = X.shape
        
        for it in range(iters):
            
            t = np.random.randint(n)
            
            Xt = X[:, t:t + 1]
            Yt = Y[:, t:t + 1]
            
            loss = self.loss.forward(self.forward(Xt), Yt)
            self.backward(self.loss.backward())      
            
            self.print_accuracy(it, X, Y, loss)
            
            self.sgd_step(lrate)
            
# Add this method to the Sequential model class
Sequential.sgd = sgd