# NumPy

We begin by importing the NumPy package. It is common practice to rename NumPy's namespace to the shorthand `np`.

In [4]:
import numpy as np


## Arrays

The most important object in NumPy is an array, which is an N-dimensional tensor. Each array has a shape (its dimensions) and a data-type (`dtype`, but casting happens automatically as it does in Python).

In [5]:
W = np.array([1, 2, 3]) # A vector
W.shape, W.dtype

((3,), dtype('int64'))

In [6]:
W = W / 2
W

array([ 0.5,  1. ,  1.5])

In [7]:
W.dtype

dtype('float64')

In [8]:
W = np.arange(9).reshape((3, 3))
W

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

## Basics

Most algebraic manipulations are applied element-wise and not inplace.

In [9]:
W + 1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
W * W

array([[ 0,  1,  4],
       [ 9, 16, 25],
       [36, 49, 64]])

## Broadcasting

NumPy performs broadcasting, which means that arrays with fewer dimensions are automatically resized to be compatible with the other array. This is useful for example when adding biases to a minibatch.

In [11]:
V = np.array([0, 1, 1])
W * V

array([[0, 1, 2],
       [0, 4, 5],
       [0, 7, 8]])

In [12]:
W.shape, V.shape

((3, 3), (3,))

## Algebra

The most important operations is probably the matrix-matrix product.

In [13]:
V = np.array([1, 1, 1])
np.dot(W, V)  # A matrix-vector product, sum over columns

array([ 3, 12, 21])

In [14]:
np.dot(W, W)  # Matrix-matrix product of W with itself i.e. W squared

array([[ 15,  18,  21],
       [ 42,  54,  66],
       [ 69,  90, 111]])

In [15]:
W.dot(W)  # Alternative notation, equal to np.dot(W, W)

array([[ 15,  18,  21],
       [ 42,  54,  66],
       [ 69,  90, 111]])

In [16]:
W.dot(W.T)  # W.T takes the product of W, alternatively use W.transpose()

array([[  5,  14,  23],
       [ 14,  50,  86],
       [ 23,  86, 149]])

In [17]:
a = np.array([[1, 2, 3]])
a.shape

(1, 3)

# Python

When coding neural networks, a useful programming abstraction is to create a class for each layer which implements two functions: The forward propagation, and the backward propagation. In Python, classes are created as follows:

In [18]:
# Python 3 allows for the shorter `class Layer:` notation
class Layer(object):
    # A method is defined using `def`
    def forward_propagation(self, input_):
        # The variable `self` is the instance of this class
        # This returns the transformed input
        pass
    
    def backward_propagation(self, gradient_wrt_output):
        # This takes the gradient w.r.t. the output,
        # and returns the gradient w.r.t. the input
        pass
    
    def update_parameters(self, input_, gradient_wrt_output, learning_rate):
        # This takes the input, the gradient w.r.t. the output and
        # updates the parameters using the given learning rate
        pass

For example, a linear transformation looks like:

In [19]:
# This class inherits from Layer
class Linear(Layer):
    # __init__ is the constructor method of this class
    def __init__(self, input_dim, output_dim):
        self.W = np.random.randn(input_dim, output_dim)
        
    def forward_propagation(self, input_):
        return input_.dot(self.W)
    
    def backward_propagation(self, gradient_wrt_output):
        return gradient_wrt_output.dot(self.W.T)
    
    def update_parameters(self, input_, gradient_wrt_output, learning_rate):
        self.W -= learning_rate * input_.T.dot(gradient_wrt_output)

You can extend this implementation to account for biases, and implement layers for the sigmoid and softmax operations as well. They can then be strung together to form an MLP.

*Note: This is largely pseudo-code, and untested*

In [20]:
A = np.ones(10)
A[None, :, None].shape

(1, 10, 1)

In [21]:
# Define the network as a series of layers
layers = [Linear(784, 100), Sigmoid(), Linear(100, 10), Softmax()]

# Forward propagate the data
inputs = [get_minibatch()]
for layer in layers:
    inputs.append(layer.forward_propagation(inputs[-1]))

# Calculate the cost
activations = inputs[-1]
c = -np.log(activations[np.arange(activations.shape[0], targets)]).mean()

# Get the gradient of the cost with respect to the softmax output
# The gradient of the logarithm is 1 / x, the non-indexed entries are zero, hence
dc = -1 / activations * (np.arange(activations.shape[0])[:, None] == targets).T

# Now go layer by layer in reverse and perform the backward propagation
grads_wrt_inputs = [dc]
for layer in layers[::-1]:
    layer.update_parameters(inputs[layers.index(layer)], grad_wrt_inputs[-1], 0.001)
    grads_wrt_inputs.append(layer.backward_propagation(grads_wrt_inputs[-1]))

NameError: name 'Sigmoid' is not defined

To calculate the gradient analytically, loop over each individual parameter, set it to $\theta - \delta$, calculate the cost, set it to $\theta + \delta$, calculate the cost again (then set it to its original value). The gradient for $\theta$ is then the difference of the two costs divided by $2\delta$.