# Yet another explanation of backprop

There are many tutorials on backpropagation out there. I've skimmed through a bunch of them, and overall my favorite was [this one](https://www.ritchievink.com/blog/2017/07/10/programming-a-neural-network-from-scratch/) by Ritchie Vink. I preferred because the code examples are of good quality and give a lot of leeway for improvement. [This](https://victorzhou.com/blog/intro-to-neural-networks/) blogpost by Victor Zhou also helped me develop a mental model of what's going on.

## Neural networks in a nutshell

A neural network is a sequence of layers. Every layer takes as input $x$ and outputs $z$. We can denote this by a function which we call $f$:

$$z = f(x)$$

Note that the input $x$ can be a set of features, as well as the output from another layer. In the case of a dense layer, $f$ is an affine transformation:

$$z = w x + b$$

When we stack layers, we are simply chaining functions:

$$\hat{y} = f(f(f(\dots(f(x)))))$$

In the case of dense layers, which are linear, chaining them essentially results in a linear function. This means that even if we have a million dense layers stacked together, we still won't be able to learn non-linear patterns such as the XOR function. To add non-linearity, we add an *activation function* after each layer. Let's call these activation functions $g$. The output from the activation functions will be called $a$.

$$a = g(f(x))$$

When we stack layers, our final output is:

$$\hat{y} = g(f(g(f(\dots(g(f(x)))))))$$

Of course there are many more flavors of neural networks but that's the general idea. In the case of using dense layers, we're looking to tune the weights $w$ and biases $b$. That's where backpropagation comes in.

## Backpropagation

First of all, let's get the chain rule out of the way. Say you have a function $f$, a function $g$, and an input $x$. If we compose our functions and apply them to $x$ we get $g(f(x))$. Now say we want to find the derivative of $g$ with respect to $x$. The trick is that there the function $f$ in between $g$ and $x$. In this case we use the chain rule, which gives us:

$$\frac{\partial g}{\partial x} = \frac{\partial g}{\partial f} \times \frac{\partial f}{\partial x}$$

In other words, in order to compute $\frac{\partial g}{\partial x}$, we have to compute $\frac{\partial g}{\partial f}$ and $\frac{\partial f}{\partial x}$ and multiply them together. The chain rule is thus just a tool that we can add to our toolkit. In the case of neural networks it's super useful because we're basically just chaining functions. 

Let's say we're looking at the weights of the final layer. We'll call them $w$. The output of the network is denoted as $\hat{y}$ whilst the ground truth is $y$. We have a loss function $L$ which indicates the error between $y$ and $\hat{y}$. To update the weights, we need to calculate the gradient of the loss function with respect to the weights:

$$\frac{\partial L}{\partial w}$$

In between $w_i$ and $L$, there is the application of the dense layer and the activation function. We can thus apply the chain rule:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \times \frac{\partial a}{\partial z} \times \frac{\partial z}{\partial w}$$

In the case where our loss function is the mean squared error, the derivative is:

$$\frac{\partial L}{\partial a} = 2 \times (a - y)$$

For a sigmoid activation function, the derivative is:

$$\frac{\partial a}{\partial z} = \sigma(z) (1 - \sigma(z))$$

where $\sigma$ is in fact the sigmoid function. In the case of a dense layer, the derivative is:

$$\frac{\partial z}{\partial w} = x$$

We simply have to multiply all these elements together in order to obtain $\frac{\partial L}{\partial w}$:

$$\frac{\partial L}{\partial w} = (2 \times (a - y)) \times (\sigma(z) (1 - \sigma(z))) \times x$$

Recall that $a$ is the output of the network after having been processed by the activation function. We could have as well called it $\hat{y}$ because we're looking at the final layer, but we use $a$ because it's more generic and applies to each layer in the network. $z$ is the output of the network *before* being processed by the activation function. Note that implementation wise we thus have to keep both in memory. We can't just obtain $a$ and erase $z$.

If we plug in a different activation function and/or a different loss function, then everything will still work as long as each element is differentiable. Note that if we use the identity activation function (which doesn't change the input and has a derivative of 1), then we're simply doing linear regression!

Now how about the weights of the penultimate layer (the one just before the last one). Well we "just" have write it down using the chain rule. Here goes:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial w_2}$$

We've indexed the $a$s and $z$s because we're looking at multiple layer. In this case $a_3$ is the output of the 3rd layer (we called it $a$ before) whilst $a_2$ is the output of the 2nd layer. An important thing to notice is that we're using $\frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3}$, which we already calculated previously. We can exploit this when we implement backpropagation in order to speed up our code but also make it shorter.

Here is the gradients for the weights of the 1st layer:

$$\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial a_3} \times \frac{\partial a_3}{\partial z_3} \times \frac{\partial z_3}{\partial a_2} \times \frac{\partial a_2}{\partial z_2} \times \frac{\partial z_2}{\partial a_1} \times \frac{\partial a_1}{\partial z_1} \times \frac{\partial z_1}{\partial w_1}$$

Again the first four elements of the product have already been computed.

How about the biases $b_i$? Well in a dense layer the derivative with respect to the biases is 1 (it was $x$ with respect to the weights). For the 3rd layer this will result in:

$$\frac{\partial L}{\partial b} = (2 \times (a - y)) \times (\sigma(z) (1 - \sigma(z))) \times 1$$

In [45]:
import numpy as np
from sklearn import datasets
from sklearn import metrics
from sklearn import model_selection
from sklearn import preprocessing


class ReLU:
    """Rectified Linear Unit (ReLU) activation function."""

    @staticmethod
    def activation(z):
        z[z < 0] = 0
        return z

    @staticmethod
    def gradient(z):
        z[z < 0] = 0
        z[z > 0] = 1
        return z


class Sigmoid:
    """Sigmoid activation function."""

    @staticmethod
    def activation(z):
        return 1 / (1 + np.exp(-z))

    @staticmethod
    def gradient(z):
        s = Sigmoid.activation(z)
        return s * (1 - s)


class Identity:
    """Identity activation function."""

    @staticmethod
    def activation(z):
        return z

    @staticmethod
    def gradient(z):
        return np.ones_like(z)


class MSE:
    """Mean Squared Error (MSE) loss function."""

    @staticmethod
    def loss(y_true, y_pred):
        return np.mean((y_pred - y_true) ** 2)

    @staticmethod
    def gradient(y_true, y_pred):
        return 2 * (y_pred - y_true)


class SGD:
    """Stochastic Gradient Descent (SGD)."""

    def __init__(self, learning_rate):
        self.learning_rate = learning_rate

    def step(self, weights, gradients):
        weights -= self.learning_rate * gradients


class NN:
    """

    Parameters:
        dimensions (tuples of ints of length n_layers)

    """

    def __init__(self, dimensions, activations, loss, optimizer):
        self.n_layers = len(dimensions)
        self.loss = loss
        self.optimizer = optimizer

        # Weights and biases are initiated by index. For a one hidden layer net you will have a w[1] and w[2]
        self.w = {}
        self.b = {}

        # Activations are also initiated by index. For the example we will have activations[2] and activations[3]
        self.activations = {}
        for i in range(len(dimensions) - 1):
            self.w[i + 1] = np.random.randn(dimensions[i], dimensions[i + 1]) / np.sqrt(dimensions[i])
            self.b[i + 1] = np.zeros(dimensions[i + 1])
            self.activations[i + 2] = activations[i]

    def _feed_forward(self, X):
        """Executes a forward pass through the neural network.

        This will return the state at each layer of the network, which includes the output of the
        network.

        Parameters:
            X (array of shape (batch_size, n_features))

        """

        # z = w(x) + b
        z = {}

        # a = f(z)
        a = {1: X}  # First layer has no activations as input

        for i in range(2, self.n_layers + 1):
            z[i] = np.dot(a[i - 1], self.w[i - 1]) + self.b[i - 1]
            a[i] = self.activations[i].activation(z[i])

        return z, a

    def _backprop(self, z, a, y_true):
        """Backpropagation.

        Parameters:
            z (dict of length n_layers - 1):

                z = {
                    2: w1 * x + b1
                    3: w2 * (w1 * x + b1) + b2
                    4: w3 * (w2 * (w1 * x + b1) + b2) + b3
                    ...
                }

            a (dict of length n_layers):

                a = {
                    1: x,
                    2: f(w1 * x + b1)
                    3: f(w2 * (w1 * x + b1) + b2)
                    4: f(w3 * (w2 * (w1 * x + b1) + b2) + b3)
                    ...
                }

            y_true (array of shape (batch_size, n_targets))

        """

        # Determine the partial derivative and delta for the output layer
        y_pred = a[self.n_layers]
        final_activation = self.activations[self.n_layers]
        delta = self.loss.gradient(y_true, y_pred) * final_activation.gradient(y_pred)
        dw = np.dot(a[self.n_layers - 1].T, delta)

        update_params = {
            self.n_layers - 1: (dw, delta)
        }

        # Go through the layers in reverse order
        for i in range(self.n_layers - 2, 0, -1):
            delta = np.dot(delta, self.w[i + 1].T) * self.activations[i + 1].gradient(z[i + 1])
            dw = np.dot(a[i].T, delta)
            update_params[i] = (dw, delta)

        # Update the parameters
        for k, (dw, delta) in update_params.items():
            self.optimizer.step(weights=self.w[k], gradients=dw)
            self.optimizer.step(weights=self.b[k], gradients=np.mean(delta, axis=0))

    def fit(self, X, y, epochs, batch_size, print_every=np.inf):
        """Trains the neural network.

        Parameters:
            X (array of shape (n_samples, n_features))
            y (array of shape (n_samples, n_targets))
            epochs (int)
            batch_size (int)

        """

        # As a convention we expect y to be 2D, even if there is only one target to predict
        if y.ndim == 1:
            y = np.expand_dims(y, axis=1)

        # Go through the epochs
        for i in range(epochs):

            # Shuffle the data
            idx = np.arange(X.shape[0])
            np.random.shuffle(idx)
            x_ = X[idx]
            y_ = y[idx]

            # Iterate over the training data in mini-batches
            for j in range(X.shape[0] // batch_size):
                start = j * batch_size
                stop = (j + 1) * batch_size
                z, a = self._feed_forward(x_[start:stop])
                self._backprop(z, a, y_[start:stop])

            # Display the performance every print_every eooch
            if (i + 1) % print_every == 0:
                y_pred = self.predict(X)
                print(f'[{i+1}] train loss: {self.loss.loss(y, y_pred)}')

    def predict(self, X):
        """Predicts an output for each sample in X.

        Parameters:
            X (array of shape (n_samples, n_features))

        """
        _, a = self._feed_forward(X)
        return a[self.n_layers]

Boston.

In [46]:
np.random.seed(1)

X, y = datasets.load_boston(return_X_y=True)
X = preprocessing.scale(X)

# Split into train and test
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y,
    test_size=.3,
    shuffle=True,
    random_state=42
)

nn = NN(
    dimensions=(13, 10, 1),
    activations=(ReLU, Identity),
    loss=MSE,
    optimizer=SGD(learning_rate=1e-3)
)
nn.fit(X_train, y_train, epochs=30, batch_size=8, print_every=10)

y_pred = nn.predict(X_test)

print(metrics.mean_absolute_error(y_test, y_pred))

[10] train loss: 11.796707532482444
[20] train loss: 9.700941500985953
[30] train loss: 9.023612069639709
2.505495393489851


Digits.

In [47]:
np.random.seed(1)

X, y = datasets.load_digits(return_X_y=True)

# One-hot encode y
y = np.eye(10)[y]

# Split into train and test
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y,
    test_size=.3,
    shuffle=True,
    random_state=42
)

nn = NN(
    dimensions=(64, 15, 10),
    activations=(ReLU, Sigmoid),
    loss=MSE,
    optimizer=SGD(learning_rate=1e-3)
)
nn.fit(X_train, y_train, epochs=50, batch_size=16, print_every=10)

y_pred = nn.predict(X_test)

print(metrics.classification_report(y_test.argmax(1), y_pred.argmax(1)))

[10] train loss: 0.008308476136280957
[20] train loss: 0.004984925198988307
[30] train loss: 0.004102445263740696
[40] train loss: 0.0029634369443098745
[50] train loss: 0.0018708680417568045
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        53
           1       0.96      0.98      0.97        50
           2       0.94      1.00      0.97        47
           3       0.96      0.96      0.96        54
           4       0.98      1.00      0.99        60
           5       0.94      0.97      0.96        66
           6       0.98      0.98      0.98        53
           7       1.00      0.98      0.99        55
           8       1.00      0.93      0.96        43
           9       0.98      0.93      0.96        59

    accuracy                           0.97       540
   macro avg       0.98      0.97      0.97       540
weighted avg       0.97      0.97      0.97       540

