# Simple Convolutional Neural Network in Python - Step-by-Step

This notebook contains the step-by-step building blocks of a simple convolutional neural network.

**Functions**

- Convolution functions, including:
    - Zero Padding
    - Convolve window 
    - Convolution forward
    - Convolution backward
    
- Pooling functions, including:
    - Pooling forward
    - Create mask 
    - Distribute value
    - Pooling backward
    
All functions are implemented from scratch in `numpy`. 

**Notation**:

Superscript $[l]$ denotes an object of the $l^{th}$ layer e.g. $a^{[4]}$ is the $4^{th}$ layer activation; $W^{[5]}$ and $b^{[5]}$ are the $5^{th}$ layer parameters.


Superscript $(i)$ denotes an object from the $i^{th}$ example e.g. $x^{(i)}$ is the $i^{th}$ training example input.
    
    
Lowerscript $i$ denotes the $i^{th}$ entry of a vector e.g. $a^{[l]}_i$ denotes the $i^{th}$ entry of the activations in layer $l$, assuming this is a fully connected (FC) layer.
    
    
$n_H$, $n_W$ and $n_C$ denote respectively the height, width and number of channels of a given layer.


$n_{H_{prev}}$, $n_{W_{prev}}$ and $n_{C_{prev}}$ denote respectively the height, width and number of channels of the previous layer.


## Import Modules

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

## Zero-Padding

Zero-padding adds zeros around the border of an image; the main benefits of doing this are: 
- It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the "same" convolution, in which the height/width is exactly preserved after one layer. 

- It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image.

In [None]:
def zero_pad(X, pad):
    
    X_pad = np.pad(X, ((0,0), (pad, pad), (pad, pad), (0,0)), 'constant')
    
    return X_pad

    """
    Pad with zeros all images of the dataset X
    
    Argument:
    X = python numpy array of shape (m, n_H, n_W, n_C) representing a batch of m images
    pad = padding
    
    Returns:
    X_pad = padded image of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)
    """

In [None]:
x = np.random.randn(4, 3, 3, 2)
x_pad = zero_pad(x, 2)

print ("x.shape =", x.shape)
print ("x_pad.shape =", x_pad.shape)
print ("x[1,1] =", x[1,1])
print ("x_pad[1,1] =", x_pad[1,1])

fig, axarr = plt.subplots(1, 2)
axarr[0].set_title('x')
axarr[0].imshow(x[0,:,:,0])
axarr[1].set_title('x_pad')
axarr[1].imshow(x_pad[0,:,:,0])

## Single Step of Convolution 

Applies the filter to a single position of the input.

- Takes an input volume 
- Applies a filter at every position of the input
- Outputs another volume (usually of different size)

In [None]:
def conv_single_step(a_slice_prev, W, b):

    # Element-wise product between a_slice and W
    s = np.multiply(a_slice_prev, W)
    
    # Sum over all entries of the volume s
    Z = np.sum(s)
    
    # Add bias b (cast to a float) to Z
    Z = Z + float(b)

    return Z

    """
    Apply one filter defined by parameters W on a single slice (a_slice_prev) of the output activation 
    of the previous layer.
    
    Arguments:
    a_slice_prev = slice of input data; shape = (f, f, n_C_prev)
    W = weight parameters contained in a window; shape = (f, f, n_C_prev)
    b = bias parameters contained in a window; shape = (1, 1, 1)
    
    Returns:
    Z = result of convolving the sliding window (W, b) on a slice x of the input data
    """

In [None]:
a_slice_prev = np.random.randn(4, 4, 3)
W = np.random.randn(4, 4, 3)
b = np.random.randn(1, 1, 1)
Z = conv_single_step(a_slice_prev, W, b)

print("Z =", Z)

## Forward Pass

- Convolves many filters on the input
- Each convolution results in a 2D matrix output
- Outputs are stacked to get a 3D volume

In [None]:
def conv_forward(A_prev, W, b, hparameters):
    
    # Retrieve dimensions 
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve dimensions
    (f, f, n_C_prev, n_C) = W.shape
    
    # Retrieve information from "hparameters"
    stride = hparameters["stride"]
    pad = hparameters["pad"]
    
    # Compute the dimensions of the CONV output volume
    n_H = int(((n_H_prev - f + (2 * pad)) / stride) + 1)
    n_W = int(((n_W_prev - f + (2 * pad)) / stride) + 1)

    
    # Initialize the output volume Z with zeros
    Z = np.zeros((m, n_H, n_W, n_C))
    
    # Create A_prev_pad by padding A_prev
    A_prev_pad = zero_pad(A_prev, pad)
    
    for i in range(0, m):                             # loop over the batch of training examples
        a_prev_pad = A_prev_pad[i, :, :, :]           # Select ith training example's padded activation
        for h in range(0, n_H):                       # loop over vertical axis of the output volume
            for w in range(0, n_W):                   # loop over horizontal axis of the output volume
                for c in range(0, n_C):               # loop over channels (= #filters) of the output volume
                    
                    # Find the corners of the current "slice"
                    vert_start = stride * h
                    vert_end = (stride * h) + f
                    horiz_start = stride * w
                    horiz_end = (stride * w) + f
                    
                    # Use the corners to define the (3D) slice of a_prev_pad
                    a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]
                    
                    # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron
                    Z[i, h, w, c] = conv_single_step(a_slice_prev, W[:, :, :, c], b[:, :, :, c])
                                        
    # Make sure output shape is correct
    assert(Z.shape == (m, n_H, n_W, n_C))
    
    # Save information in "cache" for backprop
    cache = (A_prev, W, b, hparameters)
    
    return Z, cache

    """
    Implements the forward propagation for a convolution function
    
    Arguments:
    A_prev = output activations of the previous layer; shape = (m, n_H_prev, n_W_prev, n_C_prev)
    W = weights; sshape = (f, f, n_C_prev, n_C)
    b = Biases; shape = (1, 1, 1, n_C)
    hparameters = python dictionary containing "stride" and "pad"
        
    Returns:
    Z = conv output; shape = (np.floor((n_H_prev - f - (2 * pad)) / stride) + 1)
    cache = cache of values needed for the conv_backward() function
    """

In [None]:
A_prev = np.random.randn(10,4,4,3)
W = np.random.randn(2,2,3,8)
b = np.random.randn(1,1,1,8)
hparameters = {"pad" : 2,"stride": 2}
Z, cache_conv = conv_forward(A_prev, W, b, hparameters)

print("Z's mean =", np.mean(Z))
print("Z[3,2,1] =", Z[3,2,1])
print("cache_conv[0][1][2][3] =", cache_conv[0][1][2][3])

## 4 - Pooling layer 

The pooling (POOL) layer reduces the height and width of the input. It helps reduce computation, as well as helping make feature detectors more invariant to its position in the input. The two types of pooling layer are: 

- Max-pooling layer: slides an ($f, f$) window over the input and stores the max value of the window in the output.

- Average-pooling layer: slides an ($f, f$) window over the input and stores the average value of the window in the output.


These pooling layers have no parameters for backpropagation to train; however, they have hyperparameters such as the window size $f$ - this specifies the height and width of the fxf window.


In [None]:
def pool_forward(A_prev, hparameters, mode = "max"):

    # Retrieve dimensions
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve hyperparameters from "hparameters"
    f = hparameters["f"]
    stride = hparameters["stride"]
    
    # Define dimensions of the output
    n_H = int(1 + (n_H_prev - f) / stride)
    n_W = int(1 + (n_W_prev - f) / stride)
    n_C = n_C_prev
    
    # Initialize output matrix A
    A = np.zeros((m, n_H, n_W, n_C))              
    
    for i in range(0, m):                           # loop over the training examples
        for h in range(0, n_H):                     # loop on the vertical axis of the output volume
            for w in range(0, n_W):                 # loop on the horizontal axis of the output volume
                for c in range (0, n_C):            # loop over the channels of the output volume
                    
                    # Find the corners of the current "slice"
                    vert_start = stride*h
                    vert_end = (stride*h) + f
                    horiz_start = stride*w
                    horiz_end = (stride*w) + f
                    
                    # Use the corners to define the current slice on the ith training example of A_prev, channel c
                    a_prev_slice = A_prev[i, vert_start:vert_end, horiz_start:horiz_end, c]
                    
                    # Compute the pooling operation on the slice
                    if mode == "max":
                        A[i, h, w, c] = np.max(a_prev_slice)
                    elif mode == "average":
                        A[i, h, w, c] = np.mean(a_prev_slice)
    
    # Store the input and hparameters in "cache" for pool_backward()
    cache = (A_prev, hparameters)
    
    # Make sure output shape is correct
    assert(A.shape == (m, n_H, n_W, n_C))
    
    return A, cache

    """
    Implements the forward pass of the pooling layer
    
    Arguments:
    A_prev = input data; shape = (m, n_H_prev, n_W_prev, n_C_prev)
    hparameters = python dictionary containing "f" and "stride"
    mode = pooling mode defined as a string ("max" or "average")
    
    Returns:
    A = output of the pool layer; shape = (m, n_H, n_W, n_C)
    cache = cache used in the backward pass of the pooling layer; contains the input and hparameters 
    """
    

In [None]:
A_prev = np.random.randn(2, 4, 4, 3)

hparameters = {"stride" : 2, "f": 3}

A, cache = pool_forward(A_prev, hparameters)
print("mode = max")
print("A =", A)
print()

A, cache = pool_forward(A_prev, hparameters, mode = "average")
print("mode = average")
print("A =", A)

## Backpropagation

In modern deep learning frameworks you only have to implement the forward pass and the framework takes care of the pack pass, so most deep learning engineers don't need to bother with the details of backpropagation. It is, however, included below for completeness for those interested in understanding what is contained within those black boxes.

## Convolutional Layer Backward Pass 

This is the formula for computing $dA$ with respect to the cost for a certain filter $W_c$ and a given training example:


$$dA += \sum _{h=0} ^{n_H} \sum_{w=0} ^{n_W} W_c \times dZ_{hw}$$


Where $W_c$ is a filter and $dZ_{hw}$ is a scalar corresponding to the gradient of the cost with respect to the output of the conv layer Z at the $h^{th}$ row and $w^{th}$ column (corresponding to the dot product taken at the $i^{th}$ stride left and $j^{th}$ stride down). Note that at each time, we multiply the the same filter $W_c$ by a different dZ when updating dA. We do so mainly because when computing the forward propagation, each filter is dotted and summed by a different a_slice. Therefore when computing the backprop for dA, we are just adding the gradients of all the a_slices. 

This is the formula for computing $dW_c$ ($dW_c$ is the derivative of one filter) with respect to the loss:

$$dW_c  += \sum _{h=0} ^{n_H} \sum_{w=0} ^ {n_W} a_{slice} \times dZ_{hw}$$

Where $a_{slice}$ corresponds to the slice which was used to generate the acitivation $Z_{ij}$. Hence, this ends up giving us the gradient for $W$ with respect to that slice. Since it is the same $W$, we will just add up all such gradients to get $dW$. 

This is the formula for computing $db$ with respect to the cost for a certain filter $W_c$:

$$db = \sum_h \sum_w dZ_{hw}$$

db is computed by summing $dZ$. In this case, you are just summing over all the gradients of the conv output (Z) with respect to the cost. 

In [None]:
def conv_backward(dZ, cache):

    # Retrieve information
    (A_prev, W, b, hparameters) = cache
    
    # Retrieve dimensions
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve dimensions
    (f, f, n_C_prev, n_C) = W.shape
    
    # Retrieve information from "hparameters"
    stride = hparameters['stride']
    pad = hparameters['pad']
    
    # Retrieve dimensions
    (m, n_H, n_W, n_C) = dZ.shape
    
    # Initialize dA_prev, dW, db with correct shapes
    dA_prev = np.zeros(A_prev.shape)                           
    dW = np.zeros(W.shape)
    db = np.zeros(b.shape)

    # Pad A_prev and dA_prev
    A_prev_pad = zero_pad(A_prev, pad)
    dA_prev_pad = zero_pad(dA_prev, pad)
    
    for i in range(m):                         # loop over the training examples
        a_prev_pad = A_prev_pad[i]
        da_prev_pad = dA_prev_pad[i]
        
        for h in range(n_H):                   # loop over vertical axis of the output volume
            for w in range(n_W):               # loop over horizontal axis of the output volume
                for c in range(n_C):           # loop over the channels of the output volume
                    
                    # Find the corners of the current "slice"
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f
                    
                    # Use the corners to define the slice from a_prev_pad
                    a_slice = a_prev_pad[vert_start:vert_end,horiz_start:horiz_end,:]

                    # Update gradients for the window and filter's parameters
                    da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] +=  W[:,:,:,c] * dZ[i, h, w, c]
                    dW[:,:,:,c] += a_slice * dZ[i, h, w, c]
                    db[:,:,:,c] += dZ[i, h, w, c]
                    
        # Set the ith training example's dA_prev to the unpaded da_prev_pad
        dA_prev[i, :, :, :] = da_prev_pad[pad:-pad, pad:-pad, :]
    
    # Make sure output shape is correct
    assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))
    
    return dA_prev, dW, db

    """
    Implement the backward propagation for a convolution function
    
    Arguments:
    dZ = gradient of the cost with respect to the output of the conv layer (Z); shape = (m, n_H, n_W, n_C)
    cache = cache of values needed for the conv_backward(); output of conv_forward()
    
    Returns:
    dA_prev = gradient of the cost with respect to the input of the conv layer (A_prev); shape = (m, n_H_prev, n_W_prev, n_C_prev)
    dW = gradient of the cost with respect to the weights of the conv layer (W); shape = (f, f, n_C_prev, n_C)
    db = gradient of the cost with respect to the biases of the conv layer (b); shape = (1, 1, 1, n_C)
    """

In [None]:
dA, dW, db = conv_backward(Z, cache_conv)

print("dA_mean =", np.mean(dA))
print("dW_mean =", np.mean(dW))
print("db_mean =", np.mean(db))

## Pooling Layer Backward Pass

Next is the implementation for the backward pass of the pooling layer, starting with the MAX-POOL layer. Even though a pooling layer has no parameters for backprop to update, you still need to backpropagate the gradient through the pooling layer in order to compute gradients for layers that came before the pooling layer. 

## Max Pooling - Backward Pass  

The code in the cell below provides a helper function called `create_mask_from_window()` that creates a mask matrix which helps keep track of where the maximum of the matrix is.

$$ X = \begin{bmatrix}
1 && 3 \\
4 && 2
\end{bmatrix} \quad \rightarrow  \quad M =\begin{bmatrix}
0 && 0 \\
1 && 0
\end{bmatrix}\tag{4}$$

Since this is the value that ultimately influenced the output and therefore cost, this is the value that backprop will propagate the gradient back to.

In [None]:
def create_mask_from_window(x):
    
    mask = x == np.max(x)
    
    return mask

    """
    Creates a mask from an input matrix x to identify the max entry of x
    
    Arguments:
    x = Array of shape (f, f)
    
    Returns:
    mask = Array of the same shape as window; contains a True at the position corresponding to the max entry of x
    """

In [None]:
x = np.random.randn(2,3)

mask = create_mask_from_window(x)

print('x = ', x)
print("mask = ", mask)

## Average Pooling - Backward Pass 

The code in the cell below provides a helper function for average pooling.  Since every element of the input window in average pooling has an equal influence on the output, the mask needs to be different to that used for max pooling.

$$ dZ = 1 \quad \rightarrow  \quad dZ =\begin{bmatrix}
1/4 && 1/4 \\
1/4 && 1/4
\end{bmatrix}\tag{5}$$

In [None]:
def distribute_value(dz, shape):
    
    # Retrieve dimensions
    (n_H, n_W) = shape
    
    # Compute value to distribute on the matrix
    average = dz / (n_H * n_W)
    
    # Create a matrix where every entry is the "average" value
    a = np.ones(shape) * average
    
    return a

    """
    Distributes the input value in the matrix of dimension shape
    
    Arguments:
    dz = input scalar
    shape = the shape (n_H, n_W) of the output matrix for which we want to distribute the value of dz
    
    Returns:
    a = Array of size (n_H, n_W) for which we distributed the value of dz
    """

In [None]:
a = distribute_value(2, (2,2))

print('distributed value =', a)

## Putting it together: Pooling Backward 

In [None]:
def pool_backward(dA, cache, mode = "max"):
    
    # Retrieve information
    (A_prev, hparameters) = cache
    
    # Retrieve hyperparameters from "hparameters"
    stride = hparameters["stride"]
    f = hparameters["f"]
    
    # Retrieve dimensions
    m, n_H_prev, n_W_prev, n_C_prev = A_prev.shape
    m, n_H, n_W, n_C = dA.shape
    
    # Initialize dA_prev with zeros
    dA_prev = np.zeros(A_prev.shape)
    
    for i in range(m):                         # loop over the training examples
        
        # select training example from A_prev (≈1 line)
        a_prev = A_prev[i]
        
        for h in range(n_H):                   # loop on the vertical axis
            for w in range(n_W):               # loop on the horizontal axis
                for c in range(n_C):           # loop over the channels (depth)
                    
                    # Find the corners of the current "slice" 
                    vert_start = h
                    vert_end = vert_start + f
                    horiz_start = w
                    horiz_end = horiz_start + f
                    
                    # Compute the backward propagation in both modes
                    if mode == "max":
                        
                        # Use the corners and "c" to define the current slice from a_prev
                        a_prev_slice = a_prev[vert_start:vert_end, horiz_start:horiz_end, c]
                        
                        # Create the mask from a_prev_slice
                        mask = create_mask_from_window(a_prev_slice)
                        
                        # Set dA_prev to be dA_prev + (the mask multiplied by the correct entry of dA)
                        dA_prev[i, vert_start:vert_end, horiz_start:horiz_end, c] += np.multiply(mask, dA[i, h, w, c])
                        
                    elif mode == "average":
                        
                        # Get the value a from dA 
                        da = dA[i, h, w, c]
                        
                        # Define the shape of the filter as fxf
                        shape = (f, f)
                        
                        # Distribute it to get the correct slice of dA_prev
                        dA_prev[i, vert_start:vert_end, horiz_start:horiz_end, c] += distribute_value(da, shape)
                        
    # Make sure output shape is correct
    assert(dA_prev.shape == A_prev.shape)
    return dA_prev

    """
    Implements the backward pass of the pooling layer
    
    Arguments:
    dA = gradient of cost with respect to the output of the pooling layer; same shape as A
    cache = cache output from the forward pass of the pooling layer; contains the layer's input and hparameters 
    mode = the pooling mode defined as a string ("max" or "average")
    
    Returns:
    dA_prev = gradient of cost with respect to the input of the pooling layer, same shape as A_prev
    """
    

In [None]:
A_prev = np.random.randn(5, 5, 3, 2)
hparameters = {"stride" : 1, "f": 2}
A, cache = pool_forward(A_prev, hparameters)
dA = np.random.randn(5, 4, 2, 2)

dA_prev = pool_backward(dA, cache, mode = "max")
print("mode = max")
print('mean of dA = ', np.mean(dA))
print('dA_prev[1,1] = ', dA_prev[1,1])  
print()

dA_prev = pool_backward(dA, cache, mode = "average")
print("mode = average")
print('mean of dA = ', np.mean(dA))
print('dA_prev[1,1] = ', dA_prev[1,1]) 