# Neural Network - Gradient Checking

## 0 - Preparation

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np

In [3]:
from testCases import *
from public_tests import *
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector

## 1 - Math Background

The most critical part in the neural network is the backward propagation.

We can use the numerical computation of gradient:

$$ 
\frac{\partial \mathcal{L}}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{\mathcal{L}(\theta + \varepsilon) - \mathcal{L}(\theta - \varepsilon)}{2 \varepsilon} 
$$

to verify if our backward propagation implementation works as we expected.

## 2 - Gradient Check Implementation: 1D Data

`forward_propagation` to compute $\mathcal{L} = X \cdot \theta$ where $X, \theta \in R$

In [4]:
def forward_propagation(x: float, theta: float):
    """
    Arguments:
    x: float to denote the input data
    theta: float to denote the weight parameter
    
    Returns:
    L: the result of the function
    """
    L = theta * x
    return L

`backward_propagation` to compuate the derivative of $\mathcal{L}$ with respect to $X$

In [5]:
def backward_propagation(x: float, theta: float):
    """
    Arguments:
    x: float to denote the input data
    theta: float to denote the weight parameter
    
    Returns:
    dtheta: the derivatice
    """
    dtheta = x
    return dtheta

`gradient_check` performs numerical computation on the gradient and checks the difference between numerical and actual ones

- $d\theta$: actual derivative
- $d\hat{\theta}$: numerical computation result
- $\text{diff} = \frac{||d\theta - d\hat{\theta}||_2}{||d\theta||_2 + ||d\hat{\theta}||_2}$

In [6]:
def gradient_check(x: float, theta: float, epsilon=1e-7):
    """
    Arguments:
    x: float to denote the input data
    theta: float to denote the weight parameter
    epsilon: small float for numerical computation of the gradient
    
    Returns:
    difference: the difference between numerical results and actual one
    """
    # Numerical computation
    theta_plus = theta + epsilon
    theta_minus = theta - epsilon
    L_plus = theta_plus * x
    L_minus = theta_minus * x
    dtheta_hat = (L_plus - L_minus) / (2 * epsilon)
    # True gradient
    dtheta = backward_propagation(x, theta)
    # Calculate the difference
    numerator = np.linalg.norm(dtheta_hat - dtheta)
    denominator = np.linalg.norm(dtheta_hat) + np.linalg.norm(dtheta)
    difference = numerator / denominator
    return difference

`TEST`

In [7]:
x, theta = 2, 4
difference = gradient_check(2,4)
difference

2.919335883291695e-10

## 3 - Gradient Check Implementation: NDimensional Data

Imagine we have three-layer fully-connected neural network:`linear` + `relu` -> `linear` + `relu` -> `linear` + `sigmoid`

`forward_propagation_n` to compute output from the last layer

In [8]:
def forward_propagation_n(X: np.ndarray, Y: np.ndarray, parameters: dict):
    """
    Arguments:
    X: (num_features, num_samples)
    Y: (1, num_samples)
    parameters: {"W1": W1, "b1": b1, "W2": W2, "b2": b2, "W3": W3, "b3": b3}
    
    Returns:
    loss: the float to denote loss function result
    cache: (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    """
    # Get parameters
    num_samples = X.shape[1]
    W1, b1 = parameters["W1"], parameters["b1"]
    W2, b2 = parameters["W2"], parameters["b2"]
    W3, b3 = parameters["W3"], parameters["b3"]
    # 3-layer: linear + relu -> linear + relu -> linear + sigmoid
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    # Loss
    loss = -1.0 / num_samples * (np.matmul(Y, np.log(A3).T) + np.matmul((1 - Y), np.log(1 - A3).T))
    # save for return
    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    return loss, cache

`forward_propagation_n` to perform the backward propagation

In [9]:
def backward_propagation_n(X: np.ndarray, Y: np.ndarray, cache: tuple):
    """
    Arguments:
    X: (2, num_samples)
    Y: (1, num_samples)
    cache: (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    
    Returns:
    gradients: {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
    """
    # Get info
    num_samples = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    # Backward propagation - the last layer
    dZ3 = A3 - Y
    dW3 = 1 / num_samples * np.matmul(dZ3, A2.T)
    db3 = 1 / num_samples * np.sum(dZ3, axis=1, keepdims=True)
    # Backward propagatino - the 2nd layer
    dA2 = np.matmul(W3.T, dZ3)
    dZ2 = dA2 * np.where(Z2 > 0, 1, 0)
    dW2 = 1 / num_samples * np.matmul(dZ2, A1.T)
    db2 = 1 / num_samples * np.sum(dZ2, axis=1, keepdims=True)
    # Backward propagatino - the 1st layer
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = dA1 * np.where(Z1 > 0, 1, 0)
    dW1 = 1 / num_samples * np.dot(dZ1, X.T)
    db1 = 1 / num_samples * np.sum(dZ1, axis=1, keepdims=True)
    # Save for return
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                 "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                 "dZ1": dZ1, "dW1": dW1, "db1": db1}
    return gradients

`gradient_check` performs numerical computation on the gradient and checks the difference between numerical and actual ones

- $d\theta$: actual derivative
- $d\hat{\theta}$: numerical computation result
- $\text{diff} = \frac{||d\theta - d\hat{\theta}||_2}{||d\theta||_2 + ||d\hat{\theta}||_2}$

In [10]:
def gradient_check_n(parameters: dict, gradients: dict, X: np.ndarray, Y: np.ndarray, epsilon: float=1e-7):
    """
    Arguments:
    parameters: {"W1": W1, "b1": b1, "W2": W2, "b2": b2, "W3": W3, "b3": b3}
    X: (num_features, num_samples)
    Y: (1, num_samples)
    gradients: {"dZ3": dZ3, "dW3": dW3, "db3": db3, "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
    epsilon: small float number used to calculate the derivative
    
    Returns:
    difference: the difference between numerical results and actual one
    """
    # Flatten the weight and bias parameter into a vector
    parameters_values, _ = dictionary_to_vector(parameters)
    # Flatten the gradients into a vector    
    grad = gradients_to_vector(gradients)
    # Get info
    num_parameters = parameters_values.shape[0]
    # Verify parameter and gradient vectors have the same dimension
    assert parameters_values.shape[0] == grad.shape[0], f'parameters and grad shall have identical dimension.'
    # Initialize to store results
    L_plus = np.zeros((num_parameters, 1))
    L_minus = np.zeros((num_parameters, 1))
    grad_approximate = np.zeros((num_parameters, 1))
    # Compute approximate gradients: we need to rely on explicit for loop to change one parameter at a time to get the derivative
    for i in range(num_parameters):
        # Calculate the loss function corresponding to theta_plus
        theta_plus = parameters_values.copy()
        theta_plus[i] = theta_plus[i] + epsilon
        L_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(theta_plus))
        # Calculate the loss function corresponding to theta_minus
        theta_minus = parameters_values.copy()
        theta_minus[i] = theta_minus[i] - epsilon
        L_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(theta_minus))
        # Numerical result of gradient
        grad_approximate[i] = (L_plus[i] - L_minus[i]) / (2*epsilon)
    # Calculate the difference
    numerator = np.linalg.norm(grad_approximate - grad)
    denominator = np.linalg.norm(grad_approximate) + np.linalg.norm(grad)
    difference = numerator / denominator
    return difference

In [11]:
X, Y, parameters = gradient_check_n_test_case()

cost, cache = forward_propagation_n(X, Y, parameters)
gradients = backward_propagation_n(X, Y, cache)
difference = gradient_check_n(parameters, gradients, X, Y, 1e-7)

In [12]:
difference

1.1885552035482147e-07

## 4 - Conclusion

When defining backward propagation:

- Use numerical computation to verify the gradients are correct.
- The gradient checking shall rely on an explicit for loop such that we ONLY change one parameter at one time to get its corresponding derivative.
- Once gradient checking is done, we do not use it in the actual training loop because it is slow.