# Gradient checking

Use _gradient checking_ to verify correctness when implementing backpropagation.

Gradient checking relies on the numerical approximation of gradients.

Remarks:
- Don't use during training. Use only to debug.
- If grad check doesn't match, inspect which gradient component diverges.
- If regularizing, include regularization term when calculating both $d\theta_\text{approx}$ and $d\theta$.
- Grad check doesn't work with dropout. Implement grad check first, then re-enable dropout using `keep_prob = 1.0`.
- It is not impossible that your backprop is only correct when it is near initialization. Run grad check after at random initialization and after a few training steps to confirm correctness. This is seldomly done.
- Gradient checking is slow, run it only to make sure your code is correct. Turn it off and use backprop for the actual learning process.

## 1-D grad check

Definition of a derivative (or gradient):$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$$

#### 1-D forward prop
Implement the linear forward propagation (J(theta) = theta * x)

In [2]:
def forward_propagation(x, theta):
    """    
    Arguments:
    x: a real-valued input
    theta: our parameter, a real number as well
    
    Returns:
    J: the value of function J, computed using the formula J(theta) = theta * x
    """
    
    J = theta * x
    
    return J

#### 1-D backward prop
Computes the derivative of J with respect to theta

In [3]:
def backward_propagation(x, theta):
    """
    Arguments:
    x: a real-valued input
    theta: our param, a real number as well
    
    Returns:
    dtheta… the grad of the cost wrt theta
    """
    
    dtheta = x
    
    return dtheta

#### 1-D gradcheck function

Implement gradcheck to show that `backward_propagation()` is correctly computing the grad gradient $\frac{\partial J}{\partial \theta}$.

Grad check procedure:
1. Compute gradient approximation:
    1. $\theta^{+} = \theta + \varepsilon$
    2. $\theta^{-} = \theta - \varepsilon$
    3. $J^{+} = J(\theta^{+})$
    4. $J^{-} = J(\theta^{-})$
    5. $\text{gradapprox} = \frac{J^{+} - J^{-}}{2  \varepsilon}$
2. Compute the gradient using backward propagation, and store the result in a variable "grad".
3. Compute the relative difference between "gradapprox" and the "grad":
$$\text{difference} = \frac {\mid\mid \text{grad} - \text{gradapprox} \mid\mid_2}{\mid\mid \text{grad} \mid\mid_2 + \mid\mid \text{gradapprox} \mid\mid_2}$$
4. Check difference
   - $\text{difference} \approx 10^{-7}$: likely correct 
   - $\text{difference} \approx 10^{-5}$: okay but double-check
   - $\text{difference} \approx 10^{-3}$: worry, bug in backprop

How to compute for step 3?
1.  compute the numerator using np.linalg.norm(...)
2. compute the denominator. You will need to call np.linalg.norm(...) twice.
3. divide them.



In [6]:
def gradient_check(x, theta, epsilon=1e-7, print_msg=False):
    """
    Arguments:
    x: a float input
    theta: our parameter, a float as well
    epsilon: tiny shift to the input to compute approx grad
    
    Returns:
    difference: difference (2) between the approximated gradient and the backward propagation gradient -> float
    """
    
    # compute gradapprox
    theta_plus = theta + epsilon
    theta_minus = theta - epsilon
    J_plus = forward_propagation(x, theta_plus)
    J_minus = forward_propagation(x, theta_minus)
    gradapprox = (J_plus - J_minus)/(2*epsilon)

    
    # compute grad using backprop
    grad = backward_propagation(x, theta)

    # compute difference
    numerator = np.linalg.norm(grad-gradapprox)
    denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
    difference = numerator/denominator
    
    if print_msg:
        if difference > 2e-7:
            print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
        else:
            print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
    
    return difference

#### L-layer N-D forward prop
Implements forward propagation for an L-layer NN with ReLU activations and a sigmoid output.

In [7]:
def compute_cost(AL, Y):
    """
    Arguments:
    AL: probability vector corresponding to your label predictions, shape (1, number of examples)
    Y: true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)

    Returns:
    cost: cross-entropy cost
    """
    
    m = Y.shape[1]

    # compute loss from aL and y.
    logprobs = np.multiply(Y, np.log(AL)) + np.multiply((1-Y), np.log(1-AL))
    cost = -(1/m)*np.sum(logprobs)
    
    cost = np.squeeze(cost) # e.g. turns [[71]] into 71
    return cost

In [8]:
def forward_propagation_L(X, parameters):
    """
    Arguments:
    X: input data, shape (input size, number of examples)
    parameters: dict with W1, b1, ..., WL, bL

    Returns:
    AL: last activation value
    caches: list of caches: (Z, A_prev, W, b)
    """
    
    caches = []
    A = X
    L = len(parameters) // 2  # num of layers

    # hidden layers: LINEAR -> RELU
    for l in range(1, L):
        A_prev = A
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        Z = np.dot(W, A_prev) + b
        A = relu(Z)
        caches.append((Z, A_prev, W, b))

    # output layer: LINEAR -> SIGMOID
    WL = parameters[f'W{L}']
    bL = parameters[f'b{L}']
    ZL = np.dot(WL, A) + bL
    AL = sigmoid(ZL)
    caches.append((ZL, A, WL, bL))

    return AL, caches

#### L-layer N-D backward prop

In [9]:
def relu_derivative(Z):
    return Z > 0

In [10]:
def sigmoid_derivative(Z):
    s = 1 / (1 + np.exp(-Z))
    return s * (1 - s)

In [11]:
def backward_propagation_L(AL, Y, caches):
    """
    Implements backward propagation for an L-layer NN.
    
    Arguments:
    AL: probability vector, output of the forward propagation
    Y: true "label" vector (same shape as AL)
    caches: list of caches from forward_propagation_L:
              each cache is (Z, A_prev, W, b)

    Returns:
    grads: dict with the grads dW, db, dA for each layer
    """
    grads = {}
    L = len(caches)
    m = Y.shape[1]
    Y = Y.reshape(AL.shape)

    # init gradient from output layer (sigmoid)
    dAL = -(np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

    # output layer cache
    current_cache = caches[L-1]
    ZL, A_prev, WL, bL = current_cache
    dZL = dAL * sigmoid_derivative(ZL)
    grads[f'dW{L}'] = (1/m) * np.dot(dZL, A_prev.T)
    grads[f'db{L}'] = (1/m) * np.sum(dZL, axis=1, keepdims=True)
    dA_prev = np.dot(WL.T, dZL)

    # loop for L-1 to 1
    for l in reversed(range(L - 1)):
        Z, A_prev, W, b = caches[l]
        dZ = dA_prev * relu_derivative(Z)
        grads[f'dW{l+1}'] = (1/m) * np.dot(dZ, A_prev.T)
        grads[f'db{l+1}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
        dA_prev = np.dot(W.T, dZ)

    return grads

#### L-layer N-D grad check function

For each $i$ in num_parameters:
1. Compute `J_plus[i]`:
    1. Set $\theta^{+}$ to `np.copy(parameters_values)`
    2. Set $\theta^{+}_i$ to $\theta^{+}_i + \varepsilon$
    3. Calculate $J^{+}_i$ using to `forward_propagation_n(x, y, vector_to_dictionary(`$\theta^{+}$ `))`.     
2. Compute `J_minus[i]`: do the same thing with $\theta^{-}$
3. Compute $gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon}$
4. Compute the relative difference between "gradapprox" and the "grad":
$$\text{difference} = \frac {\mid\mid \text{grad} - \text{gradapprox} \mid\mid_2}{\mid\mid \text{grad} \mid\mid_2 + \mid\mid \text{gradapprox} \mid\mid_2}$$
5. Check difference
   - $\text{difference} \approx 10^{-7}$: likely correct 
   - $\text{difference} \approx 10^{-5}$: okay but double-check
   - $\text{difference} \approx 10^{-3}$: worry, bug in backprop

In [13]:
def gradient_check_n(parameters, gradients, X, Y, epsilon=1e-7, print_msg=False):
    """
    Checks if backward_propagation_n computes correct gradients using numerical approximation.
    
    Arguments:
    parameters: dict of your L-layer neural network parameters (W1, b1, ..., WL, bL)
    gradients: output of backward_propagation_n, same structure as parameters
    X: input data (input size, number of examples)
    Y: true labels
    epsilon: small scalar used for finite differences
    
    Returns:
    difference: rel. diff. between analytical and numerical gradients
    """

    # flatten params and grads
    parameters_vector, _ = dictionary_to_vector(parameters)
    grad_vector = gradients_to_vector(gradients)
    
    num_parameters = parameters_vector.shape[0]
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    gradapprox = np.zeros((num_parameters, 1))
    
    # compute numerical gradient approx
    for i in range(num_parameters):
        theta_plus = np.copy(parameters_vector)
        theta_plus[i][0] += epsilon
        J_plus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(theta_plus))

        theta_minus = np.copy(parameters_vector)
        theta_minus[i][0] -= epsilon
        J_minus[i], _ = forward_propagation_n(X, Y, vector_to_dictionary(theta_minus))
        
        gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
    
    # compute difference
    numerator = np.linalg.norm(grad_vector - gradapprox)
    denominator = np.linalg.norm(grad_vector) + np.linalg.norm(gradapprox)
    difference = numerator / denominator

    if print_msg:
        if difference > 2e-7:
            print(f"\033[93mWarning: gradient check failed! Difference = {difference}\033[0m")
        else:
            print(f"\033[92mGradient check passed! Difference = {difference}\033[0m")

    return difference