In [93]:
import numpy as np
from testCases_copy_2 import *
from gc_utils import sigmoid, relu, dictionary_to_vector, vector_to_dictionary, gradients_to_vector

Implementing gradient checking from scratch.

We implement gradient-checking to make sure that our backward_propagation implementation is correct.

How to use difference formula to check our backward_propagation implementation.

Results acheived from backward_propagation should be similar to those acheived by difference formula.

Identify which parameters gradient was computed erroneously.

The formula that we will use to verify if our gradients are correct

- $\frac{\partial J}{\partial \theta}$ is what you want to make sure you're computing correctly. 
$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$

- You can compute $J(\theta + \varepsilon)$ and $J(\theta - \varepsilon)$ (in the case that $\theta$ is a real number), since you're confident your implementation for $J$ is correct. 

In [6]:
def forward_propagation(x, theta):
    
    # x -- a real valued input
    # theta -- again a real valued input
    
    J = theta * x
    return J

In [8]:
J = forward_propagation(3,1)
print(J)

3


In [15]:
def backward_propagation(x, theta):
    
    # J = theta*x 
    #dtheta = x              derivative of J wrt theta is dtheta as per the nomenclature
    dtheta = x;
    
    #gradient wrt to the input x is not required for in the supervised learning.
    return dtheta;

In [16]:
x, theta = 2, 4
dtheta = backward_propagation(x, theta)
dtheta

2

Now we check if the gradients computed by the backward_propagation are correct by computing the gradients from the formula.

steps followed

1. compute $\theta^{+} = \theta + \varepsilon$
2. compute $\theta^{-} = \theta + \varepsilon$
3. compute $J(\theta^{+})$
4. compute $J(\theta^{-})$
5. compute gradapprox  $\frac {J(\theta^{+}) - J(\theta^{-})}{2 \varepsilon}$
 

This equation helps you compute the relative difference between grad_approx



In [47]:
def gradient_check(x, theta):
    
    epsilon = 0.00001
    
    theta_plus = theta + epsilon
    theta_minus = theta - epsilon
    J_theta_plus = forward_propagation(x, theta_plus)
    J_theta_minus = forward_propagation(x, theta_minus)
    
    grad_approx = (J_theta_plus - J_theta_minus)/(2*epsilon)
    grad = backward_propagation(x, theta)
    
    #computing the relative difference to see how well is our backward_propagation is doing.
    numerator   = np.linalg.norm(grad - grad_approx)
    denominator = np.linalg.norm(grad) + np.linalg.norm(grad_approx)
    difference  = numerator/denominator
    
    if(difference <= 1e-7):
        print("Gradients are correct ")
    else:
        print("Gradients are wrong ")
    
    return difference

In [48]:
print("difference " + str(gradient_check(x, theta)))
# I see the two numbers are very very close. so therefore our backward_propagation is computing the correct gradients.

Gradients are correct 
difference 7.826683745609991e-12


#### Generalising gradient Check for neural networks

in a general case cost function has more than one 1-D input and the parameters theta cotains many matrices.


We will now perform N-dimensional gradient checking.

In [74]:
#just try to see what our dictionary_to_vector function is doing.

W1 = np.random.randn(5,4)
b1 = np.zeros((5,1))
W2 = np.random.randn(3,5)
b2 = np.zeros((3,1))
W3 = np.random.randn(1,3)
b3 = np.zeros((1,1))

parameters = {"W1":W1, "b1":b1, "W2":W2, "b2":b2, "W3":W3, "b3":b3}
values, _  = dictionary_to_vector(parameters)#the function returns  theta, keys, discard keys
values

array([[-0.18308927],
       [-1.1103054 ],
       [ 0.43067261],
       [ 0.79748214],
       [-1.03185549],
       [-0.97731724],
       [-1.25076174],
       [-0.49392573],
       [ 0.30469941],
       [ 1.84701491],
       [ 0.13996798],
       [ 0.06679919],
       [-0.51184194],
       [-1.26582374],
       [ 0.17603868],
       [-1.06989065],
       [-1.10929395],
       [ 1.95267687],
       [ 1.2991675 ],
       [ 0.01806047],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [-2.14957254],
       [-0.31620888],
       [-0.0873828 ],
       [-1.4873786 ],
       [ 1.19749602],
       [-1.17241613],
       [-0.6609025 ],
       [ 1.8389223 ],
       [-0.36039164],
       [-0.15709686],
       [ 1.20331536],
       [ 0.93063559],
       [ 0.98112898],
       [ 0.08076176],
       [ 0.7074636 ],
       [ 0.        ],
       [ 0.        ],
       [ 0.        ],
       [-0.21021307],
       [ 2.14242413],
       [ 0

In [78]:
parameters = vector_to_dictionary(values+1e-7)
parameters

{'W1': array([[-0.18308917, -1.1103053 ,  0.43067271,  0.79748224],
        [-1.03185539, -0.97731714, -1.25076164, -0.49392563],
        [ 0.30469951,  1.84701501,  0.13996808,  0.06679929],
        [-0.51184184, -1.26582364,  0.17603878, -1.06989055],
        [-1.10929385,  1.95267697,  1.2991676 ,  0.01806057]]),
 'b1': array([[1.e-07],
        [1.e-07],
        [1.e-07],
        [1.e-07],
        [1.e-07]]),
 'W2': array([[-2.14957244, -0.31620878, -0.0873827 , -1.4873785 ,  1.19749612],
        [-1.17241603, -0.6609024 ,  1.8389224 , -0.36039154, -0.15709676],
        [ 1.20331546,  0.93063569,  0.98112908,  0.08076186,  0.7074637 ]]),
 'b2': array([[1.e-07],
        [1.e-07],
        [1.e-07]]),
 'W3': array([[-0.21021297,  2.14242423,  0.9195034 ]]),
 'b3': array([[1.e-07]])}

In [88]:
parameters_values = np.copy(values)
parameters_values.shape, values.shape

((47, 1), (47, 1))

In [109]:
def forward_propagation_n(X, Y, parameters):
    """
    Implements the forward propagation (and computes the cost) presented in Figure 3.
    
    Arguments:
    X -- training set for m examples
    Y -- labels for m examples 
    parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
                    W1 -- weight matrix of shape (5, 4)
                    b1 -- bias vector of shape (5, 1)
                    W2 -- weight matrix of shape (3, 5)
                    b2 -- bias vector of shape (3, 1)
                    W3 -- weight matrix of shape (1, 3)
                    b3 -- bias vector of shape (1, 1)
    
    Returns:
    cost -- the cost function (logistic cost for one example)
    """
    
    # retrieve parameters
    m = X.shape[1]
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]
    W3 = parameters["W3"]
    b3 = parameters["b3"]

    # LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)

    # Cost
    logprobs = np.multiply(-np.log(A3),Y) + np.multiply(-np.log(1 - A3), 1 - Y)
    cost = 1./m * np.sum(logprobs)
    
    cache = (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3)
    
    return cost, cache

In [110]:
x, y, parameters = gradient_check_n_test_case()
cost, _ = forward_propagation_n(x, y, parameters)

In [111]:
def backward_propagation_n(X, Y, cache):
    """
    Implement the backward propagation presented in figure 2.
    
    Arguments:
    X -- input datapoint, of shape (input size, 1)
    Y -- true "label"
    cache -- cache output from forward_propagation_n()
    
    Returns:
    gradients -- A dictionary with the gradients of the cost with respect to each parameter, activation and pre-activation variables.
    """
    
    m = X.shape[1]
    (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
    
    dZ3 = A3 - Y
    dW3 = 1./m * np.dot(dZ3, A2.T)
    db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
    
    dA2 = np.dot(W3.T, dZ3)
    dZ2 = np.multiply(dA2, np.int64(A2 > 0))
    dW2 = 1./m * np.dot(dZ2, A1.T) 
    db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
    
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = np.multiply(dA1, np.int64(A1 > 0))
    dW1 = 1./m * np.dot(dZ1, X.T)
    db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
    
    gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,
                 "dA2": dA2, "dZ2": dZ2, "dW2": dW2, "db2": db2,
                 "dA1": dA1, "dZ1": dZ1, "dW1": dW1, "db1": db1}
    
    return gradients

In [112]:
a = np.random.randn(3,2)
print(a)

b = np.copy(a)
print(b)

[[ 0.88514116 -0.75439794]
 [ 1.25286816  0.51292982]
 [-0.29809284  0.48851815]]
[[ 0.88514116 -0.75439794]
 [ 1.25286816  0.51292982]
 [-0.29809284  0.48851815]]


In [182]:
def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
    
    parameters_values,_ = dictionary_to_vector(parameters)
    num_parameters = parameters_values.shape[0] 
    grad = gradients_to_vector(gradients)
    
    grad_approx = np.zeros((num_parameters, 1))
    J_plus = np.zeros((num_parameters, 1))
    J_minus = np.zeros((num_parameters, 1))
    
    
    for i in range(num_parameters):
        
        #compute J(theta_plus)[i]
        theta_plus = np.copy(parameters_values)
        theta_plus[i]  = theta_plus[i] + epsilon
        J_plus[i],_ = forward_propagation_n(X, Y, vector_to_dictionary(theta_plus))
        
        #compute J(theta_minus)[i]
        theta_minus = np.copy(parameters_values)
        theta_minus[i]  = theta_minus[i] - epsilon
        J_minus[i],_ = forward_propagation_n(X, Y, vector_to_dictionary(theta_minus))
        
        grad_approx[i] = (J_plus[i] - J_minus[i]) / (2*epsilon)
    
    #computing the differences vector.
    numerator = np.linalg.norm(grad - grad_approx)                  #numerator vector of the differences vector
    denominator = np.linalg.norm(grad) + np.linalg.norm(grad_approx)  #denominator vector of the differences vector 
    difference = numerator / denominator
    
    if(difference > 2e-7):
        print("\033[93m" +"there is something worng in your backpropagation! " +str(difference))
    else:
        print("\033[92m" +"Your backward propagation works perfectly fine! " +str(difference))
    
    return difference

In [183]:
x, y, parameters = gradient_check_n_test_case()
cost, cache = forward_propagation_n(x, y, parameters)
gradients = backward_propagation_n(x, y, cache)
difference = gradient_check_n(parameters, gradients, x, y)

[92mYour backward propagation works perfectly fine! 1.1885552035482147e-07


In [171]:
differences = np.zeros((47,1))
for i in range(47):
    
    numerator = np.linalg.norm(grad[i] - grad_approx[i])
    denominator = np.linalg.norm(grad[i]) + np.linalg.norm(grad_approx[i])
    if(denominator==0):
        print("Here is the error")
    differences[i] = numerator/denominator

#differences

Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error
Here is the error


  


#### takeaways

1. Gradient checking is slow.
2. Gradient checking canot be used with dropout, We usually run it before turning dropout on and then upon reassuring ourselves that our backprop is correct we turn the dropout on.

3. We dont run it on every iteration of gradient descent because it is a slow process (see the iterations we run each time, each iter we do 
    -->copy all parameters as theta plus and theta minus 
    -->employ forward_propagation 2 times
    -->and blah blah operations to calculate the difference.
    ) We therefore do it inbetween the iterations of gd a few times like once in 100 or 1000 iterations.
    
 

Try resolving the above error and finding out why gradient check is not used during dropout.