# Building your Deep Neural Network: Step by Step

欢迎来到您的第4周的作业（2的第1部分）！您之前已经训练了一个2层神经网络（带有一个隐藏层）。本周，你将建立一个深层的神经网络，层数由你而定！

在这个笔记本中，您将实现构建深度神经网络所需的所有功能。
在下一个任务中，您将使用这些函数为图像分类构建深度神经网络。

**完成这项任务后，您将能够：**

- 使用像ReLU这样的非线性单元来改进你的模型
- 建立一个更深层的神经网络（具有多于一个的隐藏层）
- 实现一个易于使用的神经网络类

---

**Notation（符号）**:
- 上标 $[l]$ denotes a quantity associated with the $l^{th}$ layer. 
    - 例如: $a^{[L]}$ 是第 $L^{th}$ 层的激活函数. $W^{[L]}$ 和 $b^{[L]}$ 是第 $L^{th}$ 层的参数.
- 上标 $(i)$ denotes a quantity associated with the $i^{th}$ example. 
    - 例如: $x^{(i)}$ 是第 $i^{th}$ 训练样本.
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.
    - 例如: $a^{[l]}_i$ denotes the $i^{th}$ entry of the $l^{th}$ layer's activations).
    
**Superscript:上标; Lowerscript:下标**


## 1 - 包

我们首先导入您在这个任务中需要的所有包.
- [numpy](www.numpy.org) 科学计算包.
- [matplotlib](http://matplotlib.org) 绘制图表的库.
- dnn_utils 提供了一些必须的功能.
- testCases 提供了一些测试用例，用来评估函数的正确性.

In [1]:
import numpy as np
import h5py
import matplotlib.pyplot as plt
from testCases import *
from dnn_utils import sigmoid, sigmoid_backward, relu, relu_backward

%matplotlib inline
plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

%load_ext autoreload
%autoreload 2

np.random.seed(1)

## 2 - 初始化参数
### 2-1 两层神经网络（即单隐层）

模型结构是线性->ReLU->线性->sigmod函数。  

- W1 (n_h, n_x)
- b1 (n_h, 1)
- W2 (n_y, n_h)
- b2 (n_y, 1)

In [2]:
def initialize_parameters(n_x, n_h, n_y):
    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))

    parameters = {"W1":W1, "b1":b1, "W2":W2, "b2":b2}
    return parameters

In [3]:
# 测试一下
parameters = initialize_parameters(2,2,1)
print("W1 = ", parameters["W1"])
print("b1 = ", parameters["b1"])
print("W2 = ", parameters["W2"])
print("b2 = ", parameters["b2"])

W1 =  [[ 0.01624345 -0.00611756]
 [-0.00528172 -0.01072969]]
b1 =  [[0.]
 [0.]]
W2 =  [[ 0.00865408 -0.02301539]]
b2 =  [[0.]]


### 2-2 L层神经网络
更深层的L层神经网络的初始化更复杂，因为有更多的权重矩阵和偏置向量

当完成 `initialize_parameters_deep` 时, 你应该确保每个图层的维度是匹配的.回想下课程上所说的， $n^{[l]}$ 表示第$l$层神经元的个数. 因此假如我们输入的 $X$ 的大小是 $(12288, 209)$ (with $m=209$ examples)

**Layer L-1: ** $(n^{[L-1]}, n^{[L-2]})$ $(n^{[L-1]}, 1)$ $Z^{[L-1]} = W^{[L-1]} A^{[L-2]} + b^{[L-1]}$ $(n^{[L-1]}, 209)$ 

**Layer L: ** $(n^{[L]}, n^{[L-1]})$ $(n^{[L]}, 1)$ $Z^{[L]} = W^{[L]} A^{[L-1]} + b^{[L]}$ $(n^{[L]}, 209)$

当我们计算 $W X + b$ 的时候, 其实利用了python的“广播”特性. 假设: 

$ W = \begin{bmatrix}
    j  & k  & l\\
    m  & n & o \\
    p  & q & r 
\end{bmatrix}\;\;\; X = \begin{bmatrix}
    a  & b  & c\\
    d  & e & f \\
    g  & h & i 
\end{bmatrix} \;\;\; b =\begin{bmatrix}
    s  \\
    t  \\
    u
\end{bmatrix}\tag{2}$

那么:

$ WX + b = \begin{bmatrix}
    (ja + kd + lg) + s  & (jb + ke + lh) + s  & (jc + kf + li)+ s\\
    (ma + nd + og) + t & (mb + ne + oh) + t & (mc + nf + oi) + t\\
    (pa + qd + rg) + u & (pb + qe + rh) + u & (pc + qf + ri)+ u
\end{bmatrix}\tag{3}  $

In [4]:
def initialize_parameters_deep(layer_dims):
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)

    for l in range(1,L):
        parameters["W"+str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])*0.01
        parameters["b"+str(l)] = np.zeros((layer_dims[l], 1))

    return parameters

In [5]:
# 测试一下
parameters = initialize_parameters_deep([5,4,3])
print("W1 = ", parameters["W1"])
print("b1 = ", parameters["b1"])
print("W2 = ", parameters["W2"])
print("b2 = ", parameters["b2"])

W1 =  [[ 0.01788628  0.0043651   0.00096497 -0.01863493 -0.00277388]
 [-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
 [-0.01313865  0.00884622  0.00881318  0.01709573  0.00050034]
 [-0.00404677 -0.0054536  -0.01546477  0.00982367 -0.01101068]]
b1 =  [[0.]
 [0.]
 [0.]
 [0.]]
W2 =  [[-0.01185047 -0.0020565   0.01486148  0.00236716]
 [-0.01023785 -0.00712993  0.00625245 -0.00160513]
 [-0.00768836 -0.00230031  0.00745056  0.01976111]]
b2 =  [[0.]
 [0.]
 [0.]]


## 3 - 前向传播

### 3-1 线性传播
$$Z^{[l]} = W^{[l]}A^{[l-1]} +b^{[l]}\tag{4}$$

- W1 (n_h, n_x)
- b1 (n_h, 1)
- W2 (n_y, n_h)
- b2 (n_y, 1)

In [6]:
def linear_forward(A, W, b):
    Z = np.dot(W, A) + b
    assert(Z.shape == (W.shape[0], A.shape[1]))
    cache = (A, W, b)
    return Z, cache

In [7]:
A, W, b = linear_forward_test_case()
Z, linear_cache = linear_forward(A, W, b)
print("Z=", Z)

Z= [[ 3.26295337 -1.23429987]]


为了方便起见，你要把这两个功能 (Linear and Activation)组合为一个功能(LINEAR->ACTIVATION).

In [8]:
def linear_activation_forward(A_prev, W, b, activation):
    Z, linear_cache = linear_forward(A_prev, W, b)
    if activation == "sigmoid":
        A, activation_cache = sigmoid(Z)
    elif activation == "relu":
        A, activation_cache = relu(Z)
    cache = (linear_cache, activation_cache)
    return A, cache

In [9]:
# 测试一下
A_prev, W, b = linear_activation_forward_test_case()
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="sigmoid")
print("With sigmoid: A = ", A)
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="relu")
print("With ReLU: A = ", A)

With sigmoid: A =  [[0.96890023 0.11013289]]
With ReLU: A =  [[3.43896131 0.        ]]


L层模型前向传播

In [10]:
def L_model_forward(X, parameters):
    caches = []
    A = X
    L = len(parameters) // 2
    for l in range(1, L):
        A_prev = A
        A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)], parameters["b"+str(l)], activation="relu")
        caches.append(cache)
    
    AL, cache = linear_activation_forward(A, parameters["W"+str(L)], parameters["b"+str(L)], activation="sigmoid")
    caches.append(cache)

    assert(AL.shape == (1,X.shape[1]))
    return AL, caches

In [12]:
# 测试一下
X, parameters = L_model_forward_test_case()
AL, caches = L_model_forward(X, parameters)
print("AL = ", AL)
print("Length of caches list = ", len(caches))

AL =  [[0.17007265 0.2524272 ]]
Length of caches list =  2


太好了，你现在已经有了一个完整的向前传播，它接受输入X，并输出了一个包含你的预测的行向量 $A^{[L]}$ 。它还用“缓存”记录了所有中间值。使用 $A^{[L]}$ ，你可以计算预测结果的损失成本。

## 5 - 损失函数

Now you will implement forward and backward propagation. 你需要计算cost，因为你想检查你的模型是否真的在学习。

**Exercise**: Compute the cross-entropy cost $J$, using the following formula: $$-\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7}$$

In [13]:
def compute_cost(AL, Y):
    m = Y.shape[1]
    cost = -1/m * np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL))
    cost = np.squeeze(cost)
    return cost

In [14]:
# 测试一下
Y, AL = compute_cost_test_case()
print("cost = ", compute_cost(AL, Y))

cost =  0.41493159961539694


## 6 - 反射传播模型

Just like with forward propagation, you will implement helper functions for backpropagation. Remember that back propagation is used to calculate the gradient of the loss function with respect to the parameters. 

与前向传播类似，我们有需要使用三个步骤来构建反向传播：

- LINEAR 后向计算
- LINEAR -> ACTIVATION 后向计算，其中ACTIVATION 计算Relu或者Sigmoid 的结果
- [LINEAR -> RELU] × \times× (L-1) -> LINEAR -> SIGMOID 后向计算 (整个模型)

The three outputs $(dW^{[l]}, db^{[l]}, dA^{[l]})$ are computed using the input $dZ^{[l]}$.Here are the formulas you need:
$$ dW^{[l]} = \frac{\partial \mathcal{L} }{\partial W^{[l]}} = \frac{1}{m} dZ^{[l]} A^{[l-1] T} \tag{8}$$
$$ db^{[l]} = \frac{\partial \mathcal{L} }{\partial b^{[l]}} = \frac{1}{m} \sum_{i = 1}^{m} dZ^{[l](i)}\tag{9}$$
$$ dA^{[l-1]} = \frac{\partial \mathcal{L} }{\partial A^{[l-1]}} = W^{[l] T} dZ^{[l]} \tag{10}$$

In [16]:
def linear_backward(dZ, cache):
    A_prev, W, b = cache
    m = A_prev.shape[1]
    dW = 1/m * np.dot(dZ, A_prev.T)
    db = 1/m * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db

In [17]:
# 测试一下
dZ, linear_cache = linear_backward_test_case()

dA_prev, dW, db = linear_backward(dZ, linear_cache)
print("dA_prev = ", dA_prev)
print("dW = ", dW)
print("db = ", db)

dA_prev =  [[ 0.51822968 -0.19517421]
 [-0.40506361  0.15255393]
 [ 2.37496825 -0.89445391]]
dW =  [[-0.10076895  1.40685096  1.64992505]]
db =  [[0.50629448]]


### 6.2 - 线性激活向后传播

为了帮助你实现linear_activation_backward，我们提供了两个后向函数：

- **`sigmoid_backward`**: Implements the backward propagation for SIGMOID unit. You can call it as follows:

```python
dZ = sigmoid_backward(dA, activation_cache)
```

- **`relu_backward`**: Implements the backward propagation for RELU unit. You can call it as follows:

```python
dZ = relu_backward(dA, activation_cache)
```

If $g(.)$ is the activation function, 
`sigmoid_backward` and `relu_backward` compute $$dZ^{[l]} = dA^{[l]} * g'(Z^{[l]}) \tag{11}$$.  

**Exercise**: Implement the backpropagation for the *LINEAR->ACTIVATION* layer.

In [18]:
def linear_activation_backward(dA, cache, activation):
    linear_cache, activation_cache = cache

    if activation == "relu":
        dZ = relu_backward(dA, activation_cache)
    elif activation == "sigmoid":
        dZ = sigmoid_backward(dA, activation_cache)
        
    dA_prev, dW, db = linear_backward(dZ, linear_cache)
    return dA_prev, dW, db

In [19]:
# 测试一下
AL, linear_activation_cache = linear_activation_backward_test_case()

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation="sigmoid")
print ("sigmoid:")
print ("dA_prev = ", dA_prev)
print ("dW = ", dW)
print ("db = ", db)

dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation = "relu")
print ("relu:")
print ("dA_prev = ", dA_prev)
print ("dW = ", dW)
print ("db = ", db)

sigmoid:
dA_prev =  [[ 0.11017994  0.01105339]
 [ 0.09466817  0.00949723]
 [-0.05743092 -0.00576154]]
dW =  [[ 0.10266786  0.09778551 -0.01968084]]
db =  [[-0.05729622]]
relu:
dA_prev =  [[ 0.44090989 -0.        ]
 [ 0.37883606 -0.        ]
 [-0.2298228   0.        ]]
dW =  [[ 0.44513824  0.37371418 -0.10478989]]
db =  [[-0.20837892]]


### 6.3 - L层神经网络模型反向传播

现在您将实现整个网络的后向传播功能. Recall that when you implemented the `L_model_forward` function, at each iteration, you stored a cache which contains (X,W,b, and z). In the back propagation module, you will use those variables to compute the gradients. Therefore, in the `L_model_backward` function, you will iterate through all the hidden layers backward, starting from layer $L$. On each step, you will use the cached values for layer $l$ to backpropagate through layer $l$.

** Initializing backpropagation**:
To backpropagate through this network, we know that the output is, 
$A^{[L]} = \sigma(Z^{[L]})$. Your code thus needs to compute `dAL` $= \frac{\partial \mathcal{L}}{\partial A^{[L]}}$.
To do so, use this formula (derived using calculus which you don't need in-depth knowledge of):
```python
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL)) # derivative of cost with respect to AL
```

You can then use this post-activation gradient `dAL` to keep going backward. As seen in Figure 5, you can now feed in `dAL` into the LINEAR->SIGMOID backward function you implemented (which will use the cached values stored by the L_model_forward function). After that, you will have to use a `for` loop to iterate through all the other layers using the LINEAR->RELU backward function. You should store each dA, dW, and db in the grads dictionary. To do so, use this formula : 

$$grads["dW" + str(l)] = dW^{[l]}\tag{15} $$

For example, for $l=3$ this would store $dW^{[l]}$ in `grads["dW3"]`.

**Exercise**: Implement backpropagation for the *[LINEAR->RELU] $\times$ (L-1) -> LINEAR -> SIGMOID* model.

In [21]:
def L_model_backward(AL, Y, caches):
    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)

    # 第L层
    dAL = -np.divide(Y, AL) + np.divide(1-Y, 1-AL)
    current_cache = caches[L-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')

    # 从L-1层依次反向传播
    for l in reversed(range(L-1)):
        current_cache = caches[l]
        dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+2)], current_cache, "relu")
        grads["dA"+str(l+1)] = dA_prev_temp
        grads["dW"+str(l+1)] = dW_temp
        grads["db"+str(l+1)] = db_temp
    return grads

In [22]:
# 测试一下
AL, Y_assess, caches = L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print("dW1 = ", grads["dW1"])
print("db1 = ", grads["db1"]) 
print("dA1 = ", grads["dA1"])

dW1 =  [[0.41010002 0.07807203 0.13798444 0.10502167]
 [0.         0.         0.         0.        ]
 [0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 =  [[-0.22007063]
 [ 0.        ]
 [-0.02835349]]
dA1 =  [[ 0.          0.52257901]
 [ 0.         -0.3269206 ]
 [ 0.         -0.32070404]
 [ 0.         -0.74079187]]


### 6.4 - 更新参数

In this section you will update the parameters of the model, using gradient descent: 

$$ W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} \tag{16}$$
$$ b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} \tag{17}$$

where $\alpha$ is the learning rate. After computing the updated parameters, store them in the parameters dictionary. 

In [23]:
def update_parameters(parameters, grads, lr):
    L = len(parameters) // 2
    for l in range(L):
        parameters["W"+str(l+1)] -= lr*grads["dW"+str(l+1)]
        parameters["b"+str(l+1)] -= lr*grads["db"+str(l+1)]
    return parameters

In [24]:
# 测试一下
parameters, grads = update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)
print("W1 = ", parameters["W1"])
print("b1 = ", parameters["b1"])
print("W2 = ", parameters["W2"])
print("b2 = ", parameters["b2"])

W1 =  [[-0.59562069 -0.09991781 -2.14584584  1.82662008]
 [-1.76569676 -0.80627147  0.51115557 -1.18258802]
 [-1.0535704  -0.86128581  0.68284052  2.20374577]]
b1 =  [[-0.04659241]
 [-1.28888275]
 [ 0.53405496]]
W2 =  [[-0.55569196  0.0354055   1.32964895]]
b2 =  [[-0.84610769]]



## 7 - 总结

恭喜您实施构建深度神经网络所需的所有功能！

我们知道这是一个长期的任务，但前进只会变得更好。下一部分任务更容易。

在下一个任务中，你将把所有这些放在一起来构建两个模型：
- 双层神经网络
- 一个L层神经网络

实际上，您将使用这些模型来分类猫与非猫的图像！