# Cost Function

## Cost Function and Backpropagation

### Cost Function

L = total number of layers in the network  
$s_l$ = number of units (**not counting bias unit**) in layer l  
K = number of output units/classes

![image.png](https://i.loli.net/2020/03/01/6BteYA3CuSDsXRM.png)

We denote $h_{\Theta}(x)_{k}$ as being a hypothesis that results in the $k^{th}$ output. Our cost function for neural networks is going to be a generalization of the one we used for logistic regression:

$J(\Theta)=-\frac{1}{m} \sum_{i=1}^{m} \sum_{k-1}^{K}\left[y_{k}^{(i)} \log \left(\left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}\right)+\left(1-y_{k}^{(i)}\right) \log \left(1-\left(h_{\Theta}\left(x^{(i)}\right)\right)_{k}\right)\right]+\frac{\lambda}{2 m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l}} \sum_{j=1}^{s_{l+1}}\left(\Theta_{j, i}^{(l)}\right)^{2}$

Note:
- the double sum simply adds up the logistic regression costs calculated for each cell in the output layer
- the triple sum simply adds up the squares of all the individual Θs in the entire network.
- the i in the triple sum does **not** refer to training example i

### Backpropagation Algorithm

"Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

![image.png](https://i.loli.net/2020/03/01/9mrCgqwX46W5ayc.png)

$D_{i, j}^{(l)}:=\frac{1}{m}\left(\Delta_{i, j}^{(l)}+\lambda \Theta_{i, j}^{(l)}\right), if j \neq 0 \\
D_{i, j}^{(l)}:=\frac{1}{m} \Delta_{i, j}^{(l)} , if j=0$

$l$ 代表目前所计算的是第几层。  
$j$ 代表目前计算层中的激活单元的下标，也将是下一层的第$j$个输入变量的下标。  
$i$ 代表下一层中误差单元的下标，是受到权重矩阵中第$i$行影响的下一层中的误差单元的下标。

The capital-delta matrix D is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. Thus we get $\frac{\partial}{\partial \Theta_{i j}^{(l)}} J(\Theta)=D_{i j}^{(l)}$

### Backpropagation Intuition

![image.png](https://i.loli.net/2020/03/01/tF6fm2r3HnqLWiT.png)

In the image above, to calculate $\delta_{2}^{(2)},$ we multiply the weights $\Theta_{12}^{(2)}$ and $\Theta_{22}^{(2)}$ by their respective $\delta$ values found to the right of each edge. So we get $\delta_{2}^{(2)}=\Theta_{12}^{(2)} * \delta_{1}^{(3)}+\Theta_{22}^{(2)} * \delta_{2}^{(3)} .$ To calculate every single possible $\delta_{j}^{(l)},$ we could start from the right of our diagram. We can think of our edges as our $\Theta_{i j} .$ Going from right to left, to calculate the value of $\delta_{j}^{(i)},$ you can just take the over all sum of each weight times the $\delta$ it is coming from. Hence, another example would be $\delta_{2}^{(3)}=\Theta_{12}^{(3)} * \delta_{1}^{(4)}$

## Backpropagation and Practice

### Implementation Note: Unrolling Parameters

![image.png](https://i.loli.net/2020/03/01/la6UQBDtJs471OM.png)

### Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with: 

$\frac{\partial}{\partial \Theta} J(\Theta) \approx \frac{J(\Theta+\epsilon)-J(\Theta-\epsilon)}{2 \epsilon}$

With multiple theta matrices, we can approximate the derivative with respect to $\Theta_j$ as follows:

$\frac{\partial}{\partial \Theta_{j}} J(\Theta) \approx \frac{J\left(\Theta_{1}, \ldots, \Theta_{j}+\epsilon, \ldots, \Theta_{n}\right)-J\left(\Theta_{1}, \ldots, \Theta_{j}-\epsilon, \ldots, \Theta_{n}\right)}{2 \epsilon}$

A small value for $\epsilon$ (epsilon) such as $\epsilon=10^{-4}$, guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems.

![image.png](https://i.loli.net/2020/03/01/rkvupRXyCD2LdeU.png)

Once you have verified **once** that your backpropagation algorithm is correct, you don't need to compute gradApprox again. The code to compute gradApprox can be very slow.

### Random Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. Instead we can randomly initialize our weights for our $\Theta$ matrices using the following method:

![image.png](https://i.loli.net/2020/03/01/xmCwNopjERdZtIk.png)

Hence, we initialize each $\Theta_{i j}^{(l)}$ to a random value between $[-\epsilon, \epsilon] .$ Using the above formula guarantees that we get the desired bound. The same procedure applies to all the $\Theta$'s. Below is some working code you could use to experiment.

>Note: the epsilon used above is unrelated to the epsilon from Gradient Checking

### Putting it Together

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

- Number of input units = dimension of features $x^{(i)}$
- Number of output units = number of classes
- Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

#### Training a Neural Network

1. Randomly initialize the weights
2. Implement forward propagation to get $h_{\Theta}\left(x^{(i)}\right)$ for any $x^{(i)}$
3. Implement the cost function
4. Implement backpropagation to compute partial derivatives
5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

The following image gives us an intuition of what is happening as we are implementing our neural network:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/hGk18LsaEea7TQ6MHcgMPA_8de173808f362583eb39cdd0c89ef43e_Screen-Shot-2016-12-05-at-10.40.35-AM.png?expiry=1579824000000&hmac=6ntXAW0gBobLIEQUDoaT6emueXFQ5OMaNf5MjGV-csM)