# Deep Learning An Overview


## Neural Metworks and Deep Learning

### Commonly Seen Networks
1. Standard Nueral Network
2. CNN
3. RNN
4. Custome Hybtid

### Logistic Regression  
Given $x$, want  $\hat{y} = P(y=1 \mid x)$, $x \in \Re^{n_x}$  
Parameters: $w \in \Re^(n_x), b \in \Re$  
Output: $\hat{y} = \sigma (w^Tx+b)$
$\sigma(z) = \frac{1}{1+e^{-z}}$

Loss(Error) Function    
$L(\hat{y}, y) = -[ylog\hat{y}+(1-y)log(1-\hat{y})]$
Cost Function  
$J(w,b) = \frac{1}{m} \sum_1^m L(\hat{y}^{(i)},y) = \frac{1}{m}\sum_{1}^{m}[y^{(i)}log\hat{y}^{(i)}+(1-y^{(i)})log(1-\hat{y}^{(i)})]$

### GD(Gradient Descent)  
Repeadly{  
    $w:=w - \alpha\frac{dJ(w)}{dw}$ 
}

### Normalizing  
$x_normalized =  \frac{x}{\left \| x \right \|}$  


### Softmax  
for $x \in \Re^{'(xn)}, softmax(x) = softmax([x_1,... ,x_n])$
$= [\frac{e^{x_1}}{\sum_{j}e^{x_j}},... ,\frac{e^{x_n}}{\sum_{j}e^{x_j}}]$ 

### Useful `np` Functions  
```python
# All are vectorization supported
# e^x
np.exp(x) 

# ||x||
np.linalg.norm(x,axis=1,keepdim=True)

# sum
np.sum(x,axis=1,keepdim=True)

# dot product
np.dot(x,y)

# outer product
np.outer(x,y)

# mathematical numerical multiply
np.multiply(x,y)

# reshape
np.reshape(row,col)
```

### Common Steps for pre-processing a new dataset are:  
1. Figure out the dimensions and the shapes of the problem(m_trian, m_test, num_px,...)
2. Reshape the datasets such that each example is now a vector of size(num_px\*num_px\*3,1)
3. Standardize the data.(Images can be applied by being devided by 255, which is the length of the color area)

### To Implement a NN:  
1. Initialize(w,b)
2. Optimize the loss iteratively to learn parameters such as (w,b)   
    - computing the cost and its gradient  
    - updating the parameters using gtradient descent  
3. Use the leart (w,b) to predict the labels for a given set of examples

### For Better Algorithm effeciency and accuracy  
1. Preprocessing the dataset is important
2. Implement each function seperatedly and build a model together
3. Tuning the learning rate (a kind of 'hyper-parameter') can make big difference to the algorithm

### Shallow Neural Network  
Repeatedly{
$Z^{[i]} = W^{[i]}+b{[i]},$
$A^{[i]} = \sigma(Z^{[i]}）$
}

### Activation Functions
1. sigmoid:  
$$
a(x) = \frac{1}{1+e^{-x}}
$$
2. tanh:  
$$
a(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}
$$
3. ReLU(Rectified Linear Unit):  
$$
    a(x) =max(0,x)
$$
4. Leaky ReLU:  
$$
a(x) = max(0.01*x,x)
$$

### Why do we need non-linear activation function?  
If we use linear activation function, we just get a result of linear computation.

### Derivatives of activation functions
1. sigmoid:  
$$
g'(z)=g(z)[1-g(z)]
$$
2. ranh:
$$
g'(z)=1-[tan(z)]^2
$$
3. Relu:  
$$
g(z) = 
\left\{\begin{matrix}
0 & ,if\ z < 0 \\ 
1 & ,if\ z >0 \\
undefined & ,if z=0 
\end{matrix}\right.
$$
4. Leaky ReLU:
$$
g(z) = 
\left\{\begin{matrix}
0.01 & ,if\ z < 0 \\ 
1 & ,if\ z >0 \\
undefined & ,if z=0 
\end{matrix}\right.
$$

### General GD Algorithm
1. Forward Propagation:
$$
Z^{[i]} = W^{[i]}X^{[i-1]} + b^{[i]},\\  
A^{[i]} = g^{[i]}(Z^{[i]})
$$
2. Backward Propagation:  
$$
d_Z^{[i]} = W^{[i+1]T}d_z^{[i+1]} * g'^{[i]}(Z^{[i]})\\  
d_w^{[i]} = \frac{1}{m}d_z^{[i]}X^T\\ 
d_b^{[i]} = \frac{1}{m}\sum d_z^{[i]}
$$
3. Update Weights:
$$
W = W - \alpha * d_W\\
b = b - \alpha * d_b
$$



## Improving Deep NN - Hyper-parameters tuning, Regularization and Optimization  

### Setting Up Machine Learning Applications
#### Hyper-parameters
1. \# layers
2. \# hidden layers
3. learning rate
4. activation functions

| Condition   |Train Sets     | Dev Sets (Cross Validation)    | Test Sets(Optional)    |
| :------------- | :------------- | :------------- |:------------- |
| small amount of data     | 60      | 20    | 20    |
| big data    | 98    | 1    | 1    |

#### Make sure Dev and Test the same distribution.
Test ensures how performance will be on the target so it is necessary to keep them the same distribution.

#### Bias And Variance
- Bias measures how the performance is between train set accuracy and the baseline(generally, will be human level performance).
- Variance measures how the performance of Dev set compared with the Train set.

#### Solution for high Bias and Variance
1. High Bias:
    - Bigger Network
    - Train Longer
    - NN architecture research
2. High Variance:
    - More Data
    - Regularization
    - NN architecture research

### Regularization
- $L_1$ Regularization:
$$
\left \| w \right \|^{2}_{2} = \sum_{j = 1}^{n_x} w_j^2 = w^Tw
$$

- $L_2$ Regularization: 
$$
\frac{\lambda}{2m}\sum_{j = 1}^{n_x}|w_j|=\frac{\lambda}{2m}\left \| w \right \|_1
$$
- Frobenius Norm
$$
\left \| w^{[L]} \right \|^2_F = \sum_{i = 1}^{n^{[L-1]}}|w_j|\sum_{j = 1}^{n^{[L-1]}}(w_{ij})^2
$$

#### How does regularization prevent overfitting?
Some of the hidden layer has been waken so that the overfitting has been modified, too.  
Because of the regularization parameter to reduce the weights to more close to zero, so the real function is more close to a linear function.

### Dropout Regularization
#### What's a dropout?
- Eliminating some hidden unit randomly
- No dropout when making prediction 

#### Why does dropout works?
- Overfitting is happening so we can't rely on any one feature, and thus we choose to spread out weights.  
- To different Layer, choose relatively drop-out `keep_prov` value(usually not for input layer).
- No overfitting, no dropout.

### Other way to regularization
- Data Augmentation
    - flipping horizontally
    - randomly distortion/zooming
- Early Stopping

### Orthogonalization 
1. Optimize cost function:
    - Gradient descent
    - Momentum
    - RMS Prop
    - Atom
2. Non-overfit
    - Regularization
    - Getting more data
3. Early stop can't make you handle two of above problem independently.

4. Bias $\to $ error rate & Variance $\to$ overfitting
