# 1. Setting up your Machine Learning Application

## 1.1 Train/dev/test Sets

**Definition:**

- Training set

- Hold-out/cross validation/development set
    
    - to see which of the model perform the best
    
- Test set

**Strategy in Splitting Data**

    - several years ago: 70/30 or 60/20/20
    
    - big data era: 99/1/1 (i.e. 10k for validation and test are enough for an entire data set over 1m samples)
    
**Guideline**

    - Make sure the dev and test sets come from the same distribution.
    
    - (In some cases, not having a test set might be okay, only having dev set.)


## 1.2 Bias/Variance

**Using Train & Dev Set Errors to Determine:**

- Assume that the human classification error / optimal error / Bayes error $:\approx $ 0%

- **High Variance (underfitting)**: 1% train set error but 11% dev set error

- **High Bias (overfitting)**: 15% train set error but 16% dev set error

- **High Bias & Variance**: 15% train set error and 30% dev set error

- **Low Bias & Variance**: 0.5% train set error and 1% dev set error


- Example: 

![](./imgs/bias-variance.jpg)

![](./imgs/high-bv.jpg)

- High bias is because the most parts are linear
- High variance is because of too much flexibility

## 1.3 Basic Recipe for Machine Learning

**Select the most efficient way to reduce high bias and viriance.** 

- First, check if the bias is high?
    - bigger network (more layers and hidden unites)
    - train longer 
    - search NN architecture


- Check if the variance is high? 
    - more data 
    - regularization 
    - NN architecture search         


**"Bias Variance Trade-off"**

- In the earlier era of machine learning, one cannot reduce either bias or variance without really hurting the other.

- In the modern era, using the recipe above can drive one of them down without necessarily hurting the other too much.
    - regularization 
    - bigger network


# 2. Regularization you Neural Network

## 2.1 Regularization

**$L_{2}$ Regularization:**

In logistic regression: we want to further minimize $J(w,b)$, where

$$ J(w,b) = \frac{1}{m} \; \sum_{i=1}^{m} \; \mathcal{L}(\hat{y}^{(i)}, \; y^{(i)}) + \frac{\lambda}{2m} \; ||w||^{2}_{2} $$

$$ \frac{\lambda}{2m} \; ||w||^{2}_{2} = \frac{\lambda}{2m} \; \sum_{j=1}^{n_{x}} w_{j}^{2} = \frac{\lambda}{2m} \; w^{T}w$$

- $\lambda$ is the **regularization parameter**. It is set based on the dev set in order to prevent the overfitting.
    - In python programming, lambda is a reversed key word, so using other name, i.e. lambd, to represent the regularization parameter.

Why omit the bias term $b$: 

- $w \in \mathbb{R}^{n_{x}}$ is a high dimensional parameter while $b \in \mathbb{R}$ is only one dimensional variable, so in practice, adding the regularization of $b$ cannot improve the performance too much.


**$L_{1}$ Regularization:**

$$ \frac{\lambda}{2m} \; \sum_{j=1}^{n_{x}} |w_{j}| = \frac{\lambda}{2m} \; ||w||_{1} $$

- It ends up with being sparse, which means it will compress the model and reduce some memories. In practice, it does not help a lot. 

**Regularization in Neural Network:**

$$ J(W^{[1]}, b^{[1]}, ..., W^{[L]}, b^{[L]}) = \frac{1}{m} \; \sum_{i=1}{m} \; \mathcal{L} (\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \; \sum_{l=1}^{L} ||W^{[l]}||^{2}_{F} $$

where $$ ||W^{[l]}||^{2}_{F} = \sum_{i=1}^{n^{[l-1]}} \; \sum_{j=1}^{n^{[l]}} \; (W^{[l]}_{ij})^{2} \text{, called Forbenius / Euclidean Norm} $$

where $ W^{[l]} : (n^{[l]}, \; n^{[l-1]}) $


**Updates in Gradient Descent:**

$$ dW^{[l]} = (\text{from back propogation}) + \frac{\lambda}{m} \; W^{[l]} $$

Thus, 

$$ W^{[l]} = W^{[l]} - \alpha \; dW^{[l]} $$

$$ W^{[l]} = (1- \frac{\alpha \lambda}{m}) W^{[l]} - \alpha \; (\text{from back propogation}) $$

Since $W^{[l]}$ is keep smaller in each iteration, $L_{2}$ regularization is also called **"Weight Decay"**. 


## 2.2 Why does Regularization Reduce Overfitting?

**Assume $\lambda$ is very large**: 

- In each update, $W^{[l]} \approx 0$. To some extent, many hidden units are killed in the big neural network, and eventually, the model will be simplified as a simple logistic regression, which has very low variance but high bias. 

- In this way, we can tune the regularization parameter which can control the impacts of parameters on the model so that the model is right in the place where bias and variance are both small.

![](./imgs/large-lambda.jpg)
    
**Another interpretation:**

- When $\lambda$ is very large, in the activation function $tanh()$, the smaller parameter $W^{[l]}$ will lead to a smaller range of $Z^{[l]}$. Thus, the slope of $tanh$ is vert closed to a linear function. Hence, the whole neural network is not far away from a linear combination, which can prevent overfitting / too much flexibility of the model.

    ![](./imgs/large-lambda-tanh.jpg)

## 2.3 Dropout Regularization

**Idea:**

In each iteration, for each layer, for each training sample, we set a probability to eliminish hidden units and use the eliminished network to train that specific training sample in current layer and iteration.

**Implementing Dropout ("Inverted Dropout"):**

In a iteration, one layer i: 

```python
keep_prob = 0.8 # the percent of units that will be removed

di = np.random.rand(Ai.shape[0], Ai.shape[1]) < keep_prob # generate a binary filter

Ai = np.multiply(Ai, di) # remove units 

Ai = Ai / keep_prob 
```

The last line ("inverted dropout")

- is used to pump up the expected value of the input data so that in the computation of forward propogation, the expected value of Z won't change
- also makes **the test process easier** because it reduces some scaling problem.


**Backward propogation:** 
- You had previously shut down some neurons during forward propagation, by applying a mask $D^{[i]}$ to `Ai`. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[i]}$ to `dAi`. 

- During forward propagation, you had divided `Ai` by `keep_prob`. In backpropagation, you'll therefore have to divide `dAi` by `keep_prob` again (the calculus interpretation is that if $A^{[i]}$ is scaled by `keep_prob`, then its derivative $dA^{[i]}$ is also scaled by the same `keep_prob`).


**Strategy:**

- Different layers can have different keep-prob. Usually, the larger the weight matrix, the lower the keep-prob. By doing this, we can solve the issue where some layers are more likely to be overfitting than others. 

- For some layers which only have 1 or 2 hidden uints, the keep-prob can be 1. 

    ![](./imgs/diff-keepprob.jpg)

**Mking prediction at test time:**

- No dropout because we don't want the output to be random

- **/=keep_prob** makes **the test** without a scaling function to recover the output values into their expected range.


## 2.4 Why does Dropout Prevent Overfitting?

**Intuition:**

- **Each hidden unite cannot rely on any one feature, so has to spread out weights**, which means weights can be widely spread out to all inputs. As a result, the final weight norm will be shrank.

    - Shrinking the weight norm is silimar to the $L_{2}$ regularization.
    
    - Dropout can be considered as an adaptive way without a regularization.
    
**Downside:**

- The cost function $J$ is not well-defined. It won't monotonically descrease by using dropout. 
    - Solutions: before using dropout, plot the cost function vs the number of iteration to check if the cost decreases as the iteration increases. 

## 2.5 Other Regularization Methods

**Data Augmentation:**

- Add more fake training samples by transforming. rotating, or distorting images.

    ![](./imgs/data-aug.jpg)


**Early Stopping:**

- Stop halfway when the dev set error is the minimal

    - The Forbenius Norm is around middle size, while its size will be large when overfitting, and closed to 0 when the iteration starts.
    

- Advantage: one only needs to try small, mid-, and large size of weight norms rather than trying a number of $\lambda$ in $L_{2}$ regularization. 
    
- Downside: it like solves the high variance and high bias at the same time, which will makes the problem more complicated. 

    ![](./imgs/early-stopping.jpg)

# 3. Setting up your Optimization Problem

## 3.1 Normalizing Inputs

**Scale the test set using the same parameters $u,\sigma^{2}$ from the train set so that these two sets are doing the same transformation!**

$$ X = \frac{X - u}{\sigma} $$

![](./imgs/normalize.jpeg)

**Why Normalize Inputs?**

- If not, the cost function will be more irregular, becoming more complex to optimize. The learning rate should also be set very small. 

    ![](./imgs/why-norm.jpg)


## 3.2 Vanishing/Exploding Gradients

**Problem:**

- In the very deep and deep neural networks, the gradients may become exponentially large or small, which is called vanishing / exploding gradients. 

- The reason is that if the weight matrix is a little larger or smaller than the identity matrix, the results may become exponentially large $2^{L}$ or small $\frac{1}{2^{L}}$. 

    ![](./imgs/vanishing-exploding.jpg)
    
- Solution: **carefully initialize parameters**    


## 3.3 Weight Initialization for Deep Networks

**Change Initial Variance:**

- Change the initial variance of the weight matrix from 1 to $\frac{1}{n^{[l-1]}}$ so that the linear combinition of $Z=WA+b$ won't become too large. 

```python
# before 
W[l] = np.random.randn(shape) * 0.01

# now 
W[l] = np.random.randn(shape) * np.sqrt(1/n[l]) 
 
```

In practice: 

- $ReLU()$ function works well when the variance is set up as $\frac{2}{n^{[l-1]}}$.

- $tanh()$ function can use $\sqrt{\frac{1}{n^{[l-1]}}} $ or $\sqrt{\frac{2}{n^{[l-1]} + n^{[l]}}} $ 

- [See more details on this](https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization?source=post_page---------------------------) 


One can also treat this variance as a hyperparameter to tune in the model.


## 3.4 Numerical Approximation of Gradients

**Mathematical Idea:**

$$ f^{'}(\theta) = \lim_{\epsilon \rightarrow 0} \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon } $$


**In Practice:**

- Two-Side Gradient Checking:

$$ f^{'}(\theta) \approx \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon } $$

- One-Side Gradient Checking:

$$ f^{'}(\theta) \approx \frac{f(\theta + \epsilon) - f(\theta)}{\epsilon} $$


**Why use two-side rather than one-side grad check?**

- The two-side check is $O(\epsilon^2)$ while the one-side is $O(\epsilon)$. Thus, the difference, i.e. 0.01, becomes larger in $O(\epsilon^{2})$.  


- Example:

    ![](./imgs/gradient-check.jpg)


## 3.5 Gradient Checking

**Implementing Grad Check:**

- Reshape all weights $W^{[l]}, b^{[l]}$ into a gaint vector $\theta$ and all gradients $dW^{[l]}, db^{[l]}$ into anothre gaint vector $d\theta$. Now the cost function looks like: 

$$ J(\theta) = J(\theta_{1}, ..., \theta_{L}) $$

- For each $i$, compute:

$$ d\theta_{apporx}[i] = \frac{J(\theta_{1}, ..., \theta{i} + \epsilon, ...) - J(\theta_{1}, ..., \theta{i} - \epsilon, ...)}{2\epsilon} $$

Ideally, as said above: 

$$ d\theta_{approx}[i] \approx d\theta[i] $$

- Use a threshold to check: 

    $$ \frac{ ||d\theta_{approx} - d\theta||_{2}}{||d\theta_{approx}||_{2} + ||d\theta||_{2}} $$

    - if the differece $\approx 10^{-7}$, it's great!
    - if the differece $\approx 10^{-5}$, need to check!
    - if  the differece $\geq 10^{-3}$, it's wrong!
    


**Notes:**

- Do not use Grad Check in training - only to debug.

- If algorithm fails grad check, look at each individual component to try to identify bug. 

- Remeber regularizatin which addes the regularization term in the cost function.

- Doesn't work with dropout which randomly select hidden units. 

- Run at random initialization; perhaps again after some training which will make parametrs wondering away from 0. 


----------

# Quiz

1. You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

    **Increase the regularization parameter lambda.**

**Why:**

- The reason is that in the cost function, if the lambda is increased, the weight norm will be put more attention for minimizing the cost function. As a result, the weight norm will be reduced to around 0, also known as "Weight Decat". 

- Once the weight norm is around 0, many hidden units will die (have few impacts), so the deep neural network can be considered as a simple logistic regression, which has very high bias but less flexibility (underfitting). 

- So, **from overfitting to underfitting, one can just increase the regularization parameter lamdba**. 


2. With the inverted dropout technique, at test time:

    **You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training.**
    
**Why?**    


----------

# Resources

- TensorFlow and Deep Learning without a PhD: [Part_1](https://www.youtube.com/watch?v=u4alGiomYP4) and [Part_2](https://www.youtube.com/watch?v=fTUwdXUFfI8)

- deeplearning.ai: [initialization](http://www.deeplearning.ai/ai-notes/initialization/)

- CS231n: [Setting up the data and the model](http://cs231n.github.io/neural-networks-2/#reg)

# Assignment

## 1. Initialization

- Zero Initialization: 

    - In general, initializing all the weights to zero results in the network **failing to break symmetry**. This means that **every neuron in each layer will learn the same thing**, and you might as well be training a neural network with $n^{[l]}=1$ for every layer, and the network is no more powerful than a linear classifier such as logistic regression. 
    
    ![](./imgs/hw1-zero.png)
    
    
- Large Random Initialization: 
    
    - **Random initialization is used to break symmetry** and make sure different hidden units can learn different things
    - Poor initialization can lead to vanishing/exploding gradients, which also slows down the optimization algorithm. 
    - If you train this network longer you will see better results, but initializing with overly large random numbers slows down the optimization.
    
    ![](./imgs/hw1-large.png)
    

- He Initialization: 

    -He initialization works well for networks with ReLU activations.

    ![](./imgs/hw1-he.png)
    

- Accuracy: 

    ![](./imgs/hw1-acc.jpg)

<font color='blue'>
    
**What you should remember from this notebook**:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don't intialize to values that are too large
- He initialization works well for networks with ReLU activations. 

## 2. Regularization

**Problem:**

- You have just been hired as an AI expert by the French Football Corporation. They would like you to recommend positions where France's goal keeper should kick the ball so that the French team's players can then hit it with their head.

    ![](./imgs/hw2-prob.png)
  
  
**Data:** 

![](./imgs/hw2-data.png)
    
- Each dot corresponds to a position on the football field where a football player has hit the ball with his/her head after the French goal keeper has shot the ball from the left side of the football field.
    - If the dot is blue, it means the French player managed to hit the ball with his/her head
    - If the dot is red, it means the other team's player hit the ball with their head

**Goal**: 

- Use a deep learning model to find the positions on the field where the goalkeeper should kick the ball.

**Non-regularized Model:**

- The non-regularized model is obviously overfitting the training set. It is fitting the noisy points! 

    ![](./imgs/hw2-non-reg.png)
    

**$L_{2}$ Regularization:**

$$J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost} $$

- No overfitting any more! 
- The value of $\lambda$ is a hyperparameter that you can tune using a dev set.
- **L2 regularization makes your decision boundary smoother**. If $\lambda$ is too large, it is also possible to "oversmooth", resulting in a model with high bias (underfitting).

    ![](./imgs/hw2-l2.png)


**Dropout:**

- **It randomly shuts down some neurons in each iteration.**
- Backward propogation: 
    - You had previously shut down some neurons during forward propagation, by applying a mask $D^{[1]}$ to `A1`. In backpropagation, you will have to shut down the same neurons, by reapplying the same mask $D^{[1]}$ to `dA1`. 
    - During forward propagation, you had divided `A1` by `keep_prob`. In backpropagation, you'll therefore have to divide `dA1` by `keep_prob` again (the calculus interpretation is that if $A^{[1]}$ is scaled by `keep_prob`, then its derivative $dA^{[1]}$ is also scaled by the same `keep_prob`).

- Dropout works great! The test accuracy has increased again (to 95%)! Your model is not overfitting the training set and does a great job on the test set.
- A **common mistake** when using dropout is to use it both in training and testing. You should use dropout (randomly eliminate nodes) only in training. 
- Deep learning frameworks like [tensorflow](https://www.tensorflow.org/api_docs/python/tf/nn/dropout), [PaddlePaddle](http://doc.paddlepaddle.org/release_doc/0.9.0/doc/ui/api/trainer_config_helpers/attrs.html), [keras](https://keras.io/layers/core/#dropout) or [caffe](http://caffe.berkeleyvision.org/tutorial/layers/dropout.html) come with a dropout layer implementation. Don't stress - you will soon learn some of these frameworks.

    ![](./imgs/hw2-dropout.png)


**Accuracy:**

![](./imgs/hw2-acc.jpg)

<font color='blue'>
    
**What you should remember about dropout:**
- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.  

## 3. Gradient Checking

$$ \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} \tag{1}$$

$$ grad = \frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} = gradapprox\tag{2}$$

To check: 

$$ difference = \frac {\mid\mid grad - gradapprox \mid\mid_2}{\mid\mid grad \mid\mid_2 + \mid\mid gradapprox \mid\mid_2} \tag{3}$$


![](./imgs/hw3-gc.png)


**Notes:**
- Gradient Checking is slow! Approximating the gradient with $\frac{\partial J}{\partial \theta} \approx  \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon}$ is computationally costly. For this reason, we don't run gradient checking at every iteration during training. **Just a few times to check if the gradient is correct**. 
- Gradient Checking, at least as we've presented it, doesn't work with dropout. You would usually run the gradient check algorithm without dropout to make sure your backprop is correct, then add dropout. 


<font color='blue'>
    
**What you should remember from this notebook**:
- Gradient checking verifies closeness between the gradients from backpropagation and the numerical approximation of the gradient (computed using forward propagation).
- Gradient checking is slow, so we don't run it in every iteration of training. You would usually run it only to make sure your code is correct, then turn it off and use backprop for the actual learning process. 