# Machine Learning Specialization
## Advanced learning Algorithms
# Week 6  
  
## Terminology

| Definition        | Explanation            |
|-------------------|------------------------|
| Diagnostics | A test you can run to gain insight into what is/isn't working |  
| training data | data used to train your model |
| test data | data to test if the model works| 
| cross validation data / validation set / development set / dev set | data to cross validate if the model works |  
|$m_{train}$ | You can use this to emphasize that you are using training data|   
|$m_{test}$ | You can use this to emphasize that you are using test data|  
|$m_{cv}$ | You can use this to emphasize that you are using cross validation data| 
|generalization error| a number that tells you how big the error is for real data | 
| High bias | a model that underfits |
| High varience | a model that overfits |  
| Baseline level of performance | what is the level of error you can reasonably hope to get |
|synthetic data| Data you create from scratch that isn't real world data | 


# How to run diagnostics
## Debugging a learning algorithm
You could take the following steps to debug a linear regression that give unacceptably large errors in prediction:  
1. Get more training examples
2. Try smaller sets of features
3. Try getting additional features
4. Try adding polynomial features ($x_1^2,x_2^2,x_1,x_2, $ etc)  
5. Try decreasing $\lambda$
6. Try increasing $\lambda$

However, it's important to figure out what would be the best point to change for your data. For example you could spend weeks getting more data, but what if that isn't what is needed? Because of that we need to be able to run diagnostics.

## How to evaluate a model

### Testing for linear regression (with squared error cost)  
When we think of a model that is overly complex like this one:  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/overfit.PNG?raw=true)  
We know that $f_{\overrightarrow{w}b}(\overrightarrow{x})=w_1x+w_2x^2+w_3x^3+w_4x^4 +b$ is a great fit for the training data, but it's highly unlikely to work for new data, and thus will give poor predictions. But if you want to you can plot this, and see. The problem is, when you have more features, because if you add 3 features, how do you plot a 4 dimensional plot? And if you can, how do you see easily if the new data fits well.  
A way to deal with this is to split your training data into 2 subsets (for example 70%/30% or 80%/20%). Where you use 70% of the data as your training set. You can use the remaining 30% to test the model.  
You can first **Fit the parameter by minimizing cost function**  $$J(\overrightarrow{w},b)=^{min}_{\overrightarrow{w},b} \left[ \frac{1}{2m_{train}} \sum_{i=1}^{m_{train}} (f\overrightarrow{w},b(\overrightarrow{x}^{(i)})-y^{(i)})^2+\frac{\lambda}{2m_{train}}\sum_{j=1}^{n} w^2_j\right]$$   
After that you **compute training error**  
$$J_{train}(\overrightarrow{w},b)=\frac{1}{2m_{train}}\left[ \sum_{i=1}^{m_{train}}(f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{train})-y^{(i)}_{train})^2\right]$$  
Then you **compute test error**  
$$J_{test}(\overrightarrow{w},b)=\frac{1}{2m_{test}}\left[ \sum_{i=1}^{m_{test}}(f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{test})-y^{(i)}_{test})^2\right]$$  
  
In the example above, you should get a very low cost for the training data, since the data fits perfectly, however the test data should give you a high number since the cost will be high. 

### Testing for classification problem  
In essence you could do the same as for linear regression. But of course you do it using the cost function for a classification problem. This means **Fit the parameter by minimizing cost function**  $$J(\overrightarrow{w},b)=-\frac{1}{m} \sum_{i=1}^{m} \left[y^{(i)}log\left(f\overrightarrow{w},b(\overrightarrow{x}^{(i)})\right)+(1-y^{(i)})log\left(1-f\overrightarrow{w},b(\overrightarrow{x}^{(i)})\right) \right] +\frac{\lambda}{2m}\sum_{j=1}^nw^2_j$$  
After that you **compute training error**  
$$J_{train}(\overrightarrow{w},b)=-\frac{1}{m_{train}} \sum_{i=1}^{m_{train}} \left[y^{(i)}_{train}log\left(f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{train})\right)+(1-y^{(i)}_{train})log\left(1-f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{train})\right) \right] $$  
Then you **compute test error**  
$$J_{test}(\overrightarrow{w},b)=-\frac{1}{m_{test}} \sum_{i=1}^{m_{test}} \left[y^{(i)}_{test}log\left(f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{test})\right)+(1-y^{(i)}_{test})log\left(1-f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{test})\right) \right] $$  


However, a more common approach for this is to calculate the fraction of the test set and the fraction of the train set that the algorithm has misclassified. Which you can do by calculating how many miscalculations have been made for the training and for the test set. 

### Finding your best model 
We can now use the test error to figure out what is the best model to use. For example if you have the following models, but you are not sure which one is best:  
d=1 &nbsp; $f_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x_1+b$  
d=2 &nbsp; $f_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x_1+w_2x^2+b$  
d=3 &nbsp; $f_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x_1+w_2x^2+w_3x^3+b$  
&nbsp; $\vdots$&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;$\vdots$  
d=10 &nbsp;  $f_{\overrightarrow{w},b}(\overrightarrow{x})=w_1x_1+w_2x^2+w_3x^3+\dots w_{10}x^{10}+b$  
What you can do is compute the test error for each of them, and use the one that has the lowest test error.

### Cross validation
The problem with the before mentioned approach is that if you use the outcome of $$J_{test}$$ to decide what your model will be, you will most likely a too positive outcome. Because you are now saying that if the training data trains the model we want to use this limited data to see how well it tests. Instead of that, you should divide your data not in 2 but 3 groups. Something like:  
60% training data ($x^{m_{train}} y^{m_{train}})$  
20% cross validation ($x^{m_{cv}} y^{m_{cv}})$  
20% test data ($x^{m_{test}} y^{m_{test}})$  
  
The cross validation error is calculated just like the test and training error, just using the cross validation data:  
$$J_{cv}(\overrightarrow{w},b)=-\frac{1}{m_{cv}} \sum_{i=1}^{m_{cv}} \left[y^{(i)}_{cv}log\left(f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{cv})\right)+(1-y^{(i)}_{cv})log\left(1-f\overrightarrow{w},b(\overrightarrow{x}^{(i)}_{cv})\right) \right] $$  
  
**What you should do** is instead of validating using your test set, is validating using your cross validation set. You pick the model that has the lowest error. But you will use your test data, to calculate your generalization error. This makes sure that you didn't accidentally just found a model that works well for your test data.  
  
**You can do the exact same for neural networks**

## Bias/Variance  
A good way to see if your model is good, is to look at the bias and variance.  
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/bias_variance.PNG?raw=true)  
While the figure here is pretty clear, for more complex models, its harder to visualize. So in those cases you can mostly rely on this:  
**High Bias (underfit)**  
$J_{train}$ will be high  
$J_{train}\approx J_{cv}$  
  
**High variance (overfit)**  
$J_{cv}>>J_{train}$  
($J_{train}$ might be low)  
Human level performance error would be between the training data error and the cross validation error.  
*High variance can maybe be solved, or reduced, by more training data*
  
**High Bias (underfit) & High variance (overfit)**  
$J_{train}$ will be high  
and $J_{cv}>>J_{train}$  
Human level performance error would be significantly lower.  
*High bias isn't solved with more training data*

|solution| what does it fix| 
|--|--|
|Get more training examples | high variance|  
|Smaller set of features | high variance|  
|add more features | high bias |  
|Try adding polynomial features ($x^2_2,X^2_2,x_1,X_2, etc.) | high bias |  
| Decreasing $\lambda$ | high bias|  
| Increasing $\lambda$ | high variance |

### The impact of regularization on bias and variance  
To recap, you use $\lambda$ for regularization. You do this by assigning a number to $\lambda$, and add  
$$ + \frac{\lambda}{2m}\sum_{j=1}^n w_j^2 \$$  
to the end of your model. Or in code it would look like this:  
```python
layer = Dense (units=25,activation="relu", kernal_regularizer=L2(0.01))
```  
Where in `kernal_regularizer=L2(0.01)` 0.01 stands for the number you want to assign to $\lambda$.  
$\lambda$ will impact how much your features will have inpact on the model. So if you set $\lambda$=10000, you probably end up with a straight line. Because your formula will basically end up being $f\overrightarrow{w},b(x)\approx b$.  
In the other extreme, there is no regularization. Which could be an increase chance of overfitting.  
**So how do you find the right number for $\lambda$**  
Well, you do the same as mentioned above. You start of with a low $\lambda$ and calculate the cross validation error. After which you increase $\lambda$ and do the same:  
1\.&nbsp;  Try $\lambda$=0.00 -> $J_cv(w^{<1>},b^{<1>})$  
2\.&nbsp;  Try $\lambda$=0.01 -> $J_cv(w^{<2>},b^{<2>})$  
3\.&nbsp;  Try $\lambda$=0.02 -> $J_cv(w^{<3>},b^{<3>})$  
4\.&nbsp;  Try $\lambda$=0.04 -> $J_cv(w^{<4>},b^{<4>})$  
5\. Try $\lambda$=0.08 -> $J_cv(w^{<5>},b^{<5>})$  
&nbsp; &nbsp; &nbsp; $\vdots$ &nbsp;  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;$\vdots$  
12\. Try $\lambda$=10 -> $J_cv(w^{<5>},b^{<5>})$  
  
After that you calculate the generalization error by using the test set


### Establishing a baseline level of performance  
To understand if you training bias is high or low, you would need something to compare it to. A way to do this is to compare it to human level performance. Another way is to compare it to a competing algorithm performance. You could also guess base it on your previous experience.  
If you have a baseline performance, a training error and a cross validation error. You can compare these to see if you have a problem:  
Baseline performance&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :10.6%  
Training error ($J_{train}$)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; :10.8%  
Cross validation error ($J_{cv}$) &nbsp; :14.8%  
  
You can now say that your training error is low, 10.8-10.6=0.2%  
but your cross validation error is high 14.8-10.8=4.0%  
So in this problem you have a high variance problem

### Bias and variance in Neural networks
Large neural networks are low bias machines when trained on small to medium sized datasets. If you make you neural networks large enough, you can almost always fit your training data well. So here you can apply a different approach. A simple recipe that doesn't always work, but when it does is great:  
Does it do well on the training set ($J_{train}(\overrightarrow{w},b)$) compared to baseline performance?  
If not, you have a high bias problem, so increase your hidden layers or amount of units. Keep going through this loop until you are happy with the target level of error.   
After that you check, does it do well on the cross validation set? ($J_{cv}(\overrightarrow{w},b)$).  
If no, you have high variance, so you need to add more data.  
Start again at the start until you are happy with both variance and bias.
![](https://github.com/DouweHorsthuis/machine-learning-cousera/blob/main/images/neural_nw_bias_variance.PNG?raw=true)

## Error Analysis  
This comes second after looking for bias and variance for seeing where your algorithm is going wrong. Error analysis refers to looking at the misclassified data and seeing what the mistake is. For example you look for mistakes and see if you can group them in common mistakes. It is worth it to count out how many times you see a problem you can group. Like this you can really see what it is worth it to focus on first. When you have a big data set, lets say 1000 misclassifications, you/your team might not have the time to go through all of them. In that case **randomly** select 100 or a couple 100. 

### Adding data, but how?
Adding data seems like a solution to creating a better model, practically always. But you might not want to just get more data or all data. It is not a bad idea, but it can be slow and or expensive. Instead if you can **figure out what type of data your model struggle with**, find more of that type so you model can train better on something it has a hard time with.  
Another way to go is to **create your own extra training data**. For example, if your model is looking at pictures of letters. You could take one of the pictures of the letter A and increase/rotate/change the color/change the contrast. In this example it is still the letter A. So the model can learn from that. It is key that the "noise" you add, is something that could really happen in the real world. You can also do this where you create **synthetic data**. This means creating data from scratch. 


## The full cycle of a machine learning project
1. the scope (what do you want to work on/define project)
2. collect data 
3. train the model (training, error analysis, iterative improvement)
4. *optional* back to collecting data and train the model until you are happy
5. deploy in production (deploy,monitor and maintain it)