# 6. Machine Learning System Design

Suppose we design to use our regularized linear regression algorithm to predict housing prices. 

After testing our hypothesis on a new set of housing data, the prediction errors become extremely large. What to do next?

* Get more training examples
* Try a smaller set of features (maybe our model is overfitting)
* Try a bigger set of features (which includes additional data collection)
* Try adding polynomial features （$x_1^2, x_2^2, x_1x_2, etc.$) 
* Decreasing $\lambda$
* Increasing $\lambda$

To help us make better decisions, we can use what we will call **Machine Learning Diagnostic** , a test to run to get insights in what is / is not working with a learning algorithm, and therefore gain guidance on how to improve its performance.  

### Evaluating a Hypothesis

How to tell if our hypothesis may be overfitting? We can use our own dataset for testing!

A good rule of thumb ratio is training : testing is 70:30.   
**Note**: Pay attention to whether the dataset is sorted (or has any inherent order) or not. 

In addition to the test set error for linear regression, we may also use the **misclassification error** for classification problems. 
In this case, the test error is:

$ \frac{1}{m_{test}} \sum_{i=1}^{m_{test}} err(h_{\theta}(x^{(i)}_{test}), y^{(i)}_{test})$

where the $err$ function is a piecewise function which evaluates to 1 if the classification is **wrong** and 0 otherwise. 

### Model Selection

One way to break down our dataset into the three sets is:

* Training set: **60%**
* Cross validation set: **20%**
* Test set: **20%**

We can now calculate three separate error values for the three different sets using the following method:

* Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
* Find the polynomial degree d with the least error using the cross validation set.
* Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);

### Bias vs. Variance

Suboptimal models can be reduced in the two classes: high bias (_= underfitting_) or high variance (_= overfitting_).

As we can find out by visualizing our error in function of the degree of polynomial $d$, our error: 

* Descreases as we increase d in the training set
* Usually assumes a **U-shape**, underfitting at low d (high error / high bias) and overfitting at high d (high error again / high variance)

![Errors as function of degree of polynomial](Figures/Errors_Degs.png)

### Regularization and Bias / Variance

The regularization term is a useful tool to contain overfitting. Before, it was given. Now, we have the intuition to understand how to choose it more rigorously. 

Practically, we can choose $\lambda$ as follows:

1. Take 12 models ($\lambda = 0, 0.01, 0.02, 0.04, .. 10$)
2. Minimize their $J(\Theta)$ 
3. Use the $\Theta$ found in step 2 to find $J_{cv}(\Theta)$ **without** regularization
4. Choose the $\lambda$ with the lowest $J_{cv}(\Theta)$

### Learning Curves 

We can plot our learning curve as the graph that plots errors as a function of number of training set size:

* Training error will increase with size
* Cross validation error will decrease with size 

For high bias (underfitting) it doesn't matter if we add more training samples, we hit a limit > More data doesn't help. 

For high variance (overfitting) > More training data is helpful to better characterize the model

Error | High Bias | High Variance
------------ | ------------- | ----
Training | Increase > flat out | Slowly increase
Cross validation | Decrease > flat out | Slowly decrease
Convergence | Yes | No