# 6. Machine Learning System Design

Suppose we design to use our regularized linear regression algorithm to predict housing prices. 

After testing our hypothesis on a new set of housing data, the prediction errors become extremely large. What to do next?

* Get more training examples
* Try a smaller set of features (maybe our model is overfitting)
* Try a bigger set of features (which includes additional data collection)
* Try adding polynomial features （$x_1^2, x_2^2, x_1x_2, etc.$) 
* Decreasing $\lambda$
* Increasing $\lambda$

To help us make better decisions, we can use what we will call **Machine Learning Diagnostic** , a test to run to get insights in what is / is not working with a learning algorithm, and therefore gain guidance on how to improve its performance.  

### Evaluating a Hypothesis

How to tell if our hypothesis may be overfitting? We can use our own dataset for testing!

A good rule of thumb ratio is training : testing is 70:30.   
**Note**: Pay attention to whether the dataset is sorted (or has any inherent order) or not. 

In addition to the test set error for linear regression, we may also use the **misclassification error** for classification problems. 
In this case, the test error is:

$ \frac{1}{m_{test}} \sum_{i=1}^{m_{test}} err(h_{\theta}(x^{(i)}_{test}), y^{(i)}_{test})$

where the $err$ function is a piecewise function which evaluates to 1 if the classification is **wrong** and 0 otherwise. 

### Model Selection

One way to break down our dataset into the three sets is:

* Training set: **60%**
* Cross validation set: **20%**
* Test set: **20%**

We can now calculate three separate error values for the three different sets using the following method:

* Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
* Find the polynomial degree d with the least error using the cross validation set.
* Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$, (d = theta from polynomial with lower error);

### Bias vs. Variance

Suboptimal models can be reduced in the two classes: high bias (_= underfitting_) or high variance (_= overfitting_).

As we can find out by visualizing our error in function of the degree of polynomial $d$, our error: 

* Descreases as we increase d in the training set
* Usually assumes a **U-shape**, underfitting at low d (high error / high bias) and overfitting at high d (high error again / high variance)

### Prioritizing what to work on

**Example: SPAM Classifier** 

Feature selection: use the training set to identify the most common 10,000 / 50,000 words and represent each email as vector of 1/0 if such a word is present (1) or not (0).

**Question: How to optimally spend our time?**

Options:

* Collect more data
* Develop sophisticated features
* Develop algorithms to process your input in different ways (e.g. watch / Watch / watches / w4tch)

### Error Analysis 

Recommended approach for ML project:

1. Start with simple algorithm
2. Implement it and test in on cross-validation data
3. Plot learning curve to decide next step
4. Error analysis to understand if there are patterns in the errors

Let's assume that our spam classifier has a very high error rate on the cross-validation set. What we could do is:

* Check what type of email is it;
* List what type of features would have helped making a more accurate classification

In this regard, **numerical evaluation** (having a single number to understand if we are moving in the right direction) is extremely useful. 

### Handling Skewed Data

A particular case when it is difficult to have evaluation metrics is that of _skewed classes_.

Simply speaking, a case where we are trying to classify something which occurs in a very small portion of the cases (e.g. 0.5% cancer diagnosis). 

 _ | Actual 1 | Actual 0
------------ | ------------- | ----
Predicted 1 | True positive | False positive
Predicted 0 | False negative | True negative

Let's now introduce more sophisticated measures of evaluation: **precision** and **recall**.

We define **precision** as:  $ \frac {\text {True positive}}{\text{True positive + False positive}}$

We define **recall** as:   $ \frac {\text {True positive}}{\text{True positive + False negative}}$

We know that there is intrinsic trade-off between precision and recall as our threshold varies. 
The next big question is therefore: is there a way to choose our threshold (cut-off value) automatically? 

It turns out there is, and it is called the $F_1$ Score. 

$F_1 \text{ Score} = 2 \frac{PR}{P+R}$ 

### Data for Machine Learning

In a famous paper from Banko and Brill (2001), the authors compare the performance of 4 different algorithms are the size of the dataset increases. In all of the cases, the performance improves, ending up almost converging for 3 of the them. 


**Large data rationale** 
1. An algorithm with many parameters will have a low training error
2. Using a large dataset will make sure that training set error and test set error will be close to each other

These two conditions should ensure that the test set error will end up being low.  