# Advice for applying machine learning

# Debbuging a learning algorithm

You've implemented regularized linear regression on housing prices. 

But it makes unacceptably large errors in predictions. What do you try next?

- Get more training examples

- Try smaller sets of features

- Try getting additional features

- Try adding polynomial features ($x_1^2, x_2^2, x_1x_2, etc$)

- Try decreasing $\lambda$

- Try increasing $\lambda$


# Machine learning diagnostic

A test that you can run to gain insight what is/isn't working with a learning algorithm, and gain guidance as to how best to improve its performance.

- A diagnostic can take time to implement, but doing so can be a very good use of your time.



# Evaluating a model

## For linerar regression (with squared error cost)

Take the 70% of the available data as training set, and the remaining 30% as test set. Or 80% and 20%

Train the model on the training set, and evaluate it's performance on the test set.

Training set: Fit parameters by minimizing cost function $J(\vec w, b)$

$$J(\vec w, b) = (\frac{1}{2m_{train}} \sum^{m_{train}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2 + \frac {\lambda}{2m_{train}} \sum^n_{j=1} w^2_j$$

Test set: Compute test set error

$$J_{test}(\vec w, b) = (\frac{1}{2m_{test}} [\sum^{m_{test}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2]$$

**Note that there is no regularization parameter in the second one.**

Compute training error: 

$$J_{train}(\vec w, b) = (\frac{1}{2m_{train}} [\sum^{m_{train}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2]$$

If you have a model that fits very well the parameters, but will not generalize well to new additional data, it is called **overfitting**. 

- $J_{train}(\vec w,b)$ will be low.

- $J_{test}(\vec w,b)$ will be high.

## Classification

Fit parameters by minimizing $J(\vec w, b)$ to find $\vec w, b$:

$$J(\vec w,b) = -\frac{1}{m_{train}} \sum^{m_{train}}_{i=1} [y^{(i)} \log (f_{\vec w,b}(\vec x^{(i)})) + (1-y^{(i)}) \log (1 - f_{\vec w,b}(\vec x^{(i)}))] + \frac {\lambda}{2m_{train}} \sum^n_{j=1} w²_j $$

**Option 1**

Compute test error:

$$J_{test}(\vec w,b) = -\frac{1}{m_{test}} \sum^{m_{test}}_{i=1} [y^{(i)}_{test} \log (f_{\vec w,b}(\vec x^{(i)}_{test})) + (1-y^{(i)}_{test}) \log (1 - f_{\vec w,b}(\vec x^{(i)}_{test}))]$$

Compute training error:

$$J_{train}(\vec w,b) = -\frac{1}{m_{train}} \sum^{m_{train}}_{i=1} [y^{(i)}_{train} \log (f_{\vec w,b}(\vec x^{(i)}_{train})) + (1-y^{(i)}_{train}) \log (1 - f_{\vec w,b}(\vec x^{(i)}_{train}))]$$

**Option 2**

Instead of using the logistic loss to compute the test and train errors, we can measure what the fraction of the test set and the fraction of the training set that was misclassified.

$$ \hat y = \{  \begin{array}{rcl}
1 & if & f_{\vec w,b}(\vec x^{(i)}) \geq 0.5 \\ 0 & if & f_{\vec w,b} (\vec x^{(i)}) < 0.5 \\
\end{array}$$

count $\hat y \neq y$

$J_{test}(\vec w,b)$ is the fraction of the test set that has been misclassified.

$J_{train}(\vec w,b)$ is the fraction of the training set that has been misclassified.

# Model selection and training / cross validation / test sets

- Once parameters $\vec w,b$ are fit to the training set, the training error $J_{train}(\vec w,b)$ is likely lower than the true generalization error $J_{test}(\vec w,b)$.

- $J_{test}(\vec w,b)$ is better estimate on how well the model will generalize to new data.



**Model selection**

If we have 10 different models to choose from, we can train each of them on the training set, and evaluate the performance on the test set. Then we can choose the model that performed best on the test set (the one with the lowest $J_{test}(\vec w,b)$).

However this process is flawed because we might end up with overly optimistic estimate of the generalization error.

**Cross validation**

We divide our trainig set into 3 parts:

- Training set: 60%

- Cross validation set (validation set, development set, dev set): 20%. Used to check or cross check the validity or really the accuracy of different models.

- Test set: 20%


**Same formulas as before**

Training error:

$$J_{train}(\vec w,b) = \frac{1}{2m_{train}} [\sum^{m_{train}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2]$$

Cross validation error:

$$J_{cv}(\vec w,b) = \frac{1}{2m_{cv}} [\sum^{m_{cv}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2]$$

Test error:

$$J_{test}(\vec w,b) = \frac{1}{2m_{test}} [\sum^{m_{test}}_{i=1} (f\vec w, b (\vec x^{(i)}) - y^{(i)})^2]$$

**Model selection**

We train each of the 10 models on the training set, and evaluate the performance on the cross validation set. Then we choose the model that performed best on the cross validation set (the one with the lowest $J_{cv}(\vec w,b)$).

Finally we evaluate the performance of the model on the test set.