# Evaluating a Learning Algorithm

## Evaluating a Hypothesis

Once we have done some trouble shooting for errors in our predictions by:
- Getting more training examples
- Trying smaller sets of features
- Trying additional features
- Trying polynomial features
- Increasing or decreasing λ

We can move on to evaluate our new hypothesis.

A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a **training set** and a **test set**. Typically, the training set consists of 70% of your data and the test set is the remaining 30%.

The new procedure using these two sets is then:
1. Learn $\Theta$ and minimize $J_{\text {train}}(\Theta)$ using the training set
2. Compute the test set error $J_{\text {test }}(\Theta)$

### The test set error

1. For linear regression: $J_{\text {test }}(\Theta)=\frac{1}{2 m_{\text {test }}} \sum_{i=1}^{m_{\text {tert }}}\left(h_{\Theta}\left(x_{\text {test }}^{(i)}\right)-y_{\text {test }}^{(i)}\right)^{2}$
2. For classification - Misclassification error (aka 0/1 misclassification error):  
$\operatorname{err}\left(h_{\Theta}(x), y\right)=\begin{array}{ll}{1} & {\text { if } h_{\Theta}(x) \geq 0.5 \text { and } y=0 \text { or } h_{\Theta}(x)<0.5 \text { and } y=1} \\ {0} & {\text { otherwise }}\end{array}$

This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:

$\text { Test Error }=\frac{1}{m_{\text {test}}} \sum_{i=1}^{m_{\text {teat}}} \operatorname{err}\left(h_{\Theta}\left(x_{\text {test}}^{(i)}\right), y_{\text {test}}^{(i)}\right)$

This gives us the proportion of the test data that was misclassified.

## Model Selection and Train/Validation/Test Sets

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could over fit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.

Given many models with different polynomial degrees, we can use a systematic approach to identify the 'best' function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.

One way to break down our dataset into the three sets is:  
- Training set: 60%
- Cross validation set: 20%
- Test set: 20%

![image.png](https://i.loli.net/2020/03/01/1WJgxEtMNAfjvQw.png)

We can now calculate three separate error values for the three different sets using the following method:  
1. Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
2. Find the polynomial degree d with the least error using the cross validation set.
3. Estimate the generalization error using the test set with $J_{\text {test }}\left(\Theta^{(d)}\right),(d=$ theta from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

# Bias vs. Variance

## Diagnosing Bias vs. Variance

The training error will tend to decrease as we increase the degree d of the polynomial.  
At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

High bias (underfitting): both $J_{\text {train }}(\Theta)$ and $J_{C V}(\Theta)$ will be high. Also, $J_{C V}(\Theta) \approx J_{\text {train }}(\Theta)$  
High variance (overfitting): $J_{\text {train}}(\Theta)$ will be low and $J_{C V}(\Theta)$ will be much greater than $J_{\text {train}}(\Theta)$

![image.png](https://i.loli.net/2020/03/01/a5DjfAUrLVNeEvR.png)

## Regularization and Bias/Variance

![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/3XyCytntEeataRJ74fuL6g_3b6c06d065d24e0bf8d557e59027e87a_Screenshot-2017-01-13-16.09.36.png?expiry=1580342400000&hmac=BtVVPsHlU5dZvuqeDEO6O1fvz_ecDoE3eKzCYqQxq5E)

![image.png](https://i.loli.net/2020/03/01/MmVcA9tRoiNQj2E.png)

> 初始 $\lambda$ 为 0，表示未进行正则化，过拟合导致高方差。  
> $\lambda$ 过大导致惩罚过大，$\Theta$ 取值为 0，图像趋于水平线，欠拟合导致高偏差。

## Learning Curves

### Experiencing high bias:  

![image.png](https://i.loli.net/2020/03/01/hvBzoCxA7piw9fb.png)

**Low training set size**: causes $J_{\text {train}}(\Theta)$ to be low and $J_{C V}(\Theta)$ to be high.  
**Large training set size**: causes both $J_{t r a i n}(\Theta)$ and $J_{C V}(\Theta)$ to be high with $J_{t r a i n}(\Theta) \approx J_{C V}(\Theta)$  

> N 较小时，$J_{train}(\Theta)$ 可以拟合得较好，而对于$J_{CV}(\Theta)$，一开始少量数据可能刚好不符合拟合出来的参数。  
> N 较大时，因为高偏差欠拟合，导致$J_{train}(\Theta)$  error 增大，而大量数据会使整体拟合效果变好，$J_{CV}(\Theta)$ error 减小。

If a learning algorithm is suffering from high bias, **getting more training data will not (by itself) help much**.

> 因为欠拟合，error 会平衡在一个较高值。

### Experiencing high variance:

![image.png](https://i.loli.net/2020/03/01/1iEyHSmnUtG8NC7.png)

**Low training set size**: $J_{train}(\Theta)$ will be low and $J_{C V}(\Theta)$ will be high.  
**Large training set size**: $J_{train}(\Theta)$ increases with training set size and $J_{C V}(\Theta)$ continues to decrease without leveling off. Also, $J_{train}(\Theta)<J_{C V}(\Theta)$ but the difference between them remains significant.

> 对于高方差过拟合，$J_{train}(\Theta)$ 拟合效果很好，所以 error 只是缓慢上升。$J_{CV}(\Theta)$同理。且二者之间空隙较大。

If a learning algorithm is suffering from high variance, **getting more training data is likely to help**.

## Deciding What to Do Next Revisited

Our decision process can be broken down as follows:
- **Getting more training examples**: Fixes high variance
- **Trying smaller sets of features**: Fixes high variance
- **Adding features**: Fixes high bias
- **Adding polynomial features**: Fixes high bias
- **Decreasing λ**: Fixes high bias
- **Increasing λ**: Fixes high variance.

### Diagnosing Neural Networks

- A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
- A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.

# Building a Spam Classifier

## Prioritizing What to Work On

Given a data set of emails, we could construct a vector for each email. Each entry in this vector represents a word. The vector normally contains 10,000 to 50,000 entries gathered by finding the most frequently used words in our data set. If a word is to be found in the email, we would assign its respective entry a 1, else if it is not found, that entry would be a 0. Once we have all our x vectors ready, we train our algorithm and finally, we could use it to classify if an email is a spam or not.

![image.png](https://i.loli.net/2020/03/01/ml7TrEOK8JIZb5F.png)

So how could you spend your time to improve the accuracy of this classifier?

- Collect lots of data (for example "honeypot" project but doesn't always work)
- Develop sophisticated features (for example: using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam).

It is difficult to tell which of the options will be most helpful.

## Error Analysis

The recommended approach to solving machine learning problems is to:  
- Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
- Plot learning curves to decide if more data, more features, etc. are likely to help.
- Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

# Handling Skewed Data

## Error Metrics for Skewed Classes

评估一个模型的好坏用准确率（Accuracy）其实有缺陷，假如我们做癌症分析，最后得出该算法只有 1% 的误差，也就是说准确率达到了 99% 。这样看起来 99% 算是非常高的了，但是我们发现在训练集里面只有 0.5% 的患者患有癌症，那么这 1% 的错误率就变得那么准确了。我们再举个极端一点的例子，无论输入是什么，所有预测输出的数据都为 0（也就是非癌症），那么我们这里的正确率是 99.5%，但是这样的判断标准显然不能体现分类器的性能。

这是因为两者的数据相差非常大，在这里因为癌症的样本非常少，所以导致了预测的结果就会偏向一个极端，我们把这类的情况叫做偏斜类（Skewed Classes）问题。

所以我们需要另一种的评估方法，其中一种评估度量值叫做查准率（Precision）和召回率（Recall）。

![image.png](https://i.loli.net/2020/03/01/yU1qTcMmA9Plued.png)

假如像之前的y一直为 0，虽然其准确率为 99%，但是其召回率是 0%。所以这对于评估算法的正确性是非常有帮助的。

## Trading Off Precision and Recall

![image.png](https://i.loli.net/2020/03/01/wMtAe2oknbu4aUm.png)

还是以预测癌症作为例子。假如我们想提高其预测癌症的准确度，就要提高阈值，不再是之前的 0.5，而将其改得更高，达到 0.7 甚至 0.9，那么其预测的结果就更为准确了。这个时候，我们的查准率就比较高，而召回率就比较低了。

但是我们又不希望有癌症的人没有被预测出来，那么这种情况我们只能降低阈值，例如将 0.5 降低至 0.3，这样真正患有癌症的人就会更被容易地预测出来了。这个时候，我们的召回率比较高，而查准率就比较低了。

![image.png](https://i.loli.net/2020/03/01/HrTiaDQYPq9WbGt.png)

查准率和召回率不可得兼，用 F 值来衡量评估。

# Using Large Data Sets

## Data For Machine Learning

![image.png](https://i.loli.net/2020/03/01/eD9hNB7QVMqmYau.png)

![image.png](https://i.loli.net/2020/03/01/CwcEAjOragyNI3H.png)

首先我们因为有大量的特征量，去训练数据，这样就导致了我们的训练集误差非常小，也就是 $J_{train}(\Theta)$ 非常小。然后我们提供了大量的训练数据，这样有利于防止过拟合，可以使得 $J_{train}(\Theta) \approx J_{test}(\Theta)$ 。这样，我们的假设函数既不会存在高偏差，也不会存在高方差，所以相对而言，大数据训练出来会更加准确。