# Structuring Machine Learning Projects

## Orthogonalization

- To achieve good performance, it is necessary to **tune** the correct "knobs" of the machine learning system:
  - Ensure good performance on the training set.

  - Then, good performance on the development set (dev set).

  - Next is good performance on the test set.
  
  - Finally, achieving good real-world application performance, leading to user satisfaction.

- **Orthogonalization** helps identify and adjust specific "knobs" for each objective:
  
  - If the algorithm does not perform well on the training set, it may be necessary to increase the network size or use a different optimization algorithm like Adam.
  
  - If the performance on the dev set is not good, techniques such as **regularization** may need to be applied to improve generalization.
  
  - If the algorithm performs well on the dev set but not on the test set, considering expanding the dev set to avoid over-optimizing for the dev set is necessary.

![Image](./image/Orthogonalization.png)


## Single Number Evaluation Metric

- Confusion matrix:

  |               | Predicted cat | Predicted non-cat |
  |---------------|---------------|-------------------|
  | Actual cat    | 3             | 2                 |
  | Actual non-cat| 1             | 4                 |

- **Precision**: percentage of true cats in the recognized result: P = 3/(3 + 1)

- **Recall**: percentage of true recognition cat of the all cat predictions: R = 3/(3 + 2)
  
- **Accuracy**: (3+4)/10

  | Classifier | Precision | Recall |
  |------------|-----------|--------|
  | A          | 95%       | 90%    |
  | B          | 98%       | 85%    |

- **F1-Score** F1 = $\frac{2}{(1/P) + (1/R)}$


## Train/dev/test distributions

- Dev and test sets have to come from the same distribution.

- Choose dev set and test set to reflect data you expect to get in the future and consider important to do well on.

- Setting up the dev set, as well as the validation metric is really defining what target you want to aim at.

## Size of the dev and test sets

  - An old way of splitting the data was 70% training, 30% test or 60% training, 20% dev, 20% test (Valid for a number of examples ~ <100000)
    
  - In the modern deep learning if you have a million or more examples a reasonable split would be 98% training, 1% dev, 1% test.


## When to change dev/test sets and metrics

### Issue:

- Sometimes, the evaluation metric (like classification error) does not reflect the actual priorities or objectives of the project.

- An algorithm (A) might have a low error rate but allows undesirable images (such as pornographic content) to pass through.

- Another algorithm (B) with a higher error rate could actually be better because it prevents undesirable images from appearing.

### Solution:

- Consider revising or changing the evaluation metric when the current one is no longer appropriate.

- Suggest adding weights to examples in the development set to correctly reflect the severity of errors (e.g., errors with pornographic content have higher weights).

- Optimize the evaluation metric to accurately reflect what the application needs to perform well on, rather than just relying on the number of errors.

- Modify the development/dev set so that it more accurately reflects the data the algorithm needs to perform well on.

- Change the assessment method or test set if they do not align with the actual goals of the product or service.


## Why human-level performance?

- The **Bayes optimal error** represents a theoretical minimum error rate that cannot be surpassed by any classifier.

- The error margin between human-level performance and the Bayes optimal error is usually small.

- Humans excel in numerous tasks, and when machine learning algorithms perform worse than humans, one can:

  - Obtain labeled data from human annotators.

  - Analyze errors manually to understand the reasoning behind correct human predictions.
  
  - Analysis of bias and variance 

![Image](./image/WhyHuman.png)

## Avoidable Bias

- When working with machine learning algorithms, you aim for the algorithm to perform well on the training set but not to overfit compared to human performance.

| Humans | Training error | Dev Error | Error |
| ------ | -------------- | --------- | ----- |
| 1%     | 8%             | 10%       | Bias |
| 7.5%   | 8%             | 10%       | Variance |

- The human-level error as a proxy (estimate) for Bayes optimal error. Bayes optimal error is always less (better), but human-level in most cases is not far from it.

- Avoidable bias is the difference between the Bayes optimal error (estimated or actual) and the training error
    - $Avoidable\space Bias\space =\space Training\space Error\space -\space Human\space Error\space (Bayes) $
<br><br>
- Variance is the gap between training error and development error.
    - $Variance\space =\space Dev\space error\space -\space Training\space Error$



## Understanding Human-Level Performance


- When choosing human-level performance, it has to be chosen in terms of what you want to achieve with the system.

- You might have multiple human-level performances based on the human experience. Then you choose the human-level performance (proxy for Bayes error) that is more suitable for the system you're trying to build.

- Improving deep learning algorithms is harder once you reach a human-level performance.

- Summary of bias/variance with human-level performance:
  1. human-level error (proxy for Bayes error)
     
     - Calculate: 
         $avoidable\space bias\space =\space training\space error\space -\space human-level\space error$
     
     - If **avoidable bias** difference is the bigger, then it's a **bias** problem and you should use a strategy for bias resolving.
  2. training error & dev error
     
     - Calculate: 
         $variance\space =\space dev\space error\space -\space training\space error$
     
     - If **variance** difference is bigger, then you should use a strategy for variance resolving.

- So having an estimate of human-level performance gives you an estimate of Bayes error. And this allows you to more quickly make decisions as to whether you should focus on trying to reduce a bias or trying to reduce the variance of your algorithm.
- These techniques will tend to work well until you surpass human-level performance, whereupon you might no longer have a good estimate of Bayes error that still helps you make this decision really clearly.
