# Evaluating a Model

<u>Regression:</u>
- Split data into training data and test data (Usually 70% to 30% or 80% to 20%)
- Then, train the model on the training data using gradient descent
- Then, after learning the parameters, use the cost function (without the regularization term) to see the error on the test data
- Also, you can find the error on the train data to compare to the test data

<u>Classification</u>
- Split data into training and test data (Usually 70% to 30% or 80% to 20%)
- Then, train the model on the trainign data using gradient descent
- Then, after learning the parameters, find the fraction of misclassified labels on the test data and that's your error
- Also, you can find the fraction on the train data (train data error) to compare to the test data

# Choosing a Model

<u>Regression:</u> (Linear or Polynomial? and what degree?)
- Split data into training, cross-validation, and test set (Usually 60%, 20%, 20%)
- Train model on the training data using gradient descent
- Then, after learning the parameters, use the cost function (without the regularization term) to find the error on all three sets of data
- Do the last three steps for different degree polynomials (degree 1 to degree 10 let's say)
- Choose the degree that results in the lowest cross-validation error

<u>Classification:</u> (Which neural network architecture to try, which classification model to try?)
- Split data into training, cross-validation, and test set (Usually 60%, 20%, 20%)
- Train model on the training data using gradient descent
- Then, after learning the parameters, find the fraction of misclassified labels on the three sets of data
- Do the last three steps for all the different classification models
- Choose the model/architecture that results in the lowest cross-validation error

NOTE: You can't just look at the test set and compare them because apparently, the test set should be used only for seeing the accuracy/error of the model and should not be used for anything else. This makes it so that you haven't "fit" your model to the test set which is unfair.

NOTE: In Machine Learning, it is considered best practice to use only the Training or Cross Validation Set to make decision about your model, such as fitting parameters or choosing the model architecture, and not look at the test set at all when making these decisions. Then, once you've come up with your final model, you then evaluate it on your test set.

# Improving Performance of a Model

$J_{train}$ = error on training data

$J_{cv}$ = error on cross-validation set

<u>High Bias (Underfitting):</u>
- $J_{train}$ is high
- $J_{cv}$ is high

<u>High Variance (Overfitting):</u>
- $J_{train}$ is low
- $J_{cv}$ is high

<u>Just Right:</u>
- $J_{train}$ is low
- $J_{cv}$ is low

![](2022-07-22-15-08-28.png)

### <u>Regularization</u>

![](2022-07-22-16-12-08.png)

How to find the regularization parameter $\lambda$:
- Try $\lambda = 0$ and observe $J_{cv}$
- Try $\lambda = 0.01$ and observe $J_{cv}$
- Try $\lambda = 0.02$ and observe $J_{cv}$
- Try $\lambda = 0.04$ and observe $J_{cv}$

    ...
- Try $\lambda = 10$ and observe $J_{cv}$
- Finally, use the lambda that results in the lowest $J_{cv}$

![](2022-07-22-16-22-37.png)

### <u>Establish a Baseline Level of Performance</u>
- What is the level of error you can reasonably hope to get to?
    - Human level performance
    - Competing algorithms performance
    - Guess based on experience
- This baseline will help you to see whether the training error/CV error is low or high
    - Ex. Speech Recognition
        - Human performance error = 10.6%
        - Train error = 10.8%
        - CV error = 14.8%
        - When only seeing the train error, it may seem very high. However, when comparing it to human performance, it seems pretty good. Therefore, this model may actually be overfitting instead of underfitting.

### <u>Learning Curves</u>

![](2022-07-22-15-58-29.png)
- The x-axis plots the size of the training set
- The y-axis plots the errors
- With high bias, both error curves will flatten out and there will be a big gap between $J_{train}$ and the baseline performance
- NOTE: With high bias, you should not try to get more training data because that doesn't help

![](2022-07-22-16-03-45.png)
- With high variance, both error curves will flatten but there will be a big gap between $J_{train}$ and $J_{cv}$
- Also, human level performance may actually be above $J_{train}$
- NOTE: With high variance, you should try to get more training data as it will help because $J_{cv}$ will keep coming down approaching the $J_{train}$ curve

### <u>Error Analysis</u>

Assume $N_cv = 5000$ = # of examples in cross validation set and your algorithm misclassifies 1000 of them
- You can manually examine 100 or 200 examples and categorize them based on common traits
- This analysis can tell you if you're missing some feature
- This manual examination can also tell you about features that occur rarely so there's no point of spending a lot of time fixing them

Note that Error Analysis is easier if the job can be done by a human (seeing if an email is spam or not) but it's harder if the job cannot be easily done by a human (see which ad a user will click on).


### <u>What to try Next after the Diagnostics</u>

High Variance (Overfitting):
- Get more training examples
- Try smaller sets of features
- Try increasing $\lambda$

High Bias (Underfitting):
- Try getting additional features
- Try adding polynomial features
- Try decreasing $\lambda$
- NOTE: Reducing training set size doesn't help

How to add more data?
- Add specific data
    - Let's say error analysis tells you that 50 misclassified emails are from pharma companies
    - Then, you can add more pharma spam data into your training data so the model does better
- Data Augmentation
    - Modifying an existing training example to create a new training example
    - The new training example should be representative of the types of noise or distortions that come up in the test set
- Data Synthesis
    - Creating a brand new training example
    - Again, like Data Augmentation, the new training example should be representative of examples in the test set
- Transfer Learning
    - For when you cannot make/collect more data
    - Train your model on a different task with a large dataset
    - Then, use the trained model and train it again with the intended task
    - The intuition is that by pretraining your model, the parameters start off at a much better place, and that can result in the final model to be good
    - Note that there are many already trained neural networks online and you can use that to perform Transfer Learning
    - Note that the pretrained model should have the same input as your final model because then pretraining won't do much good
        - Ex. You can pretrain on Cats & Dog images and use that for training on Handwritten digit images

    ![](2022-07-22-17-53-11.png)

### <u>Bias/Variance for Neural Networks</u>

![](2022-07-22-16-37-26.png)
- Training a bigger neural network does reduce bias but it can get computationally expensive
- Also, sometimes you may not be able to get more data

<u>A large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately.</u>
- Regularizing a neural network in code (in the case below, $\lambda$ is 0.01):

    ![](2022-07-22-16-42-07.png)

# Machine Learning Development Process

![](2022-07-22-16-49-57.png)

# Full Cycle of a Machine Learning Project

1. Define Scope of Project
    - Ex. Speech Recognition for Voice Search
2. Collect Data
3. Refine Data
4. Train model
5. Training, error analysis & iterative improvement
6. Deploy in Production and maintain system

# Imbalanced Data

<u>Confusion Matrix</u>

![](2022-07-22-18-41-04.png)

Let's say 1 means that the patient has the rare disease and 0 means the patient doesn't

<u>Error metric:</u> We use different error metrics called Precision and Recall for imbalanced data
- Precision = of all patients where we predicted y = 1, what fraction actually have the rare disease? $$\frac{\text{True positives}}{\text{Total Predicted Positive}} = \frac{\text{True positives}}{\text{True pos + False pos}}$$
- Recall = of all patients that actually have the rare disease, what fraction did we correctly detect as having it? $$\frac{\text{True positives}}{\text{Total Actual Positive}} = \frac{\text{True positives}}{\text{True pos + False neg}}$$
- With these metrics, you can see if your model is just printing the majority class all the time because if it is, then the True Positives will be 0 making Precision = undefined and Recall = 0


<u>Precision and Recall Trade-off</u>

Usually, we want both Precision and Recall to be high but that cannot always happen

High Threshold
- Let's say that a disease is not very lethal but its treatment is very expensive
- In this case, you may want to raise your threshold so you predict 1 whenever you have a 70% chance or higher
- In this case, precision increases and recall decreases

Low Threshold
- Let's say that a disease's treatment is not expensive but if not treated, it can be lethal
- In this case, you may want to lower your threshold so you predict 1 whenever you have a 30% chance or higher
- In this case, precision decreases and recall increases

![](2022-07-22-18-57-54.png)

<u>F1-Score:</u> the harmonic mean of the precision (P) and recall (R) -- gives emphasis to whichever value is lower $$\frac{1}{\frac{1}{2}(\frac{1}{P} + \frac{1}{R})}$$
- This is because it turns out that if an algorithm has very low precision or recall, it is probably not very useful
- The F1 score gives a way to trade-off the precision and the recall