# Bias and Variance

# Diagnosing bias and variance

- High bias - underfitting

- High variance - overfitting


To find out if your model has high bias or high variance, you can look at the performance of the training set and the dev set.

- High bias: both $J_{train}$ and $J_{cv}$ will be high.

- High variance: $J_{train}$ will be low and $J_{cv}$ will be much greater than $J_{train}$.

- Just right: both $J_{train}$ and $J_{cv}$ will be low but $J_{cv}$ will be greater than $J_{train}$ by a small amount.

- High bias AND high variance: $J_{train}$ will be high and $J_{cv}$ will be much greater than $J_{train}$.

# Regularization and bias/variance

**Very large $\lambda$:**
 
- Here, the very large $\lambda$ is going to make the $w$'s parameters very small (because we are minimizing the cost function with the regularization term). So the $f_{\vec w,b}(\vec x)\approx b$, which is a very simple function (almost a straight line). So we are going to have a high bias (underfitting).

- High bias (underfit)

**Very small $\lambda$:**

- Here, the very small $\lambda$ is going to make the $w$'s parameters very large (because we are minimizing the cost function with the regularization term). So the $f_{\vec w,b}(\vec x)$ is going to be a very complex function (almost a wiggly line). So we are going to have a high variance (overfitting).

- High variance (overfit)



## Choosing the regularization parameter $\lambda$

Try different values of $\lambda$ on a grid. Then, pick the value of $\lambda$ that has the lowest $J_{cv}(\vec w,b)$.

Then evaluate the model on the test set to make sure it is not overfitting the test set.

# Establishing a baseline level of performance

What is the level of error you can reasonably hope to get to?

- Human level performance

- Competing algorithms performance

- Guess based on experience


**Important**

Calculate all percentages of error (Baseline performance, Training error and validation error). Then calculate the gap between Baseline performance and Training error and between Training and validation error. If the gap between Baseline performance and Training error is small, then you have a high bias problem. If the gap between Training and validation error is large, then you have a high variance problem.



# Learning Curves

Learning curves are plots of $J_{train}$ and $J_{cv}$ as a function of the number of training examples.

- y axis: error (both $J_{train}$ and $J_{cv}$)

- x axis: number of training examples

**$J_{train}$**

- When there are just a few training examples, the model can fit the training set very well (low error).

- As the number of training examples increases, the model cannot fit the training set as well (higher error).

- So $J_{train}$ will increase as the number of training examples increases.

**$J_{cv}$**

- When there are just a few training examples, the model cannot generalize well, so $J_{cv}$ will be high.

- As the number of training examples increases, the model can generalize better, so $J_{cv}$ will decrease.

- So $J_{cv}$ will decrease as the number of training examples increases.



**High bias**

- When the model has a high bias, $J_{train}$ will get higher as the number of examples increases. But it will flatten out at some point (plateu). The $J_{cv}$ will get lower as the number of examples increases. But it will flatten out at some point (plateu).

- The model cannot improve beyond a certain point, no matter how many more examples you feed it.

- If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

- To fix this, you need to use a more sophisticated model (one with more parameters).

**High variance**

- When the model has a high variance, $J_{train}$ will get higher as the number of examples increases. But it will flatten out at some point (plateu). The $J_{cv}$ will get lower as the number of examples increases. But it will flatten out at some point (plateu).

- The baseline performance might be higher than the training error. This is because the model is overfitting the training set.

- The model can improve with more data, but you need to feed it a lot of data to see any improvement.

- If a learning algorithm is suffering from high variance, getting more training data is likely to help.

- To fix this, you need to use a simpler model (one with fewer parameters).

# Deciding what to do next

**For high bias:**

- Try decreasing $\lambda$.

- Try adding more features.

- Try adding polynomial features ($x_1^2, x_2^2, x_1x_2, \dots$)


**For high variance:**

- Get more training data.

- Try increasing $\lambda$.

- Try smaller sets of features.

# Neural networks

- bias-variance tradeoff: to find a balance between bias and variance. To find the best possible outcome.

**Neural networks**

- If you make your neural network large enough, you can almost always fit your training set well. But the larger the neural network, the more prone it is to overfitting. That is why you need to regularize properly to avoid overfitting.

**Steps**

1. Does it do well on the trainig set? 

- if no: bigger network to reduce high bias.

- if yes: go to step 2.

2. Does it do well on the dev set?

- If no: there's high variance. Get more data.

- If yes: Done!




**Important**

A large neural network will usually do as well or better than a smaller one, so long as regularization is chosen appropriately.

## Neural network regularization

In [None]:
# Unregularized MNIST model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

layer_1 = Dense(units=25, activation="relu")
layer_2 = Dense(units=15, activation="relu")
layer_3 = Dense(units=1, activation="sigmoid")
model=Sequential([layer_1, layer_2, layer_3])

# Regularized MNIST model
from tensorflow.keras.regularizers import L2

layer_1 = Dense(units=25, activation="relu", kernel_regularizer=L2(0.01)) # that's the value of lambda
layer_2 = Dense(units=15, activation="relu", kernel_regularizer=L2(0.01))
layer_3 = Dense(units=1, activation="sigmoid", kernel_regularizer=L2(0.01))
model=Sequential([layer_1, layer_2, layer_3])