# 05.01 - The Bias-Variance Tradeoff

## Bias and Variance

In model fitting, we often encounter issues usually associated with either **high bias** or **high variance**. Essentially, we are dealing with an **underfitting** or **overfitting** problem. It is probable that the problems you encounter fall somewhere along the **bias-variance** spectrum, which could guide strategies to enhance our model(s) performance.

High **bias** is a sign of underfitting(model is oversimplified), leading to high error on both the training and test data. High **variance** is a sign of overfitting(model is overcomplicated), leading to a high error on the test data, but low error on training data.

Here are some common strategies to handle **bias** and **variance**:

- Acquiring more data: This can help improve a model if we are experiencing high variance (overfitting). With more data, our model has more learning opportunities, which can lead to improved generalization.
- Improving the quality of data: Sometimes the problem isn't the quantity of data, but the quality. Cleaning up your dataset may lead to improvements in your model.
- Increasing model complexity: This is useful when we are suffering from high bias (underfitting). A more complex model might be able to better learn from the data and capture the patterns present.
- Decreasing model complexity: If our model is suffering from high variance (overfitting), reducing the complexity of the model can help. This means making the model simpler so it's less prone to learning the noise in the data.
- Regularizing: This is a form of regression that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Our main goal with modeling is to make our predictions **generalizable**. Understanding the **bias-variance** tradeoff helps us grasp the concepts of **underfitting** and **overfitting**.

### Sum of Squared Errors (SSE)

The Sum of Squared Errors (SSE) is a commonly used measure to determine the difference between predicted and actual values in a model. It's a way to measure how well the model fits the data.

Let's break down this equation:

## $SSE = \sum_{n=1}^n{(y_i - \hat{y_i})^2}$

1. **$y_i$**: This represents the actual value from the data. The 'i' means that it's the value at the i-th position in our list of data. So, if we have 10 data points, 'i' could be any number from 1 to 10.
2. **$\hat{y_i}$**: This is the predicted value from the model for the i-th data point. The 'hat' over the y means it's an estimate.
3. **$y_i - \hat{y_i}$**: This is the difference between the actual and predicted value for the i-th data point. If our model is perfect, this difference would be zero because the predicted value would be exactly the same as the actual value.
4. **$(y_i - \hat{y_i})^2$**: We square the difference. Squaring is done to make sure all differences are positive (since squaring any number, positive or negative, gives a positive result). This prevents positive and negative differences from cancelling each other out.
5. **$\sum_{n=1}^n$**: The big 'Σ' symbol means we add up these squared differences for all the data points in our dataset. 'n' is the total number of data points. This sum gives us a single number that tells us how well our model fits the data — the smaller the SSE, the better the model fits the data.

### Decomposing Error, $E[SSE]$

The expected value of SSE can be broken down into 3 components. This represents the average error we would expect if we ran our model multiple times.

1. **Bias** (also known as $bias^2$)
$(E[\hat{y}] - y)^2$
    - **$E[\hat{y}]$**: This represents the average of all predicted values from our model. The 'E' stands for expectation, which is a fancy word for average.
    - **$y$**: This is the true value from our data.
    - **$(E[\hat{y}] - y)^2$**: This is the difference between our average prediction and the true value. We square this difference to ensure that it's always positive. This represents how far off our predictions are from the truth, on average. This is what we call the bias.
2. **Variance**
$E[\hat{y} - E[\hat{y}]]^2$
    - **$\hat{y}$**: This is a predicted value from our model.
    - **$E[\hat{y}]$**: Once again, this is the average of all predicted values from our model.
    - **$\hat{y} - E[\hat{y}]$**: This is the difference between a predicted value and the average prediction. It represents how much individual predictions vary around the average prediction.
    - **$E[\hat{y} - E[\hat{y}]]^2$**: We square this difference and then take the average. This gives us the variance, which represents how spread out our predictions are. If the variance is high, our model's predictions are very spread out and inconsistent.
3. **$\sigma^2$** - Irreducible Error
    - **$\sigma^2$**: This represents the irreducible error, which is error that we can't reduce no matter how good our model is. This error comes from randomness or natural variability in the system we're trying to model. No matter how good our model is, we can't predict random noise.

Finally, we add up all these components to get the expected SSE:

$E[SSE] = \text{bias}^2 + variance + \sigma^2$

This equation tells us that the total error in our model is a combination of bias, variance, and irreducible error. Understanding these components can help us improve our model.

### Bias and Variance in Simple Terms

First, let's break down the concepts:

- **Bias**: Imagine you're playing darts. If you keep hitting the board to the left of the bullseye, that's bias. Your throws are consistently off-target. In terms of a model, a high bias means our model makes a lot of incorrect assumptions and misses the real trends. In other words, our model is oversimplified and has a high error rate.
- **Variance**: Back to the darts game. This time, your throws are scattered all over the place. Sometimes you hit the bullseye, sometimes you miss the board entirely. This is variance. In terms of a model, high variance means our model is too sensitive to the data. It's complicated and responds too much to the noise or fluctuations in the data, leading to a high error rate on new, unseen data.

The goal is to have both low bias and low variance. This would mean our model is right on target: it makes good assumptions and doesn't overreact to noise in the data.

### Underfitting vs Overfitting

These are two problems we can run into when training our model:

- **Underfitting** (High Bias): This is like trying to fit a straight line through a curved trend. The model is too simple to capture the pattern in the data. This leads to high bias, as our model is making incorrect assumptions about the data.
- **Overfitting** (High Variance): This is like trying to trace every single data point with a complicated curve. The model is too complex and is picking up on noise or random fluctuations in the data. This leads to high variance, as our model will not generalize well to new, unseen data.

So, in simple terms: underfitting is when our model is too simple and misses the trends, while overfitting is when our model is too complex and gets confused by the noise.

Remember: the aim is to find the sweet spot in the middle where our model is just right. This sweet spot balances bias and variance, leading to the most accurate predictions.

## Why Does Bias Occur?

![high_bias.png](./high_bias.png)

- Our model is too simple for our data.

This means that the mathematical equation we are using to predict values (our model, represented by $\hat{y}$) doesn't capture all the complexities of the actual data. Imagine trying to draw a straight line through a cloud of points that clearly forms a curve - the straight line is too simple to accurately represent the curve.

- On average, our model $\hat{y}$ is going to be far from the truth $y$.

In this case, $y$ represents the real, true values from our data. The difference between our model's predictions $\hat{y}$ and the true values $y$ is the bias. If our model is too simple, then its predictions are going to be far off from the truth on average, meaning the bias is high.

- For example, I tried to model a curved relationship with a straight line.

This is a concrete example of a model being too simple. If the relationship between the variables in our data is curved, but we try to fit a straight line to it, our model won't be accurate. The straight line is too simple to capture the curved relationship.

- When we rely on simplifying assumptions that aren’t valid (i.e.linearity), we can run into high bias.

Sometimes, we make assumptions about our data to make it easier to work with. One common assumption is linearity, which means we assume the relationship between variables is a straight line. But, if this assumption is not valid (like in the previous example where the relationship is curved, not straight), our model will have high bias.

- Linear regression is a method where we might suffer from high bias.

Linear regression is a type of modeling where we try to fit a straight line to our data. As we've discussed, if the true relationship is not a straight line, this model can have high bias because it's too simple.

## Why Does Variance Occur?

![high_variance.png](./high_variance.png)

Variance occurs when our model is too complex. In other words, the model is overly sensitive to the specific details and noise in our data.

Let's break down these points further:

- **Our model $\hat{y}$ matches our data too closely.**
    
    This might sound like a good thing at first, but imagine you're trying to draw a smooth curve through a bunch of points. If you make your curve twist and turn to go through every single point, it's going to look more like a scribble than a curve. This is an example of a model that matches the data too closely. It's overcomplicated and it's probably picking up on the noise (random fluctuations) in the data, rather than the overall trend.
    
- **May not perform well on data it hasn't seen yet.**
    
    When a model is too complex, it's great at predicting the data it was trained on because it's learned all the small details and noise in that data. However, when we try to use the model to predict new data it hasn't seen before, it performs poorly. This is because the noise it learned from the training data doesn't apply to the new data. It's like studying for a test by memorizing the answers to the practice questions, instead of understanding the underlying concepts. If the questions on the test are different from the practice questions, you're going to do poorly.
    
- **We may not have enough data.**
    
    If we don't have a lot of data, our model might latch onto the noise in the data and mistake it for a real trend. This leads to a complex model that doesn't generalize well to new data. The more data we have, the easier it is for our model to figure out what's noise and what's an actual trend.
    
- **Our model may "fit" very well to data it's seen, but not generalize well to data it hasn't.**
    
    As mentioned before, a complex model will "fit" very well to the data it was trained on because it's learned all the small details and noise. But because those details and noise don't apply to new data, the model doesn't generalize well. "Generalizing" means being able to make accurate predictions on new, unseen data. A good model is one that can generalize well, not just fit well to the training data.
    

## The Tradeoff

When we are creating a predictive model (a mathematical formula that can predict some outcome based on certain inputs), one of our main goals is to make as few errors as possible. In other words, we want the predictions made by our model to be as close as possible to the actual outcomes.

But there's a problem. When we try to make our model perfectly fit our data, we might end up with a model that is too complex. This is like trying to hit a bullseye with a dart, but we keep adding more and more feathers to the dart until it's so heavy that it can't fly straight. This is what we call **overfitting**.

On the other hand, if our model is too simple, it might not capture all the important details in our data. This is like trying to hit a bullseye with a dart, but the dart is so light that it can't even reach the dartboard. This is what we call **underfitting**.

So, we need to find a balance between a model that is too simple (underfitting) and a model that is too complex (overfitting). This balance is often referred to as the **tradeoff**.

### Increasing Model Complexity

- As we increase the complexity of our model (add more variables, make the mathematical formula more complicated), our **bias** decreases. Bias is like consistent or systematic error. It's like always throwing the dart a little to the left of the bullseye. A lower bias means we are getting closer to hitting the bullseye on average.
- But as we increase the complexity of our model, our **variance** increases as well. Variance is like random or unpredictable error. It's like sometimes hitting the bullseye, sometimes hitting the edge of the dartboard, and sometimes missing the dartboard completely. A higher variance means our throws are less consistent. They might be closer to the bullseye on average (lower bias), but they are also more spread out (higher variance).

![tradeoff.png](./tradeoff.png)