<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Bias-and-Variance">Bias and Variance</a></li>
<ol><li><a class="" href="#Bias">Bias</a></li>
<li><a class="" href="#Variance">Variance</a></li>
<li><a class="" href="#Bias-Variance-Trade-Off">Bias-Variance Trade-Off</a></li>
</ol><li><a class="" href="#Bias-and-Variance-with-Model-Parameters">Bias and Variance with Model Parameters</a></li>
</ol>

# Bias and Variance

The prediction error for any machine learning algorithm can be broken down into three parts:
* Bias Error
* Variance Error
* Irreducible Error
$$
Err(x) = \mathrm{Bias}^2 + \mathrm{Variance} + \mathrm{Irreducible\ Error}
$$
The irreducible error cannot be reduced regardless of what algorithm is used. It is the error introduced from the chosen framing of the problem and may be caused by factors like unknown variables that influence the mapping of the input variables to the output variable.


Suppose the model is
$$
y = f(x) + \epsilon
$$

## Bias

Bias are the simplifying assumptions made by a model to make the target function easier to learn. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
* **Low Bias:** Suggests less assumptions about the form of the target function.
* **High-Bias:** Suggests more assumptions about the form of the target function.

Examples of low-bias machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

Examples of high-bias machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

In the simplest terms, Bias is the difference between the Predicted Value and the Expected Value. Using the above model, bias is:
$$
\mathrm{Bias} = \mathbb{E}[\hat{y}] - \mathbb{E}[y]
$$


## Variance

Variance is the amount that the estimate of the target function will change if different training data was used. The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

* **Low Variance:** Suggests small changes to the estimate of the target function with changes to the training dataset.
* **High Variance:** Suggests large changes to the estimate of the target function with changes to the training dataset.

Examples of low-variance machine learning algorithms include: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

Examples of high-variance machine learning algorithms include: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

Variance is when the model takes into account the fluctuations in the data i.e. the noise as well. The variance, using the above models is:
$$
\mathrm{Variance} = \mathbb{E}[\hat{y}^2] - \mathbb{E}[\hat{y}]^2
$$

![](images/6.png)

## Bias-Variance Trade-Off

Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.

>There is no escaping the relationship between bias and variance in machine learning.
Increasing the bias will decrease the variance.
Increasing the variance will decrease the bias.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/08/eba93f5a75070f0fbb9d86bec8a009e9.png)

To understand it better, let's look at an example. Suppose given information about the wealth and religion of a group of people, we want to predict whome they will vote. This can be done by using KNN.

![](images/7.png)

Increasing k results in the averaging of more voters in each prediction. This results in smoother prediction curves. With a k of 1, the separation between the two parties is very jagged. Furthermore, there are "islands" of Democrats in generally Republican territory and vice versa. As k is increased to, say, 20, the transition becomes smoother and the islands disappear and the split between the two parties does a good job of following the boundary line. As k becomes very large, say, 80, the distinction between the two categories becomes more blurred and the boundary prediction line is not matched very well at all.

At small k's the jaggedness and islands are signs of variance. The locations of the islands and the exact curves of the boundaries will change radically as new data is gathered. On the other hand, at large k's the transition is very smooth so there isn't much variance, but the lack of a match to the boundary line is a sign of high bias.

What we are observing here is that increasing k will decrease variance and increase bias. While decreasing k will increase variance and decrease bias. Take a look at how variable the predictions are for different data sets at low k. As k increases this variability is reduced. But if we increase k too much, then we no longer follow the true boundary line and we observe high bias. This is the nature of the Bias-Variance Tradeoff.

![](images/8_1.png)
![](images/8_2.png)
![](images/8_3.png)

[Reference 1](http://scott.fortmann-roe.com/docs/BiasVariance.html)

[Reference 2](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/)

# Bias and Variance with Model Parameters

A model can be overfitted or underfitted depending on what are the values of the hyperparameter of the model. For example, in the case of a linear regression model, the hyperparameter is the degree of the polynomial. If the degree is too low, the model will be underfitted. If the degree is too high, the model will be overfitted. This results in different values of bias and variance. Here, we'll give some examples of how the bias and variance change with the model parameters.

**Linear Model**

$\lambda$ is the regularization parameter. If $\lambda$ is too high, the model will be underfitted. If $\lambda$ is too low, the model will be overfitted. So,

Smaller $\lambda$ => High Bias, Low Variance

**KNN**

Smaller K => Higher Variance, Lower Bias

Large K => Lower Variance, Higher Bias

For higher values of k, many more points closer to the datapoint in question will be considered. This would result in higher bias error  and underfitting since many points closer to the datapoint are considered and thus it can’t learn the specifics from the training set. However, we can account for a lower variance error for the testing set which has unknown values.



**SVM**

<p>The parameter <code>C</code>, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low <code>C</code> makes the decision surface smooth, while a high <code>C</code> aims at classifying all training examples correctly. <code>gamma</code> defines how much influence a single training example has. The larger <code>gamma</code> is, the closer other examples must be to be affected.</p>

Smaller C => Lower Variance, Higher Bias

Smaller Gamma => Lower Variance, Higher Bias