# Understanding Bias and Variance in Machine Learning

## Bias Explained

* **Definition:** Bias refers to the tendency of a model to consistently make predictions that are wrong in a particular direction. It measures how far, on average, the model's predictions deviate from the correct values.
* **Underfitting (High Bias, Low Variance):**
    * **Training Data:** A model that's too simple can't capture the underlying patterns in the training data. This leads to underfitting.
    * **Test Data:** Underfitted models perform poorly on both the training and test data because they fail to generalize to new data.
    * **Example:** Using a simple linear model to predict house prices based on square footage might miss the complex relationship between these factors. This high bias leads to consistent underestimation or overestimation of price, resulting in poor performance on both datasets.

## Variance Explained

* **Definition:** Variance refers to the model's sensitivity to the specific training data. Low variance models consistently make similar predictions irrespective of noise or variations in the data. 
* **Overfitting (Low Bias, High Variance):**
    * **Training Data:** Overfitted models fit the training data too closely, capturing noise and random fluctuations. This memorizes the data rather than learning the underlying patterns.
    * **Test Data:** Overfitted models perform poorly on the test data because they haven't learned to generalize.
    * **Example:** A very complex model might perfectly fit the training data on house prices, but perform poorly on new houses. This is because it has learned the noise in the data, leading to high variance.

## The Bias-Variance Trade-off

The goal is to find a sweet spot between bias and variance:

* A model with both high bias and high variance will perform poorly.
* Ideally, we want a model with low bias and low variance.

**Complexity and the Trade-off:**

* As the model complexity increases, bias typically decreases (it can better fit the data), but variance increases (it becomes more sensitive to noise).
* Conversely, decreasing complexity increases bias (the model misses patterns) and decreases variance (less sensitive to noise).

## Key Points to Remember

* **Model Complexity:** Simpler models are more prone to underfitting, while complex models are more likely to overfit.
* **Data Quality and Quantity:** High-quality and sufficient training data can help reduce both bias and variance.
* **Regularization Techniques:** Techniques like L1 and L2 regularization can help reduce overfitting by penalizing complex models.
* **Cross-Validation:** This technique helps assess the model's performance on unseen data and identify potential overfitting or underfitting.
* **Feature Selection:** Reducing the number of features can prevent overfitting.
* **More Data:** Increased training data can improve generalization and reduce both bias and variance.


In [3]:
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import  PolynomialFeatures
from ipywidgets import interact, interactive, fixed, interact_manual

In [4]:
# Generate some noisy data
np.random.seed(42)
x = np.linspace(0, 10, 30)
y = 2 * x + 5 + np.random.randn(30) * 2

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(x.reshape(-1, 1), y, test_size=0.2)

# Create polynomial models of different degrees
def polynomial_regression(degree):
    return Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])

def plot_regression(model_type, show_train=True, show_test=True):
    if model_type == 'Underfitting':
        degree = 1
    elif model_type == 'Balanced':
        degree = 5
    else:
        degree = 15

    model = polynomial_regression(degree)
    model.fit(X_train, y_train)
    y_pred = model.predict(x[:, np.newaxis])

    plt.figure(figsize=(12, 6))
    plt.scatter(x, y, label='Data')
    plt.plot(x, y_pred, label=f'{model_type} Model')

    if show_train:
        plt.scatter(X_train, y_train, color='red', label='Training Data')
        #plt.scatter(X_test, y_test, color='gold') # test data
    if show_test:
        plt.scatter(X_test, y_test, color='Purple', label='Testing Data')

    plt.xlabel('x')
    plt.ylabel('y')
    plt.legend()
    plt.title(f'{model_type} Model')
    plt.show()

# Interactive plot
interact(plot_regression, model_type=['Underfitting', 'Balanced', 'Overfitting'], show_train=True, show_test=True)

interactive(children=(Dropdown(description='model_type', options=('Underfitting', 'Balanced', 'Overfitting'), …

<function __main__.plot_regression(model_type, show_train=True, show_test=True)>