# Bias-Variance Trade-off, Overfitting and Underfitting

![Capa](../Assets/bias_variance_tradeoff/capa3.png)

## 1. The Fundamental Problem

When we train a Machine Learning model, we want it to work well on new data (that it has never seen). But there are two types of errors that can occur:

1. Error on the Training Set (training data)
2. Error on the Test Set (new data)

The ideal model has low error on both. But in practice, there is a trade-off between two concepts: Bias and Variance.

## 2. What is Bias?

Bias is the error caused by wrong assumptions or an overly simple model.

### Practical Analogy

Imagine you are trying to hit a target with arrows:
- High Bias = Your arrows consistently hit far from the center (you are aiming wrong)
- Low Bias = Your arrows, on average, hit close to the center

### In Machine Learning

A model with high bias:
- Is too simple to capture the patterns in the data
- Makes strong assumptions about the relationship between X and y
- Results in Underfitting

Example: Using a straight line to model data with a curve

![High Bias](../Assets/bias_variance_tradeoff/1.jpg)

```
Model too simple!
Does not capture the real pattern.
```

## 3. What is Variance?

Variance is the error caused by excessive sensitivity to training data.

### Practical Analogy

Continuing with the arrows:
- High Variance = Your arrows are scattered all over the place (inconsistent)
- Low Variance = Your arrows are grouped close to each other (consistent)

### In Machine Learning

A model with high variance:
- Is too complex and "memorizes" the training data
- Adapts too much to the noise in the data
- Results in Overfitting

Example: Using a degree 10 polynomial to model simple data

![High Variance](../Assets/bias_variance_tradeoff/2.png)

``` 
Model too complex!
Passes through all points but 
does not generalize to new data.
```

## 4. Underfitting - High Bias

### What is it?

Underfitting happens when the model is too simple to capture the patterns in the data.

### Characteristics

- High error on training set
- High error on test set
- Model did not learn the basic pattern of the data

### Numerical Example

Dataset with quadratic relationship: $y = x^2 + \text{noise}$

| x  | y (actual) |
|----|----------|
| 1  | 1.2      |
| 2  | 4.1      |
| 3  | 9.3      |
| 4  | 16.2     |
| 5  | 25.1     |

Model 1: Straight line $h(x) = \theta_0 + \theta_1 x$

Result:
- Training Error: 45.2
- Test Error: 47.8

Why? A straight line cannot capture the curvature of the data!

### How to Identify Underfitting?

1. High training error (approximately 40-50% error)
2. Test error similar to training (small difference)
3. Learning curve: both errors remain high even with more data

## 5. Overfitting - High Variance

### What is it?

Overfitting happens when the model is too complex and "memorizes" the training data, including the noise.

### Characteristics

- Low error on training set
- High error on test set
- Model memorized instead of learning

### Numerical Example

Dataset with quadratic relationship: $y = x^2 + \text{noise}$

| x  | y (train) | y (test) |
|----|------------|----------|
| 1  | 1.2        | 0.9      |
| 2  | 4.1        | 3.8      |
| 3  | 9.3        | 9.5      |
| 4  | 16.2       | 15.7     |
| 5  | 25.1       | 25.4     |

Model 2: Degree 10 polynomial $h(x) = \theta_0 + \theta_1 x + \theta_2 x^2 + ... + \theta_{10} x^{10}$

Result:
- Training Error: 0.01 (practically zero!)
- Test Error: 152.7 (exploded!)

Why? The model fitted perfectly to the training data (including noise), but does not generalize to new data.

### How to Identify Overfitting?

1. Very low training error (approximately 1-5% error)
2. Very high test error (10x greater than training)
3. Large gap between training and test error
4. Learning curve: training continues dropping, test starts rising

## 6. The Ideal Model - Just Right (Goldilocks)

### Characteristics

- Low error on training set
- Low error on test set
- Small gap between the two

### Numerical Example

Model 3: Degree 2 polynomial $h(x) = \theta_0 + \theta_1 x + \theta_2 x^2$

Result:
- Training Error: 2.1
- Test Error: 2.8
- Gap: only 0.7

Perfect! Captures the real pattern (quadratic) without memorizing the noise.

## 7. Bias-Variance Trade-off

### The Total Error Equation

$$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}$$

Where:
- Bias²: error from a too simple model
- Variance: error from a too sensitive model
- Irreducible noise: inherent error in the data (unavoidable)

### The Trade-off

![Bias-Variance Trade-off](../Assets/bias_variance_tradeoff/3.png)

### Inverse Relationship

Increasing complexity:
- Bias decreases (captures complex patterns)
- Variance increases (sensitive to noise)
  
Decreasing complexity:
- Bias increases (does not capture patterns)
- Variance decreases (more stable)

## 8. How to Diagnose the Problem?

### Comparing Errors

| Situation | Training Error | Test Error | Gap | Diagnosis |
|----------|---------------|------------|-----|-------------|
| A        | 45%           | 47%        | 2%  | Underfitting (high bias) |
| B        | 2%            | 3%         | 1%  | Just Right |
| C        | 1%            | 25%        | 24% | Overfitting (high variance) |

## 9. How to Fix Underfitting (High Bias)?

### Solutions

### 1. Increase Model Complexity

Before:
```python
# Model too simple
h(x) = θ₀ + θ₁x  # Straight line
```

After:
```python
# More complex model
h(x) = θ₀ + θ₁x + θ₂x²  # Parabola
```

### 2. Add More Features

Before:
```python
# Only 1 feature
X = [size]
```

After:
```python
# Multiple features
X = [size, bedrooms, age, location]
```

### 3. Feature Engineering

Create derived features:
```python
# Original features
x₁ = size

# Derived features
x₂ = size²
x₃ = size³
x₄ = sqrt(size)
```

### 4. Remove Regularization

If you are using regularization (λ), decrease or remove it:
```python
# Before: λ too high
λ = 10  # Forces simple model

# After: λ smaller or zero
λ = 0  # Allows more flexible model
```

### 5. Train for Longer

For neural networks, increase epochs:
```python
# Before
epochs = 10  # Stopped early

# After
epochs = 100  # Trained longer
```

### Caution!

When fixing underfitting, you may cause overfitting. Always monitor the test error!

## 10. How to Fix Overfitting (High Variance)?

### Solutions

### 1. Collect More Data

The best solution! More data helps the model generalize better.

Before:
```python
m = 100  # Few examples
```

After:
```python
m = 10000  # Many examples
```

Why does it work? With more data, the model cannot "memorize" everything, forcing it to learn real patterns.

### 2. Reduce Model Complexity

Before:
```python
# Degree 10 polynomial
h(x) = θ₀ + θ₁x + θ₂x² + ... + θ₁₀x¹⁰
```

After:
```python
# Degree 2 polynomial
h(x) = θ₀ + θ₁x + θ₂x²
```

### 3. Regularization (L1 or L2)

Add penalty to large weights:

L2 Regularization (Ridge):
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2$$

```python
# λ controls how much we penalize
λ = 0.1   # Moderate regularization
λ = 1.0   # Strong regularization
λ = 10.0  # Very strong regularization
```

L1 Regularization (Lasso):
$$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|$$

Effect: Forces the weights $\theta$ to be small, making the model simpler.

### 4. Feature Selection (Remove Features)

Before:
```python
# Many features (20)
X = [size, bedrooms, age, location, ..., feature_20]
```

After:
```python
# Only important features (5)
X = [size, bedrooms, location, age, bathrooms]
```

### 5. Cross-Validation

Divide data into k-folds for validation:

![k-folds](../Assets/bias_variance_tradeoff/4.png)

## 11. Summary - Decision Table

| Problem | Symptoms | Solutions |
|----------|----------|----------|
| Underfitting | High training error<br>High test error<br>Small gap | 1. Increase model complexity<br>2. Add more features<br>3. Feature engineering<br>4. Decrease regularization (λ)<br>5. Train longer |
| Overfitting | Low training error<br>High test error<br>Large gap | 1. Collect more data<br>2. Reduce model complexity<br>3. Add regularization (L1/L2)<br>4. Remove features<br>5. Cross-validation |
| Just Right | Low training error<br>Low test error<br>Small gap | Keep it up! |

## 12. Complete Practical Example

### Dataset: Predicting house prices

```python
# Data
X_train: 80 houses
y_train: prices

X_test: 20 houses
y_test: prices
```

### Attempt 1: Simple Line

```python
model = LinearRegression()  # h(x) = θ₀ + θ₁x
```

Result:
- Training Error: 42%
- Test Error: 45%
- Diagnosis: UNDERFITTING

Action: Increase complexity

### Attempt 2: Degree 2 Polynomial

```python
model = PolynomialRegression(degree=2)  # h(x) = θ₀ + θ₁x + θ₂x²
```

Result:
- Training Error: 5%
- Test Error: 8%
- Diagnosis: JUST RIGHT

Action: Success! Balanced model.

### Attempt 3: Degree 10 Polynomial

```python
model = PolynomialRegression(degree=10)
```

Result:
- Training Error: 0.5%
- Test Error: 45%
- Diagnosis: OVERFITTING

Action: Apply regularization

### Attempt 4: Degree 10 Polynomial + Regularization

```python
model = Ridge(degree=10, alpha=1.0)  # α = λ (regularization)
```

Result:
- Training Error: 4%
- Test Error: 6%
- Diagnosis: JUST RIGHT

Action: Success! Regularization solved it.

## 13. Final Tips

### Best Practices

1. Always separate train/test (80/20 or 70/30)
2. Use cross-validation to choose hyperparameters
3. Start simple, add complexity gradually
4. Monitor both errors (training and test)
5. Plot learning curves to visualize

### Common Mistakes

1. Not separating test set (training and testing on same data)
2. Using test set to tune model (data leakage)
3. Excessive complexity from the start
4. Ignoring training error (focusing only on test)
5. Not using regularization when appropriate

---

Remember:
_"The best model is not the one that best fits the training data, but the one that best generalizes to new data."_