# Concept 2: Underfitting vs Overfitting

## 📊 The Fitting Spectrum

![Three graphs showing underfitting, good fit, and overfitting with data points and curves, size 900x400](images/fitting_spectrum.png)

*"Finding the Goldilocks zone of model complexity"*

## 📉 Underfitting Characteristics

- 🎯 High training error AND high test error- 📊 Model too simple to capture patterns- 🔍 Poor performance on both seen and unseen data- 💡 Solution: Increase model complexity

## 📈 Overfitting Characteristics

- 🎯 Low training error BUT high test error- 📊 Model memorizes training data noise- 🔍 Great on training, poor on new data- 💡 Solution: Reduce complexity or add regularization

## Detecting Over/Underfitting

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import numpy as np
np.random.seed(0)
X = np.sort(np.random.rand(100, 1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.randn(100) * 0.5

degrees = [1, 2, 5, 15]
train_errors = []
test_errors = []

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

for degree in degrees:
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    poly_model.fit(X_train, y_train)
    train_pred = poly_model.predict(X_train)
    test_pred = poly_model.predict(X_test)
    train_error = mean_squared_error(y_train, train_pred)
    test_error = mean_squared_error(y_test, test_pred)
    train_errors.append(train_error)
    test_errors.append(test_error)
    print(f"Degree {degree}: Train MSE = {train_error:.3f}, Test MSE = {test_error:.3f}")

## 🎯 Recognition Patterns

- **Underfitting:** High train error = High test error
- **Overfitting:** Low train error ≠ High test error

💭 *Question: Can you think of a real-world scenario where overfitting would be problematic?*