# Understanding Underfitting and Overfitting in Machine Learning

## 📈 Underfitting vs Overfitting - The Goldilocks Problem

![Three bears representing underfitting (too simple), overfitting (too complex), and just right models with corresponding data fit curves. Size 800x600](images/goldilocks_models.png)

*Just like Goldilocks, we need our model to be "just right"!*

## 📉 Underfitting: When Your Model is Too Simple

- **Symptoms:** Poor performance on both training AND test data- **Cause:** Model is too simple to capture underlying patterns- **Example:** Using linear regression for clearly non-linear data- **Signs:** High bias, low variance

![Scatter plot showing curved data with a straight line poorly fitting through it. Size 600x400](images/underfitting_example.png)

## 📈 Overfitting: When Your Model Memorizes Instead of Learning

- **Symptoms:** Great training performance, poor test performance- **Cause:** Model is too complex, captures noise as patterns- **Example:** Decision tree with 100 levels for simple data- **Signs:** Low bias, high variance

![Scatter plot showing a very wavy line that passes through every training point but would poorly predict new data. Size 600x400](images/overfitting_example.png)

## ⚖️ Visual Comparison

![Three side-by-side plots showing underfitting (straight line), good fit (smooth curve), and overfitting (jagged line through all points). Size 900x400](images/fitting_comparison.png)

**Training Accuracy vs Test Accuracy:**

- 🔴 Underfitting: Both low- 🟢 Good fit: Both high and similar- 🔴 Overfitting: Training high, test low

## 💻 Code Example: Demonstrating Over/Underfitting

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Generate some non-linear data
np.random.seed(0)
X = np.sort(np.random.rand(100, 1) * 10, axis=0)
y = np.sin(X).ravel() + np.random.randn(100) * 0.2

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create polynomial degrees to test
degrees = [1, 2, 5, 15]  # 1=underfit, 2=good, 5=OK, 15=overfit

for degree in degrees:
    # Create polynomial pipeline
    poly_model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    
    # Fit and evaluate
    poly_model.fit(X_train, y_train)
    train_score = poly_model.score(X_train, y_train)
    test_score = poly_model.score(X_test, y_test)
    
    print(f"Degree {degree}: Train={train_score:.3f}, Test={test_score:.3f}")