## 6. OVERFITTING

### Definition
**Model learns noise and quirks of training data**, not true patterns. Performs great on training, terrible on test data.

```
Overfitting = Model memorizes training data
              instead of learning generalizable patterns
```

### Visual Explanation:

```
Training Data:  ●●●●●●●●●●
Underlying Pattern: Straight line (simple)

Underfitted Model:    ___________   (too simple, misses pattern)
Perfectly Fit Model:  ___________   (captures pattern)
Overfitted Model:     ~∧~∧~∧~∧~    (wiggly, memorizes noise)
```

### Code Example:


In [None]:
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Generate simple data: y = x + noise
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y_true = X.ravel()
y = y_true + np.random.normal(0, 2, 100)  # Add noise

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Test different polynomial degrees
degrees = [1, 3, 5, 10, 20]
train_scores = []
test_scores = []

for degree in degrees:
    # Create polynomial features
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    # Train model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    # Evaluate
    train_pred = model.predict(X_train_poly)
    test_pred = model.predict(X_test_poly)
    
    train_mse = mean_squared_error(y_train, train_pred)
    test_mse = mean_squared_error(y_test, test_pred)
    
    train_scores.append(train_mse)
    test_scores.append(test_mse)
    
    print(f"Degree {degree:2d}: Train MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}")

# Plot: Shows overfitting
plt.figure(figsize=(12, 6))
plt.plot(degrees, train_scores, marker='o', label='Training Error', linewidth=2)
plt.plot(degrees, test_scores, marker='s', label='Test Error', linewidth=2)
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.title('Overfitting: Gap between training and test error')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Output:
# Degree  1: Train MSE: 4.45, Test MSE: 3.92  ← Good generalization
# Degree  3: Train MSE: 3.20, Test MSE: 3.45  ← Still good
# Degree  5: Train MSE: 2.10, Test MSE: 5.23  ← Starting to overfit
# Degree 10: Train MSE: 0.15, Test MSE: 24.50 ← Severe overfitting!
# Degree 20: Train MSE: 0.02, Test MSE: 89.20 ← Extreme overfitting!


### Causes of Overfitting:


In [None]:
# 1. Model too complex relative to data
#    Solution: Simplify model, reduce parameters

# 2. Too much training time
#    Solution: Use early stopping

# 3. No regularization
#    Solution: Add L1/L2 regularization

# 4. Too little data
#    Solution: Collect more data, data augmentation

# 5. Noisy training data
#    Solution: Clean data, remove outliers

from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from tensorflow.keras.callbacks import EarlyStopping

# Solution 1: Regularization (Lasso/Ridge)
ridge = Ridge(alpha=1.0)  # Higher alpha = more regularization
ridge.fit(X_train, y_train)

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Solution 2: Reduce model complexity
simple_model = RandomForestClassifier(
    n_estimators=10,      # Fewer trees
    max_depth=5,          # Shallower trees
    min_samples_split=10  # More samples required to split
)

# Solution 3: Cross-validation
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)
print(f"CV scores: {scores}")  # Detect overfitting early

# Solution 4: Early stopping (Neural Networks)
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,  # Stop if no improvement for 5 epochs
    restore_best_weights=True
)

model.fit(
    X_train, y_train,
    validation_split=0.2,
    callbacks=[early_stop],
    epochs=100
)

# Solution 5: Dropout (Neural Networks)
from tensorflow.keras import layers

model = Sequential([
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),  # Drop 30% of neurons randomly
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')
])


---
