# Bias-Variance Tradeoff?

### Bias-Variance Tradeoff?

It describes how the **error of a model** can be decomposed into three parts:

$$
\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}
$$

* **Bias**: Error due to overly simplistic assumptions in the learning algorithm. High bias can cause the model to miss relevant relations — **underfitting**.
* **Variance**: Error due to the model's sensitivity to small fluctuations in the training data. High variance means the model learns noise — **overfitting**.
* **Irreducible Error**: Noise or randomness in the data that can't be removed even with a perfect model.

---

### Visual Understanding

| Model Complexity | Bias     | Variance | Error Type   |
| ---------------- | -------- | -------- | ------------ |
| Low              | High     | Low      | Underfitting |
| Medium           | Moderate | Moderate | Optimal      |
| High             | Low      | High     | Overfitting  |

---

### Why It Matters

The goal in machine learning is to find a model that generalizes well to **unseen data**. Too simple, and it won’t learn enough (high bias). Too complex, and it memorizes noise (high variance).

---

### 🔧 How to Manage the Tradeoff

* **Use cross-validation** to detect overfitting/underfitting.
* **Regularization** (like L1, L2) to reduce variance.
* **Ensemble methods** (e.g., Random Forest) to balance both.
* **Get more data** to reduce variance.
* **Simplify the model** to reduce variance if overfitting.

---

### Example

Suppose you're fitting a model to predict housing prices:

* **Linear regression** might have high bias (can't capture non-linear trends).
* **Polynomial regression (degree 20)** might have low bias but high variance (overfits).
* **A well-tuned Random Forest** might balance bias and variance effectively.

# Cross-Validation

Cross-validation helps detect **overfitting** and **underfitting** by evaluating a model’s performance on **unseen (held-out) data**, not just on the training set.

---

### Cross-Validation?

Cross-validation (CV) is a **model validation technique** that splits the dataset into several parts (called "folds"), trains the model on a subset, and tests it on the remaining data. The most common form is **k-fold cross-validation**.

---

### How It Detects Overfitting and Underfitting

| Scenario         | Training Error | Validation Error | Diagnosis                            |
| ---------------- | -------------- | ---------------- | ------------------------------------ |
| **Underfitting** | High           | High             | Model is too simple (high bias)      |
| **Overfitting**  | Low            | High             | Model is too complex (high variance) |
| **Good Fit**     | Low–Moderate   | Low–Moderate     | Balanced model                       |

---

### Intuition

1. **Underfitting (High Bias)**:

   * Model performs poorly on both training and validation folds.
   * Cannot capture the underlying patterns.
   * CV shows **consistently high error** across all folds.

2. **Overfitting (High Variance)**:

   * Model performs very well on training folds but poorly on validation folds.
   * Learns noise and specific details of training data.
   * CV shows **large gap between training and validation scores**.

3. **Just Right**:

   * Training and validation scores are both good and close to each other.
   * CV shows **low and stable error** across folds.

---

### Tips:

* Always compare **training vs. cross-validation performance**.
* Use **learning curves** to visualize overfitting/underfitting.
* Adjust **model complexity** based on CV results.

In [None]:
# cross-validation

from sklearn.model_selection import cross_val_score, LeaveOneOut, StratifiedKFold, KFold
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=10, noise=10)

# Low bias, high variance model
tree = DecisionTreeRegressor()

# General-purpose
# Use with: 
# - moderate to large datasets, 
# - don't need fine control over the splits
# - shorthand for KFold(n_splits=5) if regression or StratifiedKFold(n_splits=5) if classification
tree_scores = cross_val_score(tree, X, y, cv=5)

# with Kfold:  
# Use with: 
# - Regression tasks, 
# - class balance is not a concern, 
# - want manual control over shuffling, random state, or fold size
kf = KFold(n_splits=5, shuffle=True, random_state=42)
tree_scores_kf = cross_val_score(tree, X, y, cv=kf)

# with LeaveOneOut
# Use with: 
# - dataset is very small (typically <100 samples)
# - When you want unbiased and nearly exhaustive evaluation
# - High-stakes scenarios (e.g., biomedical applications) with few data points
loo = LeaveOneOut()
tree_scores_loo = cross_val_score(tree, X, y, cv=loo)

# with StratifiedKFold
# Use with: 
# - Classification tasks, especially with class imbalance
# - maintaining the class distribution across folds is important
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
tree_scores_skf = cross_val_score(tree, X, y, cv=skf)


# Tree model has high training score but low CV score → **Overfitting**
# Linear model has low training and CV score → **Underfitting**

### Ensemble Techniques

| Technique    | Main Goal       | Bias               | Variance                 | Ideal For                                                |
| ------------ | --------------- | ------------------ | ------------------------ | -------------------------------------------------------- |
| **Bagging**  | Reduce variance | Lowers variance | Keeps bias similar     | High-variance, low-bias models (e.g., decision trees)    |
| **Boosting** | Reduce bias     | Lowers bias     | Can increase variance | High-bias, low-variance models or underfitting scenarios |
