# Overfitting, Underfitting, Bias, and Variance

## Commands

* `Generalized_Model = Low Bias + Low Variance`
* `Overfitting = Low Bias + High Variance`
* `Underfitting = High Bias + High Variance`

## Summary

* **Machine Learning** models rely on splitting datasets into **Training**, **Testing**, and **Validation** sets to evaluate performance.
* **Overfitting** occurs when a model performs exceptionally well on training data but fails to generalize to new, unseen test data.
* **Underfitting** happens when a model performs poorly on both the training data and the testing data, failing to capture the underlying patterns.
* A **Generalized Model** is the ideal state where the model performs well on both training and testing datasets.
* **Bias** represents the error rate in the training dataset, while **Variance** represents the error rate in the testing dataset.

## Exam Notes

### The Student Analogy for Model Performance

**Question**: How can you explain Overfitting, Underfitting, and Generalized Models using a real-world analogy?

**Answer**: Imagine a student preparing for a math exam.

* **Generalized Model (Ideal Student)**:  
  The student studies the textbook (Training Data) conceptually. When they take the exam (Test Data), they can solve new questions because they understand the logic. They score well in both practice and the final exam.

* **Overfitting (The Memorizer)**:  
  The student memorizes every specific question and answer in the textbook. They score 100% on practice questions (Training Data) but fail the exam (Test Data) because the questions are different.

* **Underfitting (The Slacker)**:  
  The student barely studies. They perform poorly on practice questions (Training Data) and fail the exam (Test Data) as well.

---

## Dataset Splitting

When building a model using libraries like **sklearn**, the dataset is typically divided to ensure proper evaluation.

* **Training Data**: The subset of data used to train the model.
* **Test Data**: The subset of data kept separate to evaluate the model’s performance on unseen data.
* **Split Ratio**: Common practice is **70–80% training** and **20–30% testing**.

## Bias and Variance

To diagnose model performance, we use **Bias** and **Variance**.

* **Bias**: Error in the **Training Dataset**
  * **Low Bias**: Low training error (high training accuracy)
  * **High Bias**: High training error (low training accuracy)

* **Variance**: Error in the **Test Dataset**
  * **Low Variance**: Low testing error (high testing accuracy)
  * **High Variance**: High testing error (low testing accuracy)

## Model Performance States

### 1. Generalized Model

A **Generalized Model** captures underlying patterns without memorizing noise.

* **Characteristics**: High accuracy in training and testing
* **Bias/Variance**: **Low Bias**, **Low Variance**
* **Example**: 90% training accuracy and 85% test accuracy — small gap indicates robustness

### 2. Overfitting

**Overfitting** happens when the model memorizes training data, including noise.

* **Characteristics**: Very high training accuracy, poor test accuracy
* **Bias/Variance**: **Low Bias**, **High Variance**
* **Example**: 99% training accuracy, 50% test accuracy — large gap indicates high variance

### 3. Underfitting

**Underfitting** occurs when the model is too simple to learn patterns.

* **Characteristics**: Low training accuracy and low test accuracy
* **Bias/Variance**: **High Bias**, **High Variance**
* **Example**: 40% training accuracy and 35% test accuracy — model failed to learn
