# **1. Model Evaluation Metrics**

---

## **1. Confusion Matrix (The Foundation)**

For a **binary classification problem** (Positive vs Negative), we classify predictions into four categories:

|                     | **Predicted Positive** | **Predicted Negative** |
| ------------------- | ---------------------- | ---------------------- |
| **Actual Positive** | True Positive (TP)     | False Negative (FN)    |
| **Actual Negative** | False Positive (FP)    | True Negative (TN)     |

---

## **2. Accuracy**

* The ratio of correctly predicted samples to total samples.
  [
  Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
  ]

✅ Good when: Classes are balanced.
❌ Misleading when: Dataset is **imbalanced** (e.g., fraud detection with 99% normal cases).

---

## **3. Precision (Positive Predictive Value)**

* Of all predicted positives, how many are truly positive?
  [
  Precision = \frac{TP}{TP + FP}
  ]

✅ Important when **false positives are costly**.
📌 Example: Spam detection (you don’t want to mark genuine emails as spam).

---

## **4. Recall (Sensitivity / True Positive Rate)**

* Of all actual positives, how many did we correctly predict?
  [
  Recall = \frac{TP}{TP + FN}
  ]

✅ Important when **false negatives are costly**.
📌 Example: Cancer detection (you don’t want to miss a real cancer case).

---

## **5. F1-Score (Harmonic Mean of Precision & Recall)**

* Balances precision and recall.
  [
  F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}
  ]

✅ Good when you need a balance between **precision and recall**.
📌 Example: Information retrieval, fraud detection.

---

## **6. ROC Curve (Receiver Operating Characteristic)**

* Plots **True Positive Rate (Recall)** vs **False Positive Rate (FPR = FP / (FP+TN))** at different thresholds.
* A good classifier → curve near top-left corner.

---

## **7. AUC (Area Under ROC Curve)**

* Measures classifier’s ability to distinguish between classes.
* Ranges from 0.5 (random guessing) to 1.0 (perfect classifier).

✅ Higher AUC = Better model.

---

## **8. When to Use Which?**

| Metric        | Best Used When                                                       |
| ------------- | -------------------------------------------------------------------- |
| **Accuracy**  | Balanced classes, general correctness                                |
| **Precision** | Cost of false positives is high (spam filters, recommendations)      |
| **Recall**    | Cost of false negatives is high (medical diagnosis, fraud detection) |
| **F1-Score**  | Need balance between precision & recall (imbalanced data)            |
| **ROC-AUC**   | Comparing models’ ability to separate classes (ranking performance)  |

---

## **9. Visualization (Conceptual)**

📌 Confusion Matrix Example (Cancer Detection):

```
                Predicted
             |  Positive   Negative
Actual   +   |    80         20
        -   |    10        890
```

* Precision = 80 / (80+10) = 0.89
* Recall = 80 / (80+20) = 0.80
* F1 = 0.84
* Accuracy = (80+890)/1000 = 0.97

---

## **10. Key Takeaways**

* Accuracy is not always enough (especially with imbalanced data).
* Precision vs Recall depends on whether you care more about **false positives** or **false negatives**.
* F1-score = balance.
* ROC-AUC = ranking ability across thresholds.

---
---
---


# **2. Bias–Variance Tradeoff**

---

## **1. Sources of Error in ML Models**

When a model predicts, the total error can be decomposed into three parts:

[
Total Error = Bias^2 + Variance + Irreducible Error
]

* **Bias:** Error from overly simple assumptions in the model.
* **Variance:** Error from sensitivity to training data (too complex model).
* **Irreducible Error (Noise):** Error due to randomness in data (cannot be fixed).

---

## **2. What is Bias?**

* **Bias = Difference between model’s average prediction and the true value.**
* High bias → model is too simple, underfitting.
* Example: Linear regression trying to fit non-linear data → biased predictions.

---

## **3. What is Variance?**

* **Variance = How much model predictions change if trained on different datasets.**
* High variance → model is too complex, overfitting.
* Example: A deep decision tree memorizes training data but performs poorly on test data.

---

## **4. Tradeoff**

* If you **increase complexity** → bias ↓ but variance ↑.
* If you **decrease complexity** → variance ↓ but bias ↑.
* Goal: Find the **sweet spot** where total error is minimized.

---

## **5. Visualization (Conceptual)**

🎯 **Bias–Variance in terms of shooting arrows at a target:**

* **High Bias, Low Variance:** Shots are grouped but far from the bullseye (systematically wrong → underfit).
* **Low Bias, High Variance:** Shots are scattered around bullseye (inconsistent → overfit).
* **Low Bias, Low Variance:** Shots are tightly around the bullseye (ideal).
* **High Bias, High Variance:** Shots are scattered far away (worst).

---

## **6. Graph of Error vs Model Complexity**

```
Error
│        Bias Error  \
│                     \
│                      \
│                        \__ Total Error
│          Variance Error /
│         /              /
│        /              /
└───────────────────────────────▶ Model Complexity
```

* At low complexity → high bias, low variance.
* At high complexity → low bias, high variance.
* Optimal complexity = minimum total error.

---

## **7. Strategies to Manage Tradeoff**

* **Reduce Bias (Underfitting):**

  * Add more features.
  * Use more complex models.
  * Reduce regularization (lower λ in Ridge/Lasso).

* **Reduce Variance (Overfitting):**

  * Use simpler models.
  * Regularization (Ridge, Lasso, Dropout in NN).
  * Collect more training data.
  * Ensemble methods (Bagging, Random Forest).

---

## **8. Real-Life Examples**

* **High Bias (Underfitting):** Predicting house prices using only number of bedrooms (ignores location, size, etc.).
* **High Variance (Overfitting):** Memorizing exact past house sales → perfect training accuracy but poor test accuracy.
* **Balanced Model:** Captures main patterns without memorizing noise.

---

## **9. Key Takeaways**

* **Bias = Wrong assumptions → underfitting.**
* **Variance = Over-sensitivity → overfitting.**
* Tradeoff = Finding optimal model complexity.
* Goal: **Low bias + Low variance** (but not zero).

---
---
---

# **3. Cross-Validation**

---

## **1. Why Do We Need Cross-Validation?**

* Training accuracy alone doesn’t tell us if the model generalizes.
* A single **train-test split** might give misleading results (depending on the split).
* Cross-validation ensures the model is tested on **different subsets** of data, reducing variance in evaluation.

---

## **2. Basic Idea**

* Split data into multiple folds (subsets).
* Train model on **k-1 folds** and test on the remaining fold.
* Repeat until each fold has been used as test once.
* Average results across folds.

---

## **3. Types of Cross-Validation**

### 🔹 **Hold-Out Method (Train-Test Split)**

* Simplest: Split dataset (e.g., 80% train, 20% test).
* Fast but can give **biased/unstable results** depending on split.

---

### 🔹 **k-Fold Cross-Validation**

* Divide data into **k equal folds**.
* Train on k-1 folds, test on 1 fold.
* Repeat k times.
* Average metrics → more stable performance estimate.

📌 Common choice: **k = 5 or 10**.

---

### 🔹 **Stratified k-Fold**

* Like k-Fold, but ensures class proportions are preserved in each fold.
* Important for **imbalanced datasets** (e.g., fraud detection).

---

### 🔹 **Leave-One-Out Cross-Validation (LOOCV)**

* Extreme case of k-Fold where k = n (number of samples).
* Train on n-1 samples, test on 1 sample.
* Very accurate but computationally expensive.

---

### 🔹 **Leave-P-Out Cross-Validation**

* Leave p samples out for testing, use rest for training.
* Generalization of LOOCV (p=1).
* Too costly for large datasets.

---

### 🔹 **Nested Cross-Validation**

* Used for **model selection + evaluation** simultaneously.
* Outer loop = evaluates model.
* Inner loop = tunes hyperparameters (Grid Search, Random Search, Bayesian Optimization).

---

## **4. Workflow Example (5-Fold CV)**

Dataset = 100 samples, k=5.

* Fold 1: Train (80), Test (20)
* Fold 2: Train (80), Test (20)
* Fold 3: Train (80), Test (20)
* Fold 4: Train (80), Test (20)
* Fold 5: Train (80), Test (20)

Final performance = average of all 5 test results.

---

## **5. Pros & Cons**

### ✅ Pros

* More reliable than single train-test split.
* Reduces risk of **overfitting/underfitting evaluation**.
* Uses data efficiently (all points used for training & testing).

### ❌ Cons

* More computationally expensive (especially LOOCV).
* Not always suitable for **time series** (since order matters).

---

## **6. Special Case: Cross-Validation in Time Series**

* Cannot randomly shuffle (future depends on past).
* Use **rolling/expanding window CV**:

  * Train on first 70%, test on next 10%, move window forward.

---

## **7. Key Takeaways**

* Cross-validation provides a **robust estimate of model performance**.
* **k-Fold (k=5 or 10)** is the most commonly used.
* Use **Stratified CV** for imbalanced datasets.
* Use **Nested CV** when tuning hyperparameters.
* Use **Time Series CV** for sequential data.

---
---
---

# **4. Regularization Techniques in Machine Learning**

---

## **1. Why Do We Need Regularization?**

* In high-dimensional data (lots of features), models can overfit.
* Overfitting = memorizing noise instead of learning patterns.
* Regularization helps by **adding a penalty term** to the loss function → discourages overly complex models.

---

## **2. General Form**

For regression, the typical cost function is **Mean Squared Error (MSE):**

[
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2
]

Regularization adds a **penalty**:

[
J(\theta) = \text{Loss Function} + \lambda \cdot \text{Penalty}
]

* **λ (lambda)** = regularization strength.

  * Large λ → stronger penalty → simpler model.
  * Small λ → weaker penalty → closer to normal regression.

---

## **3. Ridge Regression (L2 Regularization)**

* Adds penalty = sum of squared coefficients.

[
J(\theta) = \text{MSE} + \lambda \sum_{j=1}^{n} \theta_j^2
]

* Shrinks coefficients towards zero, but **never exactly zero**.
* Works well when: Many small/medium correlated features.

✅ Pros: Prevents overfitting, stable solution.
❌ Cons: Doesn’t perform feature selection (all features remain).

📌 **Use Case:** Ridge is good when you believe **all features contribute** but just want to shrink their impact.

---

## **4. Lasso Regression (L1 Regularization)**

* Adds penalty = sum of absolute values of coefficients.

[
J(\theta) = \text{MSE} + \lambda \sum_{j=1}^{n} |\theta_j|
]

* Can shrink some coefficients **exactly to zero** → feature selection.
* Works well when: Only a few features are important.

✅ Pros: Performs automatic **feature selection**.
❌ Cons: Can be unstable if features are highly correlated.

📌 **Use Case:** Lasso is good when you expect **sparsity** (only some features matter).

---

## **5. Elastic Net (Combination of L1 & L2)**

* Hybrid of Ridge + Lasso.

[
J(\theta) = \text{MSE} + \lambda_1 \sum |\theta_j| + \lambda_2 \sum \theta_j^2
]

* Balances **feature selection (L1)** and **stability (L2)**.
* Good when: Features are highly correlated, or you need both benefits.

✅ Pros: Flexible, handles correlated features better.
❌ Cons: Two hyperparameters to tune.

📌 **Use Case:** Best for **high-dimensional datasets** (like genomics, text classification).

---

## **6. Visualization (Effect on Coefficients)**

* **No Regularization:** Coefficients can grow very large.
* **Ridge:** Coefficients shrink, but none go to zero.
* **Lasso:** Some coefficients = 0 (feature elimination).
* **Elastic Net:** Combination — some shrink, some eliminated.

---

## **7. Beyond Linear Models (Regularization in Other Models)**

* **Decision Trees / Ensembles**

  * Control depth, min samples, pruning (prevents overfitting).
* **Neural Networks**

  * L1/L2 penalties on weights.
  * **Dropout**: Randomly remove neurons during training.
  * **Early Stopping**: Stop training when validation error increases.

---

## **8. Choosing λ (Regularization Strength)**

* If λ = 0 → no regularization (risk of overfitting).
* If λ → ∞ → all coefficients shrink to 0 (underfitting).
* Use **Cross-Validation (CV)** to find optimal λ.

---

## **9. Real-Life Applications**

* **Finance:** Prevents overfitting in stock price prediction with many indicators.
* **Genomics:** Lasso selects important genes for disease prediction.
* **Text Classification:** Elastic Net handles thousands of word features (TF-IDF).

---

## **10. Key Takeaways**

* **Ridge (L2):** Shrinks coefficients, keeps all features.
* **Lasso (L1):** Eliminates irrelevant features.
* **Elastic Net (L1+L2):** Best of both worlds, especially for correlated features.
* Regularization = crucial for reducing **variance** while controlling **bias**.

---
---
---