# Underfitting vs. Overfitting in Machine Learning

**Key Problem Areas**
- **Underfitting (High Bias):** The model is too simple and fails to capture the underlying trend.
- **Overfitting (High Variance):** The model fits the training data too closely, capturing noise and failing to generalize.

---

## Regression Example: Predicting Housing Prices

- **Input Feature:** $x$ (size of the house)
- **Output Value:** $y$ (price of the house)

### Models Considered
1. **Simple Linear Model:**
   - **Equation:** $y = \theta_0 + \theta_1 x$
   - **Observation:** The straight-line fit doesn't capture the flattening trend seen in the data.
   - **Result:** **Underfitting (High Bias)** – the model has a strong linear preconception, ignoring the curvature in data.

2. **Quadratic Model:**
   - **Features:** $x$ and $x^2$
   - **Equation:** $y = \theta_0 + \theta_1 x + \theta_2 x^2$
   - **Observation:** The curve fits the data better without perfectly matching every training point.
   - **Result:** **Just Right** – balances fitting the training data while generalizing well to unseen examples.

3. **Fourth-Order Polynomial Model:**
   - **Features:** $x, x^2, x^3, x^4$
   - **Observation:** The model passes through all training points exactly.
   - **Result:** **Overfitting (High Variance)** – the model is too wiggly, capturing noise and unlikely to generalize.

### Visual Summary

| Model Type          | Features Used                | Fit to Training Data       | Generalization    |
|---------------------|------------------------------|----------------------------|-------------------|
| Underfitting        | Linear ($x$)                 | Poor (misses curvature)    | Poor              |
| Just Right          | Quadratic ($x$, $x^2$)         | Good (balanced fit)        | Good              |
| Overfitting         | Fourth-order ($x$, $x^2$, $x^3$, $x^4$) | Perfect fit (too complex) | Poor              |

---

## Classification Example: Classifying Tumors

**Features:**
- $x_1$: Tumor size
- $x_2$: Patient age

**Task:** 
- Classify tumors as malignant or benign.

### Logistic Regression Models
1. **Simple Logistic Regression:**
   - **Model Formulation:** 

$$ g(z) = \frac{1}{1+e^{-z}} \quad \text{where} \quad z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 $$

   - **Decision Boundary:** A straight line where $z=0$.
   - **Result:** **Underfitting (High Bias)** – the boundary is too simple and may misclassify data points.

2. **Quadratic Logistic Regression:**
   - **Extended Features:** Include quadratic terms (e.g., $x_1^2$, $x_2^2$, and possibly $x_1x_2$).
   - **Decision Boundary:** Curved (e.g., an ellipse) that better separates the classes.
   - **Result:** **Just Right** – offers a balance between fitting the data and generalizing well.

3. **High-Order Polynomial Logistic Regression:**
   - **Extended Features:** Many higher-order polynomial terms.
   - **Observation:** The decision boundary becomes very contorted to perfectly fit the training examples.
   - **Result:** **Overfitting (High Variance)** – although it classifies the training set perfectly, it is unlikely to perform well on new data.

---

## The Goldilocks Analogy

- **Underfitting:** Like a bowl of porridge that's too cold – the model is too simple.
- **Overfitting:** Like a bowl of porridge that's too hot – the model is excessively complex.
- **Just Right:** Like a bowl of porridge that's perfectly warm – the model achieves the ideal balance between bias and variance.

---

## Key Takeaways

**Underfitting (High Bias):**
- Model cannot capture the underlying data trends.
- Simplistic assumptions (e.g., assuming data is purely linear).
- Poor performance on both training and new data.

**Overfitting (High Variance):**
- Model fits the training data excessively well, including noise.
- Highly sensitive to minor changes in the training data.
- Poor generalization to unseen data.

**Ideal Model (Just Right):**
- Achieves a balance between bias and variance.
- Captures essential patterns without being overly complex.
- Generalizes well to new, unseen examples.

# Strategies to Address Overfitting

Remember that overfitting occurs when a model fits the training data too well, capturing noise instead of the underlying distribution, leading to poor generalization to new data.

**Three Main Strategies to Combat Overfitting:**
1. **Collect More Training Data**
2. **Feature Selection (Use Fewer Features)**
3. **Regularization**

## 1. Collect More Training Data

Increasing the number of training examples helps the algorithm learn a function that is smoother and less wiggly.
- **Advantage:** A larger dataset provides a better representation of the true underlying distribution, which can prevent the model from fitting noise.
- **Limitation:** More data may not be available in some scenarios (e.g., limited sales data in a small market).

**Practical Example:**  

For housing price prediction: If you currently have limited data on house sizes and prices, adding more examples can help the model avoid overreacting to outliers or noise in a small dataset.

---

## 2. Feature Selection (Using Fewer Features)

By selecting only the most relevant features, you reduce the complexity of the model, thereby reducing its tendency to overfit.

- **Manual Feature Selection:**  Use your domain intuition to select features that are most relevant.
- **Pros:** Reduces overfitting by lowering model complexity.
- **Cons:** May discard potentially useful information.
- **Automated Methods:** Later in the course, you will explore algorithms that automatically choose the most appropriate set of features.

**Example Scenario:**  
- **With Many Features:** A model using 100 features (e.g., house size, number of bedrooms, age, income level in the neighborhood, distance to the nearest coffee shop, etc.) might overfit if there isn’t enough training data.
- **With Fewer Features:** Choosing a subset like **size**, **number of bedrooms**, and **age** might help the model generalize better.

---

## 3. Regularization

Regularization is a technique that shrinks the values of the model parameters, effectively reducing the impact of less important features without completely eliminating them.

When using polynomial features (e.g., $x$, $x^2$, $x^3$, etc.), the parameters (weights) for higher-order terms can become very large. **Regularization** adds a penalty for large parameter values, encouraging the model to keep the parameters small and the decision boundary smoother.

Imagine dimming a set of overly bright lights; you don't turn them off completely, but you reduce their intensity so they don't overwhelm the overall lighting.

**Mathematical Formulation (General Idea):**  

If the cost function for a model is:

$$ J(\theta) = \text{Loss}(\theta) $$

Regularization adds a term such as:

$$ J_{\text{reg}}(\theta) = \text{Loss}(\theta) + \lambda \sum_{j=1}^{n} \theta_j^2 $$

Here, $\lambda$ is a hyperparameter that controls the amount of regularization.
  
**Parameter Considerations:**
- Typically, only the weights $w_1, w_2, \dots, w_n$ are regularized.
- The bias term $b$ is usually excluded or not heavily regularized.

**Benefits:**  
- Allows the use of all features while preventing any single feature from having an overly large impact.
- **Flexibility:** Unlike feature selection, regularization doesn't completely remove a feature; it just reduces its influence.

---

## Additional Learning Resources: Overfitting Lab

- **Interactive Lab Features:**
  - Visualize different examples of overfitting in both regression and classification.
  - Adjust the degree of the polynomial (e.g., $x$, $x^2$, $x^3$, etc.) and see how the model's fit changes.
  - Experiment with adding more training data and using feature selection to observe their effects on overfitting.

- **Purpose:**  
  This lab will help build intuition around:
  - How overfitting occurs.
  - The impact of various techniques (data augmentation, feature selection, and regularization) in mitigating overfitting.

---

## Summary

- **Addressing Overfitting:**
  1. **Collect More Data:**  
     More training examples help the model learn a less complex function.
  2. **Feature Selection:**  
     Using a subset of relevant features reduces complexity, mitigating overfitting.
  3. **Regularization:**  
     Shrinks parameter values to prevent features from having an overly large impact, without completely discarding any feature.

- **Takeaway:**  
  In practice, the best strategy often involves a combination of these approaches. Understanding when and how to apply each technique is critical for building models that generalize well to new data.



# Regularization: Modifying the Cost Function

**Goal:** Keep the parameters ($W_1, W_2, \dots, W_N$) small to prevent the model from becoming overly complex and overfitting the training data.

**Example Recap:**  
- **Quadratic Fit:** A quadratic function can provide a good fit for the data.
- **High-Order Polynomial Fit:** A high-order polynomial may overfit the data by being too wiggly.

**Key Idea:** If you penalize large values for specific parameters (e.g., $W_3$ and $W_4$), they are forced to be close to zero, effectively reducing the contribution of higher-order (or less important) features.

---

## Modified Cost Function

**Standard Linear Regression Cost Function:**

$$ J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 $$

**Modified Cost Function:**

$$ J_{\text{reg}}(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^{n} \theta_j^2 \right] $$
  
**Components:**
- **Mean Squared Error Term:** Encourages the model to fit the training data well.
- **Regularization Term:** $\lambda \sum_{j=1}^{n} \theta_j^2$ penalizes large parameter values, effectively shrinking them.
  
**Notes on Conventions:**
- The term is often scaled by $\frac{\lambda}{2m}$ so both components are on a similar scale.
- By convention, **do not regularize the bias term** ($b$ or $\theta_0$), though some implementations might include it; the difference is usually minimal.

---

## Trade-Off Controlled by $\lambda$

- **$\lambda = 0$:**
  - **Effect:** No regularization applied.
  - **Result:** Model may overfit, especially if it is overly complex.
  
- **$\lambda$ is Very Large (e.g., $10^{10}$):**
  - **Effect:** Heavy penalty on parameter sizes forces all $\theta_j$ (for $j \geq 1$) to be near 0.
  - **Result:** Model becomes overly simple (e.g., a horizontal line) and underfits the data.
  
- **Choosing $\lambda$:**
  - **Balance is Key:** A moderate value of $\lambda$ ensures the model fits the data well while keeping parameters small enough to avoid overfitting.
  - **Model Selection:** Later in the course, techniques for choosing the optimal $\lambda$ will be discussed.

---

## 4. Summary and Takeaways

- **Regularization Mechanism:**
  - Penalizes large parameters by adding a term $\lambda \sum_{j=1}^{n} \theta_j^2$ to the cost function.
  - Encourages simpler models, similar in effect to reducing the number of features.

- **Trade-Off in the Cost Function:**
  - **Minimizing Mean Squared Error:** Fits the training data well.
  - **Minimizing the Regularization Term:** Keeps parameters small to reduce overfitting.
  
- **Overall Impact: Proper $\lambda$ Setting** Leads to a model that is neither too complex (overfitting) nor too simple (underfitting).


# Gradient Descent with Regularized Linear Regression

The goal is to minimize a modified cost function that includes both the usual squared error term and an additional term to penalize large parameter values, thereby reducing overfitting.

---

## Regularized Cost Function

The **unregularized cost function** for linear regression is:

$$
J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2
$$

The **modified (regularized) cost function** becomes:

$$
J_{\text{reg}}(\theta) = \frac{1}{2m} \left[ \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \lambda \sum_{j=1}^{n} \theta_j^2 \right]
$$

**Components:**
- **Error Term:** Measures the squared difference between predictions and actual values.
- **Regularization Term:** $\lambda \sum_{j=1}^{n} \theta_j^2$ penalizes large weights to keep the model simpler.

**Conventions:**
- We scale by $\frac{1}{2m}$ for both terms to keep them on a similar scale.
- **Bias term ($b$ or $\theta_0$)** is not regularized.

---

## Gradient Descent Updates

For **standard gradient descent**:

**For each weight $w_j$ ($j = 1, 2, \dots, n$):**

$$
w_j := w_j - \alpha \cdot \frac{\partial J}{\partial w_j}
$$

**For bias $b$:**

$$
b := b - \alpha \cdot \frac{\partial J}{\partial b}
$$

**With regularization**, the derivative for $w_j$ changes while $b$ remains the same.

**New Partial Derivative for $w_j$:**

$$
  \frac{\partial J_{\text{reg}}}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)} + \frac{\lambda}{m}w_j
$$

**Update Rule for $w_j$:**

$$
w_j := w_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)} + \frac{\lambda}{m}w_j \right]
$$

**Update Rule for $b$:**
$$
b := b - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)
$$

> **Note:** The bias term $b$ is not regularized.

---

## Intuition Behind the Update

The update rule for $w_j$ can be rearranged as follows:

$$
w_j := \left(1 - \alpha \frac{\lambda}{m}\right) w_j - \alpha \cdot \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}
$$

**Interpretation:**
- **Shrinkage Factor:** The term $\left(1 - \alpha \frac{\lambda}{m}\right)$ multiplies $w_j$ at every iteration, gradually shrinking it.  

**Example:** With $\alpha = 0.01$, $\lambda = 1$, and $m = 50$, we get:

$$
1 - \alpha \frac{\lambda}{m} = 1 - \frac{0.01}{50} = 0.9998
$$

Thus, on each iteration, $w_j$ is scaled by approximately 0.9998.

**Usual Gradient Descent Component:** The remaining term is the standard gradient descent update for unregularized linear regression.

**Effect:** Regularization slowly reduces the magnitude of the weights, helping to control model complexity and prevent overfitting.

---

## Derivative Calculation Overview

Here's a brief look at how the derivative with respect to $w_j$ is derived:
1. **Start with the cost function:**

$$
J_{\text{reg}}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

2. **Differentiate with respect to $w_j$:**

For the error term, using the chain rule, you obtain:
$$
\frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)}
$$

For the regularization term:
$$
\frac{\partial}{\partial w_j} \left( \frac{\lambda}{2m} w_j^2 \right) = \frac{\lambda}{m} w_j
$$

3. **Combine the two derivatives to yield:**

$$
\frac{\partial J_{\text{reg}}}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left( h_\theta(x^{(i)}) - y^{(i)} \right)x_j^{(i)} + \frac{\lambda}{m} w_j
$$

> **Note:** The bias term $b$ is unaffected by regularization.

---

## 5. Summary

- **Objective:** Modify gradient descent to work with a regularized cost function that prevents overfitting by keeping weights small.
- **Key Changes:**
  - **Cost Function:** Augmented with a regularization term: $\frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2$.
  - **Gradient Update for Weights:** Incorporates an extra term $\frac{\lambda}{m}w_j$, leading to a multiplicative shrinkage factor.
  - **Bias Term:** Remains unchanged (not regularized).

- **Impact on Learning:**
  - Regularization helps control model complexity.
  - Choosing appropriate $\lambda$ balances the trade-off between fitting the data and keeping the model simple.

# Regularized Logistic Regression

How to implement regularized logistic regression using gradient descent? 

The approach is very similar to regularized linear regression, with the main difference being the use of the logistic (sigmoid) function for predictions.

- **Problem:** Logistic regression with many features (especially high-order polynomial features) can overfit the training data, resulting in an overly complex decision boundary.
- **Solution:** Apply regularization by adding a penalty term to the cost function to keep the parameter values small. This helps to generalize better to new data.

---

## Cost Function with Regularization

- **Unregularized Cost Function:**

$$
J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log f(z^{(i)}) + (1-y^{(i)}) \log (1 - f(z^{(i)})) \right]
$$

where:
- $f(z) = \frac{1}{1 + e^{-z}}$ (the sigmoid function)
- $z$ is typically a high-order polynomial in the features

**Modified Cost Function:**

$$
J_{\text{reg}}(\theta) = J(\theta) + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

**Explanation:**
- The first part is the usual logistic regression cost.
- The second part is the regularization term:
  - $\lambda$ is the regularization parameter controlling the strength of the penalty.
  - The sum runs over all feature weights ($w_1, w_2, \dots, w_n$), excluding the bias term $b$.

**Effect:** Penalizes large weights, thereby simplifying the decision boundary and reducing overfitting.

---

## Gradient Descent Updates for Regularized Logistic Regression

**For each weight $w_j$ ($j = 1, 2, \dots, n$):**

$$
w_j := w_j - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( f(z^{(i)}) - y^{(i)} \right)x_j^{(i)} + \frac{\lambda}{m}w_j \right]
$$

- The term $\frac{\lambda}{m}w_j$ is the additional derivative coming from the regularization term.
- This update is similar to that of regularized linear regression, except that:
    - $f(z)$ now represents the logistic function.
    - The standard gradient descent term for logistic regression is used.

**For the bias term $b$:**
$$
b := b - \alpha \left[ \frac{1}{m} \sum_{i=1}^{m} \left( f(z^{(i)}) - y^{(i)} \right) \right]
$$

> **Note:** The bias term is **not regularized.**

### Key Points

- **Similarity to Linear Regression:** The gradient update equations are nearly identical to those for regularized linear regression, with the only difference being the hypothesis function (sigmoid for logistic regression vs. linear for regression).
- **Regularization Effect:** Regularization shrinks the weights by adding a term proportional to the weight value itself, thereby reducing overfitting.

---

## Practical Implementation

**Implementation Tips:**
- Ensure that the updates for all parameters ($w_j$ and $b$) are done **simultaneously.**
- Choose an appropriate value for $\lambda$:
    - **Too low ($\lambda = 0$):** No regularization; the model may overfit.
    - **Too high (e.g., $\lambda \gg 1$):** Excessive shrinkage, leading to underfitting.
- In the practice lab, you will have the opportunity to experiment with different $\lambda$ values to observe their effect on the decision boundary.

- **Code Insight:**
  - Review the provided code for implementing regularized logistic regression.
  - Understand how the regularization term is integrated into the gradient descent update.

---

## 5. Final Thoughts and Next Steps

- **Real-World Impact:**  
  Understanding when and how to reduce overfitting is a crucial skill in machine learning, leading to models that perform better on unseen data.
  
- **Future Learning:**
  - This course has covered key concepts in linear and logistic regression.
  - In the next course, you will learn about neural networks (deep learning), which build upon these fundamental techniques (cost functions, gradient descent, and regularization).

- **Encouragement:**  
  Congratulations on mastering these foundational concepts! Keep practicing with the labs and quizzes, and prepare for more advanced topics like neural networks in the coming materials.

---

> **[!TIP] Real-World Application**
>  
> Many companies leverage regularized logistic regression to ensure that their models generalize well, which is critical for applications such as fraud detection, medical diagnosis, and more. Your ability to implement and tune these models can lead to significant real-world value.

---
