# 🧠 Week 03: Classification

This week, you learn about **classification**, where your output variable $y$ can take on only one of a small handful of possible values, instead of any number in an infinite range. It turns out that **linear regression is not a good algorithm** for classification problems. Let's take a look at why — this will lead us into a different algorithm called **logistic regression**.

---

## 🟢 Binary Classification

A classic example is classifying a tumor as **malignant** versus **not**. In each of these problems, the variable you want to predict ($y$) can only take on **two possible values**: **No** or **Yes**, **0** or **1**.

This type of classification problem is called **binary classification**, where the word *binary* refers to the fact that there are **only two classes or categories**.

In these problems, the terms **class** and **category** are often used interchangeably.

---

## ✅ Positive vs. Negative Class

One common convention is to label:

- The **false** or **0** class as the **negative class**  
- The **true** or **1** class as the **positive class**

For example:

- In spam classification:
  - A non-spam email → **Negative example** ($y = 0$)
  - A spam email → **Positive example** ($y = 1$)

This terminology helps clarify the nature of the outcome in binary classification.

---

## 🔢 Logistic Regression: Key Concepts

### 🧮 Sigmoid (Logistic) Function

The sigmoid function (also called the logistic function) maps real values to a range between 0 and 1:

$$
g(z) = \frac{1}{1 + e^{-z}}
$$

- If $z \gg 0$, then $g(z) \approx 1$
- If $z \ll 0$, then $g(z) \approx 0$
- If $z = 0$, then $g(z) = \frac{1}{2}$

---

### 🔁 Logistic Regression Model

Logistic regression uses the sigmoid function to map the linear combination of features to a probability between 0 and 1.

**Step 1: Compute linear combination**

$$
z = \mathbf{w} \cdot \mathbf{x} + b
$$

**Step 2: Apply sigmoid to compute prediction**

$$
f_{\mathbf{w}, b}(\mathbf{x}) = g(z) = \frac{1}{1 + e^{-(\mathbf{w} \cdot \mathbf{x} + b)}}
$$

---

### 📊 Interpretation

The output of the logistic regression model is interpreted as the **probability that $y=1$ given input $\mathbf{x}$**:

$$
f_{\mathbf{w}, b}(\mathbf{x}) = P(y=1 \mid \mathbf{x}; \mathbf{w}, b)
$$

Therefore:

- If $f(\mathbf{x}) = 0.7$, then the model predicts a 70% chance that $y=1$ and a 30% chance that $y=0$.
- You can set a **decision threshold** (e.g., 0.5) to map probabilities to binary predictions.

---

### ✅ Summary

- Logistic regression models **binary classification** tasks ($y \in \{0, 1\}$).
- It applies a sigmoid function to a linear combination of inputs to output a probability.
- The model output is interpreted as $P(y=1 \mid \mathbf{x})$.

## 🧠 Logistic Regression with Multiple Features and Nonlinear Boundaries

Let's now explore **logistic regression** with **two features** ($x_1$, $x_2$), and see how it can define both **linear** and **nonlinear** decision boundaries.

---

### 🔴🔵 Binary Classification with Two Features

- Suppose you have a dataset with:
  - **Red crosses**: $y = 1$ (positive class)
  - **Blue circles**: $y = 0$ (negative class)

- The logistic regression model makes predictions as:

$$
f(\vec{x}) = g(z), \quad \text{where} \quad z = w_1x_1 + w_2x_2 + b
$$

- Example:  
  Let $w_1 = 1$, $w_2 = 1$, and $b = -3$.  
  Then:
  
  $$
  z = x_1 + x_2 - 3
  $$

---

### 🟨 Decision Boundary (Linear Case)

- The **decision boundary** occurs when $z = 0$:
  
  $$
  x_1 + x_2 - 3 = 0 \quad \Rightarrow \quad x_1 + x_2 = 3
  $$

- This is a **straight line** separating the feature space:
  - **Right of the line** $\rightarrow$ predict $y = 1$
  - **Left of the line** $\rightarrow$ predict $y = 0$

---

### 🔵 Non-Linear Decision Boundary (Using Polynomial Features)

Now consider using **non-linear features**, such as:

$$
z = w_1 x_1^2 + w_2 x_2^2 + b
$$

- Let $w_1 = 1$, $w_2 = 1$, and $b = -1$  
  Then:

  $$
  z = x_1^2 + x_2^2 - 1
  $$

- The decision boundary occurs when $z = 0$:

  $$
  x_1^2 + x_2^2 = 1
  $$

- This is a **circle** of radius 1 centered at (0,0):
  - **Inside the circle** $\rightarrow$ predict $y = 0$
  - **Outside the circle** $\rightarrow$ predict $y = 1$

---

### 🔷 Even More Complex Decision Boundaries

We can introduce higher-order polynomial terms to allow more complex shapes:

$$
z = w_1 x_1 + w_2 x_2 + w_3 x_1^2 + w_4 x_1 x_2 + w_5 x_2^2
$$

- With appropriate parameters, logistic regression can define:
  - **Ellipses**
  - **Non-convex shapes**
  - **Irregular complex regions**

---

### 📝 Summary

- Logistic regression with only linear terms results in **linear boundaries**.
- By including **polynomial terms**, it can model **non-linear** and **complex regions**.
- This flexibility makes logistic regression a powerful tool for binary classification tasks.

In the next step, we’ll learn how to **train** logistic regression using a cost function and gradient descent.

## 🎯 Logistic Regression – Cost Function and Loss

### 🧪 Goal
Choose parameters $w$ and $b$ that **fit the training data well** for classification.

---

### ❌ Why Not Use Squared Error?

- **Squared error** cost function works for linear regression:

  $$
  J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \frac{1}{2}(f^{(i)} - y^{(i)})^2
  $$

- But with **logistic regression**:
  
  $$
  f(x) = \frac{1}{1 + e^{-(w \cdot x + b)}}
  $$

  - The resulting cost is **non-convex**.
  - Leads to **many local minima**, making **gradient descent unreliable**.

---

### ✅ Logistic Regression Loss Function

We define a new **loss** function for a **single training example** $(x, y)$:

$$
L(f(x), y) =
\begin{cases}
- \log(f(x)) & \text{if } y = 1 \\\\
- \log(1 - f(x)) & \text{if } y = 0
\end{cases}
$$

- Where $f(x)$ is the **predicted probability** from the sigmoid function.

---

### 📉 Intuition Behind the Loss

- If $y = 1$:
  - Predicting $f(x) \approx 1$ → ✅ **Low loss**
  - Predicting $f(x) \approx 0$ → ❌ **High loss → $\infty$**

- If $y = 0$:
  - Predicting $f(x) \approx 0$ → ✅ **Low loss**
  - Predicting $f(x) \approx 1$ → ❌ **High loss → $\infty$**

👉 The loss **heavily penalizes** wrong confident predictions.

---

### 🧮 Logistic Regression Cost Function (Average Loss)

We define the **total cost function** $J(w, b)$ as the **average loss over $m$ training examples**:

$$
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(f^{(i)}, y^{(i)})
$$

This cost is:

- **Convex** ✅  
- Works well with **gradient descent** ✅

---

### 📌 Summary

- ❌ Squared error is **not appropriate** for logistic regression.
- ✅ New **log-loss** (or cross-entropy loss) is used:
  - Convex
  - Guides gradient descent properly
- Produces **smooth cost surface** → global minimum easier to find

---

In the next section, we’ll simplify this cost function and begin **implementing gradient descent** to optimize parameters.

## 🧠 Logistic Regression – Simplified Loss and Cost Functions

---

### 🎯 Objective
Simplify the **loss** and **cost** functions for easier implementation during **gradient descent**.

---

### 🔁 Generalized Loss Function (Binary Classification)

Instead of writing two separate cases for $y = 1$ and $y = 0$, we can write the **loss** compactly as:

$$
\mathcal{L}(f(x), y) = -y \log(f(x)) - (1 - y)\log(1 - f(x))
$$

---

### ✅ Why This Works:

- When $y = 1$:
  - $1 - y = 0$
  - $$
    \mathcal{L}(f(x), 1) = -\log(f(x))
    $$

- When $y = 0$:
  - $y = 0$
  - $$
    \mathcal{L}(f(x), 0) = -\log(1 - f(x))
    $$

Thus, this single equation works **for both cases**.

---

### 💰 Cost Function (Average Loss Over Training Set)

If you have $m$ training examples, the total **cost function** becomes:

$$
J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(f^{(i)}, y^{(i)})
$$

Substitute the expression for $\mathcal{L}$:

$$
J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(f^{(i)}) + (1 - y^{(i)})\log(1 - f(x^{(i)})) \right]
$$

Where:

- $f^{(i)} = \frac{1}{1 + e^{-(w \cdot x^{(i)} + b)}}$ is the **sigmoid prediction**.

---

### 📌 Notes:

- This cost function is **convex**, which makes **gradient descent reliable**.
- It's derived from **Maximum Likelihood Estimation (MLE)** — a common statistical principle for fitting models.
- This is the **standard loss used in binary logistic regression**.

---

### 🔧 Implementation Hint

In practice, this function is:

- Simple to code
- Easy to differentiate (for gradient descent)
- Robust for classification tasks

➡️ Let’s now move on to **implementing gradient descent** for this cost function.

## 🧮 Logistic Regression - Gradient Descent Implementation

To fit the parameters of a **logistic regression model**, we aim to **minimize the cost function**:

$$
J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(f_{\mathbf{w}, b}(\mathbf{x}^{(i)}), y^{(i)})
$$

Where:
- $\mathbf{w}$ are the weights
- $b$ is the bias
- $f(\mathbf{x})$ is the model output (sigmoid of $z = \mathbf{w} \cdot \mathbf{x} + b$)
- $\mathcal{L}$ is the **logistic loss**:
  $$
  \mathcal{L}(f, y) = -y \log(f) - (1 - y) \log(1 - f)
  $$

---

### 🔁 Gradient Descent Algorithm

To minimize the cost $J(\mathbf{w}, b)$, apply **gradient descent**:

#### 🧠 Derivatives:

- For each weight $w_j$:
  $$
  \frac{\partial J(\mathbf{w}, b)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\mathbf{w}, b}(\mathbf{x}^{(i)}) - y^{(i)}\right) \cdot x_j^{(i)}
  $$

- For the bias $b$:
  $$
  \frac{\partial J(\mathbf{w}, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left(f_{\mathbf{w}, b}(\mathbf{x}^{(i)}) - y^{(i)}\right)
  $$

---

### 🔄 Parameter Updates

Simultaneous updates:

$$
w_j := w_j - \alpha \cdot \frac{\partial J}{\partial w_j} \quad \text{(for all } j=1,\dots,n\text{)}
$$

$$
b := b - \alpha \cdot \frac{\partial J}{\partial b}
$$

Where:
- $\alpha$ is the **learning rate**

---

### ❗ Important Notes

- The equations **look similar** to those in linear regression, but:
  - In **linear regression**, $f(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b$
  - In **logistic regression**, $f(\mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$ where $\sigma(z)$ is the **sigmoid function**

- So the **behavior is different**, despite the gradient formulas being similar in form.

---

### ⚡ Optimization Tip: Feature Scaling

Feature scaling (e.g., normalizing inputs to range $[-1, 1]$) helps **speed up convergence** of gradient descent — just like in linear regression.

---

### 🧪 Labs

You’ll implement this in the optional/practice labs:
- Visualize: sigmoid function, contour plots, 3D surface, learning curves
- Use **scikit-learn** for logistic regression

## 🧠 The Problem of Overfitting

### ✅ Concepts Introduced

- **Overfitting**: When a model fits the training data *too well*, including the noise — resulting in poor generalization to new/unseen data.
- **Underfitting**: When a model is *too simple* to capture the underlying trend in the data.
- **Generalization**: The ability of a model to perform well on **new data**, not just the training data.

---

### 📉 Example: Housing Prices (Regression)

#### Underfitting (High Bias)
- Model: Simple linear regression.
- Fits the data poorly.
- Misses obvious patterns.
- Cost is high.
- **Too simple.**

#### Just Right (Balanced Bias-Variance)
- Model: Quadratic regression (`x`, `x²`).
- Fits the data reasonably well.
- Generalizes well to new houses.
- **Best balance**.

#### Overfitting (High Variance)
- Model: 4th-degree polynomial (`x`, `x²`, `x³`, `x⁴`).
- Fits the training data *perfectly* (cost = 0).
- Very **wiggly**, poor predictions for new data.
- Sensitive to slight changes in training set.
- **Too complex.**

---

### 📊 Classification Example (Logistic Regression)

- Inputs: `x₁ = tumor size`, `x₂ = patient age`
- Labels: Malignant (×) vs Benign (○)

| Model Type        | Decision Boundary          | Behavior      |
|-------------------|-----------------------------|---------------|
| **Underfit**      | Linear (`z = wᵀx + b`)       | High bias     |
| **Just Right**    | Quadratic (`x₁²`, `x₂²`, etc.) | Good fit      |
| **Overfit**       | High-order polynomial        | High variance |

---

### 🧩 Terminology

- **High Bias**: Model too simplistic; underfits data.
- **High Variance**: Model too complex; overfits data.
- **Just Right**: Good generalization, low cost on new data.

> 💬 Like the story of *Goldilocks*:  
> ❄ Too cold → Underfit (High Bias)  
> 🔥 Too hot → Overfit (High Variance)  
> 🍲 Just right → Balanced model

---

### 🛠️ What’s next?

In the next video:  
➡️ **Regularization**: A technique to reduce overfitting and help the model generalize better.

## 🛠️ Addressing Overfitting

Overfitting occurs when a model fits the training data *too well*, capturing noise and leading to poor generalization. In this video, three main strategies to reduce overfitting are introduced:

---

### 1. 📈 Collect More Training Data
- More data helps the model generalize better.
- Reduces the model’s tendency to memorize noise.
- Particularly useful for complex models with many features.
- **Limitation**: Not always feasible (e.g., limited housing sales in an area).

---

### 2. 🧩 Use Fewer Features (Feature Selection)
- Using too many features (especially irrelevant ones) can cause overfitting.
- Reducing the feature set helps simplify the model.
- This process is called **feature selection**.

#### Example:
Instead of using 100 features (size, bedrooms, floors, age, income, distance to coffee shop, etc.), choose a subset like:
- Size
- Number of bedrooms
- Age of the house

> ⚠️ Caveat: You might discard useful information by eliminating features.

---

### 3. 🧲 Apply Regularization
Regularization **shrinks parameter values** to reduce overfitting without eliminating features.

#### 🔎 Key Ideas:
- Overfit models often have **large weights** (`w₁, w₂, ..., wₙ`).
- Regularization encourages smaller weights.
- Keeps **all features**, but limits their impact.

#### ✅ Advantage:
- Smooths the model without dropping features completely.
- Especially useful in models like polynomial regression or logistic regression.

#### 📌 Convention:
- Regularize only the weights (`w₁...wₙ`), not the bias term `b`.
- Regularizing `b` makes minimal difference in practice.

---

### 📚 Summary: 3 Ways to Reduce Overfitting

| Strategy                | Description                                                     |
|-------------------------|-----------------------------------------------------------------|
| 1. More Data            | Improves generalization, reduces model variance.               |
| 2. Fewer Features       | Simplifies model, avoids irrelevant/noisy data.                |
| 3. Regularization       | Penalizes large weights, reduces model complexity smoothly.    |

---

### 🧪 Optional Lab
In the optional lab:
- Play with synthetic regression/classification data.
- Add/remove data points.
- Adjust polynomial degree (`x`, `x²`, `x³`, etc.).
- Try adding/removing features.
- **Visualize** the impact of overfitting and how each strategy mitigates it.

## 🧮 Cost Function with Regularization

### 🎯 Goal
Incorporate **regularization** into the model to prevent **overfitting** by penalizing large values of the parameters $w_j$.

---

### 🔁 Cost Function (Linear Regression)

The original (non-regularized) cost function is:

$$
J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f^{(i)} - y^{(i)} \right)^2
$$

> where $f^{(i)} = w^T x^{(i)} + b$

---

### 🛡️ Regularization Term

We add a penalty on the size of the weights:

$$
\text{Regularization term} = \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

- Penalizes large values of $w_j$
- The bias term $b$ is **not** typically regularized
- $\lambda$ is the **regularization parameter** that controls the strength of the penalty

---

### ✅ Regularized Cost Function

The full cost function with regularization becomes:

$$
J_{\text{reg}}(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

- The first term minimizes the prediction error
- The second term keeps the model weights small to reduce overfitting

---

### ⚖️ Effect of $\lambda$

| Value of $\lambda$         | Model Behavior                            |
|----------------------------|-------------------------------------------|
| $\lambda = 0$              | No regularization → ❗️ Overfitting         |
| $\lambda \to \infty$       | All weights $\to 0$ → ❗️ Underfitting      |
| Intermediate $\lambda$     | 🟢 Balanced fit with reduced complexity    |

---

### 📌 Notes

- Regularization is especially useful when you have **many features** and are unsure which ones are important
- It allows you to **keep all features** while reducing their influence
- By convention, **only** $w_1, ..., w_n$ are regularized (not $b$)
- The $1 / (2m)$ scaling helps make $\lambda$ more consistent across different dataset sizes

---

### 🧠 Intuition

Regularization ≈ Enforcing simpler models:

- Less “wiggly” / less complex functions
- Better generalization to new data
- Prevents the model from fitting to noise

---

Would you like me to write the next part on **Regularization in Linear Regression** — with gradients, partial derivatives, and the updated gradient descent formulas?

## 🧮 Regularized Linear Regression

In regularized linear regression, we apply **gradient descent** to minimize the following cost function:

$$
J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} \left( f^{(i)} - y^{(i)} \right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

> where $f^{(i)} = w^T x^{(i)} + b$

---

### 🔁 Gradient Descent Updates

For each parameter $w_j$ (with $j = 1, \dots, n$):

$$
w_j := w_j - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f(x^{(i)}) - y^{(i)} \right) x_j^{(i)} + \frac{\lambda}{m} w_j \right)
$$

For the bias $b$ (not regularized):

$$
b := b - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} \left( f(x^{(i)}) - y^{(i)} \right) \right)
$$

---

### ✅ Summary

- Regularization adds a term to the gradient update that **shrinks** weights
- Helps prevent **overfitting** by discouraging large parameter values
- Bias term $b$ is **not** regularized by default
- Works especially well when you have **many features** and **not much data**

---

## 🧪 Regularized Logistic Regression

Logistic regression can **overfit** when trained with many features — especially **high-order polynomials**. Regularization helps reduce overfitting by penalizing large parameter values.

---

### 🎯 Regularized Cost Function

To regularize logistic regression, we modify the original cost function:

$$
J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(f(x^{(i)})) + (1 - y^{(i)}) \log(1 - f(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2
$$

> - $f(x^{(i)}) = \sigma(z^{(i)}) = \frac{1}{1 + e^{-z^{(i)}}}$  
> - $z^{(i)} = w^T x^{(i)} + b$

- The second term is the **regularization term**, which penalizes large values of $w_j$
- We do **not** regularize the bias term $b$

---

### 🔁 Gradient Descent Updates

We minimize $J(w, b)$ using gradient descent with the following **update rules**:

#### For each weight $w_j$:

$$
w_j := w_j - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} (f^{(i)} - y^{(i)}) x_j^{(i)} + \frac{\lambda}{m} w_j \right)
$$

#### For the bias $b$ (not regularized):

$$
b := b - \alpha \left( \frac{1}{m} \sum_{i=1}^{m} (f^{(i)} - y^{(i)}) \right)
$$

- $\alpha$ is the learning rate  
- $\lambda$ is the regularization parameter  
- $m$ is the number of training examples

---

### ⚙️ Key Insights

- The update for $w_j$ is **identical** in form to regularized linear regression
- Only difference: $f^{(i)}$ is now the **sigmoid** of $z^{(i)}$, not a linear function
- Regularization **shrinks** the weights $w_j$, helping to prevent overfitting
- $b$ is **excluded** from regularization to avoid unnecessary bias shifts

---

### ✅ Summary

- Regularization improves generalization in logistic regression by **penalizing large weights**
- The effect is to smooth the decision boundary and **avoid overfitting**, even with many features
- Update rules are nearly identical to those for regularized linear regression

---

🔬 You’ll implement this in the practice lab. Try adjusting $\lambda$ interactively and observe how it affects the decision boundary.