# Logistic Regression: Mathematical Intuition and Cost Function


## Summary

* **Linear Regression** fails for classification because of **outliers** and unbounded outputs that can exceed 1 or drop below 0.
* **Logistic Regression** squashes the output between 0 and 1 using the **Sigmoid Activation function**.
* The standard linear regression cost function produces a **non-convex function** with multiple **local minima** when applied to logistic regression.
* To ensure a **convex function** with a single **global minimum**, Logistic Regression uses a **Log Loss** cost function.
* The **Convergence Algorithm (Gradient Descent)** iteratively minimizes the cost function by updating parameter values.

---

## Recap: Why Linear Regression Fails for Classification

Linear Regression is unsuitable for classification for two key reasons:

* **Outliers** can significantly shift the best fit line, altering predictions.
* The linear hypothesis produces unbounded outputs that may be **greater than 1** or **less than 0**.

To solve this, model outputs must be **bounded** within valid probability limits (0 to 1).

---

## The Sigmoid Activation Function

The mathematical solution to squashing the output is the **Sigmoid Function**.

### Sigmoid Formula

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Key properties:

* The output is always strictly between **0 and 1**.
* If $z \ge 0$, then $\sigma(z) \ge 0.5$.
* If $z < 0$, then $\sigma(z) < 0.5$.

---

## The Logistic Regression Hypothesis

In Linear Regression:

$$
f(x) = \theta_0 + \theta_1 x_1
$$

In Logistic Regression, we apply the sigmoid function:

$$
h_\theta(x) = \frac{1}{1 + e^{-z}}
$$

where:

$$
z = \theta_0 + \theta_1 x_1
$$

Thus,

$$
h_\theta(x) = \sigma(\theta_0 + \theta_1 x_1)
$$

This transformation ensures predictions are bounded between 0 and 1.

---

## The Cost Function and Convexity

In Linear Regression, the cost function is:

$$
J(\theta_0, \theta_1) =
\frac{1}{2m}
\sum_{i=1}^{m}
\left(h_\theta(x^{(i)}) - y^{(i)}\right)^2
$$

If this Mean Squared Error (MSE) is used with the sigmoid hypothesis, the resulting cost function becomes **non-convex**.

### Why is this a problem?

* Non-convex functions have multiple **local minima**.
* Gradient Descent may get stuck in a local minimum.
* The model may fail to reach the true **global minimum**.

---

## The Log Loss Cost Function

To ensure convexity, Logistic Regression uses **Log Loss (Binary Cross-Entropy)**.

### Conditional Form

If $y = 1$:

$$
Cost = -\log(h_\theta(x))
$$

If $y = 0$:

$$
Cost = -\log(1 - h_\theta(x))
$$

### Combined Form

$$
Cost(h_\theta(x), y)
=
- y \log(h_\theta(x))
- (1 - y) \log(1 - h_\theta(x))
$$

### Final Logistic Regression Cost Function

$$
J(\theta_0, \theta_1)
=
-\frac{1}{m}
\sum_{i=1}^{m}
\left[
y^{(i)} \log(h_\theta(x^{(i)}))
+
(1 - y^{(i)}) \log(1 - h_\theta(x^{(i)}))
\right]
$$

This function is **strictly convex**, meaning it has:

* One **global minimum**
* No local minima
* Reliable optimization using Gradient Descent

---

## Convergence Algorithm (Gradient Descent)

The goal of training is to **minimize the cost function** by updating parameters $\theta_0$ and $\theta_1$.

The update rule is:

$$
\theta_j
:=
\theta_j
-
\alpha
\frac{\partial}{\partial \theta_j}
J(\theta_0, \theta_1)
$$

Where:

* $\alpha$ = learning rate
* $\frac{\partial}{\partial \theta_j}$ = partial derivative

Because the Log Loss function is convex, Gradient Descent will correctly converge to the **global minimum**.