## Concept Overview

In Logistic Regression, we want to predict the probability that a given input $x$ belongs to a the positive class (1): 

$\hat{y} = P(y = 1 \mid x)$


Since probabilities must stay between $0$ and $1$, we do not simply use a linear function like

$
w^T x + b.
$

Instead, we first define the linear term $z = w^T x + b,$ and then pass it through the **Sigmoid function**, defined as

$$
\sigma(z) = \frac{1}{1 + e^{-z}}.
$$

This S-shaped curve acts as a squashing function:

- When $z$ is a very large positive number, $e^{-z} \to 0$, so  
  $$
  \sigma(z) \to 1.
  $$

- When $z$ is a very large negative number, $e^{-z}$ becomes very large, making the denominator grow significantly, so  
  $$
  \sigma(z) \to 0.
  $$

Thus, the sigmoid function ensures that the output always remains between $0$ and $1$, making it suitable for modeling probabilities.


### Sigmoid
![sigmoid-curve](../images/sigmoid-curve.png)

$\sigma(x) = \frac{1}{1 + e^{-z}}$
- Squashes input into the range (0, 1). 
- Used for binary classification where the output is the probability of a single class being positive
- Suffers from vanishing gradients â€” when inputs are very positive or very negative, the gradient becomes extremely small, slowing or stopping learning.

In [1]:
import torch

In [2]:
z_values = torch.arange(-4, 5, 2)
z_values

tensor([-4, -2,  0,  2,  4])

In [3]:
torch.sigmoid(z_values)

tensor([0.0180, 0.1192, 0.5000, 0.8808, 0.9820])

## Loss vs. Cost Function

The **Loss function** $\mathcal{L}(\hat{y}, y)$ measures error for a single training example, while the **Cost function** $J(w, b)$ measures the average error across the entire training set of $m$ examples.

**1. The Loss Function (Cross-Entropy)**

To ensure a convex optimization surface, we use:
$$\mathcal{L}(\hat{y}, y) = -(y \log \hat{y} + (1-y) \log(1-\hat{y}))$$

This formula automatically adapts based on the ground truth $y$:

- If $y = 1$: 
$\mathcal{L}(\hat{y}, y) = -\log \hat{y}$
  - Goal: $\hat{y} \rightarrow 1$ (Loss $\rightarrow 0$).
- If $y = 0$: $\mathcal{L}(\hat{y}, y) = -\log(1-\hat{y})$ 
  - Goal: $\hat{y} \rightarrow 0$ (Loss $\rightarrow 0$). 

**2. The Cost Function**

The cost is the arithmetic mean of all individual losses:$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{(i)}, y^{(i)})$$Full expanded form:$$J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \hat{y}^{(i)} + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right]$$

**Why do we multiply the loss function by $-1$?**
- Because the logarithm of any number between 0 and 1 is always negative and since Logistic Regression outputs probabilities, $log (\hat{y})$ will always be negative. Therefore, we multiply the result of the loss function with -1 to transform it into a positive loss that gradient descent can minimize. 

In [6]:
torch.log(torch.tensor(0.9))

tensor(-0.1054)