# Logistic Regression

The sigmoid function is defined as:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:
- $z$ is the input to the sigmoid function.

The loss function, commonly using binary cross-entropy, is:

$$
L = -\left( y \log(\sigma(z)) + (1 - y) \log(1 - \sigma(z)) \right)
$$


Here:
- $y$ represents the true label.
- $\sigma(z)$ is the predicted probability from the sigmoid function.

To calculate the gradient of the loss with respect to $z$, we differentiate $L$ with respect to $z$:

$$
\frac{dL}{dz} = \sigma(z) - y
$$



\begin{align*}
\sigma(z) &= \frac{1}{1 + e^{-z}} \\
L &= -\left( y \log(\sigma(z)) + (1 - y) \log(1 - \sigma(z)) \right) \\
\frac{dL}{d\sigma(z)} &= -\left( \frac{y}{\sigma(z)} - \frac{1 - y}{1 - \sigma(z)} \right) \\
\frac{d\sigma(z)}{dz} &= \sigma(z) (1 - \sigma(z)) \\
\frac{dL}{dz} &= \frac{dL}{d\sigma(z)} \cdot \frac{d\sigma(z)}{dz} \\
&= -\left( \frac{y}{\sigma(z)} - \frac{1 - y}{1 - \sigma(z)} \right) \cdot \sigma(z) (1 - \sigma(z)) \\
&= -\left( y (1 - \sigma(z)) - (1 - y) \sigma(z) \right) \\
&= \sigma(z) - y
\end{align*}


**Notation:**
- $z$: Input to the sigmoid function.
- $y$: True label (target).
- $\sigma(z)$: Output of the sigmoid function, representing the predicted probability.
- $L$: Loss function (binary cross-entropy).
- $\frac{dL}{dz}$: Gradient of the loss with respect to $z$.


The formula for $z$ is given by:

$$
z = \mathbf{w}^\top \mathbf{x} + b
$$

**Notation:**
- $\mathbf{w}$: Weight vector.
- $\mathbf{x}$: Input feature vector.
- $b$: Bias term.
- $z$: Linear combination of inputs and weights plus bias.

**Explanation:**
- $\mathbf{w}^\top \mathbf{x}$ represents the dot product of the weight vector and the input feature vector.
- $b$ is the bias that allows the model to adjust the output along with the weighted input features.
- Together, $z$ serves as the input to the sigmoid function $\sigma(z)$, which produces the predicted probability.


The gradients of the loss function $L$ with respect to the weight vector $\mathbf{w}$ and the bias $b$ are defined as follows:

$$
\frac{\partial L}{\partial \mathbf{w}} = (\sigma(z) - y) \mathbf{x}
$$

$$
\frac{\partial L}{\partial b} = \sigma(z) - y
$$

**Notation:**
- $\mathbf{w}$: Weight vector.
- $b$: Bias term.
- $\mathbf{x}$: Input feature vector.
- $y$: True label.
- $\sigma(z)$: Output of the sigmoid function.
- $L$: Loss function (binary cross-entropy).

**Derivation:**

Given the loss function:
$$
L = -\left( y \log(\sigma(z)) + (1 - y) \log(1 - \sigma(z)) \right)
$$

And the linear combination:
$$
z = \mathbf{w}^\top \mathbf{x} + b
$$

We have already established that:
$$
\frac{dL}{dz} = \sigma(z) - y
$$

To find the gradients with respect to $\mathbf{w}$ and $b$, we apply the chain rule.

1. **Gradient with respect to the weight vector $\mathbf{w}$:**
   
   $$
   \frac{\partial L}{\partial \mathbf{w}} = \frac{dL}{dz} \cdot \frac{\partial z}{\partial \mathbf{w}} = (\sigma(z) - y) \mathbf{x}
   $$

   - **Explanation:** The gradient with respect to $\mathbf{w}$ is the product of the error term $(\sigma(z) - y)$ and the input feature vector $\mathbf{x}$.

2. **Gradient with respect to the bias $b$:**
   
   $$
   \frac{\partial L}{\partial b} = \frac{dL}{dz} \cdot \frac{\partial z}{\partial b} = \sigma(z) - y
   $$

   - **Explanation:** The gradient with respect to $b$ is simply the error term $(\sigma(z) - y)$ since the derivative of $z$ with respect to $b$ is 1.

**Summary:**
- The gradient with respect to the weights $\mathbf{w}$ is proportional to the input features scaled by the prediction error.
- The gradient with respect to the bias $b$ is equal to the prediction error.

These gradients are used in optimization algorithms like Gradient Descent to update the model parameters $\mathbf{w}$ and $b$ in order to minimize the loss function $L$.


In [2]:
import numpy as np

In [None]:
def sigmoid(z):
    y_pred = 1 / (1 + np.exp(-z)) # element-wise process
    return y_pred

class BCELoss: 
    """
    Binary cross entropy loss
    p if y = 1
    1- p if y = 0

    loss = p**y * (1-p)**(1-y) -> ylogp + (1-y)log(1-p)
    """

    def __call__(self, y_true, y_pred):
        loss = -(y_true * np.log(y_pred) + (1-y_true) * np.log(1-y_pred)) # do forget the negative sign
        self.y_true, self.y_pred = y_true, y_pred
        return np.mean(loss) # mean average the loss vector (loss for each row sample) and aggregate
    
    def get_loss_grad(self):
        """
        calculate gradient
        loss gradient = X_transpose @ (y_pred - y_true)
        # more detail here: https://classic.d2l.ai/chapter_linear-networks/softmax-regression.html#softmax-and-derivatives]
        # notice for softmax/linear regression, the loss grad is the same thing here and this is not a coincidence
        """
        loss_grad = (self.y_pred - self.y_true) / self.y_true.shape[0]
        return loss_grad


class LogisticModel:

    def __init__(self, input_dim, output_dim):
        self.weight = np.random.randn(input_dim, output_dim)
        self.bias = np.zeros((1, output_dim))
        # self.bias = np.zeros([0]): typo here which results in [] loss

    def __call__(self, X):
        """
        X: (n, p), w: (p, 1)
        y: (n, 1)
        """
        self.X = X
        z = X @ self.weight + self.bias
        y_pred = sigmoid(z)
        return y_pred

    def backward(self, loss_grad):
        """calculate gradient
        loss gradient = X_transpose @ (y_pred - y_true)
        """
        self.weight_grad = self.X.T @ loss_grad 
        self.bias_grad = loss_grad.sum(axis=0)

    def step(self, lr=0.01):
        """gradient descent"""
        self.weight = self.weight - self.weight_grad * lr 
        self.bias = self.bias - self.bias_grad * lr 
    

def one_epoch_train(X, y_true, epoch, model, loss_func):
    y_pred = model(X)
    loss = loss_func(y_true, y_pred)
    if epoch % 10 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}, Weight: {model.weight[0][0]:.4f}, Bias: {model.bias[0][0]:.4f}")
    loss_grad = loss_func.get_loss_grad()
    model.backward(loss_grad)
    model.step()



def generate_data():
    n, d = 400, 1
    w_true, b_true = np.array([3]).reshape(-1, 1), np.array([0])
    X = np.random.uniform(-1, 1, (n, d)) # (n, p)
    y_prob = sigmoid(X @ w_true + b_true)
    y_true = (y_prob >= 0.5).astype(int)
    return X, y_true

In [177]:
np.random.seed(10)
X, y_true = generate_data()
print(X.shape, y_true.shape)
model = LogisticModel(1, 1)
loss_func = BCELoss()

for i in range(500):
    one_epoch_train(X, y_true, i, model, loss_func)

y_pred = model(X)
acc = ((y_pred >= 0.5) == y_true).sum() / y_true.shape[0]
print(f"final acc: {acc}")

(400, 1) (400, 1)
Epoch 0, Loss: 0.5292, Weight: 0.7566, Bias: 0.0000
Epoch 10, Loss: 0.5258, Weight: 0.7752, Bias: 0.0009
Epoch 20, Loss: 0.5223, Weight: 0.7937, Bias: 0.0017
Epoch 30, Loss: 0.5190, Weight: 0.8121, Bias: 0.0026
Epoch 40, Loss: 0.5156, Weight: 0.8303, Bias: 0.0035
Epoch 50, Loss: 0.5124, Weight: 0.8484, Bias: 0.0043
Epoch 60, Loss: 0.5091, Weight: 0.8663, Bias: 0.0051
Epoch 70, Loss: 0.5060, Weight: 0.8841, Bias: 0.0060
Epoch 80, Loss: 0.5028, Weight: 0.9018, Bias: 0.0068
Epoch 90, Loss: 0.4997, Weight: 0.9194, Bias: 0.0076
Epoch 100, Loss: 0.4967, Weight: 0.9368, Bias: 0.0084
Epoch 110, Loss: 0.4937, Weight: 0.9541, Bias: 0.0093
Epoch 120, Loss: 0.4907, Weight: 0.9713, Bias: 0.0101
Epoch 130, Loss: 0.4878, Weight: 0.9883, Bias: 0.0109
Epoch 140, Loss: 0.4850, Weight: 1.0053, Bias: 0.0116
Epoch 150, Loss: 0.4821, Weight: 1.0221, Bias: 0.0124
Epoch 160, Loss: 0.4793, Weight: 1.0388, Bias: 0.0132
Epoch 170, Loss: 0.4766, Weight: 1.0554, Bias: 0.0140
Epoch 180, Loss: 0.47