# Logistic Regression

Logistic Regression is a supervised learning algorithm used for binary classification tasks. It predicts the probability that a given input belongs to a particular class.

## Sigmoid for Classification
- Logistic regression uses the **sigmoid** function to convert linear predictions into probabilities:
  
\[
\sigma(z) = \frac{1}{1 + e^{-z}}
\]

- It maps any real value \(z\) to the range \((0, 1)\), suitable for probabilities.
- Decision boundary typically set at probability = 0.5.

## Initialization of Weights and Bias
- Weights \(W\) initialized with small random numbers to avoid saturating the sigmoid function:

```python
W = np.random.randn(1, n_features) * 0.01
b = 0

```

Symmetry Problem: If all weights are initialized identically (e.g., zeros), neurons update symmetrically, causing ineffective learning. Random initialization breaks this symmetry, enabling effective learning.

## Forward Propagation
- Compute linear combination and activation:

\[
Z = W X + b \\
A = \sigma(Z)
\]

- Forward propagation calculates predictions \(A\) given current weights and bias.
- \(Z\) is the linear combination of inputs and weights, transformed by the sigmoid activation to yield predicted probabilities \(A\).

## Back Propagation
- Computes gradients (partial derivatives) for updating weights and biases by propagating the loss backward:

\[
dZ = A - Y \\
dW = \frac{1}{m}(dZ \cdot X^T) \\
db = \frac{1}{m}\sum{dZ}
\]

- Gradients represent how the loss changes concerning each parameter.
- \(dZ\) is derived by applying the chain rule to the loss function, resulting in a simple difference between predictions and actual labels.


#TODO: compute partial derivates

## Binary Cross-Entropy Loss
- Evaluates the accuracy of predicted probabilities against actual binary labels:

\[
L(\hat{y}, y) = - \frac{1}{m}\sum_{i=1}^{m}[y^{(i)}\log(\hat{y}^{(i)}) + (1 - y^{(i)})\log(1 - \hat{y}^{(i)})]
\]

- Minimizing this loss pushes predictions closer to true labels.

## Regularization
- Prevents overfitting by penalizing overly complex models (large weights):

**L2 Regularization:**
\[
L_{regularized} = L + \frac{\lambda}{2m}\sum{W^2}
\]

- \(\lambda\) adjusts the amount of regularization.

**Dropout:**
TODO:

## Gradient Descent
- Updates parameters iteratively to minimize loss:

\[
W = W - \alpha \cdot dW \\
b = b - \alpha \cdot db
\]

- Learning rate \(\alpha\) controls how quickly the model updates. Smaller values ensure stable convergence; larger values speed up training but may cause instability."
