
# Logistic Regression

### 1. Overview

- **Type:** Supervised machine learning algorithm used for **classification** problems.
- Unlike linear regression which predicts continuous values, logistic regression predicts the **probability** that an example belongs to a particular class (usually binary: 0 or 1).
- Generalizes to multiclass problems via extensions.


### 2. Purpose and Use Cases

- Predicts categories such as:
    - Yes/No (binary classification)
    - Success/Failure
    - Spam/Not Spam
- Widely used in medical diagnosis, credit risk assessment, marketing, etc.


### 3. Core Mathematical Concepts

#### Logistic Function (Sigmoid)

- Transforms any real-valued number into output between 0 and 1.
- Formula:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

Where:

$$
z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n
$$

- \$ \beta_0 \$ is the intercept (bias), \$ \beta_i \$ are coefficients.
- Output \$ \sigma(z) \$ represents \$ P(y=1|x) \$, the probability input \$ x \$ belongs to class 1.


### 4. Model Equation

$$
P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \sum_{i=1}^n \beta_i x_i)}}
$$

- $P(y=0|x) = 1 - P(y=1|x)$
- To predict class, typically apply a threshold:
    - Predict 1 if $P(y=1|x) \geq 0.5$
    - Predict 0 otherwise


### 5. Likelihood and Loss Function

- Logistic regression parameters \$ \beta \$ are found by **maximizing the likelihood** of observed data.
- Likelihood function for binary outcomes:

$$
L(\beta) = \prod_{i=1}^n P(y_i|x_i) = \prod_{i=1}^n [p(x_i)]^{y_i} [1 - p(x_i)]^{1-y_i}
$$

- More commonly, optimize the **log-likelihood** (sum of logs for numerical stability):

$$
\log L(\beta) = \sum_{i=1}^n \left[y_i \log p(x_i) + (1 - y_i) \log(1 - p(x_i))\right]
$$

- Equivalent to minimizing the **binary cross-entropy loss** (log loss).


### 6. Training Algorithm

- Typically trained using **maximum likelihood estimation**.
- Optimization done with **gradient descent** or its variants (stochastic gradient descent, batch gradient descent).
- Updates parameters iteratively to minimize the log loss.


### 7. Model Interpretation

- Coefficients \$ \beta_i \$ represent the **log-odds** change in the outcome for a unit increase in feature \$ x_i \$.
- Exponentiating a coefficient yields the **odds ratio** for that feature.
- The intercept \$ \beta_0 \$ is the log-odds of the outcome when all features are 0.


### 8. Assumptions

- The log-odds of the outcome is a linear combination of features.
- Observations are independent.
- No exact multicollinearity among features.
- Large sample sizes generally needed for stable estimates.


### 9. Evaluation Metrics

| Metric | Description | Notes |
| :-- | :-- | :-- |
| Accuracy | % of correct predictions | Simple, but can be misleading with imbalanced data |
| Precision | True Positives / (True Positives + False Positives) | Measures relevance of positive predictions |
| Recall (Sensitivity) | True Positives / (True Positives + False Negatives) | Measures ability to detect positives |
| F1-Score | Harmonic mean of precision \& recall | Balanced metric for imbalanced datasets |
| Confusion Matrix | Table of TP, FP, TN, FN | Detailed error analysis |
| ROC Curve \& AUC | Plots true positive rate vs false positive rate | Overall measure of model discrimination ability |

### 10. Advantages and Disadvantages

| Advantages | Disadvantages |
| :-- | :-- |
| Simple, interpretable, and fast to train | Assumes linear relationship in log-odds |
| Outputs calibrated probabilities | Can struggle with complex, non-linear data |
| Works well for linearly separable data | Sensitive to outliers and multicollinearity |
| Foundation for more advanced classification methods | Requires careful feature engineering |
