# Logistic Regression

Despite its name, Logistic Regression is a **supervised classification** algorithm used to predict the probability of a target variable. It is most commonly used for **binary classification** ($0$ or $1$, Yes or No).



### 1. The Core Idea
It combines a standard linear model with a "squashing" function to ensure the output is always a probability between $0$ and $1$.

**Step 1: The Linear Part**
Calculate the weighted sum of inputs (just like Linear Regression):
$$z = w \cdot x + b$$

**Step 2: The Sigmoid Function (Activation)**
Apply the **Sigmoid** function to $z$ to map it to the range $[0, 1]$:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Final Hypothesis:**
$$P(y=1|x) = \hat{y} = \frac{1}{1 + e^{-(wx+b)}}$$

---

### 2. Decision Boundary
The model outputs a probability (e.g., $0.85$). To make a classification decision, we apply a threshold (usually $0.5$).

* If $P(y=1|x) \ge 0.5 \rightarrow$ Predict **Class 1**.
* If $P(y=1|x) < 0.5 \rightarrow$ Predict **Class 0**.

---

### 3. Loss Function: Binary Cross-Entropy (Log Loss)
We **cannot** use Mean Squared Error (MSE) because the Sigmoid function makes the error surface "wavy" (non-convex), which confuses Gradient Descent.

Instead, we use **Log Loss**, which penalizes confident wrong predictions heavily.

$$J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$$

* If actual $y=1$: We want $\hat{y} \approx 1$. If $\hat{y} \approx 0$, loss approaches $\infty$.
* If actual $y=0$: We want $\hat{y} \approx 0$. If $\hat{y} \approx 1$, loss approaches $\infty$.

---

### 4. Why Not Linear Regression?
1.  **Unbounded Output:** Linear Regression can predict values like $1.5$ or $-0.2$, which don't make sense as probabilities.
2.  **Assumption Violation:** Linear Regression assumes errors are normally distributed; in classification, errors are Bernoulli distributed.

---

### 5. Assumptions of Logistic Regression
1.  **Binary Outcome:** The target variable is binary (or converted to binary).
2.  **Independence:** Observations are independent of each other.
3.  **Linearity of Log-Odds:** The relationship between the independent variables and the **log-odds** of the dependent variable is linear.
    * $\ln(\frac{p}{1-p}) = wx + b$
4.  **No Multicollinearity:** Little to no correlation between independent variables.

---

### 6. Evaluation Metrics
Since we are doing classification, we use:
* **Accuracy**
* **Precision & Recall**
* **F1-Score** (Harmonic mean of Precision and Recall)
* **ROC-AUC Curve**

---

### 7. FAQ

**Q: Is Logistic Regression a linear model?**
**A:** Yes. Even though the output curve is non-linear (S-shape), the decision boundary it creates is a straight line (or plane) in the feature space. The relationship between features and the *log-odds* is linear.

**Q: Can it handle multi-class problems?**
**A:** Yes.
* **One-vs-Rest (OvR):** Trains one binary classifier for each class against all others.
* **Multinomial:** Uses the **Softmax** function instead of Sigmoid to output probabilities for $K$ classes that sum to $1$.

**Q: Why Log Loss instead of MSE?**
**A:** Log Loss is **convex** for Logistic Regression, ensuring Gradient Descent finds the global minimum. MSE would result in many local minima.

In [1]:
import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = 0
        self.b = 0

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n = len(X)

        for _ in range(self.epochs):
            z = self.w * X + self.b
            y_pred = self.sigmoid(z)

            dw = (1/n) * np.sum(X * (y_pred - y))
            db = (1/n) * np.sum(y_pred - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        z = self.w * X + self.b
        y_pred = self.sigmoid(z)
        return [1 if i >= 0.5 else 0 for i in y_pred]


In [2]:
X = np.array([1, 2, 3, 4, 5])
y = np.array([0, 0, 0, 1, 1])

model = LogisticRegression()
model.fit(X, y)

print(model.predict(np.array([3, 5])))


[1, 1]


In [3]:
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])

model = LogisticRegression()
model.fit(X, y)

print(model.predict([[3], [5]]))
print(model.predict_proba([[3], [5]]))


[0 1]
[[0.64726666 0.35273334]
 [0.18436618 0.81563382]]


In [4]:
print(model.coef_)
print(model.intercept_)


[[1.0470438]]
[-3.74817743]


# Linear Discriminant Analysis (LDA)

LDA is a **supervised** dimensionality reduction and classification technique. Unlike PCA (which looks for variance), LDA looks for "separability."

**Goal:** Find a linear combination of features (a new axis) that best separates two or more classes.



### 1. The Core Idea: Fisher's Criterion
LDA tries to achieve two things simultaneously:
1.  **Maximize Between-Class Variance ($S_B$):** Push the centers (means) of different classes as far apart as possible.
2.  **Minimize Within-Class Variance ($S_W$):** Keep the data points of the same class clustered tightly together.

**The Objective Function ($J$):**
$$J(w) = \frac{w^T S_B w}{w^T S_W w}$$
* We want to find the projection vector $w$ that maximizes this ratio.

---

### 2. Comparison: LDA vs. PCA

| Feature | **LDA** | **PCA** |
| :--- | :--- | :--- |
| **Type** | **Supervised** (Uses class labels $y$) | **Unsupervised** (Ignores labels, looks at $X$ only) |
| **Goal** | Maximize **Class Separability** | Maximize **Data Variance** |
| **Focus** | "Which direction separates Red vs. Blue?" | "Which direction has the most spread?" |
| **Output** | Axes for Classification | Axes for Reconstruction/Compression |
| **Use Case** | Feature Extraction & Classification | Dimensionality Reduction & Visualization |

---

### 3. Steps of LDA (The Algorithm)
1.  **Compute Mean Vectors:** Calculate the mean vector $\mu_k$ for each class.
2.  **Compute Scatter Matrices:**
    * **Within-Class Scatter ($S_W$):** How spread out is the data inside each class?
    * **Between-Class Scatter ($S_B$):** How far apart are the class means?
3.  **Solve Eigenvalue Problem:** We compute the eigenvectors of the matrix:
    $$A = S_W^{-1} S_B$$
4.  **Select Top Eigenvectors:** Choose the top $k$ eigenvectors (discriminants) to form the new subspace.
5.  **Project:** Transform the original data onto these new axes.

---

### 4. LDA as a Classifier
Once projected, LDA can classify new data points simply by measuring the distance to the centroid (mean) of each class in the new space. It assigns the sample to the closest class mean.

**Assumptions:**
1.  Features are **Normally Distributed** (Gaussian).
2.  **Homoscedasticity:** All classes share the same Covariance Matrix (spread shape).
3.  **Linear Separability:** Classes can be separated by a straight line/plane.

---

### 5. Pros & Cons

| Advantages | Disadvantages |
| :--- | :--- |
| Simple, fast, and easy to interpret. | Fails if data is not normally distributed. |
| Reduces dimensionality while preserving class info. | Fails if classes share different covariance structures (spreads). |
| Works very well for small datasets. | Poor performance on non-linear data (curved boundaries). |

---

### 6. FAQ

**Q: Difference between LDA and Logistic Regression?**
**A:**
* **LDA** assumes data is Gaussian and uses the entire distribution to find the boundary. It is a "Generative" model.
* **Logistic Regression** makes fewer assumptions (does not assume normality) and focuses only on the boundary. It is a "Discriminative" model.

**Q: Can LDA be used for regression?**
**A:** No. It deals with class labels. For regression, you would use standard Linear Regression or PLS (Partial Least Squares).

**Q: When should I choose LDA over PCA?**
**A:** Use **LDA** when you have labels and your specific goal is to classify the data. Use **PCA** when you don't have labels (unsupervised) or just want to compress the data for storage/visualization.

> **One-Liner:**
> * **PCA:** "Show me the spread." (Max Variance)
> * **LDA:** "Show me the difference." (Max Separation)

# LDA: Deep Dive & Geometric Intuition

### 1. Geometric Intuition
**Goal:** Find a projection that maximizes class separability.

* **Think:** "Squeeze" the points of the same class together, and "Push" the centers of different classes apart.
* **The Result:** For $K$ classes, LDA finds at most $(K-1)$ discriminant axes (directions).



**Objective Function:**
$$J(w) = \frac{w^T S_B w}{w^T S_W w}$$
* **$w$**: The projection vector (what we are solving for).
* **$S_B$**: Between-class scatter matrix (Measure of separation).
* **$S_W$**: Within-class scatter matrix (Measure of compactness).

---

### 2. When to Use What? (The Master Table)

| Condition | Use **LDA**? | Better Alternative |
| :--- | :--- | :--- |
| **Normal Features** | ✅ **Works Well** | -- |
| **Non-Normal** | ❌ Poor Performance | **Logistic Regression** |
| **Equal Variance** | ✅ **Optimal** | -- |
| **Unequal Variance** | ⚠️ Suboptimal | **QDA** (Quadratic DA) |
| **Linear Separable** | ✅ **Excellent** | -- |
| **Non-Linear** | ❌ Fails | **SVM** (RBF Kernel) |
| **Many Features** | ❌ Singular $S_W$ | **Regularized LDA** |
| **Small $n$** | ⚠️ Overfits | **Naive Bayes** |

---

### 3. Top Interview Questions (Q&A)

#### Q1: LDA vs. Logistic Regression - When to choose which?
Both are linear classifiers, but they make different assumptions.
* **Choose LDA when:**
    1.  Features are approximately **Normally Distributed**.
    2.  Classes have similar **Covariance** (spread).
    3.  **Small sample size** (LDA is more data-efficient).
    4.  You need probabilistic outputs with Gaussian assumptions.
* **Choose Logistic Regression when:**
    1.  Features are **not normal** (binary, counts, skewed).
    2.  You want interpretable coefficients (log-odds).
    3.  You need to handle many features (L1/L2 regularization handles this better).
* **Key Insight:** LDA models $P(X|Y)$ (Generative), while LR models $P(Y|X)$ (Discriminative).

#### Q2: Can LDA handle more than 2 classes? How?
**Yes.** LDA handles multiple classes naturally; it does not need One-vs-Rest.
* **Solution:** It finds $(K-1)$ discriminant axes.
* **Classification:** It projects data to this lower-dimensional space and uses Mahalanobis distance to find the closest class mean.
* **Example:** If you have 10 classes, LDA finds at most 9 discriminant components.

#### Q3: What happens when $S_W$ is singular? How to fix?
$S_W$ (Within-class scatter) becomes non-invertible (singular) when $N_{samples} < N_{features}$ or when features are perfectly correlated.
* **Fix 1 (Regularization):** Add $\lambda I$ to $S_W$ (Shrinkage LDA).
* **Fix 2 (PCA):** Run PCA first to reduce dimensions, then run LDA.
* **Fix 3 (Feature Selection):** Remove correlated features.

#### Q4: How does LDA differ from ANOVA?
* **ANOVA:** Univariate. It tests if group means are different for **one** feature at a time.
* **LDA:** Multivariate. It finds a linear combination of **all** features that maximizes separation.
* **Think:** ANOVA is 1D LDA. LDA is Multivariate ANOVA + Dimensionality Reduction.

#### Q5: Can LDA be used for feature selection?
**Yes.** The magnitude of the discriminant coefficients indicates feature importance (assuming data is scaled). You can also use Stepwise LDA to add/remove features based on their separation power.

---

### 4. Mathematical Summary

1.  **Within-class scatter:**
    $$S_W = \sum \sum (x - \mu_c)(x - \mu_c)^T$$
2.  **Between-class scatter:**
    $$S_B = \sum n_c (\mu_c - \mu)(\mu_c - \mu)^T$$
3.  **Optimization:**
    $$\max_w \frac{w^T S_B w}{w^T S_W w}$$
4.  **Solution:**
    Eigenvectors of $S_W^{-1} S_B$.
5.  **Components:**
    $\min(K-1, p)$ where $K=$ classes, $p=$ features.

---

### 5. Final Quick Comparison

* **LDA vs PCA?** $\rightarrow$ LDA: Supervised (Max Class Separation); PCA: Unsupervised (Max Variance).
* **LDA vs QDA?** $\rightarrow$ LDA: Linear boundary (Equal Covariance); QDA: Quadratic boundary (Different Covariances).
* **LDA vs Logistic Regression?** $\rightarrow$ LDA: Generative (Assumes Normality); LR: Discriminative (Robust to non-normality).
* **Assumption Checklist:** Normality, Equal Covariance, Linear Separability.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np

# Sample data
X = np.array([[4.0, 2.0], [2.0, 4.0], [2.0, 3.0], [3.0, 6.0], [4.0, 4.0]])
y = np.array([0, 0, 0, 1, 1])

# Create LDA model
lda = LinearDiscriminantAnalysis()
lda.fit(X, y)

# Predict class
print(lda.predict([[3.0, 4.0]]))

# Transform data to lower dimension
X_new = lda.transform(X)
print(X_new)


| Feature               | LDA                              | QDA                                 |
| --------------------- | -------------------------------- | ----------------------------------- |
| Covariance assumption | Same for all classes             | Different for each class            |
| Decision boundary     | Linear                           | Quadratic                           |
| Flexibility           | Less                             | More                                |
| Number of parameters  | Fewer → less risk of overfitting | More → can overfit if small dataset |
| When to use           | Classes have similar variance    | Classes have different variance     |


# Quadratic Discriminant Analysis (QDA)

QDA is a supervised classification algorithm similar to LDA, but it is **more flexible**.

**The Main Difference:**
* **LDA** assumes all classes share the **same** Covariance Matrix ($\Sigma_{shared}$). It draws a **straight line**.
* **QDA** assumes each class has its **own** Covariance Matrix ($\Sigma_k$). It draws a **quadratic curve**.



### 1. Key Concepts

* **Per-Class Covariance:** It acknowledges that "Class A might be a tight circle, while Class B is a wide oval." It doesn't force them to have the same shape.
* **Quadratic Boundary:** Because the spreads are different, the optimal separating line becomes a curve (parabola, hyperbola, or ellipse).

### 2. LDA vs. QDA Comparison

| Feature | **LDA** (Linear) | **QDA** (Quadratic) |
| :--- | :--- | :--- |
| **Covariance Matrix** | **Shared** ($\Sigma$) | **Specific per class** ($\Sigma_k$) |
| **Decision Boundary** | Linear (Straight line/plane) | Quadratic (Curved) |
| **Flexibility** | Low (High Bias) | High (Low Bias) |
| **Parameters** | Few (Less overfitting) | Many (Risk of overfitting) |
| **Best Used When** | Classes have similar spread. | Classes have **different spreads**. |

### 3. When to Use QDA?
1.  **Unequal Variances:** When one class is much more spread out than the other.
2.  **Non-Linear Separation:** When a straight line cannot separate the classes well.
3.  **Large Sample Size:** QDA needs more data than LDA because it has to estimate a full covariance matrix for *every* class.

---

### 4. Steps of QDA
1.  **Compute Means:** Calculate the mean vector $\mu_k$ for each class.
2.  **Compute Covariance Matrices:** Calculate a separate matrix $\Sigma_k$ for each class.
    * *Note: In LDA, we would average these into one. In QDA, we keep them separate.*
3.  **Compute Posterior:** Use Bayes' Theorem with a multivariate Gaussian formula.
4.  **Classify:** Assign the sample to the class with the highest posterior probability.

**The Math (Simplified):**
The discriminant function $\delta_k(x)$ includes a term like:
$$x^T \Sigma_k^{-1} x$$
Because $\Sigma_k$ is different for each class, this $x^2$ term (quadratic) remains in the final equation, creating the curved boundary.

---

### 5. Pros & Cons

| Advantages | Disadvantages |
| :--- | :--- |
| **Better Fit:** Can model complex, non-linear boundaries. | **Overfitting:** Requires estimating many more parameters ($p(p+1)/2$ parameters per class). |
| **Handles Variance:** Excellent when class spreads differ drastically. | **Data Hungry:** Needs more data samples to estimate those parameters accurately. |
| **Probabilistic:** Gives actual probabilities, not just labels. | **Assumptions:** Still assumes data is Normally Distributed. |

---

### 6. FAQ

**Q: Difference between LDA and QDA?**
**A:**
* **LDA:** Shared covariance $\rightarrow$ Linear boundary.
* **QDA:** Per-class covariance $\rightarrow$ Quadratic boundary.

**Q: When to prefer QDA over LDA?**
**A:** When you know (or suspect) that the variance/spread of the classes is very different. (e.g., The "Fraud" class is very spread out, but "Non-Fraud" is very tight).

**Q: Can QDA handle more than 2 classes?**
**A:** Yes, it naturally handles multi-class classification by calculating probabilities for each class and picking the winner.

**Q: Does QDA require feature scaling?**
**A:** Not strictly (because it learns the variance), but it helps numerical stability. However, normalization is usually recommended for all Gaussian-based models.

---

### Summary Rule of Thumb
* **LDA:** Simple, Robust, Low Variance (Use for small data / similar spreads).
* **QDA:** Flexible, Complex, Low Bias (Use for large data / different spreads).

# QDA: Deep Dive & Nuances

### 1. Geometric Intuition

* **The Core Idea:** QDA fits a **separate Gaussian distribution** to each class.
* **The Shape:** Unlike LDA, which forces all classes to have the same shape (shared covariance), QDA allows each class to have its own unique "shape" (mean + specific covariance matrix).
* **The Result:** The decision boundary emerges from comparing these overlapping multivariate Gaussians. It naturally forms curves (ellipses, parabolas, hyperbolas).



**The Math (Bayes' Theorem in Action):**
For class $k$, we assume the data follows a multivariate normal distribution:
$$P(X | Y = k) \sim N(\mu_k, \Sigma_k)$$

Using Bayes' Theorem:
$$P(Y = k | X) \propto P(X | Y = k) \cdot P(Y = k)$$

**The Discriminant Function ($\delta_k$):**
After taking logs and simplifying, we get the score for class $k$:
$$\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) - \frac{1}{2} \ln|\Sigma_k| + \ln \pi_k$$

* **Note:** The term $x^T \Sigma_k^{-1} x$ is quadratic. In LDA, $\Sigma_k$ is the same for all classes, so the quadratic term cancels out. In QDA, it stays, creating the curve.

---

### 2. Bias-Variance Tradeoff (The Parameter Explosion)

QDA is much more "data hungry" than LDA because it has to estimate many more numbers.

**Parameter Counting (for $K$ classes and $d$ features):**

* **LDA:** $K \cdot d$ (means) + $d(d+1)/2$ (shared covariance) + $K$ (priors).
* **QDA:** $K \cdot d$ (means) + $K \cdot d(d+1)/2$ (**separate covariances**) + $K$ (priors).

**Example ($d=10$ features, $K=3$ classes):**
* **LDA:** $30 + 55 + 3 = \mathbf{88}$ parameters.
* **QDA:** $30 + 165 + 3 = \mathbf{198}$ parameters.
* *Result:* QDA has ~2.25x more parameters to learn, meaning it has **Higher Variance** and needs more training data to avoid overfitting.

---

### 3. Top Interview Questions (Q&A)

#### Q1: When does QDA outperform LDA?
**A:** QDA wins when:
1.  **Covariances Differ:** Class spreads are significantly different (e.g., Class A is tight, Class B is wide). Check this with **Bartlett's test**.
2.  **Sufficient Data:** You have enough samples to estimate those extra parameters (Rule of thumb: $10 \times \text{features}$ per class).
3.  **Non-Linearity:** The true decision boundary is curved.

#### Q2: What happens when covariance matrices are singular?
**A:** "Singularity" means the matrix cannot be inverted (determinant is 0). This happens if:
* $N_{samples} < N_{features}$ (High-dimensional, low-sample data).
* Features are perfectly correlated (collinearity).
* **Solution:** Use **Regularization**. Add a small value to the diagonal: $\Sigma_{new} = \Sigma + \lambda I$. This is often called "Regularized QDA."

#### Q3: Can QDA handle categorical features directly?
**A:** **No.** QDA assumes features are **Continuous** and **Gaussian**.
* *Workaround:* You can technically use One-Hot Encoding, but the resulting matrix is often singular or sparse, breaking the Gaussian assumption. It's better to use Naive Bayes for categorical data.

#### Q4: How do you visualize QDA boundaries?
**A:** Since boundaries are quadratic:
* **2D:** Plot contours of the discriminant function.
* **Higher Dimensions:** You cannot easily visualize the "curve" in 10D. You typically project data into 2D (using PCA or LDA) and then plot the QDA boundaries on top, though this is an approximation.

#### Q5: Can QDA be used for feature selection?
**A:** **Not really.** Unlike LDA (which gives you discriminant vectors with weights), QDA's parameters are hidden inside complex quadratic matrices. There are no simple coefficients like $w_1, w_2$ to rank features.

---

### 4. Summary Cheat Sheet

| Condition | **QDA** Recommendation |
| :--- | :--- |
| **Small Sample Size** |  **Avoid** (Use LDA or Naive Bayes) |
| **High Dimensions ($p \gg n$)** |  **Avoid** (Covariance becomes singular) |
| **Different Class Spreads** |  **Use** (Captures variance differences) |
| **Non-Linear Boundary** |  **Use** (Captures curves) |
| **Interpretability** |  **Low** (No simple coefficients) |

---

### 5. Mathematical Summary

1.  **Class-conditional density:** $$P(X|Y=k) = N(\mu_k, \Sigma_k)$$
2.  **Discriminant function:** $$\delta_k(x) = -\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{1}{2}\ln|\Sigma_k| + \ln\pi_k$$
3.  **Decision rule:** $$\hat{y} = \arg\max_k \delta_k(x)$$
4.  **Total Parameters:** $$K \left[ d + \frac{d(d+1)}{2} + 1 \right]$$

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import numpy as np

# Sample data
X = np.array([[4.0, 2.0], [2.0, 4.0], [2.0, 3.0], [3.0, 6.0], [4.0, 4.0]])
y = np.array([0, 0, 0, 1, 1])

# Create QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X, y)

# Predict class
print(qda.predict([[3.0, 4.0]]))

# Posterior probabilities
print(qda.predict_proba([[3.0, 4.0]]))
