# Logistic Regression


Logistic Regression is a supervised classification algorithm used to predict probabilities for binary outcomes (0 or 1).

Despite the name, it is used for classification, not regression.


Core Idea

Uses a linear combination of features, then applies the sigmoid function to squash the output into a probability between 0 and 1
Uses Linear Model + Sigmoid Function

Output is probability between 0 and 1


Linear Part  z = wx + b

Sigmoid Function  σ(z) = 1/ (1+e^(−z))


Converts linear output into probability

Decision Boundary

if probability ≥ 0.5 → Class 1

else → Class 0

Loss Function

Binary Cross-Entropy (Log Loss)

Loss=−[ylog(p)+(1−y)log(1−p)]

MSE is not used because sigmoid is non-linear

How It Learns

Updates parameters (β₀, β₁) to minimize log loss, typically using an optimization algorithm like Gradient Descent

Uses Gradient Descent

Updates weights to minimize log loss

Why Not Linear Regression?

Linear regression outputs values beyond [0,1]

Poor for classification

Logistic Regression gives probability

Evaluation Metrics

Accuracy

Precision

Recall

F1-Score

ROC-AUC

Assumptions

1.Outcome is binary.

2.Observations are independent.

3.Linear relationship between log-odds and features. 

4.Little multicollinearity

Why sigmoid?

Converts output to probability

Why log loss?

Penalizes wrong confident predictions more

Is Logistic Regression linear?

Linear in parameters, non-linear in output

Can it do multi-class?

Yes, using One-Vs-Rest



In [1]:
import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs
        self.w = 0
        self.b = 0

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        n = len(X)

        for _ in range(self.epochs):
            z = self.w * X + self.b
            y_pred = self.sigmoid(z)

            dw = (1/n) * np.sum(X * (y_pred - y))
            db = (1/n) * np.sum(y_pred - y)

            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        z = self.w * X + self.b
        y_pred = self.sigmoid(z)
        return [1 if i >= 0.5 else 0 for i in y_pred]


In [2]:
X = np.array([1, 2, 3, 4, 5])
y = np.array([0, 0, 0, 1, 1])

model = LogisticRegression()
model.fit(X, y)

print(model.predict(np.array([3, 5])))


[1, 1]


In [3]:
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([0, 0, 0, 1, 1])

model = LogisticRegression()
model.fit(X, y)

print(model.predict([[3], [5]]))
print(model.predict_proba([[3], [5]]))


[0 1]
[[0.64726666 0.35273334]
 [0.18436618 0.81563382]]


In [4]:
print(model.coef_)
print(model.intercept_)


[[1.0470438]]
[-3.74817743]


# LDA

LDA is a supervised dimensionality reduction and classification technique.

It finds a linear combination of features that best separates two or more classes.

Mainly used for classification.

Maximize between-class variance and minimize within-class variance.

Key Concepts

Within-class variance → how spread out data is in a class

Between-class variance → how far apart class means are

Projection → LDA projects data into a lower-dimensional space (like 1D or 2D) that maximizes class separability

```bash


 Feature     LDA                                         PCA                                                      
 ----------  ------------------------------------------  -------------------------------------------------------- 
 Supervised   uses class labels                           unsupervised                                            
 Goal        Maximize class separability                 Maximize variance of data                                
 Output      Lower-dimensional space for classification  Lower-dimensional space for reconstruction / compression 
 Use case    Classification                              Dimensionality reduction                                 

```
When to Use LDA

Classification tasks

Reduce dimensionality before classification

Small datasets (better than PCA for class separation)

Steps of LDA

Compute mean vectors for each class

Compute within-class scatter matrix (SW)

Compute between-class scatter matrix (SB)

Solve eigenvalue problem for SW^−1 *  SB

Select top eigenvectors → new subspace

Project data → classify using  nearest centroid

LDA as Classifier

After projecting data, assign a sample to the class with closest mean in the LDA space

Assumes features are normally distributed

Advantages

Simple, fast, easy to interpret

Reduces dimensionality while improving class separation

Works well for linearly separable classes

Disadvantages

Assumes normal distribution of features

Assumes same covariance matrix for all classes

Poor performance if classes are not linearly separable


Difference between LDA and Logistic Regression?
LDA assumes normality, Logistic Regression does not; LDA uses class distributions.

Q: Difference between LDA and PCA?
LDA is supervised, PCA is unsupervised.

Q: Can LDA be used for regression?
No, only for classification.

Q: When to use LDA?
Small datasets, linearly separable classes, feature reduction + classification.

PCA → “max variance” (unsupervised), LDA → “max class separation” (supervised)



Geometric Intuition:

Goal: Find projection that maximizes class separability

Think: "Squeeze" same classes together, "push" different classes apart


For K classes, LDA finds at most (K-1) discriminant axes that maximize:

Objective: J(w) = (wᵀS_B w) / (wᵀS_W w)
where:

  w = projection vector (what we're solving for)

  S_B = Between-class scatter matrix (separation measure)

  S_W = Within-class scatter matrix (compactness measure)

```bash
┌─────────────────┬────────────────────────┬─────────────────────────┐
│ Condition       │ Use LDA                │ Use Alternative         │
├─────────────────┼────────────────────────┼─────────────────────────┤
│ Normal features │  Works well           │                         │
│ Non-normal      │  Poor performance     │ Logistic Regression     │
│ Equal variance  │  Optimal              │                         │
│ Unequal var.    │  Suboptimal           │ QDA (Quadratic DA)      │
│ Linear sep.     │  Excellent            │                         │
│ Non-linear      │  Fails                │ SVM with RBF kernel     │
│ Many features   │  Singular S_W         │ Regularized LDA         │
│ Small n         │  Overfits             │ Naive Bayes             │
└─────────────────┴────────────────────────┴─────────────────────────┘
```
Q1: LDA vs Logistic Regression - When to choose which?

Both are linear classifiers, but different assumptions:

Choose LDA when:
1. Features are (approximately) normally distributed
2. Classes have similar covariance structures
3. Small sample size (LDA is more data-efficient)
4. Need probabilistic outputs with Gaussian assumption

Choose Logistic Regression when:
1. Features are not normal (binary, counts, etc.)
2. Want interpretable coefficients (log-odds)
3. Need to handle many features (L1/L2 regularization)
4. Model misspecification is a concern (LR is more robust)

Key insight: LDA models P(X|Y), LR models P(Y|X)

Q2: Can LDA handle more than 2 classes? How?

"Yes, LDA naturally extends to multi-class:

One-vs-Rest approach: Not needed, LDA handles multiple classes directly

Solution: Find (K-1) discriminant axes that maximize class separation

Classification: Project to (K-1) dimensional space, use Mahalanobis distance

Visualization: For K classes, you can visualize in up to (K-1) dimensions

Example: 10 classes → at most 9 discriminant components"

Q3: What happens when S_W is singular? How to fix?

"S_W becomes singular when:

n_samples < n_features (common in genomics, text)

Features are linearly dependent

Constant features or zero variance

slv

Regularization: Add λI to S_W (shrinkage LDA)

PCA preprocessing: Reduce dimensions first

Feature selection: Remove correlated/irrelevant features

Use pseudoinverse: SVD-based solution

Kernel LDA: Map to higher dimension where data is linearly separable"

Q4: How does LDA differ from ANOVA?


"Both separate group means, but:

• ANOVA: Tests if group means are different (univariate, one feature)

• LDA: Finds linear combination that maximizes separation (multivariate)

Think: ANOVA is 1D LDA. LDA = multivariate ANOVA + dimension reduction."

Q5: Can LDA be used for feature selection?

"Yes, two ways:

Discriminant coefficients: Magnitude indicates feature importance

Stepwise LDA: Add/remove features based on discriminant power

But caution: LDA coefficients assume linear separability and equal variance."


When to Use LDA:

 Small to medium datasets

 Normally distributed features

 Linear class boundaries

 Need dimensionality reduction + classification

 Interpretable feature importance needed

When to Avoid LDA:

 Highly non-normal features

 Non-linear decision boundaries

 Very high-dimensional data (n_features >> n_samples)

 Classes with very different variances

 Need for non-linear interactions
 

1. Within-class scatter: S_W = Σ Σ (x - μ_c)(x - μ_c)ᵀ
2. Between-class scatter: S_B = Σ n_c (μ_c - μ)(μ_c - μ)ᵀ
3. Objective: max_w (wᵀS_B w) / (wᵀS_W w)
4. Solution: eigenvectors of S_W^{-1} S_B
5. components: min(K-1, p) where K=classes, p=features

Q: LDA vs PCA? → LDA: supervised, max class separation; PCA: unsupervised, max variance

Q: LDA vs QDA? → LDA: linear, equal covariance; QDA: quadratic, different covariances

Q: LDA vs Logistic Regression? → LDA: generative, models P(X|Y); LR: discriminative, models P(Y|X)

Q: Assumptions? → Normality, equal covariance, linear separability

Q: Singular S_W fix? → Regularization, PCA first, feature selection

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
import numpy as np

# Sample data
X = np.array([[4.0, 2.0], [2.0, 4.0], [2.0, 3.0], [3.0, 6.0], [4.0, 4.0]])
y = np.array([0, 0, 0, 1, 1])

# Create LDA model
lda = LinearDiscriminantAnalysis()
lda.fit(X, y)

# Predict class
print(lda.predict([[3.0, 4.0]]))

# Transform data to lower dimension
X_new = lda.transform(X)
print(X_new)


# QDA

QDA is a supervised classification algorithm like LDA.

Main difference: assumes each class has its own covariance matrix, instead of sharing one (as in LDA).

Decision boundary is quadratic instead of linear.

Allow different spreads for each class → more flexible decision boundary



Within-class covariance → each class has its own covariance matrix

Decision boundary → quadratic curve separating classes

Works best when class variances are different

When to Use QDA

Classification tasks

Classes have different spreads / covariance

Small or moderate-dimensional data (high-dimensional → overfitting risk)

Steps of QDA

Compute mean vector for each class

Compute covariance matrix for each class

Compute posterior probability for each class using Bayes’ theorem

Assign sample to class with highest posterior probability


Advantages

Can model non-linear boundaries

Better than LDA when class covariances differ

Probabilistic output (posterior probabilities)

Disadvantages

More parameters → risk of overfitting for small datasets

Assumes normal distribution of features

Less interpretable than LDA

Q: Difference between LDA and QDA?

LDA → shared covariance → linear boundary, QDA → per-class covariance → quadratic boundary

Q: When to prefer QDA over LDA?

Class variances are different

Q: Can QDA handle more than 2 classes?

Yes, multi-class classification

Q: Does QDA require feature scaling?

Not strictly, but helps sometimes




LDA → Linear boundary, shared covariance
QDA → Quadratic boundary, per-class covariance

| Feature               | LDA                              | QDA                                 |
| --------------------- | -------------------------------- | ----------------------------------- |
| Covariance assumption | Same for all classes             | Different for each class            |
| Decision boundary     | Linear                           | Quadratic                           |
| Flexibility           | Less                             | More                                |
| Number of parameters  | Fewer → less risk of overfitting | More → can overfit if small dataset |
| When to use           | Classes have similar variance    | Classes have different variance     |


Geometric Intuition:

QDA fits a separate Gaussian distribution to each class

Each class gets its own "shape" (mean + covariance matrix)

Decision boundary emerges from comparing these multivariate Gaussians

For class k, assume data follows multivariate normal distribution:


P(X | Y = k) ~ N(μ_k, Σ_k)

Using Bayes' Theorem:

P(Y = k | X) ∝ P(X | Y = k) * P(Y = k)

After taking logs and simplifying:

Discriminant function δ_k(x) = -½(x - μ_k)ᵀΣ_k⁻¹(x - μ_k) - ½log|Σ_k| + log π_k

where π_k = prior probability of class k

Bias-Variance Tradeoff in QDA:

Number of parameters to estimate:

For K classes with d features:

LDA: K*d (means) + d(d+1)/2 (shared covariance) + K (priors)

QDA: K*d (means) + K*d(d+1)/2 (separate covariances) + K (priors)

Example: d=10, K=3

LDA: 30 + 55 + 3 = 88 parameters

QDA: 30 + 165 + 3 = 198 parameters

QDA has 2.25x more parameters → needs more data


Q1: When does QDA outperform LDA?

"QDA outperforms LDA when:

Class covariance matrices are significantly different (check with Bartlett's test)

Sample size is large enough (rule of thumb: at least 10× features per class)

True decision boundary is quadratic/non-linear

Classes have different spreads or orientations

Example: Class A is tightly clustered, Class B is widely spread → QDA will capture this 

better."

Q2: How many parameters does QDA need to estimate?

"For K classes with d features:

Means: K × d parameters

Covariance matrices: K × d(d+1)/2 parameters (each symmetric)

Priors: K parameters

Total: K[d + d(d+1)/2 + 1] parameters

Example: 3 classes, 10 features → 3[10 + 55 + 1] = 198 parameters

This explains why QDA needs much more data than LDA!"

Q3: What happens when covariance matrices are singular?

"Singular covariance occurs when:

n_samples < n_features (common in high-dim low-sample settings)

Features are linearly dependent

Perfect collinearity

Solutions:

Regularization: Add λI to covariance matrices (reg_param in sklearn)

Feature selection/reduction: Use PCA or filter methods first

Use diagonal QDA: Assumes features are independent (like Naive Bayes)

Shrinkage QDA: Shrink toward pooled covariance (like LDA-QDA hybrid)"

Q4: Can QDA handle categorical features?

"Directly? No. QDA assumes continuous, normally distributed features.

Workarounds:

Encoding: Use one-hot encoding, but beware of curse of dimensionality

Separate modeling: Model categorical features with different distributions

Mixed models: Use QDA for continuous, Naive Bayes for categorical

Kernel methods: Map to continuous space

Better: Use models designed for mixed data types."

Q5: How to visualize QDA decision boundaries?

"Three approaches:

2D feature space: Plot contours of discriminant functions

LD1-LD2 projection: Project to LDA space first, then apply QDA

Pairwise plots: For multi-class, plot each pair of features

Key insight: QDA boundaries can be ellipses, parabolas, or hyperbolas depending on 
covariance differences."

When to Use QDA :

Different class covariances/spreads

Moderate to large sample sizes

Quadratic/non-linear decision boundaries

Need probabilistic outputs

Visualization of class distributions

When to Avoid QDA:

 Very small sample sizes

 High-dimensional data (p >> n)

 Features not normally distributed

 Computational efficiency needed

 Need for interpretable linear coefficients

1. Class-conditional density: P(X|Y=k) = N(μ_k, Σ_k)

2. Discriminant function: δ_k(x) = -½(x-μ_k)ᵀΣ_k⁻¹(x-μ_k) - ½log|Σ_k| + logπ_k

3. Decision rule: ŷ = argmax_k δ_k(x)

4.  parameters: K[d + d(d+1)/2 + 1]


Q: QDA vs LDA? → QDA: separate covariances, quadratic boundaries; LDA: shared covariance,

linear boundaries

Q: When QDA fails? → Small samples, high dimensions, non-normal data

Q: Regularization? → Add λI to covariance matrices to prevent singularity

Q: Multi-class? → Direct extension works naturally

Q: Feature importance? → Not directly available (unlike LDA coefficients)







In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
import numpy as np

# Sample data
X = np.array([[4.0, 2.0], [2.0, 4.0], [2.0, 3.0], [3.0, 6.0], [4.0, 4.0]])
y = np.array([0, 0, 0, 1, 1])

# Create QDA model
qda = QuadraticDiscriminantAnalysis()
qda.fit(X, y)

# Predict class
print(qda.predict([[3.0, 4.0]]))

# Posterior probabilities
print(qda.predict_proba([[3.0, 4.0]]))
