# Topic 2 B Logistic Regression vs LDA (Pipeline)

**Corresponding script:**
- `scripts/exercises/log_reg_lda_ex1.py`

## Learning goals
- Understand LDA used as a **supervised transformer** inside a pipeline
- Compare **baseline Logistic Regression** vs **LDA → Logistic Regression**
- Reinforce the key rule: for binary classification, LDA has max **1** component


In [6]:
import numpy as np

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression

## 1) Dataset with redundancy (multicollinearity-like)

We generate a dataset with:
- many features (20)
- 10 informative
- 10 redundant (correlated combinations of informative ones)

This is exactly the situation where dimensionality reduction can help:
- remove redundancy
- simplify the model’s input space


In [7]:
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=10,
    n_redundant=10,
    random_state=7
)

print("X shape:", X.shape)
print("Class counts:", np.bincount(y))

X shape: (1000, 20)
Class counts: [505 495]


## 2) Evaluation setup (Repeated Stratified K-Fold)

We use the same CV approach as in Topic 1:
- stratified folds
- repeated multiple times

This gives a stable estimate of mean accuracy and its variability.


In [8]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

## 3) Baseline: Logistic Regression on raw features

Logistic Regression is a linear classifier.

It computes a score:

$$
\hat{y} = w^T x + b
$$

Then converts it to a probability via the sigmoid function (binary case), and predicts a class.

This baseline answers:
> How well can a simple linear model work **without** dimensionality reduction?


In [9]:
baseline = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=2000))
])

scores_base = cross_val_score(baseline, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

print(f"Baseline LogReg accuracy: {scores_base.mean():.3f} ± {scores_base.std():.3f}")

Baseline LogReg accuracy: 0.825 ± 0.034


## 4) LDA → Logistic Regression pipeline

Now we insert LDA **before** Logistic Regression.

Important constraint:

$$
n\_{components} \le C - 1
$$

Here we have **C = 2 classes**, so `n_components` can only be **1**.

So we build:

`StandardScaler → LDA(1) → LogisticRegression`


In [10]:
lda_logreg = Pipeline([
    ("scaler", StandardScaler()),
    ("lda", LinearDiscriminantAnalysis(n_components=1)),
    ("logreg", LogisticRegression(max_iter=2000))
])

scores_lda = cross_val_score(lda_logreg, X, y, scoring="accuracy", cv=cv, n_jobs=-1)

print(f"LDA(1)+LogReg accuracy: {scores_lda.mean():.3f} ± {scores_lda.std():.3f}")
print(f"Delta (LDA - baseline): {(scores_lda.mean() - scores_base.mean()):.4f}")

LDA(1)+LogReg accuracy: 0.825 ± 0.034
Delta (LDA - baseline): 0.0000


## ✅ Interpretation of results (Topic 2B)

We compared two cross-validated pipelines:

- **Baseline:** StandardScaler → Logistic Regression
- **LDA pipeline:** StandardScaler → LDA(1) → Logistic Regression

Results:

- Baseline accuracy: **0.825 ± 0.034**
- LDA(1)+LogReg accuracy: **0.825 ± 0.034**
- Δ (LDA − baseline): **0.0000**

### What this means
- LDA did **not** change performance on this dataset.
- This is common when the problem is already well-solved by a linear classifier (LogReg).
- In binary classification, LDA can produce only **one** discriminant axis:

$$
n_{\text{components}} \le C-1 = 1
$$

So LDA compresses the data into **LD1**. If LD1 preserves nearly all class-separating information, accuracy stays the same.

### Takeaway
- LDA is **not guaranteed** to improve accuracy.
- Its benefit here is mainly **supervised dimensionality reduction** (simpler representation), not better prediction.
