<h2>Dimensionality Reduction</h2>

Dimensionality reduction is the process of reducing the number of features (variables) in a dataset while preserving as much relevant information as possible.

It solves:

    Curse of dimensionality: by Lowering feature space → models generalize better

    Overfitting: by Reducing complexity → fewer spurious correlations

    Computational inefficiency: by Speeding up training/testing

    Hard-to-visualize data: by Enabling 2D or 3D plots

    Multicollinearity: by Finding uncorrelated components

    Noisy / redundant features: by Keeping only what matters

<h3>⚙️ Two Main Categories of Dimensionality Reduction</h3>

🔹 <b>1. Feature Selection (Subset of original features)</b>

    Select important features using:

        Variance Threshold

        Mutual Information

        Recursive Feature Elimination (RFE)

        SelectKBest

    ✅ Interpretability is preserved — no new transformed features.

🔹 <b>2. Feature Extraction (Transform to new features)</b>

    Create new features that combine existing ones:

        PCA (Principal Component Analysis)

        t-SNE, UMAP (for visualization)

        LDA (for classification)

        Autoencoders (Deep Learning)

<h3>📌 Golden Principles</h3>

1. Don’t reduce before you understand your data. Dimensionality reduction can hide relationships and make models harder to interpret.

2. Scale before you reduce. Most techniques (e.g. PCA) are sensitive to scale — use StandardScaler.

3. Keep enough variance. In PCA, you usually want to preserve 95%+ of variance.

4. Use reduction after cleaning, imputation, encoding, and scaling.

5. Not all techniques are for the same goal.

    PCA = compression

    t-SNE / UMAP = visualization

    RFE = feature ranking

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score