# Principal Component Analysis (PCA)

PCA is a **dimensionality reduction** technique that transforms data into a new coordinate system:
- The first axis (**PC1**) captures the maximum variance.
- The second axis (**PC2**) captures the next highest variance (orthogonal to PC1).
- And so on...

### Why PCA?
- Reduce dataset dimensions while keeping most information.
- Visualize high-dimensional data in 2D/3D.
- Remove noise and redundancy.

### Key Steps:
1. Standardize data (mean=0, variance=1).
2. Compute covariance matrix.
3. Find eigenvalues & eigenvectors.
4. Project data into new space (principal components).


In [1]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [2]:
# Load dataset (Digits dataset - 8x8 images)
digits = load_digits()
X = digits.data  # Flattened images
y = digits.target

print("Original shape:", X.shape)

In [3]:
# Standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [4]:
# Apply PCA (keep 2 components for visualization)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Transformed shape:", X_pca.shape)

In [5]:
# Visualize PCA result
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=15)
plt.legend(*scatter.legend_elements(), title="Digits")
plt.title("PCA: Digits Dataset (2 Components)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()

In [6]:
# Explained variance ratio
print("Explained variance ratio (first 2 PCs):", pca.explained_variance_ratio_)
print("Total variance explained (2 PCs):", np.sum(pca.explained_variance_ratio_))

In [7]:
# PCA with more components (e.g., 50)
pca_full = PCA(n_components=50)
X_pca_full = pca_full.fit_transform(X_scaled)

plt.plot(np.cumsum(pca_full.explained_variance_ratio_), marker='o')
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.title("Choosing Optimal Number of Components")
plt.grid()
plt.show()

### Key Takeaways:
- PCA reduces dimensionality while preserving variance.
- Often used before clustering or visualization.
- Helps remove noise and redundancy.
- Explained variance ratio tells how much information each component keeps.
