# Dimensionality Reduction with PCA

High-dimensional data can be challenging for machine learning models due to:
- Increased computation
- Risk of overfitting
- Difficulty in visualization

**Principal Component Analysis (PCA)** helps reduce dimensions while retaining most of the variance.

## 1. What is PCA?
- PCA finds **new axes (principal components)** that maximize variance in the data.
- The first principal component explains the most variance, the second explains the next, and so on.
- Helps compress data while keeping important patterns.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

X.head()

## 2. Applying PCA
- We'll reduce the **4D Iris dataset** to **2D** for visualization.

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Original shape:", X.shape)
print("Transformed shape:", X_pca.shape)

## 3. Visualizing PCA Results

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Iris Dataset')
plt.colorbar(label='Target Class')
plt.show()

## 4. Explained Variance
- PCA also tells us how much variance each component explains.

In [None]:
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:", sum(pca.explained_variance_ratio_))

## ✅ Summary
- PCA reduces dimensionality while preserving variance.
- Useful for visualization, noise reduction, and speeding up models.
- Choose `n_components` based on **explained variance**.
- PCA is unsupervised (doesn't use labels).

👉 Next steps: Try PCA on larger datasets (digits, MNIST) and see how well models perform with reduced dimensions.