# PCA from Scratch vs Sklearn

In this notebook, we demonstrate a Principal Component Analysis (PCA) implementation using NumPy and compare it with `scikit-learn`'s PCA.

---

### Objectives

- Understand the PCA algorithm step-by-step
- Implement PCA from scratch
- Visualize explained variance
- Compare projections from custom vs `sklearn` PCA


**Library Imports**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA as SklearnPCA
from pca import PCAFromScratch

**## 1. Load and Standardize the Wine Dataset
**

In [None]:
X, y = load_wine(return_X_y=True)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

scaler = StandardScaler()
x_train_sd = scaler.fit_transform(x_train)
x_test_sd = scaler.transform(x_test)

**## 2. Apply PCA from Scratch
We compute the eigenvectors and eigenvalues of the covariance matrix of the standardized data.
**

In [None]:
pca_scratch = PCAFromScratch(n_components=2)
x_train_pca = pca_scratch.fit_transform(x_train_sd)

## 3. Explained Variance (Custom PCA)

Let's examine the amount of variance captured by each of the two components.


In [None]:
explained = pca_scratch.explained_variance_
plt.bar(range(1, len(explained)+1), explained / np.sum(explained))
plt.title("Explained Variance Ratio (Custom PCA)")
plt.xlabel("Principal Components")
plt.ylabel("Ratio of Variance Explained")
plt.grid(True)
plt.show()

**## 4. Apply scikit-learn PCA for Comparison
**

In [None]:
pca_sklearn = SklearnPCA(n_components=2)
x_train_skpca = pca_sklearn.fit_transform(x_train_sd)

**## 5. Visualize the Projections
Let's compare how well both PCA methods separate the wine classes in 2D space.
**

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
colors = ['red', 'green', 'blue']
labels = ['Class 0', 'Class 1', 'Class 2']

# Custom PCA plot
for i, color in zip(np.unique(y_train), colors):
    axes[0].scatter(x_train_pca[y_train == i, 0], x_train_pca[y_train == i, 1],
                    color=color, label=labels[i])
axes[0].set_title("Custom PCA")
axes[0].legend()

# Sklearn PCA plot
for i, color in zip(np.unique(y_train), colors):
    axes[1].scatter(x_train_skpca[y_train == i, 0], x_train_skpca[y_train == i, 1],
                    color=color, label=labels[i])
axes[1].set_title("Sklearn PCA")
axes[1].legend()

plt.suptitle("PCA Comparison: Custom vs Sklearn")
plt.tight_layout()
plt.show()

## 🧩 Conclusion

- Both PCA implementations successfully reduce the data to 2 dimensions.
- The custom implementation captures the variance and clusters well, comparable to `scikit-learn`.
- This validates our eigenvector-based method against industry-standard libraries.
