# PCA From Scratch

In this notebook, I’ll implement **Principal Component Analysis (PCA)** completely from scratch, using only **NumPy**.  

**Goal:** understand *how PCA actually works under the hood*, not just how to call `sklearn.decomposition.PCA`.  

---

## Step-by-Step Implementation Plan  

### Step 1: Import & Load Data  
- Use the **Digits dataset** (64 features) or create a **synthetic dataset**.  

### Step 2: Standardization  
- Center the data (zero mean per feature).  
- (Optionally) scale to unit variance if features are on different scales.  

### Step 3: Compute Covariance Matrix  
- Derive covariance manually using NumPy:  
  \[
  \Sigma = \frac{1}{n-1} X^\top X
  \]  

### Step 4: Eigen Decomposition  
- Use `np.linalg.eig` or `np.linalg.svd`.  
- Extract **eigenvalues (variance explained)** and **eigenvectors (directions)**.  

### Step 5: Sort Eigenvalues  
- Sort in descending order.  
- Select top *k* eigenvectors (principal components).  

### Step 6: Project Data  
- Transform original dataset into reduced space using the selected components.  

### Step 7: Compare with sklearn PCA  
- Validate results against `sklearn.decomposition.PCA` to confirm correctness.  

---

By the end, we’ll have a **working PCA implementation from scratch** and confirm that it matches scikit-learn’s PCA.

In [1]:
import numpy as np

class MyPCA:
    """
    A simple PCA (Principal Component Analysis) implementation from scratch using NumPy.

    Parameters
    ----------
    n_components : int, optional
        Number of principal components to keep.
        If None, all components are kept.
    """

    def __init__(self, n_components=None):
        self.n_components = n_components
        self.components_ = None
        self.explained_variance_ = None
        self.mean_ = None

    def fit(self, X):
        """
        Fit the PCA model on dataset X.

        Steps:
        1. Center the data (subtract mean).
        2. Compute covariance matrix.
        3. Perform eigen decomposition.
        4. Sort eigenvalues & eigenvectors.
        5. Keep the top n_components.
        """

        # Step 1: Center the data
        self.mean_ = np.mean(X, axis=0)
        X_centered = X - self.mean_

        # Step 2: Covariance matrix
        cov_matrix = np.cov(X_centered, rowvar=False)

        # Step 3: Eigen decomposition (since covariance is symmetric, use eigh)
        eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)

        # Step 4: Sort eigenvalues (largest -> smallest)
        sorted_idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[sorted_idx]
        eigenvectors = eigenvectors[:, sorted_idx]

        # Step 5: Store top n_components
        if self.n_components is not None:
            eigenvectors = eigenvectors[:, :self.n_components]
            eigenvalues = eigenvalues[:self.n_components]

        self.components_ = eigenvectors
        self.explained_variance_ = eigenvalues
        return self

    def transform(self, X):
        """
        Project the dataset X onto the principal components.
        """
        X_centered = X - self.mean_
        return np.dot(X_centered, self.components_)

    def fit_transform(self, X):
        """
        Fit the model and return the transformed dataset in one step.
        """
        return self.fit(X).transform(X)

## Compare MyPCA vs Scikit-learn PCA