# Principal Component Analysis (PCA) Implementation from Scratch

## Introduction to PCA

Principal Component Analysis (PCA) is a dimensionality reduction method used to transform a large set of variables into a smaller set that still contains most of the original information.

## Steps of PCA Algorithm

1. Standardize the dataset.
2. Compute the covariance matrix of the standardized data.
3. Calculate eigenvalues and eigenvectors from the covariance matrix.
4. Sort eigenvalues and corresponding eigenvectors in descending order.
5. Select the top k eigenvectors as principal components.

## Interpretation

- Principal components represent directions in feature space along which the variance of the data is maximized.
- PCA helps in dimensionality reduction and visualization of high-dimensional data.


In [7]:

import numpy as np
def pca(data, k):
    """
    Perform PCA on the given data from scratch,
    fully standardizing features (mean 0, std 1).
    
    Parameters:
    - data (np.ndarray): shape (n_samples, n_features)
    - k (int): number of principal components to return
    
    Returns:
    - principal_components (np.ndarray): shape (n_features, k)
    """
    # 1. Convert data to float, then mean-center and variance-scale (standardize)
    data = data.astype(float)
    mean = np.mean(data, axis=0)
    std_dev = np.std(data, axis=0)
    standardized_data = (data - mean) / std_dev

    # 2. Compute covariance matrix
    covariance_matrix = np.cov(standardized_data, rowvar=False)

    # 3. Eigen-decomposition
    eigenvalues, eigenvectors = np.linalg.eigh(covariance_matrix)

    # 4. Sort eigenvalues/eigenvectors in descending order of eigenvalues
    sorted_indices = np.argsort(eigenvalues)[::-1]
    eigenvectors = eigenvectors[:, sorted_indices]

    # 5. Keep top k eigenvectors
    principal_components = eigenvectors[:, :k]

    # 6. Fix signs for consistency (optional but helps match exact expected output)
    #    Flip any eigenvector whose largest absolute-value entry is negative
    for i in range(principal_components.shape[1]):
        col = principal_components[:, i]
        if col[np.argmax(np.abs(col))] < 0:
            principal_components[:, i] *= -1

    # 7. Round to 4 decimals
    return np.round(principal_components, 4)
# Example Usage
data = np.array([[1, 2], [3, 4], [5, 6]])
k = 1
principal_components = pca(data, k)
print("Principal Components:\n", principal_components)
print(pca(np.array([[4,2,1],[5,6,7],[9,12,1],[4,6,7]]),2))


Principal Components:
 [[0.7071]
 [0.7071]]
[[ 0.6855  0.0776]
 [ 0.6202  0.4586]
 [-0.3814  0.8853]]
