# Principal Component Analysis (PCA)

**Principal Component Analysis (PCA)** is a dimensionality reduction technique used in machine learning and statistics. It transforms high-dimensional data into a lower-dimensional form while retaining as much variance (information) as possible.

PCA is particularly useful when:

  + You have many features and want to reduce computational complexity
  + You need to visualize high-dimensional data in 2D or 3D
  + Features are correlated and you want to eliminate redundancy
  + You want to reduce noise and improve model performance

In this chapter, we will explore the mathematical foundations of PCA, connecting to concepts from linear algebra and calculus, and demonstrate how to apply PCA in practice.


## Why Dimensionality Reduction?

In many real-world datasets, we encounter the "curse of dimensionality" - as the number of features increases:

  + **Computational cost** grows exponentially
  + **Visualization** becomes impossible beyond 3 dimensions
  + **Model complexity** increases, leading to overfitting
  + **Feature correlation** may introduce redundancy

PCA addresses these challenges by finding a new set of uncorrelated variables (principal components) that capture the maximum variance in the data.


## Mathematical Foundations of PCA

PCA relies on concepts from linear algebra and calculus. Understanding the mathematical foundation will help you appreciate how PCA works and when to use it.

### Variance and Covariance

**Variance** measures how much a single variable spreads out from its mean:

$$
\text{Var}(X) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2
$$

**Covariance** measures how two variables change together:

$$
\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})
$$

Where:

  + $x_i$ and $y_i$ are individual data points
  + $\bar{x}$ and $\bar{y}$ are the means of $X$ and $Y$
  + $n$ is the number of observations

The **covariance matrix** $\Sigma$ for a dataset with $p$ features is a $p \times p$ symmetric matrix where:

$$
\Sigma = \begin{bmatrix}
\text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_p) \\
\text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_p) \\
\vdots & \vdots & \ddots & \vdots \\
\text{Cov}(X_p, X_1) & \text{Cov}(X_p, X_2) & \cdots & \text{Var}(X_p)
\end{bmatrix}
$$

### Eigenvalues and Eigenvectors

PCA finds the directions of maximum variance by computing the eigenvalues and eigenvectors of the covariance matrix.

For a square matrix $A$, a vector $\mathbf{v}$ is an **eigenvector** and $\lambda$ is its corresponding **eigenvalue** if:

$$
A\mathbf{v} = \lambda\mathbf{v}
$$

This means that when matrix $A$ is applied to eigenvector $\mathbf{v}$, the vector only gets scaled by $\lambda$ without changing direction.

To find eigenvalues, we solve the **characteristic equation**:

$$
\det(A - \lambda I) = 0
$$

Where:

  + $\det$ is the determinant
  + $I$ is the identity matrix
  + $\lambda$ are the eigenvalues that satisfy this equation

**In the context of PCA:**

  + **Eigenvectors** of the covariance matrix represent the directions (principal components) of maximum variance
  + **Eigenvalues** represent the amount of variance explained by each principal component
  + Larger eigenvalues correspond to more important principal components


## The PCA Algorithm

PCA follows these steps to transform the data:

### Step 1: Standardize the Data

First, we center the data by subtracting the mean from each feature:

$$
X_{\text{centered}} = X - \mu
$$

Often, we also scale the data to unit variance (z-score normalization):

$$
X_{\text{standardized}} = \frac{X - \mu}{\sigma}
$$

Where $\mu$ is the mean vector and $\sigma$ is the standard deviation vector.

**Why standardize?** Features with larger scales would dominate the principal components. Standardization ensures each feature contributes equally.

### Step 2: Compute the Covariance Matrix

Calculate the covariance matrix of the standardized data:

$$
\Sigma = \frac{1}{n-1} X^T X
$$

Where $X$ is the centered/standardized data matrix ($n \times p$), with $n$ samples and $p$ features.

### Step 3: Compute Eigenvalues and Eigenvectors

Solve the eigenvalue problem for the covariance matrix:

$$
\Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i
$$

This yields $p$ eigenvalues $\lambda_1, \lambda_2, \ldots, \lambda_p$ and corresponding eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_p$.

### Step 4: Sort Eigenvalues and Select Principal Components

Sort eigenvalues in descending order:

$$
\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p
$$

Select the top $k$ eigenvectors corresponding to the $k$ largest eigenvalues. These form the **principal components**.

### Step 5: Transform the Data

Project the original data onto the selected principal components:

$$
Z = X W_k
$$

Where:

  + $Z$ is the transformed data ($n \times k$)
  + $X$ is the standardized original data ($n \times p$)
  + $W_k$ is the matrix of $k$ selected eigenvectors ($p \times k$)


## Variance Explained

An important aspect of PCA is understanding how much information (variance) is retained after dimensionality reduction.

The **proportion of variance explained** by the $i$-th principal component is:

$$
\text{Variance Explained}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}
$$

The **cumulative variance explained** by the first $k$ components is:

$$
\text{Cumulative Variance}_k = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{j=1}^{p} \lambda_j}
$$

**Rule of thumb:** Select enough principal components to explain at least 80-95% of the total variance.


## Geometric Interpretation

Geometrically, PCA can be understood as:

1. **Finding new axes:** The principal components represent new orthogonal (perpendicular) axes in the feature space
2. **Rotating the data:** PCA rotates the data so that the maximum variance lies along the first axis (PC1), the second maximum variance along the second axis (PC2), and so on
3. **Projection:** The transformed data are the projections of the original data points onto these new axes

This rotation aligns the data with the directions of maximum variance, making it easier to identify patterns and reduce dimensionality.


## Connection to Calculus: Optimization Perspective

PCA can also be viewed as an optimization problem. We want to find the direction $\mathbf{w}$ that maximizes the variance of the projected data:

$$
\max_{\mathbf{w}} \mathbf{w}^T \Sigma \mathbf{w} \quad \text{subject to} \quad \|\mathbf{w}\| = 1
$$

Using **Lagrange multipliers** from calculus, we form the Lagrangian:

$$
L(\mathbf{w}, \lambda) = \mathbf{w}^T \Sigma \mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)
$$

Taking the derivative with respect to $\mathbf{w}$ and setting it to zero:

$$
\frac{\partial L}{\partial \mathbf{w}} = 2\Sigma\mathbf{w} - 2\lambda\mathbf{w} = 0
$$

This simplifies to:

$$
\Sigma\mathbf{w} = \lambda\mathbf{w}
$$

This is exactly the eigenvalue equation! The solution is the eigenvector corresponding to the largest eigenvalue. Subsequent principal components are found by maximizing variance in directions orthogonal to previous components.


## PCA in Python

In Python, we can implement PCA using:

  + **Manual implementation:** Using NumPy to compute covariance matrix, eigenvalues, and eigenvectors
  + **Scikit-learn:** Using the [`PCA` class](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) for efficient implementation

### Basic Usage with Scikit-learn

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# Access explained variance
print("Variance explained by each component:")
print(pca.explained_variance_ratio_)

# Access principal components (eigenvectors)
print("\nPrincipal components:")
print(pca.components_)
```


## When to Use PCA

PCA is particularly useful when:

  + **Reducing computational cost:** With hundreds or thousands of features
  + **Visualization:** Reducing to 2-3 dimensions for plotting
  + **Removing multicollinearity:** When features are highly correlated
  + **Noise reduction:** As minor components often represent noise
  + **Feature extraction:** Creating new features that capture the most important patterns

**Important considerations:**

  + PCA assumes **linear relationships** between features
  + The transformed features (principal components) are **linear combinations** of original features, which may be harder to interpret
  + PCA is sensitive to **scaling**, so always standardize your data first
  + PCA is an **unsupervised** technique - it doesn't consider the target variable


## Applications of PCA

PCA is widely used across various domains:

  + **Image Processing:** Facial recognition, image compression
  + **Genomics:** Analyzing gene expression data
  + **Finance:** Portfolio risk analysis, detecting patterns in stock prices
  + **Natural Language Processing:** Topic modeling, document similarity
  + **Computer Vision:** Object detection, image reconstruction


## Summary

Principal Component Analysis is a powerful dimensionality reduction technique that:

  + Transforms data into a new coordinate system where axes represent directions of maximum variance
  + Uses eigenvectors and eigenvalues of the covariance matrix to identify these directions
  + Can be understood through optimization using Lagrange multipliers from calculus
  + Helps reduce computational complexity, visualize data, and eliminate feature redundancy
  + Requires standardization and careful selection of the number of components to retain

In the next section, we will explore practical implementations of PCA, both from scratch and using Python libraries.