# What is PCA?

**Principal Component Analysis (PCA)** is a linear algebra method used to reduce the dimensionality of data while keeping as much variance (information) as possible.

It works by:
1. Centering the data (subtracting the mean)
2. Computing the covariance matrix
3. Finding the eigenvectors and eigenvalues
4. Sorting the eigenvectors by highest variance (eigenvalues)
5. Projecting the data onto the top principal components

PCA transforms the data into a new coordinate system where the axes (called principal components) are ordered by importance. It is widely used in machine learning, data visualization, and noise reduction.

# PCA via Linear Algebra 


Before using libraries like `scikit-learn`, it's important to understand how **Principal Component Analysis (PCA)** works under the hood. In this example, we manually implement PCA using only NumPy and linear algebra concepts, such as:

- Centering the data
- Covariance matrix
- Eigenvalues and eigenvectors
- Projection using dot products

This helps us connect the math to the code.


### Step 1–2: Generate and Center the Data

We first create a 2D dataset with a clear linear pattern.  
Then we **center** the data by subtracting the mean from each column.  
This step is necessary because PCA is based on the **variance relative to the origin**.

Mathematically:
$$
X_{centered} = X - \mu
$$


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [None]:

# Step 1: Generate synthetic 2D data
np.random.seed(0)
x = 2 * np.random.rand(100, 1)
y = 0.5 * x + np.random.randn(100, 1) * 0.2
data = np.hstack((x, y))

# Step 2: Center the data
mean = np.mean(data, axis=0)
data_centered = data - mean



### Step 3: Compute the Covariance Matrix

The covariance matrix captures how the two dimensions (features) vary together.

In 2D:
$$
\text{Cov}(X) = \begin{bmatrix}
\text{Var}(x_1) & \text{Cov}(x_1, x_2) \\
\text{Cov}(x_2, x_1) & \text{Var}(x_2)
\end{bmatrix}
$$

We compute it with `np.cov(data_centered.T)`, where each row is a sample and each column is a feature.


In [None]:
# Step 3: Covariance matrix
cov_matrix = np.cov(data_centered.T)

### Step 4: Eigenvalues and Eigenvectors

We compute the **eigenvalues** and **eigenvectors** of the covariance matrix. Each eigenvector represents a direction in the data space, and the corresponding eigenvalue tells us **how much variance** is in that direction.
PCA keeps the directions (eigenvectors) with the largest eigenvalues.

In [None]:
# Step 4: Eigen decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Step 5: Sort eigenvalues and eigenvectors in descending order
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]


### Step 6: Projecting the data and ploting

In [None]:
# Step 6: Projection onto the first principal component
pc1 = eigenvectors[:, 0]
projected = np.dot(data_centered, pc1.reshape(-1, 1))

# Step 7: Recover projected data in original space (for visualization)
projected_2D = np.dot(projected, pc1.reshape(1, -1)) + mean

# Step 8: Plot
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0], data[:, 1], alpha=0.4, label='Original data')
plt.scatter(projected_2D[:, 0], projected_2D[:, 1], color='orange', alpha=0.8, label='PCA projection', marker='x')
plt.quiver(*mean, *(pc1 * 3), color='purple', angles='xy', scale_units='xy', scale=1, width=0.01, label='First PC')
plt.scatter(*mean, color='black', s=60, label='Data mean')

plt.xlabel("x")
plt.ylabel("y")
plt.title("Understanding PCA via Linear Algebra")
plt.legend()
plt.grid(True)
plt.axis('equal')
plt.tight_layout()
plt.show()

## Summary: What Did We Learn?

By manually applying PCA:

- We centered the data using the mean
- We computed the covariance matrix
- We extracted the principal directions using eigenvectors
- We projected the data using dot products

This gives a deeper understanding of what PCA really does: It finds a new axis (direction) that captures the most variance in the data.

This understanding is crucial for interpreting PCA results in machine learning and data science.

# PCA using scikit-learn 

## PCA with scikit-learn

In this section, we apply **Principal Component Analysis (PCA)** using the `scikit-learn` library. Unlike our previous implementation where we performed every step manually using NumPy and linear algebra, `scikit-learn` provides a compact and efficient way to do all steps in one call. This is useful when building machine learning pipelines or preprocessing real-world data.


In [None]:
# Step 1: Generate synthetic 2D data
np.random.seed(0)
x = 2 * np.random.rand(100, 1)
y = 0.5 * x + np.random.randn(100, 1) * 0.2
data = np.hstack((x, y))

# Step 2: Apply PCA using scikit-learn
pca = PCA(n_components=1)
projected = pca.fit_transform(data)  # this returns the 1D projection
projected_back = pca.inverse_transform(projected)  # back to 2D for visualization

# Step 3: Extract the principal component vector
pc1 = pca.components_[0]
mean = pca.mean_

# Step 4: Plot
plt.figure(figsize=(8, 6))
plt.scatter(data[:, 0], data[:, 1], alpha=0.4, label='Original data')
plt.scatter(projected_back[:, 0], projected_back[:, 1], color='orange', alpha=0.8, label='PCA projection', marker='x')
plt.quiver(*mean, *(pc1 * 3), color='purple', angles='xy', scale_units='xy', scale=1, width=0.01, label='PC1')
plt.scatter(*mean, color='black', s=60, label='Data mean')

plt.xlabel("x")
plt.ylabel("y")
plt.title("PCA using scikit-learn")
plt.legend()
plt.grid(True)
plt.axis('equal')
plt.tight_layout()
plt.show()


### How does `scikit-learn.PCA` work internally?

The method `PCA.fit_transform(data)` performs:

1. **Centering**: Subtracts the mean from each column of the data
2. **Covariance calculation**: Computes the covariance matrix
3. **Eigen decomposition**: Extracts eigenvalues and eigenvectors
4. **Sorting**: Orders the components by the highest variance
5. **Projection**: Projects the data onto the top `n_components`

We can also **reconstruct** the original data (approximately) using `.inverse_transform()`.


### What are we visualizing?

In the plot:

- The **blue points** are the original 2D data.
- The **orange Xs** are the projections of the data onto the first principal component.
- The **purple arrow** shows the direction of the first principal component (PC1).
- The **black dot** is the mean of the data, used to center the points during PCA.

This visualization shows how PCA finds the "best-fit" direction (in terms of variance) to reduce dimensionality.


### When to use `scikit-learn.PCA`?

You should use `scikit-learn.PCA` in real-world projects when:

- You want to **reduce the number of features** before applying machine learning
- You want to **remove redundancy** or noise from your dataset
- You want to **visualize high-dimensional data** in 2D or 3D
- You want to **automate** PCA inside a preprocessing pipeline

In educational settings, it's best to first **understand PCA with NumPy**, then move to `scikit-learn` to **apply it efficiently**.
