# Conceptual and Statistical Introduction

## The statistical problem of high dimensionality

In high-dimensional biological data, many features are correlated. Variance measures how spread out data points are around their mean. Covariance measures how features vary together.

## Geometric intuition

PCA rotates the coordinate system to align with directions of maximum variance. These directions correspond to eigenvectors of the covariance matrix and represent the most informative axes of variation.

## Drug development relevance

Omics datasets often contain thousands of variables but few samples. PCA helps separate signal from noise and identify dominant biological processes before downstream biomarker analysis.

## Intuitive Understanding of PCA (Feynman-Style)

Before applying PCA to real biological data, we build intuition using a very small, concrete example.

**PCA asks a simple question:**

> In which direction are the data points most spread out?

That direction captures the most information and becomes the **first principal component**.

### A Simple Numeric Example

Consider the following four points:

- (1, 2)
- (2, 4)
- (3, 6)
- (4, 8)

As x increases by 1, y increases by 2. The points lie almost perfectly along a straight line.

Intuitively, almost all information lies along **one direction**, not two.

In [None]:
import numpy as np
# NumPy is used here to create and manipulate numerical arrays

import matplotlib.pyplot as plt
# Matplotlib is used to visualize point geometry and variance directions

# Define a small 2D dataset with strong linear correlation
# Each row represents one observation (sample)
# Each column represents one feature
X = np.array([
    [1, 2],
    [2, 4],
    [3, 6],
    [4, 8]
])

# Create a square figure so distances are not visually distorted
plt.figure(figsize=(6, 6))

# Scatter plot shows the raw geometry of the data points
plt.scatter(X[:, 0], X[:, 1])

# Label axes to clarify feature interpretation
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# Title explains what stage of the intuition this plot represents
plt.title('Original Data Points')

# Grid helps visually estimate distances and spread
plt.grid(True)

# Render the plot
plt.show()

### Why Mean-Centering Is Necessary

PCA measures variance **around the mean**.

If the data cloud is not centered at the origin, distances are measured incorrectly.

Mean-centering shifts the cloud so its center lies at (0, 0), without changing its shape.

In [None]:
# Compute the mean of each feature (column-wise mean)
# This represents the geometric center of the data cloud
mean = np.mean(X, axis=0)

# Subtract the mean from every data point
# This shifts the entire cloud so it is centered at the origin
X_centered = X - mean

# Create a new figure for the centered data
plt.figure(figsize=(6, 6))

# Plot the centered points
plt.scatter(X_centered[:, 0], X_centered[:, 1])

# Draw reference axes to show centering at (0, 0)
plt.axhline(0, color='gray', linestyle='--')
plt.axvline(0, color='gray', linestyle='--')

# Label axes to reflect centered feature space
plt.xlabel('Centered Feature 1')
plt.ylabel('Centered Feature 2')

# Title emphasizes the effect of mean-centering
plt.title('Mean-Centered Data')

# Grid improves spatial intuition
plt.grid(True)

# Render the plot
plt.show()

### Direction of Maximum Variance

The direction along which the projected points are most spread out captures the maximum variance.

In this example, that direction aligns closely with the line y = 2x.

PCA mathematically finds this direction by maximizing variance.