# Conceptual and Statistical Introduction

## The statistical problem of high dimensionality

In high-dimensional biological data, many features are correlated. Variance measures how spread out data points are around their mean. Covariance measures how features vary together.

## Geometric intuition

PCA rotates the coordinate system to align with directions of maximum variance. These directions correspond to eigenvectors of the covariance matrix and represent the most informative axes of variation.

## Drug development relevance

Omics datasets often contain thousands of variables but few samples. PCA helps separate signal from noise and identify dominant biological processes before downstream biomarker analysis.

# PCA for Biomarker Discovery (Drug Development)

## 1. Motivation

High-dimensional biological data contains correlated features. PCA reduces dimensionality while preserving dominant variance patterns.

## 2. Generating Synthetic Biomarker Data

Rows represent samples and columns represent genes or proteins.

In [None]:
import numpy as np
# NumPy is used for numerical computation and random number generation

import pandas as pd
# Pandas provides labeled DataFrames, which are critical for tracking gene identities

np.random.seed(42)
# Fixing the random seed ensures reproducibility of the synthetic dataset
# Changing this value would generate a different dataset and slightly alter PCA results

num_samples = 50
# Number of biological samples (e.g., patients or experiments)
# In omics, samples are often far fewer than features

num_features = 100
# Number of genes or proteins measured per sample
# High feature dimensionality motivates dimensionality reduction

data = np.random.rand(num_samples, num_features) * 10
# np.random.rand generates values in the range [0, 1)
# Multiplying by 10 increases variance magnitude to mimic real biological measurements

df = pd.DataFrame(
    data,
    columns=[f'Gene_{i+1}' for i in range(num_features)]
)
# Assign explicit gene labels so PCA loadings remain interpretable

df.head()
# Display the first few rows to verify data structure and scale

## 3. Standardizing the Data

PCA is sensitive to feature scale. Standardization ensures equal contribution from all features.

In [None]:
from sklearn.preprocessing import StandardScaler
# StandardScaler centers features to mean = 0 and scales to unit variance
# Without this step, high-magnitude genes would dominate PCA directions

scaler = StandardScaler()
# Initializes the scaler object, which will learn mean and standard deviation

scaled_data = scaler.fit_transform(df)
# fit_transform first learns scaling statistics, then applies them
# Each gene now contributes equally to variance calculations

## 4. Performing PCA

Principal components capture orthogonal directions of maximum variance.

In [None]:
from sklearn.decomposition import PCA
# PCA performs eigen decomposition of the covariance matrix

pca = PCA(n_components=10)
# n_components determines how many principal directions are retained
# Fewer components increase compression but lose information

principal_components = pca.fit_transform(scaled_data)
# fit_transform learns principal axes and projects data onto them

pc_df = pd.DataFrame(
    principal_components,
    columns=[f'PC_{i+1}' for i in range(10)]
)
# Store principal components in a labeled DataFrame for interpretation

pc_df.head()
# Inspect the transformed low-dimensional representation

## 5. Explained Variance Analysis

Explained variance indicates how much information each component retains.

In [None]:
import matplotlib.pyplot as plt
# Matplotlib is used for visualizing variance trends

explained_variance_ratio = pca.explained_variance_ratio_
# Fraction of total variance explained by each principal component

cumulative_explained_variance = np.cumsum(explained_variance_ratio)
# Cumulative sum shows how variance accumulates as more components are added

plt.figure(figsize=(10, 6))
# figsize controls plot dimensions for readability

plt.plot(
    range(1, len(explained_variance_ratio)+1),
    explained_variance_ratio,
    'o-',
    label='Individual'
)
# Individual variance per component

plt.plot(
    range(1, len(explained_variance_ratio)+1),
    cumulative_explained_variance,
    'o-',
    label='Cumulative'
)
# Cumulative variance guides how many components to retain

plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
plt.legend()
plt.show()
# Visualize variance retention to inform dimensionality choice