# Conceptual and Statistical Introduction

## The statistical problem of high dimensionality

In high-dimensional biological data, many features are correlated. Variance measures how spread out data points are around their mean. Covariance measures how features vary together.

## Geometric intuition

PCA rotates the coordinate system to align with directions of maximum variance. These directions correspond to eigenvectors of the covariance matrix and represent the most informative axes of variation.

## Drug development relevance

Omics datasets often contain thousands of variables but few samples. PCA helps separate signal from noise and identify dominant biological processes before downstream biomarker analysis.

# PCA for Biomarker Discovery (Drug Development)

## 1. Motivation

High-dimensional biological data contains correlated features. PCA reduces dimensionality while preserving dominant variance patterns.

## 2. Generating Synthetic Biomarker Data

Rows represent samples and columns represent genes or proteins.

In [None]:
import numpy as np
import pandas as pd

np.random.seed(42)

num_samples = 50
num_features = 100

data = np.random.rand(num_samples, num_features) * 10

df = pd.DataFrame(data, columns=[f'Gene_{i+1}' for i in range(num_features)])

df.head()

## 3. Standardizing the Data

PCA is sensitive to feature scale. Standardization ensures equal contribution from all features.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

## 4. Performing PCA

Principal components capture orthogonal directions of maximum variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=10)
principal_components = pca.fit_transform(scaled_data)

pc_df = pd.DataFrame(principal_components, columns=[f'PC_{i+1}' for i in range(10)])

pc_df.head()

## 5. Explained Variance Analysis

Explained variance indicates how much information each component retains.

In [None]:
import matplotlib.pyplot as plt

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

plt.figure(figsize=(10, 6))
plt.plot(range(1, len(explained_variance_ratio)+1), explained_variance_ratio, 'o-', label='Individual')
plt.plot(range(1, len(explained_variance_ratio)+1), cumulative_explained_variance, 'o-', label='Cumulative')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
plt.legend()
plt.show()