**Principal Component Analysis (PCA)** is a dimensionality reduction technique widely used in data analysis and machine learning. It transforms a dataset with possibly correlated features into a new set of **uncorrelated features**, called **principal components**, ordered by the amount of **variance** they capture in the data.

PCA is a method for **finding the directions in which the data varies the most**, and then **re-expressing** the data along those directions. These directions are called **principal components**.

- Think of your dataset as a cloud of points in a high-dimensional space.
- PCA finds **new axes** (directions) such that:
  - The first axis explains as much **variance** in the data as possible.
  - The second axis is orthogonal (perpendicular) to the first and explains the next most variance, and so on.


### **Why use PCA?**

- Reduce the number of features while retaining most of the information (variance).
- Remove noise or redundant features.
- Visualize high-dimensional data in 2D or 3D.
- Improve performance of machine learning models.


Let’s say we have a dataset $ X \in \mathbb{R}^{n \times d} $, where:
- $ n $ is the number of samples,
- $ d $ is the number of features (dimensions).

Each row $ \mathbf{x}^{(i)} \in \mathbb{R}^d $ is one data point.


## 📍 Step 1: Z-score normalization

**Z-score normalization**, also known as **standardization**, is a technique used to **rescale features** so that they have:

- a **mean of 0**, and  
- a **standard deviation of 1**.

Given a feature $ x_i $ the **Z-score normalized** value of this feature is:

$$
z_i = \frac{x_i - \mu_i}{\sigma_i}
$$

Where:
- $ \mu_i $ is the **mean** of the feature $ x_i $,
- $ \sigma_i $ is the **standard deviation** of the feature $ x_i $,
- $ z_i $ is the **standardized** value.

Why use it for PCA?  
- PCA looks for directions of **maximum variance**.
- Mean shifts don't change variance — but **we want the variance to be about the origin**, so all our axes (principal components) go through the center of the data cloud.
- It simplifies math because the mean becomes 0, which simplifies the covariance formula.


### Step 2: **Compute the Covariance Matrix**

Covariance tells us how much two variables **change together**.

For centered data $ X $, we compute:
$$
\Sigma = \frac{1}{n} X^T X
$$
This is a $ d \times d $ matrix where each entry $ \Sigma_{ij} $ measures how features $ i $ and $ j $ vary together

- The covariance matrix captures the **structure** of the data.
- If feature \( i \) increases when feature \( j \) increases, \( \Sigma_{ij} > 0 \).
- We're looking for the directions (linear combinations of features) where **data varies the most**.


### Step 3: **Eigen Decomposition**

Now we compute the **eigenvectors** and **eigenvalues** of the covariance matrix \( \Sigma \):
$$
\Sigma \mathbf{v} = \lambda \mathbf{v}
$$

Where:
- $ \mathbf{v} \in \mathbb{R}^d $ is an **eigenvector**,
- $ \lambda $ is the corresponding **eigenvalue**.

We get $ d $ eigenvectors and eigenvalues.

- Each eigenvector gives a **direction** in the feature space.
- The corresponding eigenvalue tells us **how much variance** the data has along that direction.
- So we sort eigenvectors by decreasing eigenvalues → biggest variance first.


### Step 4: **Select Top $ k $ Components**

Choose the top $ k $ eigenvectors (those with the largest eigenvalues) to form a matrix:
$$
W_k = [\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_k] \in \mathbb{R}^{d \times k}
$$

This will be our new **basis** for the data.

- We are **rotating** the data into a new coordinate system aligned with directions of greatest variance.
- We also **reduce dimensionality** by ignoring the directions with little variance (often interpreted as noise).


### Step 5: **Project the Data**

We now project the centered data into the new space:
$$
Z = X_{\text{centered}} W_k \in \mathbb{R}^{n \times k}
$$

Each row in $ Z $ is a **low-dimensional representation** of the original data point.

- We re-express each data point using the new axes that capture the most variation.
- This gives us a compressed, denoised version of the data.


### Geometric Interpretation

Imagine a 3D point cloud where most of the spread is along a plane. PCA:
- Finds that plane.
- Projects the data onto that plane.
- Discards the small component orthogonal to the plane (less important).

This is why PCA can drastically **reduce dimensionality** without losing much **information**.



### 2. **Implementation in Python (with `scikit-learn`)**

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Step 1: Load your data (example)
df = pd.read_csv('your_data.csv')
X = df.drop('target', axis=1)

# Step 2: Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Apply PCA
pca = PCA(n_components=2)  # choose number of components
X_pca = pca.fit_transform(X_scaled)

# Step 4: Check explained variance
print(pca.explained_variance_ratio_)
```


### 3. **Choosing the Number of Components**
Use the **explained variance ratio** to decide how many principal components to keep:
```python
import matplotlib.pyplot as plt

pca = PCA().fit(X_scaled)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         pca.explained_variance_ratio_.cumsum())
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.show()
```