PCA is a feature extraction technique that is it reduces dimensionality

PCA reduces the higher dimension data to lower dimensions by best capturing the essence of the data

Benefits:
1) Makes the algorithm run faster due to reduced no of dimensions
2) Visualization - helps us to visualize higher dimensional data 

### Geometric Intuition

When we choose the features for our model selection, we choose the one which has more variance as it is better for training

## 🧠 Principal Component Analysis (PCA)

### 🔍 What is PCA?
Principal Component Analysis (PCA) is a **dimensionality reduction** technique that transforms a high-dimensional dataset into a lower-dimensional space while preserving as much **variance** (information) as possible.

---

### 📐 Geometric Intuition
Imagine data points scattered in a 2D space (e.g., a cloud of points in XY plane). PCA seeks to:

1. **Find new axes** (principal components) that best explain the **spread (variance)** of the data.
2. The **first principal component** is the direction (vector) in which the data varies the most.
3. The **second component** is orthogonal (perpendicular) to the first and captures the next highest variance, and so on.
4. These new axes are **linear combinations** of the original features.

#### 🧊 Analogy:
Think of PCA like fitting a line (or plane) through a cloud of points such that:
- The line captures the **most variance**.
- Data can be projected onto this line to reduce dimensions while keeping maximum information.

---

### 🛠️ How PCA Works (Step-by-Step)
Let $X$ be the dataset ($m$ samples × $n$ features):

1. **Standardize the data**:
   - Mean-center each feature (subtract mean).
   - Scale to unit variance (optional but recommended).

2. **Compute the covariance matrix**:
   $\Sigma = \frac{1}{m} X^T X$

3. **Compute eigenvalues and eigenvectors** of $\Sigma$:
   - Eigenvectors = principal components.
   - Eigenvalues = variance explained by each component.

4. **Sort eigenvectors** by eigenvalues (descending order).

5. **Select top-k components** to retain $k$ dimensions.

6. **Project data** onto new basis:
   $X_{\text{proj}} = X W_k$
   where $W_k$ = matrix with top $k$ eigenvectors as columns.

---

### 📊 Properties
- PCA is **unsupervised**.
- Sensitive to feature scaling.
- Components are **orthogonal** (uncorrelated).
- PCA tries to **maximize variance**, not class separation.

---

### 📈 Variance Explained
You can compute the variance retained using:
$\text{Explained Variance Ratio} = \frac{\lambda_i}{\sum_j \lambda_j}$
Where $\lambda_i$ is the eigenvalue of the i-th component.

Use a **scree plot** to choose the number of components.

---

### 💡 Example
Suppose we have 2D data:
```
X = [[2.5, 2.4],
     [0.5, 0.7],
     [2.2, 2.9],
     [1.9, 2.2],
     [3.1, 3.0],
     [2.3, 2.7],
     [2.0, 1.6],
     [1.0, 1.1],
     [1.5, 1.6],
     [1.1, 0.9]]
```
- PCA finds that the first principal component might be along the vector $[0.88, 0.47]$.
- Projecting onto this vector compresses 2D data into 1D while retaining max variance.

---

### 🔗 PCA vs SVD
PCA can also be computed via **Singular Value Decomposition (SVD)**:
$X = U \Sigma V^T$
- Columns of $V$ are principal components.

---

### 📌 Applications
- Visualization (2D/3D projections)
- Preprocessing for ML (noise reduction)
- Feature decorrelation
- Compression

---

### 🧪 In Practice (Scikit-learn)
```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
```

---

### 📚 Summary
- PCA finds **new orthogonal axes** maximizing **variance**.
- Good for **visualization** and **noise filtering**.
- It is based on **eigen decomposition** or **SVD**.
- PCA is **unsupervised**, linear, and interpretable.

Use PCA **when your goal is to reduce dimensionality without losing too much information**.

### How it works when the features have equal amount of variance

No of Principal components <= No of original features

PCA finds a new co-ordinate axis that will maximise the variance

![image.png](attachment:image.png)

In the original axes, both of the components had equal amounts of variance so we couldn't do anything. But now, we can see in the new axes, PC1 has more variance than PC2

### Why is variance so important?

Variance tells us the spread of the data

Spread is proportional to the variance, but variance is not exactly spread

Variance is preferred because over absolute deviation, the mod function is not diffrentiable but variance is diffrentiable

# 🔢 Variance

Variance measures how much the data varies around the mean. It quantifies the spread or dispersion of a set of data points.

## 🧮 Formula (for one feature):

Given $m$ data points $x_1, x_2, ..., x_m$, the variance is:

$$
\text{Var}(x) = \frac{1}{m} \sum_{i=1}^{m} (x_i - \bar{x})^2
$$

where $\bar{x}$ is the mean of the data.

## 📌 Intuition:

* A high variance means the data points are spread out.
* A low variance means the data points are close to the mean.

## 📎 In PCA:

* Variance tells us how much information or signal is present along a direction.
* PCA finds the directions (principal components) that maximize variance, i.e., directions along which the data is most spread out.
* These directions help capture the structure of the data efficiently when reducing dimensions.

![image.png](attachment:image.png)

When we project on the axes having more variance for the data we don't lose much information about the data. For example, in the above graph, had we projected it on y-axis, it would have said the two points are very close when it is clearly not the case but this is captured when we project it on the X-axis