The problem PCA is trying to solve : 
In real datasets:
1) We have many features
2) Many of them are correlated
3) High dimensions : 
    > slow models
    
    > overfitting
    
    > hard visualization
    
    > curse of dimensionality

Principal Component Analysis (PCA) is a dimensionality reduction technique that: 
>Finds new axes (directions) in data

>These axes capture maximum variance

>And are uncorrelated with each other

### Geometric Intuition : 

>when you have very big dataset , its kind of hard to predict which feature should we select 

### PCA Steps

1. Collect the dataset with numerical features.
2. Clean the data (handle missing values and duplicates).
3. Standardize all features so they are on the same scale.
4. Compute the covariance matrix to understand relationships between features.
5. Calculate eigenvalues and eigenvectors of the covariance matrix.
6. Sort eigenvalues in descending order based on explained variance.
7. Select the top k principal components.
8. Project the original data onto the new component axes.
9. Analyze explained variance to check information retention.
10. Use the PCA-transformed data for visualization or modeling.


##### we make a new column from combination of other columns -> feature constuction [in PCA]
##### PCA never chooses between features — it creates better ones.

> If two features have different variance, we keep the one with higher variance.
If they have equal variance, PCA does not select one feature; instead, it creates a new feature by rotating the axes to capture maximum variance.

### Variance — Why is it Important?

Variance measures how much the data spreads out from its mean.

In machine learning, variance is important because:
- High variance indicates more information and diversity in the data.
- Low variance often means the feature is almost constant and carries little useful information.
- Features with zero or near-zero variance do not help models learn patterns.

In PCA:
- Variance is treated as a proxy for information.
- Directions with high variance capture the main structure of the data.
- Directions with low variance mostly represent noise.

Geometric intuition:
- A wider data spread means better separation between points.
- PCA keeps directions where points are far apart and discards directions where points collapse.

In simple terms:
- High variance = strong signal
- Low variance = weak signal or noise


### Variance (Mathematical Definition)

Variance measures the average squared deviation of data points from their mean.

For a dataset:
x₁, x₂, x₃, ..., xₙ

1. Compute the mean (average):
μ = (1/n) Σ xᵢ

2. Compute deviation from the mean:
(xᵢ − μ)

3. Square the deviations:
(xᵢ − μ)²

4. Take the average of squared deviations:

Population Variance:
σ² = (1/n) Σ (xᵢ − μ)²

Sample Variance:
s² = (1/(n − 1)) Σ (xᵢ − μ)²

Why square the deviations?
- Prevents positive and negative values from canceling out
- Penalizes larger deviations more
- Makes variance mathematically convenient for optimization

In PCA:
- Variance along a direction tells how much information is captured
- PCA finds directions that maximize σ²


### Spread vs Variance (Important Distinction)

Spread is a qualitative, geometric idea.
Variance is a quantitative, mathematical measure.

Variance does NOT equal spread.
Variance is proportional to the square of the spread.

Mathematically:
Variance = average of (distance from mean)²

If data points are twice as far from the mean,
the variance becomes four times larger.

So:
- Larger spread → larger variance
- Smaller spread → smaller variance
- But they are not the same thing

Why PCA uses variance:
- Variance provides a precise, computable measure of spread
- Squaring emphasizes larger deviations
- Makes optimization and linear algebra tractable

Correct way to say it:
"Variance is a mathematical quantity that is proportional to the square of the data spread."

PCA intuition:
- PCA finds directions with maximum variance
- Which corresponds to directions with maximum data spread


##### Is the modulus (absolute value) function differentiable at zero?

The modulus function is:
f(x) = |x|

For x > 0:
f′(x) = 1

For x < 0:
f′(x) = −1

At x = 0:
- Left-hand derivative = −1
- Right-hand derivative = +1

Since the left-hand and right-hand derivatives are not equal,
the derivative at x = 0 does NOT exist.

Therefore:
The modulus (absolute value) function is NOT differentiable at zero.


>ML prefers variance over absolute spread because squared loss is smooth and differentiable, enabling efficient optimization.

>A good analogy is viewing the world from the right camera angle:
Just like a photographer chooses the best angle to capture the most meaningful view of a scene, PCA chooses the direction of highest variance so that when we reduce dimensions we retain as much of the important structure in the data as possible. This is why we say PCA tries to retain most of the variance while reducing dimensions.