<a href="https://colab.research.google.com/github/Shahsawar51/MY_DATA_SCIENCE_JOURNEY/blob/main/wk29_detailed_PCA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Part 1: PCA Basics, Motivations, and Core Steps
# What is PCA? **bold text**

Definition: PCA is a statistical technique that reduces the dimensionality of a dataset while preserving most of its information (variance). It transforms data into a new coordinate system called principal components.  
Intuition: Imagine your data as a 3D cloud of points (e.g., height, weight, age). PCA finds the directions (principal components) where the cloud is most spread out and projects the data onto those directions.  
Example: We reduced our 3D dataset (height, weight, age) to 1D (PC1), capturing 97.96% of the variation.

# **Motivations for Dimensionality Reduction**
# **Why use PCA? Here are the key reasons: **

Simplify Analysis: Fewer features make data easier to analyze and interpret (e.g., one PC1 score vs. three features).  
Faster Machine Learning: Fewer features speed up model training (e.g., a model predicting athletic ability using PC1 trains faster).  
Visualization: High-dimensional data (e.g., 1000D) can be plotted in 2D/1D (e.g., PC1 line plot).  
Noise Removal: PCA discards low-variance components (e.g., PC3 with 0 variance) that may contain noise.  
Storage Efficiency: Reduced data requires less memory (e.g., 24 numbers to 8 for our dataset).

Our Case: We used PCA to simplify our 3D dataset into 1D, making it easier for analysis, visualization, and modeling.
The Curse of Dimensionality

What is it? When a dataset has too many features (e.g., 1000), it becomes sparse, causing problems for analysis and machine learning.  
Issues:  
Sparsity: Data points are far apart, making patterns hard to find.  
Computation Cost: High-dimensional matrices (e.g., 1000 × 1000 covariance) are slow to process.  
Overfitting: Models memorize noise, failing on new data.  
Visualization: Impossible to plot 1000D data directly.


PCA’s Role: Reduces dimensions (e.g., 1000 to 50) while keeping most info, mitigating the curse.  
Our Case: Our dataset had only 3 features, so the curse wasn’t an issue, but PCA still simplified it effectively.

PCA Steps (Theory)
Let’s walk through PCA’s steps using our dataset:
Step 1: Standardize the Data

What? Scale features to have mean = 0 and standard deviation = 1, as features have different units (cm, kg, years).  
How?[\text{Standardized Value} = \frac{\text{Value} - \text{Mean}}{\text{Standard Deviation}}]  
Our Dataset:  
Original: Height (160-180 cm), Weight (50-70 kg), Age (15-19 years).  
Standardized: Mean = 0, std = 1 for each.  
Example (Student 1): Height = (\frac{160 - 169}{6.264} \approx -1.437).


Why? Prevents features with larger ranges (e.g., height) from dominating.

# **Step 2: Compute the Covariance Matrix**

What? A matrix showing how features vary together (variance) and relate (covariance).  
Formula:[\text{Cov}(X, Y) = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})]For standardized data (mean = 0):[\text{Cov}(X, Y) = \frac{1}{n} \sum X_i \cdot Y_i]  
Our Case:[C = \begin{bmatrix}1.000 & 1.000 & 0.954 \1.000 & 1.000 & 0.954 \0.954 & 0.954 & 1.000\end{bmatrix}]  
Diagonal (1.000): Variance of each feature (standardized = 1).  
Off-diagonal (1.000, 0.954): Height-weight strongly correlated, height-age also correlated.


Why? Captures the data’s spread and relationships.

Step 3: Compute Eigenvalues and Eigenvectors

What?  
Eigenvalues ((\lambda)): Quantify how much variance each principal component captures.  
Eigenvectors ((v)): Define the directions of principal components.


How? Solve the characteristic equation:[\det(C - \lambda I) = 0]Then find eigenvectors:[(C - \lambda I)v = 0]  
Our Case:  
Polynomial: (-\lambda^3 + 3\lambda^2 - 0.179768\lambda = 0).  
Eigenvalues: (\lambda_1 \approx 2.9388 (97.96%)), (\lambda_2 \approx 0.0612 (2.04%)), (\lambda_3 = 0 (0%)).  
Eigenvector for (\lambda_1): (\begin{bmatrix} 0.577 \ 0.577 \ 0.577 \end{bmatrix}) (PC1).


Why? Eigenvalues show importance (variance), eigenvectors give the direction (new axes).

Intuition: The covariance matrix is like a machine transforming data. Eigenvectors are special directions where data only stretches ((\lambda) times), not rotates.

Part 2: Projection, Evaluation, Variants, and Use Cases
Step 4: Project Data onto Principal Components

What? Map the data onto selected principal components to reduce dimensions.  
How? Use dot product:[\text{PC1 Score} = x \cdot v_1 = x_1 \cdot 0.577 + x_2 \cdot 0.577 + x_3 \cdot 0.577](Where (x) = standardized data point, (v_1) = PC1 eigenvector).  
Our Case:  
Standardized point 1: (\begin{bmatrix} -1.437 \ -1.437 \ -1.234 \end{bmatrix}).  
PC1 score: ((-1.437 \cdot 0.577) + (-1.437 \cdot 0.577) + (-1.234 \cdot 0.577) \approx -2.370).  
New dataset: 1D (PC1 scores) with 97.96% variance.  


Student
PC1



1
-2.370


2
-1.009


3
-0.089


4
-0.020


5
1.706


6
-2.002


7
3.066


8
0.716





Why? Reduces dimensions (3D to 1D) while retaining most information.

Step 5: Evaluate PCA Performance
How to check if PCA worked well?  

Explained Variance Ratio:  
PC1 = 97.96%, excellent (95%+ is ideal).  
Formula: (\frac{\lambda_1}{\sum \lambda_i}).


Reconstruction Error:  
Reconstruct: (\text{PC1 Score} \cdot v_1).  
Mean squared error (MSE) small (e.g., point 1 error ~0.017), as only 2.04% loss.


Downstream Task Performance:  
ML model on PC1 (e.g., athletic prediction) has accuracy close to original data.  
Visualization (PC1 line plot) shows clear patterns.


Scree Plot:  
Plot eigenvalues vs. component number. Elbow at PC1 confirms 1 component is enough.


Cross-Validation:  
PC1-based model’s cross-validated accuracy matches original data’s.



Our Case: PCA was highly effective (97.96% variance, low error, good task performance).
Can PCA Be Reversed?

Partially Possible: Reconstruct data using PC scores and eigenvectors:[\text{Reconstructed Point} = (\text{PC1 Score}) \cdot v_1]  
Example: PC1 = -2.370 → (\begin{bmatrix} -1.367 \ -1.367 \ -1.367 \end{bmatrix}) (vs. original (\begin{bmatrix} -1.437 \ -1.437 \ -1.234 \end{bmatrix})).


Why Not Fully? PC2 (2.04%) and PC3 (0%) info lost, as we kept only PC1.  
Fully Possible? If all PCs (PC1, PC2, PC3) kept, 100% reconstruction, but no dimensionality reduction benefit.  
Our Case: 2.04% loss due to PC1-only, so approximate reconstruction.

PCA for Nonlinear Datasets

Issue: PCA assumes linear relationships, ineffective for highly nonlinear data (e.g., weight = height²).  
Our Case: Height, weight, age were linearly correlated (covariance: 1.000, 0.954), so PCA worked well.  
Nonlinear Data: PCA misses curved patterns (e.g., circular data), capturing low variance.  
Alternatives: t-SNE, UMAP, Kernel PCA for nonlinear patterns.

PCA Variants and When to Use Them

Vanilla PCA:  
When? Small/medium linear datasets (e.g., our 8 rows, 3 features).  
Why? Exact results, simple, fits in memory.


Incremental PCA:  
When? Large datasets (millions of rows) or streaming data that don’t fit in memory.  
Why? Processes data in batches, memory-efficient.


Randomized PCA:  
When? Large, high-dimensional datasets (e.g., 1000 features, fast results needed).  
Why? Uses approximations for speed, slight accuracy loss.


Kernel PCA:  
When? Nonlinear datasets (e.g., image pixels, text embeddings).  
Why? Maps data to higher-dimensional space for nonlinear patterns, but computationally heavy.



Our Case: Vanilla PCA was ideal (small, linear dataset).
Estimating Dimensions for 95% Variance

Scenario: 1000D dataset, want 95% variance.  
How? Choose (k) PCs where:[\frac{\lambda_1 + \dots + \lambda_k}{\sum \lambda_i} \geq 0.95]  
Our Case: 1 PC (97.96%) was enough for 3D. For 1000D, typically 20-200 PCs needed, depending on data correlation.  
Why Uncertain? Exact (k) depends on eigenvalue distribution (use scree plot).

Chaining Dimensionality Reduction Algorithms

Does It Make Sense? Sometimes, if algorithms complement each other.  
When?  
Coarse to Fine: PCA (1000D to 100D), then t-SNE (100D to 2D for visualization).  
Linear + Nonlinear: PCA for noise removal, Kernel PCA for nonlinear patterns.  
Efficiency: PCA simplifies data for complex methods (e.g., autoencoders).


Our Case: Linear data, so PCA alone was enough. Chaining (e.g., PCA + t-SNE) useful if nonlinear patterns existed.  
Risks: Info loss increases, pipeline complexity grows, redundant processing.

Drawbacks of PCA

Info Loss: 2.04% lost in our case (PC2). Critical if PC2 had unique info.  
Interpretability: PCs (e.g., PC1 = height + weight + age mix) less intuitive than original features.  
Computation Cost: Heavy for large datasets (e.g., 1000D covariance matrix).  
Linear Assumption: Fails on nonlinear data (e.g., curved patterns).  
Choosing PCs: Tricky to decide how many PCs to keep (scree plot helps).

Final Use Cases of PCA

Dimensionality Reduction: Simplified our 3D to 1D for analysis/ML.  
Visualization: PC1 line plot showed student “size” patterns.  
Noise Removal: Ignored PC2, PC3 (low/no variance).  
Feature Engineering: PC1 as input for ML models (e.g., athletic prediction).  
Compression: Reduced storage (24 to 8 numbers).

Our Case: PCA created a 1D dataset (PC1) representing “overall size,” ideal for ML, visualization, and storage.

Conclusion
This notebook covered PCA’s theory in depth:  

Basics: What PCA is, why it’s used, curse of dimensionality.  
Steps: Standardization, covariance matrix, eigenvalues/eigenvectors, projection.  
Advanced: Evaluation, reversal, nonlinear data, variants, chaining, drawbacks.  
Our Dataset: 8 students, 3 features reduced to 1D (97.96% variance).
