In [1]:
# https://www.geeksforgeeks.org/principal-component-analysis-pca/


# Principal Component Analysis (PCA)

is a dimensionality reduction technique commonly used in machine learning and statistics. Its main objective is to transform high-dimensional data into a lower-dimensional representation, capturing as much variance as possible. PCA achieves this by identifying the principal components, which are the orthogonal directions in the feature space along which the data varies the most.

### Steps in Principal Component Analysis (PCA):

1. **Standardization:**
   - Standardize the features (subtract the mean and divide by the standard deviation) to ensure that all features have the same scale.

2. **Covariance Matrix Calculation:**
   - Compute the covariance matrix of the standardized data. The covariance matrix provides information about the relationships between different features.

3. **Eigendecomposition:**
   - Perform eigendecomposition on the covariance matrix to obtain the eigenvalues and eigenvectors. Each eigenvector represents a principal component, and the corresponding eigenvalue represents the amount of variance explained by that component.

4. **Sort Eigenvectors:**
   - Sort the eigenvectors in descending order based on their corresponding eigenvalues. This allows you to prioritize the principal components that capture the most variance.

5. **Select Principal Components:**
   - Choose the top \(k\) eigenvectors (principal components) based on the desired dimensionality reduction. The cumulative explained variance can be used to determine the appropriate number of components.

6. **Projection:**
   - Project the original data onto the selected principal components to obtain the lower-dimensional representation.

### Example Using Python and Scikit-Learn:

Here's an example of applying PCA to a synthetic dataset using the `PCA` class from scikit-learn:

```python
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Generate synthetic data with two features
np.random.seed(42)
X = np.random.rand(100, 2) * 10  # Two-dimensional data

# Apply PCA with two components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Visualize the original and transformed data
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], alpha=0.8, s=50)
plt.title('Original Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

plt.subplot(1, 2, 2)
plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.8, s=50)
plt.title('PCA Transformed Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

plt.tight_layout()
plt.show()
```

In this example, synthetic data with two features is generated. PCA is then applied to transform the data into a two-dimensional space. The scatter plots visualize the original data and the transformed data after applying PCA. The transformed data is represented along the principal components, capturing the most significant variations in the original data.

- Minimize projection residuals
- Maximize Variance