# **Principal Componenet analysis Model Theory**


## Principal Component Analysis (PCA)

---

## Theory
Principal Component Analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. It achieves this by identifying the directions (principal components) in which the data varies the most. PCA is widely used in data visualization, noise reduction, and feature extraction.

The main idea is to:
- Compute the covariance matrix of the data.
- Perform eigenvalue decomposition to identify the principal components.
- Project the data onto the principal components to reduce dimensionality.

---

## Mathematical Foundation
- **Covariance Matrix**:
  The covariance matrix \( \Sigma \) of the data \( X \) is computed as:
  $$ \Sigma = \frac{1}{n-1} X^T X $$
  - \( X \): Data matrix (centered around the mean).
  - \( n \): Number of data points.

- **Eigenvalue Decomposition**:
  The covariance matrix is decomposed into eigenvalues \( \lambda_i \) and eigenvectors \( v_i \):
  $$ \Sigma v_i = \lambda_i v_i $$
  - \( \lambda_i \): Eigenvalues (represent the amount of variance explained by each principal component).
  - \( v_i \): Eigenvectors (represent the directions of the principal components).

- **Principal Components**:
  The principal components are the eigenvectors sorted by their corresponding eigenvalues in descending order.

- **Dimensionality Reduction**:
  The data is projected onto the top \( k \) principal components to obtain the reduced-dimensional representation:
  $$ Y = X V_k $$
  - \( Y \): Reduced-dimensional data.
  - \( V_k \): Matrix of the top \( k \) eigenvectors.

---

## Algorithm Steps
1. **Standardization**:
   - Standardize the data to have zero mean and unit variance.

2. **Covariance Matrix**:
   - Compute the covariance matrix of the standardized data.

3. **Eigenvalue Decomposition**:
   - Perform eigenvalue decomposition on the covariance matrix to obtain eigenvalues and eigenvectors.

4. **Sorting**:
   - Sort the eigenvectors by their corresponding eigenvalues in descending order.

5. **Projection**:
   - Project the data onto the top \( k \) eigenvectors to obtain the reduced-dimensional representation.

---

## Key Parameters
- **n_components**: The number of principal components to retain.
- **svd_solver**: The solver to use for eigenvalue decomposition (e.g., `auto`, `full`, `arpack`).
- **whiten**: Whether to whiten the data (scale the components to unit variance).

---

## Advantages
- Reduces dimensionality while preserving variance.
- Improves computational efficiency by reducing the number of features.
- Helps in visualizing high-dimensional data.
- Removes multicollinearity between features.

---

## Disadvantages
- Assumes linear relationships between features.
- Sensitive to the scaling of features.
- May not capture complex structures in the data.
- Interpretability of principal components can be challenging.

---

## Implementation Tips
- **Standardize** the data before applying PCA to ensure equal contribution from all features.
- Use **scree plots** to determine the optimal number of principal components.
- Consider **Kernel PCA** for non-linear dimensionality reduction.
- Use **explained variance ratio** to understand the contribution of each principal component.

---

## Applications
- Data visualization (e.g., 2D/3D plots of high-dimensional data)
- Noise reduction
- Feature extraction for machine learning models
- Image compression
- Genomics and bioinformatics

PCA is a powerful and widely-used technique for dimensionality reduction. While it has limitations, it is a valuable tool for many real-world applications.

# Model Evaluation for Principal Component Analysis (PCA)

---

### 1. Explained Variance Ratio
**Formula:**
$$
\text{Explained Variance Ratio} = \frac{\text{Variance of Principal Component}}{\text{Total Variance}}
$$
**Description:**
- Measures the proportion of the dataset's variance explained by each principal component.
- Cumulative explained variance indicates how much information is retained by the top \( k \) components.

**Interpretation:**
- Higher values indicate that the component captures more variance.
- Useful for determining the optimal number of components to retain.

---

### 2. Cumulative Explained Variance
**Formula:**
$$
\text{Cumulative Explained Variance} = \sum_{i=1}^k \text{Explained Variance Ratio}_i
$$
**Description:**
- Measures the total proportion of variance explained by the first \( k \) principal components.

**Interpretation:**
- Values close to 1 indicate that the top \( k \) components capture most of the variance.
- Helps decide how many components to keep for dimensionality reduction.

---

### 3. Scree Plot Analysis
**Description:**
- A graphical representation of the explained variance ratio for each principal component.
- Helps identify the "elbow point" where adding more components provides diminishing returns.

**Interpretation:**
- The elbow point suggests the optimal number of components to retain.
- Components after the elbow contribute little to explaining variance.

---

### 4. Reconstruction Error
**Formula:**
$$
\text{Reconstruction Error} = \frac{1}{N} \sum_{i=1}^N ||x_i - \hat{x}_i||^2
$$
**Description:**
- Measures the error between the original data and the data reconstructed from the principal components.
- Indicates how well the PCA preserves the original data structure.

**Interpretation:**
- Lower values indicate better reconstruction.
- Useful for assessing the quality of dimensionality reduction.

---

### 5. Singular Values
**Formula:**
$$
\text{Singular Values} = \sqrt{\text{Eigenvalues of Covariance Matrix}}
$$
**Description:**
- Represents the importance of each principal component.
- Larger singular values correspond to components that capture more variance.

**Interpretation:**
- Useful for understanding the relative importance of each component.
- Helps identify components that can be discarded.

---

### 6. Principal Component Loadings
**Description:**
- Represents the contribution of each original feature to a principal component.
- Loadings indicate the direction and magnitude of the feature's influence.

**Interpretation:**
- Higher absolute values indicate stronger influence.
- Useful for interpreting the meaning of principal components.

---

### 7. Correlation Circle (Biplot)
**Description:**
- A graphical representation of the relationships between original features and principal components.
- Helps visualize how features contribute to the components.

**Interpretation:**
- Features closer to the circle's edge have stronger influence.
- Useful for feature selection and interpretation.

---

### 8. Dimensionality Reduction Effectiveness
**Description:**
- Evaluates the effectiveness of PCA in reducing the number of dimensions while retaining information.
- Measured by comparing performance metrics (e.g., classification accuracy) before and after PCA.

**Interpretation:**
- Higher retained performance indicates effective dimensionality reduction.
- Useful for assessing the trade-off between complexity and information loss.

---

### 9. Computational Efficiency
**Description:**
- Evaluates the computational cost of performing PCA, including time and memory usage.
- Important for large datasets or real-time applications.

**Interpretation:**
- Lower computational cost indicates better scalability.
- Useful for assessing the practicality of PCA for specific use cases.

---

### 10. Outlier Detection
**Description:**
- Evaluates the ability of PCA to detect outliers by examining the reconstruction error or the distance of data points from the principal component subspace.

**Interpretation:**
- Points with high reconstruction error or large distances are potential outliers.
- Useful for identifying anomalies in the data.

---

## sklearn template [sckit-kit: model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)


# Principal Componenet analysis - Example

## Data loading

##  Data processing

## Plotting data

## Model definition

## Model evaulation