# **PCA** - Unsupervised Machine Learning

PCA goal is to reduce dimensionality curse.

## **Principal Component Analysis (PCA)**

Principal Component Analysis (PCA) is a powerful unsupervised dimensionality reduction technique. Its primary goal is **to transform a high-dimensional dataset into a lower-dimensional** one while retaining as much of the original variance as possible. This is achieved by identifying a new set of orthogonal axes (principal components) that capture the most significant variance in the data. The first principal component accounts for the largest possible variance, the second for the next largest, and so on.

**Key Applications:**
*   **Dimensionality Reduction:** Simplifies complex datasets, making them easier to visualize and analyze.
*   **Noise Reduction:** By focusing on the principal components that explain most of the variance, PCA can effectively filter out noise.
*   **Feature Extraction:** Creates new, uncorrelated features that can be used as input for other machine learning algorithms, often improving their performance and reducing training time.
*   **Data Visualization:** Enables visualization of high-dimensional data in 2D or 3D.

## Pros
*   **Dimensionality Reduction:** Simplifies complex datasets, making them easier to visualize and analyze.
*   **Noise Reduction:** By focusing on the principal components that explain most of the variance, PCA can effectively filter out noise.
*   **Feature Extraction:** Creates new, uncorrelated features that can be used as input for other machine learning algorithms, often improving their performance and reducing training time.
*   **Data Visualization:** Enables visualization of high-dimensional data in 2D or 3D.
*   **Faster execution**

## Limitations

1. **Interpretability:** The principal components are not always interpretable, as they are linear combinations of the original variables.
2. **Assumptions:** PCA assumes that the data is linearly related and that the relationships between variables are additive.
3. **Sensitivity to Outliers:** PCA is sensitive to outliers, as they can dominate the results and affect the principal components.
4. **Loss of Information:** PCA is a lossy compression technique, as it discards some of the variance in the data.
5. **Computational Complexity:** PCA can be computationally expensive, especially for large datasets.


## How PCA Works

1. **Standardization:** PCA is sensitive to the variances of the initial variables. If there is a variable with a much larger variance than others, it dominates the results. Therefore, it is essential to standardize the data (subtract the mean and divide by the standard deviation) before applying PCA.
2. **Covariance Matrix:** PCA computes the covariance matrix of the standardized data. This matrix shows the relationships between variables.
3. **Eigenvalues and Eigenvectors:** The covariance matrix is decomposed into its eigenvalues and eigenvectors. Eigenvalues represent the amount of variance explained by each eigenvector (principal component). Eigenvectors are the directions of maximum variance.
4. **Principal Components:** The eigenvectors are sorted by their corresponding eigenvalues (from highest to lowest). The first principal component accounts for the largest possible variance, the second for the next largest, and so on.
5. **Projection:** The data is projected onto the principal components, resulting in a lower-dimensional representation while preserving most of the variance.
6. **Dimensionality Reduction:** The number of principal components can be selected based on the amount of variance retained (e.g., 95% of the total variance).

## Key Concepts

1. **Principal Components:** These are the new orthogonal axes that capture the most significant variance in the data. The first principal component accounts for the largest possible variance, the second for the next largest, and so on.
2. **Explained Variance:** The amount of variance explained by each principal component is given by its corresponding eigenvalue. The total explained variance is the sum of all eigenvalues.
3. **Cumulative Explained Variance:** This shows the cumulative percentage of variance explained by the principal components. It helps in selecting the number of principal components to retain.
4. **Scree Plot:** A plot of the eigenvalues (variances) against the principal components. It helps in visually selecting the number of principal components to retain.
5. **Data Compression:** PCA reduces the dimensionality of the data while preserving most of the variance. This makes it useful for data compression and visualization.

## Applications

1. **Image Compression:** PCA can be used to compress images by representing them in a lower-dimensional space while maintaining most of the visual information.
2. **Data Visualization:** PCA is often used to visualize high-dimensional data in 2D or 3D, making it easier to identify patterns and relationships.
3. **Feature Extraction:** PCA can be used to extract features from complex datasets, often improving the performance of machine learning algorithms.
4. **Noise Reduction:** By focusing on the principal components that explain most of the variance, PCA can effectively filter out noise from the data.
5. **Dimensionality Reduction:** PCA is particularly useful when working with high-dimensional datasets, as it can reduce the number of features while preserving most of the variance.






## Implementation in Python

Here's a simple example of how to implement PCA in Python using the scikit-learn library:


In [10]:
from sklearn.decomposition import PCA
import numpy as np
from sklearn.preprocessing import StandardScaler


# Generate some random data
data = np.random.rand(100, 5)       # 100 rows, 5 columns

# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply PCA
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

print(data_pca.shape)
print()

# Print the explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print()

# Print the principal components
print("Principal Components: \n", pca.components_)
print()

(100, 2)

Explained Variance Ratio: [0.24846213 0.22260685]

Principal Components: 
 [[-0.09630283  0.46665079  0.63448315 -0.603602   -0.0778368 ]
 [ 0.38315862  0.54365468 -0.05432731  0.20991972  0.71457072]]



Suppose you are working on a dataset, where you dont have the required domain knowledge, and there you have multiple features, and you want to reduce the dimensionality of the dataset, and you donot know which feature is imp and which one is not. Then you can either gain some domain knowledge about both the features (not efficient) or else you can plot both the features on a scatterplot and check for the **variance**. The feature having the higher variance will be considered as the more imp feature (will be used to train the ML model). **This is called Feature Selection.**

Suppose now you have got a situation where you are working on a different dataset, where both the features have equal variance, and both the features are equally imp. What you can do is you can extract a new feature out of both the features. **This is called Feature Extraction.**    
    
*  For eg. you have number of rooms and number of washrooms predicting the price. What you can do is, you can combine both the feature and make a new feature as 'Total rooms and washrooms' / 'Total rooms and washrooms per sqft' / 'Total sqft' / 'Size of the flat'.  And after this you can remove both the old features and train the ML model.

Variance is important in data analysis and machine learning for several reasons:

1.  **Measures Spread/Dispersion:** Variance quantifies how much individual data points deviate from the mean. A high variance indicates that data points are spread out over a wider range of values, while a low variance suggests that data points tend to be close to the mean.

2.  **Feature Selection:** In machine learning, features with very low or zero variance provide little to no information for a model to learn from, as they are essentially constant. Features with higher variance often contain more discriminatory power.

3.  **Understanding Data Distribution:** Variance (along with the mean) helps in understanding the shape and characteristics of a data distribution. For example, in a normal distribution, variance determines the width of the bell curve.

4.  **Risk Assessment:** In finance, variance is a key measure of risk. Higher variance in asset returns indicates greater volatility and thus higher risk.

5.  **Statistical Inference:** Variance is a critical component in many statistical tests and confidence interval calculations. For instance, the standard error of the mean depends on the sample variance.

6.  **Principal Component Analysis (PCA):** As the document title suggests, PCA heavily relies on variance. PCA aims to find new principal components (directions) that capture the maximum variance in the data. These components represent the most significant patterns in the dataset. By projecting data onto components with high variance, PCA reduces dimensionality while retaining as much information (spread) as possible.

7.  **Model Performance and Overfitting:** In some models, understanding the variance of predictions can help diagnose issues like overfitting (high variance in predictions on different subsets of training data). The bias-variance trade-off is a fundamental concept in supervised learning.

In summary, variance provides crucial insights into the variability and information content within a dataset, making it a foundational concept in statistics, data science, and machine learning.
