# Principal Component Analysis (PCA)

### What is PCA?

Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of features in a dataset while preserving as much variance (information) as possible. It transforms the data into a new coordinate system defined by principal components (PCs), which are linear combinations of the original features.

### How PCA Works:

1. **Standardize the Data:** Ensure all features have a mean of 0 and variance of 1 to avoid bias from larger-scale features.
2. **Compute Covariance Matrix:** Calculate the covariance matrix of the dataset to understand relationships between features.
3. **Compute Eigenvectors and Eigenvalues:**
    - Eigenvectors determine the directions (principal components) of maximum variance.
    - Eigenvalues measure the amount of variance explained by each principal component.
4. **Sort Principal Components:** Rank components by their eigenvalues in descending order.
5. **Select Top Components:** Retain the first principal components that explain most of the variance.
6. **Transform Data:** Project the original data onto the selected principal components.

### PCA Formula:

The projection of a data point $ x $ onto a principal component $ p $ is:

$ x' = x \cdot p $

where $ p $ is the eigenvector of the covariance matrix.

### Key Properties of PCA:

- **Linear transformation:** PCA finds linear combinations of the original features.
- **Variance maximization:** The first principal component explains the most variance, the second explains the next most (orthogonal to the first), and so on.
- **Unsupervised:** PCA does not use labels.

### Applications of PCA:

- Reduce dimensionality for visualization (e.g., projecting data to 2D or 3D).
- Speed up machine learning models by removing redundant features.
- Denoise data by discarding components with low variance.
- Feature extraction and compression.

### Advantages of PCA:

- Reduces overfitting by eliminating redundant features.
- Enhances interpretability by reducing dimensions.
- Speeds up computations for high-dimensional data.

### Disadvantages of PCA:

- Linear method, so it struggles with non-linear relationships.
- Sensitive to scaling of data.
- Can lose interpretability of original features.

# t-Distributed Stochastic Neighbor Embedding (t-SNE)

### What is t-SNE?

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique primarily used for visualizing high-dimensional data in 2D or 3D. It focuses on preserving the local structure of the data.

### How t-SNE Works:

1. **Pairwise Similarities in High-Dimensional Space:**

    Compute the probability $ p_{ij} $ that a point $ i $ is a neighbor of point $ j $ in the original space using a Gaussian distribution:

    $ p_{ij} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq l} \exp(-\|x_k - x_l\|^2 / 2\sigma_k^2)} $

    The joint probability is then:

    $ P_{ij} = \frac{p_{ij} + p_{ji}}{2N} $

2. **Pairwise Similarities in Low-Dimensional Space:**

    Compute the probability $ q_{ij} $ in the lower-dimensional space using a t-distribution:

    $ q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|y_k - y_l\|^2)^{-1}} $

3. **Minimize Divergence Between Distributions:**

    Use Kullback-Leibler (KL) Divergence as the objective function to match the high-dimensional probabilities ($ P $) with the low-dimensional ones ($ Q $):

    $ KL(P \| Q) = \sum_{i \neq j} P_{ij} \log \frac{P_{ij}}{Q_{ij}} $

    Optimize using gradient descent to adjust the low-dimensional embeddings.

### Key Properties of t-SNE:

- Focuses on preserving local neighborhoods.
- Uses t-distribution in low-dimensional space, which helps spread points further apart to avoid crowding.

### Applications of t-SNE:

- Visualizing high-dimensional datasets like word embeddings, image data, or gene expressions.
- Exploring data clusters or patterns before applying other algorithms.

### Advantages of t-SNE:

- Handles non-linear relationships well.
- Effective at revealing local patterns in data.
- Great for visualization of complex datasets.

### Disadvantages of t-SNE:

- Computationally expensive, especially for large datasets.
- Results can vary between runs (non-deterministic).
- Poor at preserving global structures (focuses on local relationships).

# PCA vs. t-SNE

| Aspect         | PCA                              | t-SNE                               |
|----------------|----------------------------------|-------------------------------------|
| Type           | Linear dimensionality reduction. | Non-linear dimensionality reduction.|
| Goal           | Preserve variance in data.       | Preserve local neighborhood structure. |
| Output         | New axes (principal components). | Visualization of clusters.          |
| Scalability    | Fast and scalable to large datasets. | Computationally expensive.          |
| Global Structure | Preserves global structure.     | Focuses on local structure.         |
| Deterministic  | Yes (same result every time).    | No (varies between runs).           |
| Applications   | Feature extraction, compression, denoising. | Data visualization, cluster exploration. |

### Key Takeaway

- Use PCA for reducing dimensions while retaining variance.
- Use t-SNE for visualizing clusters and non-linear patterns in data.

### Example of PCA

**Scenario: Image Compression**

Imagine we have grayscale images of handwritten digits (like in the MNIST dataset). Each image is 28x28 pixels, meaning there are 784 features (one for each pixel).

The goal is to reduce the dimensions while retaining the most critical information.

**Steps:**

1. **Original Data:**

	Each image is represented as a vector of 784 features.

2. **Apply PCA:**

	- Compute the covariance matrix.
	- Find the eigenvectors and eigenvalues.
	- Select the top components (e.g., 50).
	- Project the data onto these 50 components.

3. **Result:**

	- Each image is now represented by only 50 features (instead of 784), while retaining ~90% of the variance.
	- The reduced data can be stored and transmitted more efficiently.

4. **Reconstruction:**

	When we want to reconstruct the image from the reduced data, it will look similar to the original but with some loss of detail.

**Visualization of PCA on 2D Data:**

Imagine a dataset of points in 3D space (features: height, weight, and age). PCA reduces it to 2D:

- Original data: (height, weight, age).
- After PCA: (PC1, PC2), where PC1 and PC2 capture most of the variance.

### Example of t-SNE

**Scenario: Visualizing Word Embeddings**

Suppose you have word embeddings generated by models like Word2Vec or GloVe. Each word is represented as a vector in 300-dimensional space.

**Steps:**

1. **High-Dimensional Space:**

	Words like “king,” “queen,” and “castle” are points in a 300-dimensional space, where similar words are closer together.

2. **Apply t-SNE:**

	- Compute pairwise similarities in 300D space using Gaussian probabilities.
	- Map the high-dimensional points to a 2D or 3D space while preserving local relationships.

3. **Result:**

	- Words with similar meanings (e.g., “king” and “queen”) will form clusters in the 2D plot.
	- You might see distinct clusters for animals, countries, professions, etc.

**Visualization of t-SNE on MNIST Dataset:**

- Input: High-dimensional pixel data of handwritten digits (e.g., 784D).
- Output: A 2D plot where points representing the same digit (e.g., “0” or “1”) form distinct clusters.
  - Cluster 1: All “0”s.
  - Cluster 2: All “1”s.

### PCA vs. t-SNE Examples in Real-World

| Use Case              | PCA Example                                                   | t-SNE Example                                               |
|-----------------------|---------------------------------------------------------------|-------------------------------------------------------------|
| Gene Expression Data  | Reduce thousands of gene features to top components for further analysis. | Visualize clusters of similar gene expressions in 2D.       |
| Image Data            | Compress images by reducing pixel dimensions (e.g., 784D to 50D). | Explore clusters of images (e.g., faces, objects).          |
| Text Data             | Reduce TF-IDF matrix for topic modeling.                      | Visualize clusters of documents or word embeddings.         |
| Customer Segmentation | Reduce purchase behavior features to key components.          | Visualize customer groups (e.g., high vs. low spenders).    |

**Key Observations:**

1. **PCA:** Focuses on linear transformations and variance preservation, ideal for preprocessing and compression.
2. **t-SNE:** Non-linear and primarily for visualization to explore local patterns and clusters.


# Interview Questions on PCA and t-SNE

## PCA Questions

1. **What is PCA?**
    - PCA is a dimensionality reduction technique that transforms data into a new coordinate system defined by principal components, preserving as much variance as possible.

2. **How does PCA work?**
    - PCA standardizes data, computes the covariance matrix, finds eigenvectors and eigenvalues, ranks them, and projects data onto the top principal components.

3. **What are principal components?**
    - Principal components are orthogonal vectors that represent directions of maximum variance in the data.

4. **What is the role of eigenvalues in PCA?**
    - Eigenvalues indicate the amount of variance explained by each principal component. Larger eigenvalues mean more variance is captured.

5. **What are the limitations of PCA?**
    - PCA assumes linearity, is sensitive to scaling, and may lose interpretability of original features.

6. **When would you use PCA?**
    - To reduce dimensionality, speed up computations, or visualize high-dimensional data.

7. **How do you decide the number of principal components to keep?**
    - Use the explained variance ratio and select enough components to retain ~90-95% of the variance.

8. **What are some practical applications of PCA?**
    - Image compression, feature extraction, noise reduction, and exploratory data analysis.

9. **How does PCA differ from linear regression?**
    - PCA identifies new axes that maximize variance, while linear regression predicts a target variable based on input features.

10. **What type of data is suitable for PCA?**
     - Numeric, continuous data; PCA does not work well with categorical features unless encoded numerically.

## t-SNE Questions

1. **What is t-SNE?**
    - t-SNE is a non-linear dimensionality reduction technique used for visualizing high-dimensional data in 2D or 3D.

2. **How does t-SNE differ from PCA?**
    - PCA is linear and preserves global variance, while t-SNE is non-linear and focuses on preserving local neighborhood structure.

3. **How does t-SNE work?**
    - t-SNE computes pairwise similarities in high-dimensional space and maps them to low-dimensional space using probabilities and minimizes KL divergence.

4. **What is the role of the t-distribution in t-SNE?**
    - The t-distribution prevents the crowding problem in low-dimensional space by spreading points apart.

5. **What are the main hyperparameters of t-SNE?**
    - Perplexity (controls local vs. global focus), learning rate, and number of iterations.

6. **What are the limitations of t-SNE?**
    - Computationally expensive, non-deterministic, and poor at preserving global structure.

7. **When would you use t-SNE?**
    - For visualizing high-dimensional data clusters, such as word embeddings or gene expression data.

8. **Why is t-SNE non-deterministic?**
    - It initializes embeddings randomly and uses gradient descent, leading to slightly different results on each run.

9. **How does perplexity affect t-SNE?**
    - Perplexity balances the trade-off between preserving local vs. global structure; common values are 5–50.

10. **Can t-SNE handle large datasets?**
     - Not efficiently. For large datasets, techniques like PCA+t-SNE (dimensionality reduction with PCA first) or UMAP are better options.

## Advanced PCA and t-SNE Questions

1. **How do PCA and t-SNE complement each other?**
    - PCA is often used to reduce dimensions before applying t-SNE, speeding up computations and reducing noise.

2. **What are eigenvalues and eigenvectors, and how are they used in PCA?**
    - Eigenvalues represent variance captured by principal components; eigenvectors define the direction of those components.

3. **What are the alternatives to PCA and t-SNE?**
    - PCA: Linear Discriminant Analysis (LDA), Factor Analysis.
    - t-SNE: UMAP, Isomap.

4. **Why does t-SNE minimize KL divergence?**
    - To match pairwise probabilities between high- and low-dimensional spaces, preserving local structure.

5. **How does t-SNE handle high-dimensional noise?**
    - It is sensitive to noise. Preprocessing steps like PCA or filtering can improve t-SNE performance.

### Other dimensionality reduction techniques. 

Depending on the type of data and the relationships you want to preserve. Below is a list of popular methods, categorized into linear and non-linear approaches:

### Linear Dimensionality Reduction Methods

1. **Linear Discriminant Analysis (LDA):**
	- **How it works:** Finds a linear combination of features that separates classes in a dataset. Maximizes the distance between class means and minimizes the spread within each class.
	- **Use case:** Supervised learning problems for classification tasks.
	- **Limitation:** Assumes data is linearly separable and works only with labeled data.

2. **Factor Analysis:**
	- **How it works:** Assumes that observed data is generated by a set of unobserved latent variables and some noise. Reduces dimensions by modeling the covariance structure of the data.
	- **Use case:** Psychometrics, social sciences, and marketing data.
	- **Limitation:** Works best when the data fits the factor analysis model assumptions.

3. **Independent Component Analysis (ICA):**
	- **How it works:** Separates mixed signals into statistically independent components. Often used in signal processing or for identifying hidden factors.
	- **Use case:** Blind source separation, such as separating audio signals (e.g., cocktail party problem).
	- **Limitation:** Sensitive to noise and requires careful preprocessing.

4. **Multi-Dimensional Scaling (MDS):**
	- **How it works:** Preserves pairwise distances between points in the high-dimensional space when mapping to a lower-dimensional space.
	- **Use case:** Visualizing data with meaningful distance metrics.
	- **Limitation:** Computationally expensive for large datasets.

### Non-Linear Dimensionality Reduction Methods

1. **t-SNE (t-Distributed Stochastic Neighbor Embedding):**
	- **How it works:** Focuses on preserving local structures by minimizing KL divergence between high- and low-dimensional distributions.
	- **Use case:** Visualization of clusters in high-dimensional datasets.
	- **Limitation:** Non-deterministic and computationally expensive.

2. **UMAP (Uniform Manifold Approximation and Projection):**
	- **How it works:** Similar to t-SNE but faster and better at preserving both local and global structures.
	- **Use case:** Large-scale high-dimensional data visualization.
	- **Limitation:** Hyperparameters require careful tuning.

3. **Isomap (Isometric Mapping):**
	- **How it works:** Computes geodesic distances between points on a manifold and preserves them in the low-dimensional space.
	- **Use case:** Non-linear relationships in manifold-like data.
	- **Limitation:** Sensitive to noise and outliers.

4. **Autoencoders (Deep Learning-based):**
	- **How it works:** Neural networks learn to encode data into a compressed representation and then decode it back to the original input.
	- **Use case:** Complex non-linear dimensionality reduction for large datasets.
	- **Limitation:** Requires large datasets and significant training time.

5. **Locally Linear Embedding (LLE):**
	- **How it works:** Preserves local relationships by reconstructing each point using its neighbors.
	- **Use case:** Manifold learning in non-linear data.
	- **Limitation:** Computationally expensive and sensitive to noise.

6. **Kernel PCA:**
	- **How it works:** Extends PCA to non-linear relationships by applying the kernel trick to project data into higher dimensions before performing PCA.
	- **Use case:** Non-linear data where standard PCA fails.
	- **Limitation:** Requires kernel selection and parameter tuning.

7. **Laplacian Eigenmaps:**
	- **How it works:** Uses graph-based methods to preserve local relationships between points by constructing a similarity graph.
	- **Use case:** Non-linear data with underlying graph-like structures.
	- **Limitation:** Sensitive to graph construction and requires careful tuning.

### Comparison of Methods

| Method         | Linear/Non-Linear | Preserves                | Use Case                          |
|----------------|-------------------|--------------------------|-----------------------------------|
| PCA            | Linear            | Global variance          | Feature reduction, compression.   |
| LDA            | Linear            | Class separability       | Classification tasks.             |
| t-SNE          | Non-linear        | Local neighborhood       | Visualization of clusters.        |
| UMAP           | Non-linear        | Local & global structure | Large-scale data visualization.   |
| Isomap         | Non-linear        | Geodesic distances       | Manifold learning.                |
| Kernel PCA     | Non-linear        | Variance (in kernel space) | Non-linear feature extraction.    |
| Autoencoders   | Non-linear        | Learned representation   | Complex feature extraction.       |

### Key Takeaway:

- Use PCA, LDA, or ICA for linear problems.
- Use t-SNE, UMAP, or Isomap for non-linear relationships or visualization.
- Use Autoencoders for deep learning applications or very complex data.
