## MACHINE LEARNING DAY 24 : Dimensionality Reduction

### What is Dimensionality Reduction?

Dimensionality reduction is a **technique used in machine learning and data preprocessing** to reduce the number of features (also known as variables, attributes, or dimensions) in a dataset, while **preserving the important patterns or structures**.

High-dimensional datasets (i.e., datasets with many features) can be complex, computationally expensive, and prone to overfitting. Dimensionality reduction helps address these challenges by simplifying the dataset, which often leads to:

* Reduced model complexity and training time
* Improved model performance
* Enhanced visualization and interpretability
* Less storage and memory usage
* Better handling of multicollinearity and noise

---

### Why is Dimensionality Reduction Needed?

1. **Curse of Dimensionality**: As the number of features increases, the data becomes sparse, making it harder for models to find patterns.
2. **Overfitting**: Too many features increase the risk that the model fits noise rather than signal.
3. **Computation**: High-dimensional data requires more memory, processing time, and power.
4. **Data Visualization**: It is hard to visualize datasets with more than 3 dimensions. Dimensionality reduction enables us to project high-dimensional data into 2D or 3D for better understanding.
5. **Noise Reduction**: Many features might be irrelevant or redundant. Removing them reduces noise.

---

## Types of Dimensionality Reduction

There are **two major types** of dimensionality reduction techniques:

---

### 1. **Feature Selection**

Feature selection **selects a subset of relevant original features** from the dataset without altering them. It does not create new features but filters out those that are not useful for the model.

#### Key Characteristics:

* Keeps original features intact.
* Based on statistical tests or model performance.
* Easier to interpret since original meaning of variables is retained.

#### Methods of Feature Selection:

**a. Filter Methods** – Use statistical measures to score features without involving any machine learning algorithm:

* **Correlation Coefficient**: Select features that are highly correlated with the target and not highly correlated with each other.
* **Chi-Square Test**: Used when both features and targets are categorical.
* **ANOVA (Analysis of Variance)**: Used when features are numerical and the target is categorical.

**b. Wrapper Methods** – Use a machine learning model to evaluate feature subsets:

* **Recursive Feature Elimination (RFE)**: Recursively removes the least important features.
* **Forward Selection**: Starts with zero features and adds one at a time.
* **Backward Elimination**: Starts with all features and removes the least important ones.

**c. Embedded Methods** – Perform feature selection as part of model training:

* **Lasso Regression (L1 Regularization)**: Shrinks some coefficients to zero, effectively removing features.
* **Tree-based Methods (e.g., Random Forest)**: Provide feature importance scores to rank and select variables.

---

### 2. **Feature Extraction**

Feature extraction **transforms the data** from the high-dimensional space to a lower-dimensional space by **creating new features** that are combinations of the original ones.

#### Key Characteristics:

* New features may not have direct interpretations.
* Preserves most of the data’s structure or variance.
* Especially useful when feature relationships are complex or nonlinear.

#### Common Feature Extraction Methods:

**a. Principal Component Analysis (PCA)**

* Unsupervised linear transformation technique.
* Projects data onto orthogonal axes (principal components) that capture maximum variance.
* The first few components usually capture most of the variability in the data.

**Steps in PCA**:

1. Standardize the data
2. Compute the covariance matrix
3. Calculate eigenvalues and eigenvectors
4. Choose top k eigenvectors (components)
5. Project the data onto the new subspace

**Use Case**: Data compression, noise reduction, visualization.

---

**b. Linear Discriminant Analysis (LDA)**

* Supervised technique.
* Maximizes class separability by finding axes that maximize **between-class variance** and minimize **within-class variance**.
* Works well when the classes are linearly separable.

**Use Case**: Classification problems with labeled data.

---

**c. t-SNE (t-distributed Stochastic Neighbor Embedding)**

* Non-linear dimensionality reduction for **visualization**.
* Converts high-dimensional similarities into low-dimensional space while preserving local structure.

**Use Case**: Visualizing complex data in 2D or 3D (e.g., word embeddings, image clusters).

---

**d. UMAP (Uniform Manifold Approximation and Projection)**

* Like t-SNE but faster and scalable.
* Preserves both **local and global structure** of data.
* Useful for visualization and preprocessing.

---

**e. Autoencoders**

* Neural networks trained to reconstruct their input.
* The bottleneck layer (compressed representation) serves as a lower-dimensional encoding.
* Can capture nonlinear feature relationships.

**Use Case**: Deep learning-based dimensionality reduction, especially for image, audio, or time-series data.

---

### Feature Selection vs. Feature Extraction – Summary

| Aspect                | Feature Selection            | Feature Extraction                    |
| --------------------- | ---------------------------- | ------------------------------------- |
| Output                | Subset of original features  | New features created from originals   |
| Interpretability      | High                         | Often low                             |
| Complexity            | Simpler                      | Often more complex (e.g., PCA, NN)    |
| Information preserved | May lose some                | Tries to preserve maximum information |
| Examples              | Correlation, Chi-Square, RFE | PCA, LDA, t-SNE, Autoencoders         |