# Dimensionality Reduction
The purpose of this is to identify data needed from a data set and parsing the data down to those criteria while retaining data integrity. From a mathematic standpoint, **DR** can be seen as reducing a 3D array to a 2D array.
<br>
<br> PCA is an industry standard module used for dimensionality (i.e. feature) reduction.
<br> LCA will also be covered
<br>
<br>
Feature selection will be discussed:
* Filter methods
* Wrapper Methods

### Motivation
Big data leads to high-dimensional data, big data includes structured data, images, audio, language/text, and sensors/recordings.

* Images: each image has an x and y axis, and each pixel has a plethora of values from hue data to black and white contrast data.
* Text: All text found on the internet is a form of list data 
* Structured Data: Data sets and tables are essentially pandas waiting to be explored.
* Audio: All audio has a bitrate and binary data that can be studied or iterated over.
<br>


Why reduce dimensionality (number of features)?
* Visualization
* Improved model performance
* Lower computational complexity
* DR as unsupervised learning

#### Visualization
The human visual system is the most powerful perceptual system in the known universe... But it only works in up to 3 dimensions.

#### Improved Model Performance
* Denoising: Reduces clutter in data clusters
* Disentangled (uncorrelated) features: Correlated features can turn up where they are so similar it is redundant to have both. Correlated features can aversely effect machine learning models.
* Fewer Model Parameters: \<we will cover this in the future>
* Less *overfitting*: Overfitting occurs when the model too accurately fits and performs on a training set, but it performs poorly on a test set.

##### Lower computational complexity
* Dimensionality reduction as compression (less memory load)
* Fewer computations (smaller matrix multiplication)

##### Dimensionality reduction as unsupervised learning
* high-dimensional data often exists on a *low-dimensional* manifold.
* By reducing the dimensionality, we might learn the true underlying structure.

### Principle Component Analysis (PCA)
**PCA Intuition**: 
<br> The coordinate system data is projected onto has chosen features representing perpendicular x and y axes. Data based on these two features will be plotted as a point on that system. regardless of the orientation of the axes, the point will always be projected in the same original position, however the new axes will provide a different data coordinate for the point. The only time we would want to adjust the axes is when we can identify a 'trend line' and superimpose a perpendicular system with the x-axis being the trend line.
<br>
<br>The trend line x-axis is referred to as the *1st Principle Component* or PC1. IF there is enough variance for the y axis to plot points effectively, that axis is referred to as the *2nd Principle Component* or PC2.

#### Projecting data onto the PCs
Project each data point(vector) onto each PC(vector). This projection is done using a **dot product**. We can express the projection of multiple datapoints onto multiple PCs using a matrix multiplication.

#### Picking the number of PCs
* One strategy: number of PCs up to a certain % of total variance explained. $$ Explained\,Variance\,Ratio\,X\,Principle\,Components $$


### Scaling DF before PCA
* PCA finds dimensions with high variance
* If some columns of your data have much higher variance, they will dominate their PCs
* The variance of these columns is often arbitrary (e.g. mm vs. m units)
* Assume each column is equally important by applying StandardScaler

#### What do the PCs represent
* Size *d* vectors spanning original space
* When a data point is projected onto one, gives one number.
    * Together, these numbers preserve as much information as possible.
    
### Linear Discriminant Analysis (LDA)
* Like PCA, LDA reduces dimensionality
* LDA projections:
    * Minimize intra-class variance
    * Maximize inter-class variance
* Supervised (LDA) vs. Unsupervised(PCA)

## Feature Selection
* Dimensionality Reduction: Creates new features that are functions of the original ones (for PCA, linear combinations of the original ones).
* Feature Selection: Removes redundant features and keeps important ones
* Why ever use FS?: Want your resulting feature set to still be interpretable (i.e. not creating new, potentially hard to interpret features)
    * Means feature selection is mostly used when features are already interpretable (I.e. structured data like tables with column names).

### Filter Methods
* Measure relevance of feature correlation with dependent variable (target).
* If feature is correlated with target, keep. Otherwise, discard.
* Applied **before** training ML model
* Advantages:
    * Fast, no training involved
* Disadvantages:
    * Ignores feature combinations
    * Keeps redundant features.

### Wrapper Methods
* Train ML model with different subsets of feature
* If feature improves performance, add/keep it. Otherwise, ignore/remove it.
* Applied **during** training ML model
* Advantages:
    * Evaluates features in context of others
    * Performanve-driven
* Disadvantages: 
    * Slow, retrain model several times

#### Forward Selection Method
**1.** *SelectedFeatures = []*
<br>**2.** Find *F* in (AllFeatures - SelectedFeatures) that, if added to SelectedFeatures, best improves model performance
<br>**3.** If adding *F* improved performance more than some threshold, permanently add it to SelectedFeatures and go back to (2)

#### Backward Elimination Method
**1.** *SelectedFeatures = AllFeatures*
<br>**2.** Find *F* in SelectedFeatures that, if removed from SelectedFeatures, decreases model performance the least
<br>**3.** If removing *F* decreased performance less than some threshold, permanently remove it from SelectedFeatures and go back to (2)

## PCA DEMO