# PCA

* PCA (or principal component analysis) is the first of the techniques you will see aimed at dimensionality reduction. This technique is about taking your full dataset and reducing it to only the parts that hold the most information.

# How deep will these notes go?

### The goal is to leave this lesson with an understanding of:
1. How PCA is used in the world
2. How to perform PCA in python
3. A conceptual understanding of how the algorithm works.
4. How to interpret the results of PCA.

If you want to dive deeper into the mathematics, there will be additional links provided, but it will not be a main focus of this lesson.

# PCA Lesson Topics

There is a lot to cover with Principal Component Analysis (or PCA). However, you will gain a solid understanding of PCA by the end of this lesson, by applying this technique in a couple of scenarios using scikit-learn, and practicing interpreting the results.

We will also cover conceptually how the algorithm works, and I will provide links to explore what is happening mathematically in case you want to dive in deeper! Here is an outline of what you can expect in this lesson.

### 1. Dimensionality Reduction through Feature Selection and Feature Extraction
With large datasets we often suffer with what is known as the "curse of dimensionality," and need to reduce the number of features to effectively develop a model. Feature Selection and Feature Extraction are two general approaches for reducing dimensionality.

### 2. Feature Extraction Using PCA
Principal Component Analysis is a common method for extracting new "latent features" from our dataset, based on existing features

### 3. Fitting PCA
In this part of the lesson, you will use PCA in scikit-learn to reduce the dimensionality of images of handwritten digits.

### 4. Interpretting Results
Once you are able to use PCA on a dataset, it is essential that you know how to interpret the results you get back. There are two main parts to interpreting your results - the principal components themselves and the variability of the original data captured by those components. You will get familiar with both.

### 5. Mini-Project
Finally, you will put your skills to work on a new dataset

### 6. Quick Recap
We will do a quick recap, and you will be ready to use PCA for your own applications, as well as the project!

## Latent Features

<img src='lat_feat1.png' width=400px>
<img src='lat_feat2.png' width=400px>

Latent features are features that aren't explicitly in your dataset.

In this example, you saw that the following features are all related to the latent feature **home size**:

1. lot size
2. number of rooms
3. floor plan size
4. size of garage
5. number of bedrooms
6. number of bathrooms

Similarly, the following features could be reduced to a single latent feature of **home neighborhood**:

1. local crime rate
2. number of schools in five miles
3. property tax rate
4. local median income
5. average air quality index
6. distance to highway

So even if our original dataset has the 12 features listed, we might be able to reduce this to only 2 latent features relating to the home size and home neighborhood.

#### How do these statements group?
<img src='lat_feat3.png' width=600px>

**Maybe like this**

<img src='lat_feat4.png' width=600px>

# Reducing the Number of Features - Dimensionality Reduction

Our real estate example is great to help develop an understanding of feature reduction and latent features. But we have a smallish number of features in this example, so it's not clear why it's so necessary to reduce the number of features. And in this case it wouldn't actually be required - we could handle all six original features to create a model.

But the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) becomes more clear when we're grappling with large real-world datasets that might involve hundreds or thousands of features, and to effectively develop a model really requires us to reduce our number of dimensions.

## Two Approaches: Feature Selection and Feature Extraction
### Feature Selection
Feature Selection involves finding a **subset** of the original features of your data that you determine are most relevant and useful. In the example image below, taken from the video, notice that "floor plan size" and "local crime rate" are features that we have selected as a subset of the original data.
<img src='feat_sel1.png' width=600px>

#### Methods of feature selection
* Filter Methods:
Filtering approaches use a ranking or sorting algorithm to filter out those features that have less usefulness. Filter methods are based on discerning some inherent correlations among the feature data in unsupervised learning, or on correlations with the output variable in supervised settings. Filter methods are usually applied as a preprocessing step. Common tools for determining correlations in filter methods include: **Pearson's Correlation**, **Linear Discriminant Analysis (LDA)**, and **Analysis of Variance (ANOVA)**

* Wrapper Methods:
Wrapper approaches generally select features by directly testing their impact on the performance of a model. The idea is to "wrap" this procedure around your algorithm, repeatedly calling the algorithm using different subsets of features, and measuring the performance of each model. Cross-validation is used across these multiple tests. The features that produce the best models are selected. Clearly this is a computationally expensive approach for finding the best performing subset of features, since they have to make a number of calls to the learning algorithm. Common examples of wrapper methods are: **Forward Search**, **Backward Search**, and **Recursive Feature Elimination**.

**Scikit-learn** has a [feature selection module](https://scikit-learn.org/stable/modules/feature_selection.html) that offers a variety of methods to improve model accuracy scores or to boost their performance on very high-dimensional datasets.

### Feature Extraction
Feature Extraction involves extracting, or constructing, new features called **latent features**. In the example image below, taken from the video, "Size Feature" and "Neighborhood Quality Feature" are new latent features, extracted from the original input data.
<img src='feat_ext1.png' width=600px>
#### Methods of feature extraction
Constructing latent features is exactly the goal of **Principal Component Analysis (PCA)**, which we'll explore throughout the rest of this lesson.

Other methods for accomplishing Feature Extraction include **Independent Component Analysis (ICA)** and **Random Projection**, which we will study in the following lesson.

### Further Exploration
If you're interested in deeper study of these topics, here are a couple of helpful blog posts and a research paper:
* https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/
* https://elitedatascience.com/dimensionality-reduction-algorithms
* http://www.ai.mit.edu/projects/jmlr/papers/volume3/guyon03a/source/old/guyon03a.pdf

# Principal Components
Given data:
<img src='pca1.png' width=600px>
We can shrink the data in two dimensions into this line:
<img src='pca2.png' width=600px>
Which is very similar to regression, but we're not using this for prediction:
<img src='pca3.png' width=600px>
The goal in the above is to shrink the space that our data lives in

`1.` An advantage of Feature Extraction over Feature Selection is that the latent features can be constructed to incorporate data from multiple features, and thus retain more information present in the various original inputs, than just losing that information by dropping many original inputs.

`2.` **Principal components** are linear combinations of the original features in a dataset that aim to retain the most information in the original data.

`3.` You can think of a **principal component** in the same way that you think about a **latent feature**

The general approach to this problem of high-dimensional datasets is to search for a **projection** of the data onto a smaller number of features which preserves the information as much as possible.

We'll take a closer look in the rest of this lesson.

# Principal Component Properties
There are two main properties of principal components:

`1.` **They retain the most amount of information in the dataset.** In this video, you saw that retaining the most information in the dataset meant finding a line that reduced the distances of the points to the component across all the points (same as in regression!).

`2.` **The created components are orthogonal to one another.** So far we have been mostly focused on what the first component of a dataset would look like. However, when there are many components, the additional components will all be orthogonal to one another. Depending on how the components are used, there are benefits to having orthogonal components. In regression, we often would like independent features, so using the components in regression now guarantees this.

[A great answer about common PCA questions](https://stats.stackexchange.com/questions/110508/questions-on-pca-when-are-pcs-independent-why-is-pca-sensitive-to-scaling-why)

<img src='pca_prop1.png' width=600px>
By choosing components that span the largest variance in the dataset, you lose the least amount of information

<img src='pca_prop2.png' width=600px>
The amount of information lost is the sum of the distances from the points to the line; the component on the right loses less information than the one on the right

<img src='pca_prop3.png' width=600px>
The components must have 90-degrees between them, note that orthogonality indicates independence

_______________

Some short items about PCA
<img src='pca_quiz1.png' width=600px>