
# What is PCA? #

The data matrix:
- features as a basis (or "desciption") for instances

A "change of basis" for the data matrix that has two useful properties:
- new features express maximum variation
- features are uncorrelated

<mark>Figure: PCA</mark>

One of the limitations of PCA is that the new features might not have an obvious interpretation. What could it mean to add half of X and three Ys? But as a purely mathematical transformation, it is quite useful.

# Features are Uncorrelated #

Large correlations in the data can cause problems for machine learning models that rely on matrix computations (like linear regression and neural nets). 

One solution would be to throw out features one by one until you only have low-correlation features remaining. The drawback here of course is that you would also be throwing out a lot of potentially useful information.

It's as if PCA takes all the "best" parts of each feature and recombines them into a new set of optimal features (at least as far as variance is concerned), which are, moreover, completely uncorrelated. You can therefore use this new set of features without worrying about losing any information.

# Dimensionality Reduction #

One solution to high dimensional data is to filter out all the features with low variation. You could choose to keep only the top 10 features ordered by variation. The drawback here is similar to before -- it's likely that you'll be losing a lot of useful information.

This is where the second property PCA is useful: that the components are ordered by maximal variation. Often, almost all of the a dataset's variation will be contained in the first few principal components, say 95%. So, we'd be losing the minimal amount of information.

# Using PCA #

The first is that your features should all be centered and on the same scale. The amount of variance in a feature will depend on how large its values are. Observe,

In [None]:
x_inches = 36
x_feet = 3

variance_inches = ...
variance_feet = ...

The best situation is when all your features are the same kind of thing, like signal measurements, and have the same units. In this case, you can pass the data directly to PCA (which will center it).

Next best is to standardize your data. The drawback here is that differences in the shape of distributions can distort the results, but generally this works well.

The second thing is that PCA should only be used on numerical features. Even if you've encoded your categorical variables in some way, like with one-hot encoding, it's still not a good idea. 

The third thing is that PCA is a *dependent* transform. You should do `fit` and then `transform` separately, not use on the whole dataset.

# Example - Abalone(???) #

Show column transformer. PolyFeatures. PCA. Pipeline.