Principal Component Analysis (PCA), just like clustering is a partitioning of the dataset based on proximity, we could think of PCA as a partitioning of the variation in the data. PCA is a great tool to help us discover important relationships in the data and can also be used to create more informative features.

* NOTE: PCA is typically applied to standardized data. With standardized data "variation" means "correlation". With unstandardized data "variation" means "covariance". 

### Principal Component Analysis

Example with Abalone dataset, with Height and Diameter of their shells.

<img src="https://i.imgur.com/rr8NCDy.png">

The shorter axis we might call the "Size" component: small height and small diameter (lower left) contrasted with large height and large diameter (upper right). The longer axis we might call the "Shape" component: small height and large diameter (flat shape) contrasted with large height and small diameter (round shape).

Size and Shape are another way to describe the data(the shells), this is the idea of PCA, instead of describing the data with the original features, we describe it with its axes of variation. The axes of variation become the new features.

<img src='https://i.imgur.com/XQlRD1q.png'>

The new features PCA constructs are actually just linear combinations (weighted sums) of the original features:

* df["Size"] = 0.707 * X["Height"] + 0.707 * X["Diameter"]
* df["Shape"] = 0.707 * X["Height"] - 0.707 * X["Diameter"]

There will be as many pricipal components as there are features in the original dataset. The Size componnent captures the majority of the variation between H and D. It's important to remember, the amount of variance in a component doesnt necessarily correspond to how good it is as predictor: It depends on what you're trying to predict. 

### PCA as Feature Engineering

Two ways of using PCA as feature engineering

* The first way is to use it as a descriptive technique. Since the components tell us about the variation, we could compute the MI scores for the components and see what kind of variation is most predictive of our target. That could give us ideas for kinds of features ot create -- a product of H and D if Size is important, or a ratio of H and D if Shape is important. we could even try clustering on one or more of the high-scoring components.

* THe Second way is to use the components themselves as features. Because the components exposes the variational structure of the data directly, they can often be more informative than the original features. Here are some use-cases:
 - Dimensionality reduction: When our features are highly redundant(multicolinear), PCA will parition out the redundancy into one or more ner-zero variance components, which we can the drop since they will contain little or no information.
 - Anomaly detection: Unusual variation, not apparent from the original features, will often show up in the low-variance. These components could be highly informative in an anomaly or outlier detection task.
 - Noise reduction: A collection of sensor readings will often share some common