# What is PCA?

Learn how, when, and why to use Principal Component Analysis (PCA) as part of the Machine Learning lifecycle.

## Introduction to PCA

- In the world of Machine Learning, any model that we implement will be more valuable when the features are engineered to suit the question we’re trying to answer. 
- With many datasets, we can simply include all available features, which gives us the full picture about our observations.
-  For example, it’s straightforward to see a correlation between height and weight for a patient dataset. 
- Some datasets, however, have very large numbers of features. 
- If our example patient dataset expanded to include 20 different features, how would we visualize and correlate this data? 
- When it comes time to actually process the data and train the model, we often hit computational or complexity limits. 
- How do we leverage correlations within the data to make fewer, better features without losing the information included in the dataset?

<br>

- Situations like this are a great use case for implementing Principal Component Analysis. 
- PCA is a technique where we can reduce the number of features in a dataset without losing any of the information we have. 
- Sounds pretty great right? This article will cover various aspects of PCA, so let’s dive in.

## Laying the groundwork for PCA

- Before we dive into the specifics of PCA, we need to understand the importance of information. 
- In particular, we need to understand how variance plays into the level of information in a dataset. 
- For the purposes of this article, we will be looking at a synthetic dataset about local pizza stores. 
- Let’s see what data we have:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('pizza.csv')

print(df.columns)

Index(['revenue', 'total_customers', 'amt_flour', 'amt_tomatoes',
       'amt_cheese'],
      dtype='object')


- Each row of data pertains to an individual store, and gives information about how the store is doing overall with inventory and sales. 
- Suppose we look at just `revenue` and `total_customers`, and we see the following information:
    | revenue | total_customers |
    |---------|-----------------|
    | 12345   | 500             |
    | 13425   | 500             |
    | 10872   | 500             |
    | 9561    | 500             |
- In this scenario, the value for `total_customers` has a value of 500 for every row. 
- While every row has a value, and therefore has data, this column does not provide a lot of *information*, due to the lack of variance in the values. 
- While we could include this feature in our downstream analytics, it doesn’t provide any additional value, because each row would have the same data.

<br>

- Now let’s look at what the real data shows us for these two columns:

In [2]:
df[['revenue','total_customers']].head()

Unnamed: 0,revenue,total_customers
0,2185.430535,207
1,4778.214379,143
2,3793.972738,76
3,3193.963179,188
4,1202.083882,51


- As we can see, the real dataset has far more variance in these two columns.
-  Since each feature has significant variance, these features provide valuable information about our observations, and should therefore be included in our analysis.

<br>

- Variance alone is one indicator of the level of information in a dataset, but is not the only factor. 
- To expand on the idea of variance within a dataset, we will look at the Coefficient of Variance, or CV for short. 
- The premise here is that variance must be taken into context with the central tendencies of that dataset. 
- For example, if a dataset has a variance of 5, that will mean very different things if the mean is 2 vs. a dataset with a mean of 100.
$$ CV = \frac{\text{Standard Deviation}}{\text{Mean}} = \frac{\sigma}{\mu} $$
- In probability theory and statistics, the [**Coefficient of Variance (CV)**](https://en.wikipedia.org/wiki/Coefficient_of_variation), also known as *normalized root-mean-square deviation (NRMSD)*, **percent RMS**, and **relative standatd deviation (RSD)**, is a standardized measure of dispersion of a probability distribution or frequency distribution. 
- It is defined as the ratio if the standard deviation $\sigma$ to the mean $\mu$ (or its absolute value, $\left|\mu\right|$), and often expressed as a percentage ("%RSD").

<br>

- Now, let’s actually calculate the Coefficient of Variance for each of our columns.

In [5]:
import numpy as np

#define function to calculate cv
cv = lambda x: np.std(x, ddof=1) / np.mean(x) * 100

print(df.apply(cv).sort_values(ascending=False))

total_customers    56.054917
revenue            48.561545
amt_tomatoes       48.139269
amt_cheese         46.915199
amt_flour          46.459940
dtype: float64


- All of the features in this dataset have enough variance where they will be useful in analysis. 
- Since variance is an important factor to PCA, these features will ultimately be ordered by the level of information (i.e. variance) they have.
- For this dataset, that means, in order of importance, PCA will look at `total_customers`, `revenue`, `amt_tomatoes`, `amt_cheese` and then `amt_flour`. 
- While the results of PCA won’t resemble our original features, they will be a mathematical representation of the information contained in the original features, which has value for analytical purposes.

## Coding Question 1

- The kind of information we have can vary from dataset to dataset, and thus can the Coefficient of Variance. 
- Use what you just learned on a new set of synthetic pizza store data, `pizza_new.csv`. 
- Calculate the Coefficients of Variance for each feature in the dataset. 
- Then, create a ranked order Python list for the features in the dataset in terms of information for PCA, from most important to least important.

In [8]:
df_temp = pd.read_csv("pizza_new.csv")
df_temp.head()

Unnamed: 0,revenue,total_customers,amt_flour,amt_tomatoes
0,3878.463421,75,43.812033,20.764636
1,1064.688336,184,19.219871,22.641881
2,3977.471388,85,43.349558,24.235654
3,1566.199962,123,6.842027,9.580122
4,3548.269084,120,7.234767,11.25916


- Calculate coefficient of variance for every feature

In [None]:
cv_temp = lambda x: np.std(x, ddof=1) / np.mean(x) * 100
df_temp.apply(cv_temp).sort_values(ascending=False)

total_customers    55.682374
amt_flour          49.214791
revenue            48.151389
amt_tomatoes       47.523382
dtype: float64

- Rank order of importance from highest to lowest (in a list)

In [13]:
importance_rank = []
importance_rank = list(df_temp.apply(cv_temp).sort_values(ascending=False).index)
print(importance_rank)

['total_customers', 'amt_flour', 'revenue', 'amt_tomatoes']


## The Math Behind PCA

- At this point, we need to address how we can actually take information from multiple features and distill it down into a smaller number of features. 
- Let’s dive deeper into each of the steps that lead to PCA.

### Data Matrix

- First, we need to isolate a *Data Matrix*, another name for a dataset. 
- This data matrix holds all of the features and information that we are interested in. 
- Many datasets will have columns that hold information (i.e. features), and other columns that we want to predict (i.e. labels). 
- Using our previous pizza dataset, we have 5 features in our data matrix.
    | revenue | total_customers | amt_flour | amt_tomatoes | amt_cheese |
    |---------|-----------------|-----------|--------------|------------|
    | 9931.860710 | 615.336682 | 37.662830 | 174.102712 | 139.402208 |
    | 12397.798907 | 725.440590 | 44.424509 | 239.119556 | 168.425842 |
    | 11983.079340 | 630.987797 | 40.259276 | 224.084121 | 146.612426 |
    | 13910.984353 | 746.264763 | 43.633485 | 227.096619 | 170.726464 |
    | 13083.859701 | 689.060436 | 48.964844 | 221.383478 | 154.786070 |

### Covariance Matrix

- From here, the next step of PCA is to calculate a covariance matrix. 
- Essentially, a covariance matrix is calculating how much a feature changes with changes in every other feature, i.e., we’re looking at the relative variance between any two features. 
- Mathematically, the formula for covariance between two features `X` and `Y` is:
$$ Cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) $$
- We will do this equation for the relationship between each of our features, ultimately resulting in a covariance matrix that shows relationships for the entire dataset. 
- Simplifying our example dataset, we could think about our pizza dataset having five individual features with the names `a`, `b`, `c`, `d`, and `e`. 
- Our ultimate covariance matrix, thus, would end up looking like this:
$$ \begin{bmatrix} Cov_{a,a} & Cov_{a,b} & Cov_{a,c} & Cov_{a,d} & Cov_{a,e} \\ Cov_{b,a} & Cov_{b,b} & Cov_{b,c} & Cov_{b,d} & Cov_{b,e} \\ Cov_{c,a} & Cov_{c,b} & Cov_{c,c} & Cov_{c,d} & Cov_{c,e} \\ Cov_{d,a} & Cov_{d,b} & Cov_{d,c} & Cov_{d,d} & Cov_{d,e} \\ Cov_{e,a} & Cov_{e,b} & Cov_{e,c} & Cov_{e,d} & Cov_{e,e} \end{bmatrix} $$
- Luckily, with the pandas package, we can calculate a covariance matrix with the `.cov()` method. 
- For our pizza dataset, this results in the following:

In [23]:
def calc_var(X, Y):
    assert len(X) == len(Y), "X and Y must have the same number of elements"
    
    diff = np.sum((X - np.mean(X)) * (Y - np.mean(Y)))
    return diff / (len(X) - 1)


m = np.zeros(df.columns.size**2).reshape(df.columns.size, df.columns.size)
for i in range(df.columns.size):
    for j in range(df.columns.size):
        m[i, j] = calc_var(df.iloc[:, i], df.iloc[:, j])

df_cov_matrix = pd.DataFrame(m, columns=df.columns, index=df.columns)
df_cov_matrix

Unnamed: 0,revenue,total_customers,amt_flour,amt_tomatoes,amt_cheese
revenue,1752348.0,-3671.655564,82.021059,73.3308,384.500535
total_customers,-3671.656,7342.909472,3.612878,-22.544257,-6.924364
amt_flour,82.02106,3.612878,164.933974,-6.928481,1.917606
amt_tomatoes,73.3308,-22.544257,-6.928481,62.209303,-0.487155
amt_cheese,384.5005,-6.924364,1.917606,-0.487155,26.211398


In [14]:
df.cov()

Unnamed: 0,revenue,total_customers,amt_flour,amt_tomatoes,amt_cheese
revenue,1752348.0,-3671.655564,82.021059,73.3308,384.500535
total_customers,-3671.656,7342.909472,3.612878,-22.544257,-6.924364
amt_flour,82.02106,3.612878,164.933974,-6.928481,1.917606
amt_tomatoes,73.3308,-22.544257,-6.928481,62.209303,-0.487155
amt_cheese,384.5005,-6.924364,1.917606,-0.487155,26.211398


- One important point to note is that along the primary diagonal (from top-left to bottom-right), we see the same variance values that we calculated for each individual column earlier on.

### Matrix Factorization, Eigenvalues, and Eigenvectors

- We now have a matrix of variance values for our features. 
- The next step in PCA revolves around *matrix factorization*. 
- Without going into too much detail, our goal with matrix factorization is to find a pair of smaller matrices whose product would equal our covariance matrix. 
- Another way of thinking about it: 
    - we want to find a smaller matrix that captures the majority of our information.

<br>

- An important part of this matrix factorization are *Eigenvectors*. 
- Eigenvectors are vectors (mathematical concepts that have direction and magnitude) that do not change direction when a transformation is applied to them.
- In the context of data matrices, these eigenvectors give us a direction to “rotate” the dataset in n-dimensional space so we can look at the entire dataset from a simplified perspective.
- The *eigenvalues* are related to the relative variation described by each principal component.

<br>

- For a matrix `A`, the eigenvectors and eigenvalues are the solutions to the following equation:
$$ det(A - \lambda I) $$
- After some linear algebra, for our covariance matrix, we are looking for the solution to the following matrix, which will be our eigenvectors and eigenvalues.
$$ det \begin{bmatrix} 1752348.4078824243 - \lambda & -3671.6555642109524 & 82.02105899969825 & 73.33079976049092 & 384.5005354863867 \\ -3671.6555642109524 & 7342.909472185136 - \lambda & 3.612878344589165 & -22.544256957087928 & -6.924364070540847 \\ 82.02105899969825 & 3.612878344589165 & 164.93397445921298 - \lambda & -6.928481092691553 & 1.9176056462812185 \\ 73.33079976049092 & -22.544256957087928 & -6.928481092691553 & 62.209302900386234 - \lambda & -0.48715455227517007 \\ 384.5005354863867 & -6.924364070540847 & 1.9176056462812185 & -0.48715455227517007 & 26.211397934710895 - \lambda \end{bmatrix} $$

## Principal Components

- All of the underlying math behind PCA results in principal components, but what exactly are they? 
- Principal components are a linear combination of all the input features from the original dataset. 
- By using the eigenvectors we calculated earlier, we can “rotate” our dataset features from an n-dimensional space into a 2-dimensional space, which is easier for us to understand and analyze.
- To illustrate this point, let’s return to our pizza dataset. We can observe the correlation between our `revenue` and `total_customers` features.

```python
sns.scatterplot(x='total_customers', y='revenue', data=df)
```	

<img src="Images/revenue-customer-scatterplot.webp" width="700">

- We can see a positive correlation between these two features, and could use that information to guide any analysis we perform. 
- We can also do correlation plots for every combination of features, like so:

```python
sns.pairplot(df)
```

<img src="Images/all-scatterplots.webp" width="800">

- Each individual combination of features will have its own correlation and variance, both of which provide valuable information about that relationship. 
- When comparing two features at a time, these relationships are more understandable.
- If we wanted to, however, look at all of the feature relationships and information at once, it would be very difficult to decipher, as we cannot visualize data in a 5-dimensional space.
- By using PCA, however, we can reduce the dimensionality of our dataset into a 2-dimensional dataset, allowing for better visualization. 
- Let’s see the result.

In [20]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca_array = pca.fit_transform(df)
pca_array

array([[-540.63114048,   53.00605385],
       [2052.28185925,   -5.5292275 ],
       [1068.18083613,  -74.57687627],
       ...,
       [ 219.74177294,  141.55706742],
       [1853.0866606 ,   45.00856271],
       [ 583.34077584, -103.63115492]], shape=(750, 2))

- As we can see, by running PCA on our original dataset, we were able to take our 5 features and reduce the dimensions down to 2 principal components. 
- With 2 dimensions, we can now plot the data on a single scatterplot:

```python	
sns.scatterplot(pca_array[:,0], pca_array[:,1])
```

<img src="Images/pca_pizza_plot.webp" width="700">

- While it can be difficult to interpret what this new data matrix is showing us, it does hold valuable information that can be used in a variety of contexts. 
- The axes of this chart are the two most impactful principal components as part of our analysis, and were the two that we decided to keep.

## Coding Question 2

For a given dataset, we start by calculating a `covariance matrix` for all of our features. Afterwards, we perform `matrix factorization`, which will separate out the dataset and give us two results:
1) `eigenvectors`, also known as Principal Components, which define the direction, or "rotation", of our new data space
2) `eigenvalues`, which determine the magnitude of that new data space

## The How, Where, and Why of PCA

- PCA serves an important role in many different parts of data science and analytics in general, as this process allows us to maximize the amount of information we can extract from data while reducing computational time down the line. 
- We just saw a common use case for PCA with our pizza dataset. 
- We took a higher dimensional dataset (5 dimensions in our case), and reduced it down to 2 dimensions. 
- This two-dimensional dataset can now be an input to a variety of Machine Learning models. 
- For example, we could use this new dataset as part of a forecasting model, or perform linear regression. 
- These techniques would have been much more difficult prior to performing PCA.

<br>

- PCA is also inherently an unsupervised learning algorithm and can be used to identify clusters in data on its own. 
- Very similar to the popular k-means algorithms, PCA will look at overall similarities between the different features in a dataset. 
- When we set the number of principal components to keep, we are defining the number of similar “rotations” of our dataset, which will act very much like a cluster of their own. 
- Typically, many practitioners will implement PCA as a precursor to other clustering algorithms to augment the accuracy, but it is an interesting application to do clustering with PCA alone!

<br>

- Another, very powerful, application of PCA is with image processing. 
- Images hold a vast amount of information in each file, and analyzing this information can have very useful applications. 
- Image classification, for example, uses algorithms to detect the subject of an image, or find a particular object within the image. 
- Overall, it can be very costly to process image data, due to the high dimensionality it has. 
- By applying PCA, however, practitioners can reduce the number of features for the image with minimal information loss and continue their processing.

<br>

- Below are two images from a Fruit and Vegetable Image dataset. 
- The first image is the original image, and the second is the reconstructed image using the calculated eigenvectors after performing PCA. 
- Note that, in this very quick example, there is minimal information loss, meaning that data could be effectively used for analytics.

<img src="Images/original_pineapple.webp" width="400">
<img src="Images/reconstructed_pineapple.webp" width="400">

## Summary

- In this article, you learned the underlying mathematics behind Principal Component Analysis (PCA) and got a bird’s eye view of how, when, and where to implement PCA. 
- Principal Component Analysis is a powerful tool that provides a mechanism to reduce dimensionality and simplify datasets without losing the valuable information they inherently contain. 
- The following image summarizes the different applications of PCA and we are now ready to delve into these in detail!

<img src="Images/PCA Summary.webp" width="800">