# CHEM 60 - April 10th, 2024 (Principal component analysis)

Principal Component Analysis is another scientific computing classic - it goes by many names (Singular Value Decomposition, Eigenvalue decomposition, Empirical orthogonal functions, etc.) and is used widely across pretty much all fields that deal with numerical data.

![](https://kavassalis.space/s/Dimensions_PCA_pubs_chart.png)

Look at the diversity of fields here!


The original dataset I wanted to use for this week's class was to be from this delightful chemistry paper: [Reviving degraded colors of yellow flowers in 17th century still life paintings with macro- and microscale chemical imaging](https://www.science.org/doi/10.1126/sciadv.abn6344). It involved x-ray powder diffraction data, image processing, PCA, and k-means clustering! Sadly, the raw data was not as available as I had hoped, but it's still a great example of this technique being applied to a - perhaps unexpected - chemical problem. We will look at one chemical application below, but there are lots of different applications out there (e.g. some [cool paleo chemistry thing?](https://www.science.org/doi/10.1126/sciadv.aba6883) or [studying novel environmental contaminants](https://www.science.org/doi/10.1126/science.aba7127) or for [drug discovery](https://www.science.org/doi/10.1126/sciadv.aao1551)

Even the `RDkit` library you used on Monday has a [tutorial](https://chem-workflows.com/content/PCA_compounds.html) for applying PCA to chemical data because it is so routine.


---


For extra fun, given that the last couple of weeks, we have been thinking of how chemistry can inform computation (not just computation informing chemistry), here is a link to a delightful advance in PCA - [resonant quantum PCA]( https://www.science.org/doi/10.1126/sciadv.abg2589). While PCA is an old and well established technique, new variants are still being developed! This one is using insights from chemistry to build a better algorithm to do it!

---

Save your in-class notebook copy in your personal Drive as usual.


#Imports

Here are the Python imports that we will need today. We are using more pre-built functions than usual today. The default formatting stuff is here too.

Run the below code block to get started.

In [None]:
# Standard library imports
import math as m

# Third party imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns # for a faster plot one time
from sklearn.cluster import KMeans # for pre-built clustering
from sklearn.decomposition import PCA # and PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler # and normalization


# This part of the code block is telling matplotlib to make certain font sizes exra, extra large by default
params = {'legend.fontsize': 'xx-large',
         'axes.labelsize': 'xx-large',
         'axes.titlesize':'xx-large',
         'xtick.labelsize':'xx-large',
         'ytick.labelsize':'xx-large'}
# This line updates the default parameters of pyplot (to use our larger fonts)
plt.rcParams.update(params)

We are going to use real experimental data today, so we'll need access to the Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')



---



# What is Principal Component Analysis

Principal Component Analysis (PCA) is a technique for simplifying high-dimensional data (i.e., a dataset with lots of different variables) while maintaining patterns. It is one of the most widely used *unsupervised machine learning* algorithms out there and is extremely powerful. The goal of PCA is to capture the data's most essential features and represent the data in lower dimensions, which are called the principal components.

To appreciate the math of PCA, you need a lot more linear algebra than we'd have time to cover in a course like this (🥲), but we'll go over the high-level way that PCA works (and use pre-built libraries today to actually do the calculations).

Taking data from higher to lower dimensions is kind of analogous to projecting the shadows of a three-dimensional object onto a two-dimensional surface (if you have read the 1884 hit, 'Flatland', this always comes to mind for me). You get the benefit of simplicity in a lower dimension (easy to make 2D plots! pretty much impossible to make 32D plots!), but there's a trade-off as some details are unavoidably lost in the process.

PCA begins by determining the **hyperplane** closest to the data and then simply projects the data onto it. What a fun and obvious sentence! Let's unpack that a bit... First, think of a hyperplane as a flat, infinitely extending 'space' within the bigger space where our data lies. In 3 dimensions, a hyperplane is just a usual, flat 2-dimensional plane; in 2 dimensions, it's a one-dimensional line. This 'hyperplane' term is used to describe such flat subspaces within spaces of any dimension (we could define a 10D 'flat surface' within our hypothetical 32D space, for example). The 2D and 1D spaces are easy to picture, the higher dimensional ones less so, but the concept is the same.

Now, the 'closest' hyperplane to our data is the plane where-  if we were to project our 'shadow' of our data onto it -  the 'shadow' would retain as much variation from the original data as possible (i.e. if you want the shadow of your hand to look like a hand and not a blob, only some angles of light and surfaces for the light to fall on work <- this analogy makes sense in my head, at least).

Being 'closest' means minimizing the sum of squared distances from each data point to the hyperplane (jargony, yes, but you are familar with least squares methods!). The aim is to lose as little meaningful information as possible in transitioning from a higher- to a lower-dimensional space.

So, when we say, "PCA begins by determining the hyperplane closest to the data and then simply projects the data onto it," we mean that PCA starts by finding a more straightforward representation of our complex data, which sacrifices the least important information.

The **first principal component** is the direction in the new multi-dimensional space (hyperplane) along which the data varies the most.

![PCA demo from wikepedia - a scatter plot showing most of the data aligned](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/GaussianScatterPCA.svg/1920px-GaussianScatterPCA.svg.png)

Look at the above image (from Wikipedia) for example. We could describe each data point in this scatter plot by its x and y coordinate, but if we made a new axis through the main diagonal (the longer arrowed line), just its position on that line (so now just one coordinate) would tell us a whole lot about where it is within the dataset (because most of the variance is along that line).

Once determined, the second principal component is calculated within the sub-space perpendicular to the first. Again, it's the next direction along which the data shows the most considerable variance or spread. This process continues for as many components as we have in our original data.

This tweet (from old Twitter) is an extremely memorable and more fun way to think about it than the above scatter plot.

![Tweet from Allison Horst saying: "As a warm-up exercise to teach PCA I ask students to pretend they're a whale shark then ask them what angle they'd tilt their shark face if they were approaching a delicious krill swarm..."](https://kavassalis.space/s/allison_horst_PCA.png)

What angle to turn your whale shark mouth will capture the most krill?

![Tweet from Allison Horst showing krill on a diagonal"](https://kavassalis.space/s/allison_horst_PCA1.png)

That is your first component!

Doing this process multiple times results in a set of orthogonal axes, the principal components, in descending order of the variance of data. This new 'coordinate system' represents the same data, i.e., it repeats the basic *story* from the dataset but with fewer "words", or in this case, fewer dimensions.

Okay, that might have been a lot of reading. Let's jump to seeing what it does.

# Lost in chemical space?

I love a good story. We're going to do our last class paper reading together - this is a review paper, but sets up nicely how PCA can be used to help discover novel chemsitry.

Open up "[Lost in chemical space? Maps to support organometallic catalysis](https://bmcchem.biomedcentral.com/articles/10.1186/s13065-015-0104-5)" by Natalie Fey and the Google Doc for class note taking.

After we've read and shared, we'll work through an example the author shows in this paper.

## Load the data

Let's go through the process of loading a new data frame as a pandas object. This is the data referenced in Fig. 2 (originally from the Jover et al. 2010 paper).



In [None]:
# Jover J, Fey N, Harvey JN, Lloyd-Jones GC, Orpen AG, Owen-Smith GJJ, et al. Expansion of the ligand knowledge base for monodentate P-donor ligands (LKB-P). Organometallics. 2010;29:6245–58.
df = pd.read_csv('/content/gdrive/Shared drives/Chem_60_Spring_2024/In_Class_Notebooks/data/CHEM60_Class22_data_Jover_etal.csv')
df.head()

Okay. First thing to notice - pandas wants data frames to have an index - something that orders the data points. If you don't tell it what the index is, it'll make one for you (that's the left-most column).

In [None]:
df.index

This data comes with it's own though (that happens to look extremely similar to the one pandas assigned). Let's use the **No.** column for index as the authors intended.

In [None]:
df.set_index('No.', inplace=True)
df.head()

We also can notice that there is a column that shouldn't be there - the far right column must have been loaded by mistake (ie. it has no name, and is full of NaNs). Let's get rid of it.

In [None]:
df = df.drop(['Unnamed: 33'], axis=1)
df.head()

Good practice says, before we *use* data, we look at it! This data is a bit hard to visualize though given it's size (so many different variables to plot - I don't personally find 32-dimensional spaces easy to visualize).

We also might not intuitively know what it *should* look like. Ideally, we should have some expectations going in when working with data - should some variables have some sort of a relationship? What kind of distributions might we expect?

HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital) energy should have *some* relationship, for example, so let's look at them.

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 4))  # Create a subplot with 1 row, 2 columns

# On the left plot, we draw the scatter plot
axs[0].scatter(df['E(LUMO)'], df['E(HOMO)'], edgecolor='k')
axs[0].set_ylabel('E(HOMO) (units)')
axs[0].set_xlabel('E(LUMO) (units)')
axs[0].set_ylim(min(df['E(LUMO)'].min(), df['E(HOMO)'].min())-0.05, max(df['E(LUMO)'].max(), df['E(HOMO)'].max())+0.05)
axs[0].set_xlim(min(df['E(LUMO)'].min(), df['E(HOMO)'].min())-0.05, max(df['E(LUMO)'].max(), df['E(HOMO)'].max())+0.05)
xlims = axs[0].get_xlim(); axs[0].plot(xlims, xlims, color='k', linestyle='--', linewidth=2) # adding a 1-1 line

# On the right plot, we draw the histogram
axs[1].hist(df['E(HOMO)'], bins=10, edgecolor='k')
axs[1].set_xlabel('E(HOMO) (units)')
axs[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

The above looks reasonable-ish to me (ie. the points are below the one-to-one line, vauge relationship between variables we think should have a relationship, distributions aren't normal, but not bimodal or something else unexpected).

Ideally, you should work with data you know a lot about. If not, work with someone who knows a lot about it! Computation is powerful when coupled with conceptual understanding - these need to go together (because sometimes, a dataset is nonsense!).

Okay, mini rant over.

#1. Data Preprocessing

Once we feel okay about using the data, the next thing to check is if we are missing any values. It is easy to intentionally or unintentionally have NaNs (not-a-numbers) in a dataset. We just need to know before we use the data. PCA will require all of our variables to be numeric (floats or ints, without missing values).

In [None]:
df.isnull().values.any()

Okay, so that means no NaNs present. Next step, normalize the data.

## Normalization

**Why do we need to normalize the data?** Well, for starters, we don't want units. But, the reasons are bigger than that. Imagine you're working with a data set that includes measurements in a mixture of units, for instance, millimetres and kilometres. You might want to put them all to the same units initially, but these differing scales could lead to biased results in the PCA owing to the simple fact that a kilometre is big and a millimetre is small. If you measured them in different units, they are likely going to be very different orders of magnitude when in the same units. This implies that without normalization, PCA might determine the direction of maximum variance based on the unit having larger variances when it's really relative variance that matters. Another way I have heard this explained that I kind of like - PCA may be unfairly influenced by the 'loud voices' in the data, while the 'quieter voices' might get overlooked. Normalization ensures that all 'voices' are heard at the same volume, leading to a more balanced and accurate analysis.

There are actually a couple common ways to normalize data (what? not just one? no...) Let's look at three different approaches used in the data sciences.

### Standard Scaler

This is a commonly used normalization scheme, but perhaps not the 'standard' one you'd think about. Read up on it [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler). While this normalization method is very simple to write yourself (look at the formula!), we are going to use the pre-built function to do it to save time today.

In [None]:
scaler = StandardScaler()
df_StandardScaler = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
df_StandardScaler.head()

## PRACTICE QUESTION

Do these numbers look like what you would have expected normalized numbers to look? Why or why not?



---



**notes**



---



Okay, but there are more options!

### Min-Max Scaler

Check out this one now, [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)

In [None]:
scaler = MinMaxScaler()
df_MinMaxScaler = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
df_MinMaxScaler.head()

## PRACTICE QUESTION

Double check these two methods give... related variables. How could we check? What visualizations or statistics could help us understand the relationship between the Standard Scalar and Min-Max Scalar?


---



In [None]:
# code?

**notes**



---



### Robust Scaler

The final commonly used method is the Robust Scaler, described [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler).

In [None]:
scaler = RobustScaler()
df_RobustScaler = pd.DataFrame(scaler.fit_transform(df), columns = df.columns)
df_RobustScaler.head()

In your homework, you'll have more time to reflect on these three methods and why some datasets might be better served by one or the other.

For now, we'll do the following examples with the Standard Scalar (it is... "standard", afterall).

# 2. PCA time

While we are going to use the pre-built function to do this, I will just talk through the basic math this is going on. If these terms are totally unfamilar, focus on qualitative understanding.

**Covariance Matrix computation**

The first thing calculated is a covariance matrix of the data to understand how different variables move together. The covariance matrix is a `p x p` matrix where each element represents the covariance between two features (`p` total features).

$$ cov(X, Y) = \frac{\sum{(x_i - \mu_x)(y_i - \mu_y)}}{n-1} $$

where:
- $(X)$ and $(Y)$ are two variables,
- $(x_i)$ and $(y_i)$ are the observations for each variable,
- $(\mu)$ is the mean, and
- $(n)$ is the number of data points.

We can see what the matrix looks like for this data set by letting `numpy` calculate it for us.

In [None]:
cov_matrix = np.cov(df_StandardScaler, rowvar=False)

A common way to visualize things like this is through a plot called a heat map. To make this code tidy, we'll use Seaborn (`sns`)'s prebuilt function (you can make them in matplotlib too, but the code block would be bigger).

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(cov_matrix, annot=True, fmt='.2f')
plt.show()

Here, we can see positive and negative correlations between the different variables. Seeing large magnitude correlations tells us PCA will be very effective. Why? Because correlated variables essentially tell us we have redundant data being stored here (PCA is a compression algorithm!). If two or more variables are highly correlated with each other, what if we just stored one variable that roughly captured the variance in both? That's PCA for you.

## PRACTICE QUESTION

While Seaborn is very popular for it's better prebuilt colour maps, the default one for heatmaps (above) is actually *bad* for correlation matrices. Why? What qualities of a colour map would be better for the above? You can learn about the heatmap syntax [here](https://seaborn.pydata.org/generated/seaborn.heatmap.html) and a reminder [link](https://colorbrewer2.org/) to a good resource for thinking about colour maps.



---



**notes**



---




The next thing the PCA algorithm does is **compute the Eigenvalues and Eigenvectors**. The eigenvectors represent the direction of the new sub-space (angle of shark face), and eigenvalues represent their magnitude (how wide a mouth of shark). We are not going to step through this math because it would take a couple classes on its own. Once we have the eigenvalues and eigenvectors, we **sort** them in decreasing order and choose the first few that contain the most information (variance). These comprise the new dimensions. Finally,**the original data is transformed onto the dimension that we choose**, generating a new, reduced dataset.

We'll let [Scikit learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) take the wheel today.

Some choices need to be made though.

1. What normalization scheme makes sense for the data (we said we'd use the Standard Scalar here)
2. How many components should we pick?

Let's pick all of them! (ie. N = 32)



In [None]:
pca = PCA(n_components=32)
principalComponents = pca.fit_transform(df_StandardScaler)

Wow, what did that do?

Let's ask for the amount of variance associated with each component (ie. how much of the data is it really describing). Using the pre-built function is nice, because `explained_variance_ratio_` is just an [attribute](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) of our `pca` object.

In [None]:
explained_variance = pca.explained_variance_ratio_
explained_variance

## Scree Plot
It might be hard to see what is important from just looking at those numbers, so let's plot it (this is called a Scree Plot)!

In [None]:
components = range(1, len(principalComponents[1])+1) # this is just to label our components

plt.figure(figsize=(14, 4))
plt.plot(components, explained_variance, marker='o')
plt.ylim([0,max(explained_variance)+.05])
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.xticks(components)
plt.show()

What does this tell us? Well, that we could have picked a much smaller number! Why? Let's look at those numbers.

In [None]:
print("The First PC is responsible for " + str(round(explained_variance[0]*100,3)) + "% of the variance")

In [None]:
print("The First 2 PCs are responsible for " + str(round((explained_variance[0]+explained_variance[1])*100,3)) + "% of the variance")

In [None]:
print("The First 3 PCs are responsible for " + str(round((explained_variance[0]+explained_variance[1]+explained_variance[2])*100,3)) + "% of the variance")

In [None]:
print("The First 4 PCs are responsible for " + str(round((explained_variance[0]+explained_variance[1]+explained_variance[2]+explained_variance[3])*100,3)) + "% of the variance")

In [None]:
print("The 5th PC would only add " + str(round((explained_variance[4])*100,3)) + "% of the variance")

So, most of the variance in the data is captured by the first 4 components and the rest have very small contributions. We could also just re-do the PCA with just 4 components and the variance explained is the same.

In [None]:
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(df_StandardScaler)
explained_variance = pca.explained_variance_ratio_
explained_variance

## Let's plot the data


First, let's make a pandas dataframe to store the transformed dataset so it's easier to plot. We can also wrap our head's around what the data looks like now - no physically descriptive variable names, just our four components.


In [None]:
df_PCA = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2', 'PC3', 'PC4'])
df_PCA

Now, let's plot 2 of the dimensions.

In [None]:
plt.scatter(df_PCA['PC1'], df_PCA['PC2'], edgecolor='k', s=50)
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.show()

Well, this is honestly... not hugely informative on its own. We still have a 4D space, so this 2D plot doesn't capture everything, but it actually is a way better visualization of the entire dataset than just plotting the HOMO and LUMO energy above (those were 2 our of 32 variables), PC1 and PC2 are 2 variables that represent >60% of the variance in the entire dataset (that's a big change).

What if we could say which data points were most alike in this transformed dataset? How can we use this new information to better explore the chemical space we're interested in?

# K-means Clustering

The hero of so many problems, the humble K-means clustering algorithm is back to help us make meaning of this data. Two assignments so far have included the code to do this ourselves, but we'll use scikit learn's [function](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) today.

First, we need an initial guess for the number of clusters in the data (how many genres of *things* are represented here?). It is okay to take a guess at this and try different values until one makes sense (ie. will it group things that have some chemical reason for being in a group together or not? Understanding the underlying meaning of data is always key).

In [None]:
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, n_init=10)
y_kmeans = kmeans.fit_predict(df_PCA) # what does this line do? Look at the documentation!
centres = kmeans.cluster_centers_ # what does this line do?

## PRACTICE QUESTION

What are the above lines of code returning? Why? Read the Scikit learn documentation.



---



**notes**



---

As always, plots help.

In [None]:
plt.scatter(df_PCA['PC1'], df_PCA['PC2'], c=y_kmeans, edgecolor='k', s=50, cmap='viridis')
plt.scatter(centres[:, 0], centres[:, 1], c=range(n_clusters), marker = "*", s=1000, alpha=0.5);
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.show()

How do we interpret this? We have to know what the members of the different groups have in common! Let's put this information back with our original dataframe.

In [None]:
df['Cluster'] = y_kmeans
df.head()

One really helpful piece of information that isn't in this original dataframe - coded information on the chemical structures that each of these entries is describing! This could be the name of a ligand or some other chemical descriptor. This kind of information is commonly included in the form of SMILES (you just learned about these)!

## Getting back to the meaning of the components

But what if we want to know what the different components physically represent? Well, they represent a combination of original variables. We can see what combination if we want.

In [None]:
features = df_StandardScaler.columns
components_df = pd.DataFrame(pca.components_, columns=features, index=[f'PCA{i+1}' for i in range(4)])
components_df.head()

Here, the values indicate how much each original feature contributes to the PC. The sign of weights (values in the above dataframe) represent the direction of the contribution. A positive weight means the feature and the PC are directly proportional, i.e., when the feature's value increases, the PC's value also increases. A negative weight means they are inversely proportional; as the feature's value increases, the PC's value decreases. Sometimes a component will be mostly one or two variables, sometimes a mix of everything.

If you square these values and sum them up for each row (each principal component), they should add up to 1. This is because each principal component is a unit vector, and the sum of squares of a unit vector's components is equal to 1.

We can test it out.

In [None]:
for i, sum_of_squares in enumerate(components_df.pow(2).sum(axis=1), start=1):
    print(f'The sum of squares for PCA{i} is {sum_of_squares}')

## PRACTICE QUESTION?

Are these numbers exactly 1? Why/why not?



---



**notes**



---



In some, but not many, applications, assigning pseudo-physical meaning to the different principal components is important. In many (including this one), it's really not. The components are a mix of all the features. The insights come from being able to cluster our data in the new space.


Technically, you can perform K-means clustering on the original data, however, performing clustering after PCA has *significant* benefits.

1. **Noise reduction**: PCA can help remove noise from the data. Noise (random fluctuations in the data) almost unintentionally gets lost when you do PCA. Unless the noise comes from a systematic error source, it'll get cancelled out!

2. **Reduction of redundancy**: We saw with the covariance matrix that many features in our original data were highly correlated with each other so don't actually add much new information individually. Correlated features make clustering (and many types of learning) hard (if you have 10 pieces of information that are essentially the same and 2 that are unique, you run the risk of overemphasizing the 10). PCA means that every direction is equally important and unique, leading to more objectively shaped clusters.

3. **Computational efficiency**: Reducing the number of dimensions with PCA can greatly speed up clustering algorithm, especially on large datasets (this data set we are looking at is small - but this approach can be applied on Big data too).  

4. **Visualization**: One of the most significant benefits of PCA is that it allows you to visualize high-dimensional data. After PCA, you can plot the first two or three principal components and visualise the clusters, which, assuming the first couple of components explain most of the variance in the data, can let you see a whole lot.

If your original data is not high dimensional, or if the features are highly distinctive (low correlation), applying PCA might not be necessary - while it's a great tool, not every problem is the right fit (this is true for... all tools).


---

Your homework will let you dig in more on this data set and PCA.


# Submit your notebook

It's time to download your notebook and submit it on Canvas. Go to the File menu and click **Download** -> **Download .ipynb**

Then, go to **Canvas** and **submit your assignment** on the assignment page. Once it is submitted, swing over to the homework now and start working through the paper.