In [None]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

In [None]:
import itertools

import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd

In [None]:
!which python3

In [None]:
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['font.family'] = 'STIXGeneral'
mpl.rcParams['text.usetex'] = False
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('axes', labelsize=12)
mpl.rcParams['figure.dpi'] = 300

# Dimensionality Reduction

[Matthew R. Carbone](https://www.bnl.gov/staff/mcarbone) | _Assistant Computational Scientist, Computational Science Initiative, Brookhaven National Laboratory_

In this tutorial, you will learn what dimensionality reduction means, how you can use it to your advantage in your own work, and what some of the common methods for doing dimensionality reduction are. We will then implement these methods here on some real data.

## What is dimensionality reduction?

[Dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) is the process of transforming a dataset from a _high dimensional_ to a _low dimensional_ space. This begs the question, **what is dimension?**

## What is dimension?

- Colloquially, when we think of dimension, we think of the three spatial dimensions we live in
- More specifically, we are referring to three _degrees of freedom_ in which we can move
- A data point is a single example in a data set
- Each data point carries with it a number of properties. For example, in a data set of cars, each car will have properties like the numer of wheels, the number of doors, miles per gallon, etc.
- The number of properties each data points has is its dimensionality
- Note: properties may or may not be independent!

### Check your understanding

Let's look at the Palmer Penguins dataset we played around with in the last tutorial.

In [None]:
!pip install palmerpenguins

In [None]:
from palmerpenguins import load_penguins
penguins = load_penguins()

In [None]:
penguins

What is the dimension of the Palmer Penguins dataset?

Here's a tricker example. What is the dimensionality of a "2d color image"? Hint: it's not 2!

# Principal component analysis

[Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) is a statistically rigorous method for reducing the dimensionality of a dataset. It is mostly used as a preliminary analysis and visualization tool. Here are some reference materials you can take a look at in your free time to get a better feel for what this method is and how it works!

- [Scikit Learn's decomposition reference](https://scikit-learn.org/stable/modules/decomposition.html#decompositions)
- [PCA Explained Visually with Zero Math](https://towardsdatascience.com/principal-component-analysis-pca-explained-visually-with-zero-math-1cbf392b9e7d)

## What is PCA?

PCA is arguably the simplest dimensionality reduction method. It ultimately decomposes each data point's $d$-dimensional feature vector into a $d'$-dimensional feature vector, where $d' < d$. The new "effective features" lie in a new vector space, which is a linear combination of the old one.

To get into some of the details, we have to understand the [covariance Matrix](https://en.wikipedia.org/wiki/Covariance_matrix). The Covariance Matrix $K$ is square, symmetric, positive semi-definite matrix. The diagonals $K_{ii}$ are the variances of each feature. The off-diagonals $K_{ij}$ ($i \neq j$) are the covarainces between different features. Formally, this means

$$K_{ij} = \mathrm{cov}(X_i, X_j) = \mathbb{E}[(X_i - \mathbb{E}[X_i])(X_j - \mathbb{E}[X_j])].$$

While in general $X_i$ is a random varaible, in the case of some fixed dataset, $X_i$ is usually a _feature_ of the data. Thus, for a datset of $N$ elements and $d$ features, the covariance matrix is $d \times d.$

PCA can effectively be reduced to **diagonalizing the covariance matrix**. This new diagonal form contains elements which represent the variance of each new axis, and each new axis is an eigenvector of $K.$

## "Iris" Dataset

The Iris dataset contains three different types of Iris flowers. There are 150 examples and each example has 4 attributes (4 _dimensions_). Let's see if we can use PCA to decompose the dataset from 4 dimensions into just 2! See [here](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) and [here](https://en.wikipedia.org/wiki/Iris_flower_data_set) for more details on the dataset. For now, we'll use a simple processing script to turn the dictionary dataset into a Pandas DataFrame.

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris()
# print(iris["DESCR"])  # For more information!

In [None]:
def process_iris(iris=iris):
    d = {
        "sepal length (cm)": iris["data"][:, 0],
        "sepal width (cm)": iris["data"][:, 1],
        "petal length (cm)": iris["data"][:, 2],
        "petal width (cm)": iris["data"][:, 3],
        "class": iris["target"]
    }
    return pd.DataFrame(d)

In [None]:
iris_data = process_iris(iris)

In [None]:
iris_data

Let's do PCA by hand first. Then we can find an easier way of doing it. Below, I've written a function `covariance_matrix`, which assumes an incoming matrix of shape `N` x `d`, where `N` is the number of data points and `d` is the dimensionality of each data point.

In [None]:
def covariance_matrix(X):
    """Let X be an N x d dataset."""
    
    mu = X.mean(axis=0)  # For each feature, find the mean
    X2 = X - mu          # Subtract off the mean from each element
    N = X.shape[0]       # Total number of data points
    return (X2.T @ X2) / (N - 1)

In [None]:
X = iris_data.iloc[:, :4].to_numpy()

In [None]:
Y = iris_data["class"].to_numpy()

We'll also want to scale our data to 0 mean and unit variance before we go forward.

In [None]:
X_scaled = (X - X.mean(axis=0, keepdims=True)) / (1e-8 + X.std(axis=0, keepdims=True))

In [None]:
K = covariance_matrix(X_scaled)
K

In [None]:
# Numpy requires a different convention, hence X.T
np.allclose(K, np.cov(X_scaled.T))

Now that we have the covariance matrix, it's time to diagonalize it. What does this mean? Essentially, it means we are looking for a transformation:

$$V^T K V = K'$$

such that $K'$ is diagonal. We're not going to go into the details here, but the way to do this is via the eigenvector decomposition of the matrix. Luckily, we have packages for this. `np.linalg.eig` provies a convenience method for calculating the eigenvalues `w` and eigenvectors `V` (in matrix form) of any provided square matrix.

In [None]:
w, V = np.linalg.eig(K)

In [None]:
K_prime = V.T @ K @ V         # Test the above transformation!
K_prime[K_prime < 1e-14] = 0  # Convenience for visualization only

In [None]:
K_prime

Note that the eigenvalues `w` appear in the diagonals:

In [None]:
w

and that these eigenvectors are orthogonal:

In [None]:
for ii, jj in itertools.combinations(range(4), 2):
    assert np.abs(np.dot(V[:, ii], V[:, jj])) < 1e-14

So what now? Well, we have the eigenvectors of the covariance matrix, but what do we do with them? It turns out that these eigenvectors actually represent the linear combination of the features of the original dataset such that the _first_ "direction" captures the most possible variance, the _second_ "direction" captures the second most possible variance, etc. We can execute this transformation via yet another simple matrix operation:

Another way to think about this procedure is that it takes a linear combination of features such that those features have zero correlation ([though this does not necessarily imply independence](https://towardsdatascience.com/independence-covariance-and-correlation-between-two-random-variables-197022116f93)). These "directions" corresponding to linear combinations are "ranked" in order of the amount of variance they capture.

In [None]:
X_scaled.shape

In [None]:
V.shape

In [None]:
principal_values = X_scaled @ V

Don't want to do all of this work? Don't worry! `scikit-learn` has you covered.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(4)
principal_values_via_sklearn = pca.fit_transform(X_scaled)

In [None]:
# Eigenvectors are the same up to a +/- sign, which is fine
np.allclose(np.abs(principal_values), np.abs(principal_values_via_sklearn))

The key improvement that `scikit-learn` will offer is that you don't actually have to compute all eigenvectors, as doing so is expensive. There are cheaper ways to do so, and you could do something like `pca = PCA(2)`, which tells `scikit-learn` to only compute the first to principal components. This can be very useful when you're dealing with large datasets, and performing PCA is expensive.

### Inspect the Kernels

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(3, 2))

for ii in range(4):
    ax.plot(pca.components_[:, ii])


### Make the "PCA plot"

The primary utility of PCA comes from its ability to decompose very high dimensional data into only a few dimensions efficiently. Usually, we choose two dimensions so we can make a scatter plot of the new, reduced-dimensional data.

In [None]:
principal_values_via_sklearn.shape

Usually, we label the point by the classes. Let's do that here.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(2, 2))

ax.scatter(principal_values_via_sklearn[:, 0], principal_values_via_sklearn[:, 1], c=Y)

ax.set_xlabel("$z_1$")
ax.set_ylabel("$z_2$")

Clearly, we can see that the data "clusters" in a reasonable way. So what does this mean? It's a good indicator that a predictive algorithm will be performant. It also can give us an idea as to which labels in an arbitrary dataset "correlate" well will the features. In this case, there's only one label, but in general, there may be many labels you wish to predict.

## "Labeled Faces in the Wild" Dataset

Let's take a look at another example which can hopefully give you more of an idea as to how PCA works. We'll be using the Labeled Faces in the Wild (LFW) dataset, and following along with a sklearn tutorial [here](https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html#sphx-glr-auto-examples-applications-plot-face-recognition-py).

In [None]:
from sklearn.datasets import fetch_lfw_people

In [None]:
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

y = lfw_people.target
target_names = lfw_people.target_names

In [None]:
print("Input shapes", X.shape)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(1, 1.5))

ii = 6
ax.imshow(lfw_people.images[ii], cmap=plt.cm.gray)
ax.set_title(target_names[y[ii]], fontsize=8)
ax.set_xticks([])
ax.set_yticks([])
ax.spines[['right', 'top', "left", "bottom"]].set_visible(False)

plt.show()

The dimensionality of each image is `50 x 37` pixels, for a whopping 1850 degrees of freedom. Note that although there are many degrees of freedom, these are images, which means that pixels are almost always locally correlated. Therefore, the effective dimensionality of the images are almost certainly much lower than this. While convolutional neural networks are the algorithm of choice for supervised problems dealing with images, we're going to simply perform PCA on the image data, and use reduced dimensional data to make predictions using a simpler algorithm.

In [None]:
N = lfw_people.images.shape[0]
image_shape = lfw_people.images[0].shape  # We'll need the image shape for later

First, we flatten and scale the data. At this point, the data is no longer an "image", it is just a long feature vector, where each index is a pixel.

In [None]:
X = lfw_people.images.reshape(N, -1)

We'll do a train/test split and then scale the data. Question: why do we "fit" the scaler on only the training data?

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Now we apply PCA!

In [None]:
n_components = 150
pca = PCA(n_components=n_components, svd_solver="randomized", whiten=True).fit(X_train)

In [None]:
eigenfaces = pca.components_.reshape(n_components, *image_shape)

In [None]:
fig, axs = plt.subplots(1, 10, figsize=(10, 1))

for ii, face in enumerate(eigenfaces[:10]):
    ax = axs[ii]
    ax.imshow(face, cmap=plt.cm.gray)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.spines[['right', 'top', "left", "bottom"]].set_visible(False)
    
plt.show()

(Creepy)

Here's something else can do. Let's build up an image of George W. Bush from his PCA features!

In [None]:
w_GWB = pca.transform(X[6, :].reshape(1, -1))

In [None]:
fig, axs = plt.subplots(1, 15, figsize=(10, 1))

for ii in range(15):
    ax = axs[ii]
    
    face = w_GWB[:, :(ii+1)*15] @ pca.components_[:(ii+1)*15, :]
    
    ax.imshow(face.reshape(*image_shape), cmap=plt.cm.gray)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.spines[['right', 'top', "left", "bottom"]].set_visible(False)

Let's now use a simple SVC classifier to try and figure out which face belongs to whom using the decomposed representation!

In [None]:
w_train = pca.transform(X_train)

In [None]:
from sklearn.svm import SVC

In [None]:
clf = SVC(kernel="rbf", class_weight="balanced", C=76823, gamma=0.0034)  # Best hyperparamerters from sklearn

In [None]:
clf.fit(w_train, y_train)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
w_test = pca.transform(X_test)
y_pred = clf.predict(w_test)

print(classification_report(y_test, y_pred, target_names=target_names))
ConfusionMatrixDisplay.from_estimator(
    clf, w_test, y_test, display_labels=target_names, xticks_rotation="vertical"
)
plt.tight_layout()
plt.show()

# Real research example

We now follow along with the data in Torrisi _et al_ to show a way in which PCA can be used in real research. The data we pull below is available [open access](https://data.matr.io/4/).

S. B. Torrisi, M. R. Carbone, B. A. Rohr, J. H. Montoya, Y. Ha, J. Yano, S. K. Suram & L. Hung. [Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships.](https://www.nature.com/articles/s41524-020-00376-6) npj Comput. Mater. 6, 109 (2020).

See also a few of my and my BNL colleagues' works, all of which use PCA or some other forms of dimensionality reduction:

- M. R. Carbone, S. Yoo, M. Topsakal & D. Lu. [Classification of local chemical environments from x-ray absorption spectra using supervised machine learning](https://doi.org/10.1103/PhysRevMaterials.3.033604). Physical Review Materials 3, 033604 (2019).
- M. R. Carbone, M. Topsakal, D. Lu & S. Yoo. [Machine-learning X-ray absorption spectra to quantitative accuracy](https://doi.org/10.1103/PhysRevLett.124.156401). Physical Review Letters 124, 156401 (2020).
- E. J. Sturm, M. R. Carbone, D. Lu, A. Weichselbaum & R. M. Konik. [Computing Anderson Impurity Model Spectra Using Machine Learning](https://doi.org/10.1103/PhysRevB.103.245118). Physical Review B 103, 245118 (2021).
- C. Miles, M. R. Carbone, E. J. Sturm, D. Lu, A. Weichselbaum, K. Barros & R. M. Konik. [Machine learning of Kondo physics using variational autoencoders and symbolic regression](https://doi.org/10.1103/PhysRevB.104.235111). Physical Review B 104, 235111 (2021).
- A. Ghose, M. Segal, F. Meng, Z. Liang, M. S. Hybertsen, X. Qu, E. Stavitski, S. Yoo, D. Lu & M. R. Carbone. [Uncertainty-aware predictions of molecular X-ray absorption spectra using neural network ensembles](https://doi.org/10.1103/PhysRevResearch.5.013180). Physical Review Research 5, 013180 (2023).
- Z. Liang, M. R. Carbone, W. Chen, F. Meng, E. Stavitski, D. Lu, M. S. Hybertsen & X. Qu. [Decoding Structure-Spectrum Relationships with Physically Organized Latent Spaces](https://doi.org/10.1103/PhysRevMaterials.7.053802). Physical Review Materials 7, 053802 (2023).

First, we have to get the data. To do this, we use the `requests` module to directly pull the content of the webpage, and then parse that specific format (which despite the extension is not exactly JSON). It is not important to understand the particulars.

In [None]:
import json
import requests

In [None]:
url = "https://s3.amazonaws.com/publications.matr.io/4/deployment/data/files/spectral_data/Ti_XY.json"
r = requests.get(url)
text = r.text.split("\n")
data = [json.loads(xx) for xx in text[:-1]]

Get the inputs and outputs from this list of dictionaries.

In [None]:
e_grid = data[0]["E"]
spectra = np.array([
    dat["mu"] for dat in data
    if dat["one_hot_coord"] is not None
])
coordinations = np.array([
    dat["coordination"] for dat in data
    if dat["one_hot_coord"] is not None
])

The labels, `coordinations` are the the coordination number of an X-ray-absorbing atom! If you don't know what this means, don't worry too much about it. Let's say the task at hand is that we're simply trying to classify whether or not an X-ray absorption spectrum can be used to predict this number. We can use PCA as an indicator of how well a machine learning model may perform on this data.

In [None]:
np.unique(coordinations)

Here are what some of the spectra look like. These are our input features, while the classes above are our targets.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(2, 1))

for spec in spectra[::50]:
    ax.plot(e_grid, spec, color="black", alpha=0.1)

ax.set_ylabel("$\mu(E)$ / a.u.")
ax.set_yticks([])
ax.set_xlabel("$E$ / e.V.")
plt.show()


In [None]:
n_components = 2
pca = PCA(n_components=n_components).fit(spectra)

In [None]:
w = pca.transform(spectra)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(2, 2))

ax.scatter(w[:, 0], w[:, 1], c=coordinations, s=0.5)

plt.show()