# Physics 494/594

# Unsupervised Learning 

In [None]:
# %load ./include/header.py
import numpy as np
import matplotlib.pyplot as plt
import sys
from tqdm import trange,tqdm
sys.path.append('./include')
import ml4s

%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('./include/notebook.mplstyle')
np.set_printoptions(linewidth=120)
ml4s.set_css_style('./include/bootstrap.css')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
π = np.pi

## Until Now

### Supervised Learning

- We have considered learning tasks (regression, classification) where there was labelled data (i.e MNIST, Ising Model, etc.) and we could train a neural network to "learn" these features of the data. 
- Incredibly useful, but many problems in the physical sciences aren't necessarily about prediciton. We usually want to **learn** something about the underlying distribution that generated the observation. 
- It is not useful in physics to make good predictions with the wrong model. 

## Today

- Our first foray into unsupervised learning, a large and exciting field.
- Concerned with discovering structure in unalabelled data.
- We will begin with dimensional reduction and clustering/visualization.

## Dimensional Reduction

Most of our data sets live in a very high dimension (Ising model configurations were $30x30$ so they live in 900 dimensional space!)  Our goal will be to project (or embed) these observations into a lower dimensional space.   This subspace is called the **latent space**.

Let's consider a very simple example of points in two spatial dimensions.

<!--
N = 2000
x = np.random.normal(loc=[0,0],scale=[1,0.4], size=[N,2])

θ = np.radians(35)
R = np.array([[np.cos(θ), -np.sin(θ)], [np.sin(θ), np.sin(θ)]])
for i in range(N):
    x[i] = R @ x[i]
np.savetxt('../data/scatter_2d_pca.dat',x)
-->

In [None]:
x = np.loadtxt('../data/scatter_2d_pca.dat')
plt.scatter(x[:,0],x[:,1], s=1)
plt.axis('equal')
plt.xticks([])
plt.yticks([]);

### Principal Component Analysis (PCA)

The goal of PCA is to identify the directions of largest variance with the intuition that these correspond to **signal**, while any orthogonal spread in the data is due to **noise**.  It has an equivalent definition in terms of the linear projection that minimizes the average projection cost (mean squared deviation between original vector and its deviation).


Consider our usual set of observations $\{ \boldsymbol{x}^{(n)} \}_{n=0}^{N}$ where now we do not have the associated targers or *labels* $y^{(n)}$.  Each $\boldsymbol{x}^{(n)} \in \mathbb{R}^D$ lives in a D-dimensional feature space.  Without loss of generality, we assume that the mean of the data set is zero: 

\begin{equation}
\langle \boldsymbol{x} \rangle = \frac{1}{N} \sum_{n=1}^{N} \boldsymbol{x}^{(n)} = (0,\dots,0).
\end{equation}

If this is not the case, we can simply subtract the mean from each data point.  Our goal is to project the data onto a latent space having dimensionality $M < D$.

To begin, let's project onto a 1-dimenisional space, i.e. $M=1$.  The direction of this dimension (i.e. its unit vector) can be described by a single vector, $\boldsymbol{v}_1$, which we assume has unit norm: $\boldsymbol{v}_1 \cdot \boldsymbol{v}_1 = 1$.  We then project each data point, $\boldsymbol{x}^{(n)}$ onto $\boldsymbol{v}_1$ via $\boldsymbol{v}_1^\top \boldsymbol{x}^{(n)}$. The variance of the resulting data set is given by:

\begin{align}
\sigma^2 \equiv \bigg \langle \big\lvert\boldsymbol{v}_1^{\top} \boldsymbol{x}^{(n)} - \boldsymbol{v}_1^{\top}\underbrace{\langle  \boldsymbol{x} \rangle}_{0} \big\rvert^2 \bigg \rangle  & = \frac{1}{N-1} \sum_{n=1}^{N} \boldsymbol{v}_1^{\top}  \boldsymbol{x}^{(n)}  \boldsymbol{x}^{(n)\top}  \boldsymbol{v}_1 \\
& = \boldsymbol{v}_1^{\top} \Sigma(\mathbf{X}) \boldsymbol{v}_1
\end{align}

where we have defined

\begin{equation}
\Sigma(\mathbf{X}) = \frac{1}{N-1} \mathbf{X}^{\top}\mathbf{X}
\end{equation}

to be the $D \times D$ covariance matrix of the data design matrix: 

\begin{equation}
\mathbf{X} = \left( \begin{array}{cccc}
        x_{1}^{(1)} & x_{2}^{(1)} & \cdots & x_{D}^{(1)} \\
\vdots        &      \vdots    & \ddots & \vdots \\
        x_{1}^{(N)} & x_{2}^{(N)} & \cdots & x_{D}^{(N)} \\
\end{array}
\right)\, .
\end{equation}


We now want to maximize the variance subject to the constraint $\boldsymbol{v}_1 \cdot \boldsymbol{v}_1 = 1$ which we can do by adding a Lagrange multiplier $\lambda_1$:

\begin{align}
%\frac{\partial}{\partial \boldsymbol{v}_1^{\top}} 
\boldsymbol{\nabla}_{\boldsymbol{v}_1^{\top}} \left[\boldsymbol{v}_1^{\top} \Sigma(\mathbf{X}) \boldsymbol{v}_1 + \lambda_1 (1-\boldsymbol{v}_1^{\top} \cdot \boldsymbol{v}_1) \right] & = \Sigma(\mathbf{X}) \boldsymbol{v}_1 - \lambda_1 \boldsymbol{v}_1 = 0 \\
&\Rightarrow  \Sigma(\mathbf{X})\boldsymbol{v}_1 = \lambda_1 \boldsymbol{v}_1
\end{align}

which tells us that $\boldsymbol{v}_1$ is a left eigenvector of $\Sigma(\mathbf{X})$.  If we multiple on the left by $\boldsymbol{v}_1^{\top}$ and utilize the unit norm we find:

\begin{equation}
\boxed{\lambda_1 = \boldsymbol{v}_1^{\top} \Sigma(\mathbf{X}) \boldsymbol{v}_1.}
\end{equation}

Thus the variance will be maximized when $\boldsymbol{v}_1$ is chosen to be the the eigenvector with largest eignvalue $\lambda_1$.  It is known as the **first principle component**.  Not surprisingly, we can determine additional principal components in an iterative fashion which maximizes the projected variance amongst all possible directions orthogonal to the ones already considered.  This identifies the $M$ principal components as the $M$ eigenvectors corresponding to the $M$ largest eigenvalues $\boldsymbol{v}_j$ of $\Sigma(\mathbf{X})$.

In practice, we simply diagonlize the symmetrix square matrix $\Sigma(\mathbf{X}) = \mathbf{X}^{\top}\mathbf{X}/(N-1)$, the principle components are the eigenvalues $\lambda_j$ and we organize the eigenvectors into a column matrix $\boldsymbol{V}$.

We can do this with the `scipy.linalg` package.

In [None]:
import scipy.linalg
N = x.shape[0]
x -= np.average(x,axis=0)
Σ = x.T @ x / (N-1)
λ,V = scipy.linalg.eigh(Σ)

<div class="span alert alert-warning">
    Note: <code>eigh()</code> returns eigenvalues in ascending order!
</div>

In [None]:
λ = λ[::-1]
V = np.flip(V,axis=1)

print(f'λ = {λ}')
print(f'V = {V}')

An important quantity is the *percentage of the explained variance* defined by:

\begin{equation}
\text{PCA-j} = \frac{\lambda_j}{\sum_{j=1}^{D} \lambda_j}
\end{equation}

We can plot the resulting axes defined by the principal components, or alternatively, can define a projection operator (matrix):

\begin{equation}
\boldsymbol{P} = \sum_{j=1}^M\boldsymbol{v}_j\boldsymbol{v}_j^\mathsf{T}
\end{equation}

to plot with respect to the new axes.


In [None]:
fig,ax = plt.subplots(1,2,figsize=(8,4))

ax[0].scatter(x[:,0],x[:,1], s=1, alpha=0.5)

_x = np.linspace(-4,4,100)
_y = np.linspace(-0.5,0.5,100)
ax[0].plot(_x,V[1,0]/V[0,0]*_x, '-', color=colors[0], label=f'PCA-1 = {λ[0]/np.sum(λ):.2f}')
ax[0].plot(_y,V[1,1]/V[0,1]*_y, '-', color=colors[-2], label=f'PCA-2 = {λ[1]/np.sum(λ):.2f}')

ax[0].axis('equal')
ax[0].set_xticks([])
ax[0].set_yticks([])
ax[0].legend()

# perform the projection
px = x @ V

# this can alternatively be down via direct projection
direct = False
if direct:
    px = np.zeros_like(x)
    for n in range(x.shape[0]):
        for j in range(2):
            px[n,j] = np.dot(V[:,j],x[n,:])
    
ax[1].scatter(px[:,0],px[:,1], s=1, alpha=0.5, label=r'$\mathbf{X} \mathbf{V}$')
ax[1].set_xlabel(f'PCA-1 = {λ[0]/np.sum(λ):.2f}')
ax[1].set_ylabel(f'PCA-2 = {λ[1]/np.sum(λ):.2f}')
ax[1].axis('equal')
ax[1].legend()

fig.subplots_adjust(wspace=0.25)

### SKLearn Implementation

Obviously this is useful enough that we don't have to code it up from scratch every time we want the principal components.  It has a convenient implementation in `sklearn`.

Our first step is to scale the data (so-called standard or z-scaling) such that:

\begin{equation}
\mathbf{z} = \frac{\mathbf{x}-\langle \mathbf{x} \rangle}{\sqrt{\langle \mathbf{x}^{\top}\mathbf{x} \rangle - \lvert \langle \mathbf{x} \rangle \rvert^2}} \, .
\end{equation}

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_x = scaler.fit_transform(x)

model = PCA(n_components=2)
XPCA = model.fit_transform(scaled_x)

λ = model.explained_variance_
V = model.components_.T
PCAj = model.explained_variance_ratio_

print(f'λ = {λ}')
print(f'V = {V}')
print(f'PCA-j = {PCAj}')

<div class="span alert alert-warning">
    Note: To be extra confusing, <code>model.components_.</code> returns the principal vectors as rows! I take the transpose above to compare with our eigenvectors above.
</div>

In [None]:
fig,ax = plt.subplots(1,2,figsize=(8,4))

ax[0].scatter(scaled_x[:,0],scaled_x[:,1], s=1, alpha=0.5)

_x = np.linspace(-4,4,100)
_y = np.linspace(-0.5,0.5,100)
ax[0].plot(_x,V[1,0]/V[0,0]*_x, '-', color=colors[0], label=f'PCA-1 = {λ[0]/np.sum(λ):.2f}')
ax[0].plot(_y,V[1,1]/V[0,1]*_y, '-', color=colors[-2], label=f'PCA-2 = {λ[1]/np.sum(λ):.2f}')

ax[0].axis('equal')
#ax[0].set_xticks([])
#ax[0].set_yticks([])
ax[0].legend()

# perform the projection
px = scaled_x @ V
    
ax[1].scatter(px[:,0],px[:,1], s=1, alpha=0.5, label=r'$\mathbf{X} \mathbf{V}$')
ax[1].set_xlabel(f'PCA-1 = {λ[0]/np.sum(λ):.2f}')
ax[1].set_ylabel(f'PCA-2 = {λ[1]/np.sum(λ):.2f}')
ax[1].axis('equal')
ax[1].legend()

fig.subplots_adjust(wspace=0.25)