<hr/>

# Introduction to Data Science
**Tamás Budavári** - budavari@jhu.edu <br/>

- Laplacian eigenmaps
- Exercises

<hr/>

<h1><font color="darkblue">Spectral Methods</font></h1>


- Spectral embedding

> Construct the (latent) coordinates based on a given "similarity" graph or matrix
> 

- Spectral clustering

> Use these new coordinates as input to the usual methods <br>
> E.g., simple thresholding, K-means clustering

## Adjacency Matrix

- Are two objects "close"? Are the vertices connected?

> Encode it in an $(n\!\times\!n)$ **matrix** $A$

- The matrix elements

>$ a_{ij} = \left\{ \begin{array}{ll}
         1 & \mbox{if $i$ and $j$ are connected}\\
         0 & \mbox{otherwise}\end{array} \right.  $
         
- Symmetric matrix
         

## Graph Laplacian

- Degree matrix $D$ is diagonal matrix formed from the sum of all edges

>$\displaystyle d_{ii} = \sum_j^n a_{ij} $

- The graph Laplacian

>$ L = D - A$


## Weighted Edges

- Instead of $A$ we can use a weight matrix $W$

>$ L = D - W$ 
><br><br>
> where $D$ has diagonal elements
><br><br>
>$\displaystyle d_{ii} = \sum_j w_{ij}$

- Interesting property

>$\displaystyle x^T L\,x = \frac{1}{2}\sum_{i,j}^n w_{ij}\,(x_i\!-\!x_j)^2 $ 

## Minimization

- Solution: smallest eigenvalues of $L$ and corresponding eigenvectors

> The 1st eigenvector is the trivial solution (constant) <br>
> We use the 2nd eigenvector, and so on...

- Laplacian eigenmaps

> Different similarity matrices to start with<br>
> Different normalizations


In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Blobs 

- Calculate and show the adjacency matrix
- Solve for the first 3 non-trivial eigenvectors
- Plot the diffusion coordinates
- Plot the original coordinates colored by the eigenvectors


In [None]:
X = np.loadtxt('files/Class-Blobs.csv', delimiter=',')
X.shape

In [None]:
d2 =  np.square(X[np.newaxis,:,:]-X[:,np.newaxis,:]).sum(axis=2)
A = (d2<16).astype(np.float32) # distance^2 threshold 
np.fill_diagonal(A,0); plt.spy(A);

In [None]:
dd = A.sum(axis=0)
D = np.diag(dd)

L = D - A

w, v = np.linalg.eigh(L)

print (w[0:4])
print (w.shape, v.shape)
plt.plot(w)

plt.figure()
plt.plot(v[:,1], 'x', alpha=0.5);

In [None]:
s = np.argsort(v[:,1]); 
plt.figure(); plt.plot(v[s,1],'bx',alpha=0.1);
plt.figure(); plt.plot(v[s,2],'rx',alpha=0.1);

In [None]:
plt.scatter(v[:,1],v[:,2], alpha=0.01);

In [None]:
plt.scatter(X[:,0], X[:,1], c=v[:,1], cmap=plt.cm.rainbow); plt.colorbar();

In [None]:
plt.scatter(X[:,0], X[:,1], c=v[:,2], cmap=plt.cm.rainbow); plt.colorbar();

In [None]:
plt.scatter(X[:,0], X[:,1], c=v[:,3], cmap=plt.cm.rainbow); plt.colorbar();

# Circles

Using weights

In [None]:
from sklearn import datasets
np.random.seed(3) # try other seeds, e.g., 0

X, c = datasets.make_circles(n_samples=1000, factor=0.6, noise=0.05)

plt.figure(); plt.subplot(111,aspect='equal'); 
plt.scatter(X[:,0], X[:,1], alpha=0.4, edgecolor='none');

In [None]:
# Weight matrix 
d2 =  np.square(X[np.newaxis,:,:]-X[:,np.newaxis,:]).sum(axis=2)

W = np.exp(-d2 / 0.016)
np.fill_diagonal(W,0)

# Laplacian
dd = W.sum(axis=0)
D = np.diag(dd)
L = D - W

# eigenproblem
w, v = np.linalg.eigh(L)
labels = v[:,1] > 0
print (w[:4])

# plots
plt.figure(figsize=(9,4)); plt.subplot(121);

s = np.argsort(v[:,1]); plt.plot(v[s,1], 'x', alpha=0.6);
plt.subplot(122,aspect='equal')
plt.scatter(X[:,0], X[:,1], c=labels, cmap=plt.cm.BrBG, alpha=0.5);

In [None]:
plt.figure(figsize=(13,3)); 

plt.subplot(131,aspect='equal'); plt.scatter(X[:,0],X[:,1],c=c,alpha=0.3); 
plt.colorbar(); plt.title('generated clusters');

plt.subplot(132,aspect='equal'); plt.scatter(X[:,0],X[:,1],c=v[:,1],cmap=plt.cm.seismic,alpha=0.3); 
plt.colorbar(); plt.title('colored by eigenvector');

plt.subplot(133,aspect='equal'); plt.scatter(X[:,0],X[:,1],c=(v[:,1]>0),cmap=plt.cm.bwr, alpha=0.3); 
plt.colorbar(); plt.title('derived clusters');

## Embedding coordinates

In [None]:
plt.figure(figsize=(13.5,4)); 

plt.subplot(131,aspect='equal');
plt.scatter(X[:,0],X[:,1],c='k',edgecolor='none',alpha=0.4); plt.title('orig');

plt.subplot(132,aspect='equal');
plt.scatter(v[:,1],v[:,2],c='k',edgecolor='none',alpha=0.4); plt.title('eig12');

plt.subplot(133,aspect='equal');
plt.scatter(v[:,2],v[:,3],c='k',edgecolor='none',alpha=0.4); plt.title('eig23');

## Embedding with scikit-learn

> See online [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.spectral_embedding.html)



In [None]:
from sklearn.manifold import spectral_embedding

e = spectral_embedding(adjacency=W, n_components=4, norm_laplacian=False,
                       drop_first=False)
s = np.argsort(e[:,1]) 

plt.figure(figsize=(9,4)) 
plt.subplot(121); plt.plot(e[s,1], 'xb', alpha=0.6)
plt.subplot(122, aspect='equal')
plt.scatter(X[:,0],X[:,1],c=(e[:,1]>0),cmap=plt.cm.BrBG,edgecolor='none',alpha=0.4);

In [None]:
plt.figure(figsize=(13.5,4)); 

plt.subplot(131,aspect='equal');
plt.scatter(X[:,0],X[:,1],c='k',edgecolor='none',alpha=0.4); plt.title('orig');

plt.subplot(132,aspect='equal');
plt.scatter(e[:,1],e[:,2],c='k',edgecolor='none',alpha=0.4); plt.title('eig12');

plt.subplot(133,aspect='equal');
plt.scatter(e[:,2],e[:,3],c='k',edgecolor='none',alpha=0.4); plt.title('eig23');

## Alternatively

> See online [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html)


In [None]:
from sklearn.manifold import SpectralEmbedding

se = SpectralEmbedding(n_components=3, n_neighbors=20)

f = se.fit_transform(X)

In [None]:
plt.figure(figsize=(9,9)); 

plt.subplot(221,aspect='equal');
plt.scatter(e[:,1],e[:,2],c='k',edgecolor='none',alpha=0.4); plt.title('sk-fun 12');

plt.subplot(223,aspect='equal');
plt.scatter(e[:,2],e[:,3],c='k',edgecolor='none',alpha=0.4); plt.title('sk-fun 23');

plt.subplot(222,aspect='equal');
plt.scatter(f[:,0],f[:,1],c='k',edgecolor='none',alpha=0.4); plt.title('sk-obj 12');

plt.subplot(224,aspect='equal');
plt.scatter(f[:,1],f[:,2],c='k',edgecolor='none',alpha=0.4); plt.title('sk-obj 23');

## Parameters

- Often we use a combination of two parameters

> $k$: number of neighbors to consider for similarity graph
><br>
> $\epsilon$: bandwidth of the $\exp\left(-d^2/\epsilon\right)$ similarity

## Clustering scikit-learn

> See online [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering)

In [None]:
from sklearn.cluster import SpectralClustering

sc = SpectralClustering(n_clusters=2)
clusters = sc.fit_predict(X)

plt.figure(figsize=(9,4)); 

plt.subplot(121,aspect='equal')
plt.scatter(X[:,0],X[:,1],c=c,edgecolor='none',alpha=0.4)
plt.title('orig');

plt.subplot(122,aspect='equal');
plt.scatter(X[:,0],X[:,1],c=clusters,cmap=plt.cm.BrBG,edgecolor='none',alpha=0.4)
plt.title('clusters');

## Exercise

- What's wrong with the above clustering?
- Read the documentation and fix the code
- If you found a fix, look for another