# NTDS assignment 3: spectral graph theory
[Michaël Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)

The first two assignments were designed to warm you up. This third assignment is closer to what you'll have to do for the projects. It only misses the exploratory data analysis part (we'll do that later as an exercise). As such, this exercises is composed of two parts:
1. Data collection,
2. Data exploitation.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse
import scipy.sparse.linalg

plt.rcParams['figure.figsize'] = (17, 5)

## 1 Data collection from the FMA

In the first part of the assignment, we are going to collect some data.

In [None]:
tracks = pd.read_csv('../data/fma_tracks.csv', index_col=0)
genres = pd.read_csv('../data/fma_genres.csv', index_col=0)
features = pd.read_csv('../data/fma_features.csv', index_col=0, header=[0, 1, 2])

#tracks.drop(20366, inplace=True)
#features.drop(20366, inplace=True)

#tracks = tracks[:1000]
#features = features[:1000]

In [None]:
genre1 = tracks['genre'] == 1235  # Instrumental
genre2 = tracks['genre'] == 21    # Hip-Hop

features = features.loc[genre1 | genre2, 'mfcc']
genres = tracks.loc[genre1 | genre2, 'genre']

features.shape, genres.shape

In [None]:
sum(genre1), sum(genre2)

Listen to the music

## 2 Feature extraction

As is often the case, the data at hand is too large to be dealt with directly. We have to represent it with a smaller set of features, chosen to be maximally relevant to the task. (Manual feature extraction can sometimes be replaced by end-to-end learning systems.)

For music, MFCC are often relevant spectral features.

Feature normalization

In [None]:
features -= features.mean(0)
features /= features.std(0)

## 3 Graph construction

* Is the graph connected?
* Shall we use the un-normalized or normalized Laplacian? Choose and justify.

Compute the l2. Or choose.

Hints:
* Use the `distance.pdist()` function.

### 3.1 Compute distances

Metric

The Euclidean distance is defined as $$d(i,j) = \|x_i - x_j\|_2$$

In [None]:
from scipy.spatial import distance

distances = distance.pdist(features, metric='euclidean')
distances = distance.squareform(distances)

In [None]:
plt.hist(distances.reshape(-1), bins=50);

Why are some distances equal to zero?

In [None]:
print('{} distances equal exactly zero. Why?'.format(np.sum(distances == 0)))

### 3.2 Compute the weight matrix

Gaussian kernel $$\mathbf{W}(i,j) = \exp \left( \frac{-d^2(i, j)}{\sigma^2} \right)$$

In [None]:
kernel_width = distances.mean()
weights = np.exp(-distances**2 / kernel_width**2)

np.fill_diagonal(weights, 0)

What kind of graph is that? Fully connected.

Sparsify the graph. Either knn or $\epsilon$. knn better to enforce connectedness.

In [None]:
fix, axes = plt.subplots(2, 2, figsize=(17, 8))
def plot(weights, axes):
    axes[0].spy(weights)
    axes[1].hist(weights[weights > 0].reshape(-1), bins=50);
plot(weights, axes[:, 0])

if False:
    epsilon = np.percentile(weights, 80)
    weights[weights < epsilon] = 0
else:
    NEIGHBORS = 10
    idx = np.argsort(weights)[:, :-NEIGHBORS]
    for i in range(weights.shape[0]):
        weights[i, idx[i, :]] = 0
    weights = np.maximum(weights, weights.T)

plot(weights, axes[:, 1])

### 3.3 Compute the Laplacian

In [None]:
degrees = weights.sum(0)

plt.hist(degrees, bins=50);

In [None]:
# Combinatorial Laplacian.
laplacian = np.diag(degrees) - weights

# Normalized Laplacian.
deg_inv = np.diag(1 / np.sqrt(degrees))
laplacian = deg_inv @ laplacian @ deg_inv

# Alternatively:
# laplacian = np.identity(weights.shape[0]) - deg_inv @ weights @ deg_inv

plt.spy(laplacian)

In [None]:
laplacian = sparse.csr_matrix(laplacian)

How many edges?

In [None]:
print('{} edges out of {} x {} = {}'.format(laplacian.nnz, *weights.shape, weights.size))

### 3.4 Bonus

Can you think of a way to observe if the two genres form clusters in the graph we created?

Hint: Use only the weight matrix / laplacian and the labels.

Sort the rows and columns given the labels.

## 4 Eigenvectors & eigenvalues

No need to compute the Fourier basis, only the Fiedler vector, i.e. the eigenvector associated to $\lambda_2$.

Use one of the following functions: `np.linalg.eig`, `np.linalg.eigh`, `sparse.linalg.eigs`, `sparse.linalg.eigsh`. Justify your choice.

In [None]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')

# That's much slower:
# eigenvalues, eigenvectors = np.linalg.eigh(laplacian.toarray())

In [None]:
plt.plot(eigenvalues, '.-');

Is the graph connected? Justify.

In [None]:
eigenvalues

What do you expect as the result of the below computation? Justify. Do you get the value you expected? If not, why?

Note that `x @ y` is equivalent to `np.matmul(x, y)`. You should prefer the former as it makes it easier to read formulas.

In [None]:
np.sum(laplacian @ eigenvectors[:, 0])

**Your answer here.** We expect zero because the first eigenvalue is zero. The small error is due to numerical precision.

## 5 Clustering

Compare different techniques.
PCA, Fiedler, spectral clustering

Visualization with Laplacian eigenmaps

Principal component analysis (PCA), no graph.

In [None]:
import sklearn as skl
import sklearn.utils, sklearn.preprocessing, sklearn.decomposition

features_pca = skl.decomposition.PCA(n_components=2).fit_transform(features)

genres = skl.preprocessing.LabelEncoder().fit_transform(genres)

plt.scatter(features_pca[:,0], features_pca[:,1], c=genres, cmap='RdBu', alpha=0.5);

In [None]:
plt.scatter(eigenvectors[:, 1], eigenvectors[:, 2], c=genres, cmap='RdBu', alpha=0.5);

Cluster the tracks with the Fiedler vector. How many tracks were wrongly identified?

In [None]:
labels = (eigenvectors[:, 1] > 0)

plt.scatter(eigenvectors[:, 1], eigenvectors[:, 2], c=labels, cmap='RdBu', alpha=0.5);

In [None]:
err = np.sum(np.abs(labels - genres))
err = err if err < len(labels)/2 else len(labels) - err
print('{} errors ({}%)'.format(err, err/len(labels)*100))

Tune some parameters (e.g. `kernel_width`, `NEIGHBORS`) to get less errors. You should get an error rate lower than 15% (i.e. less than 300 errors in total). Try to understand the effect of each parameter.