# NTDS assignment 3: spectral graph theory
[Michaël Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)

The first two assignments were designed to warm you up. This third assignment is closer to what you'll have to do for the projects. It only misses the exploratory data analysis part (we'll do that later as an exercise). As such, this exercises is composed of two parts:
1. Data collection,
2. Data exploitation.

In [None]:
%matplotlib inline

import configparser
import os

import requests
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import sparse, stats
import scipy.sparse.linalg
import librosa
import IPython.display as ipd

plt.rcParams['figure.figsize'] = (17, 5)

If the above cell fails, it's most probably because you miss a package. Install it with e.g. `conda install librosa` or `pip install librosa`.

## 1 Data collection

Like in any data project, the first part of the assignment is to collect some data. 

### 1.1 Get the genre of a single track

As often, we need an API key for certain operations. Add the following to your `credentials.ini` file. I gave a key during the lab on November 6. If you were not there, ask one of your classmates.
```
[freemusicarchive]
api_key = MY-KEY
```

In [None]:
# Read the confidential api key.
credentials = configparser.ConfigParser()
credentials.read(os.path.join('..', 'credentials.ini'))
api_key = credentials.get('freemusicarchive', 'api_key')

Your first task is to develop a function to retrieve the genre ID of a track given its track ID using the [FMA API](https://freemusicarchive.org/api).

Hints:
* A track might have multiple genres associated to it. Always return the first one and discard the others.
    * Note: you should never discard data blindly. I selected the tracks so that this is not a problem.
* The `get_genre` function takes an integer as input, the track ID, and returns another integer, the genre ID.

In [None]:
def get_genre(track_id):
    """Returns the genre of a track by querying the API."""
    BASE_URL = 'https://freemusicarchive.org/api/get/tracks.json'
    url = '{}?track_id={}&api_key={}'.format(BASE_URL, track_id, api_key)
    response = requests.get(url).json()
    return int(response['dataset'][0]['track_genres'][0]['genre_id'])

# A correct implementation should pass the below test.
assert get_genre(1219) == 38

### 1.2 Create a table of tracks

The `fma_tracks.csv` file contains a list of 2'000 track IDs that we will use through this assignment.

In [None]:
tracks = pd.read_csv('../data/fma_tracks.csv', index_col=0)
print('You got {} track IDs.'.format(len(tracks)))

Once imported by pandas, each row of the DataFrame represents a track.

In [None]:
tracks.head(4)

### 1.3 Add the genre to the table

Your task is to add a `genre` column to the above created `tracks` DataFrame. The column should contain an integer which represents the genre of the track, i.e. the return value of the `get_genre` function you developed.

Hints:
* When developing, retrieve the genre of the first 10 tracks only. Only once your code works run it through all the tracks. That will save you time.
* As we want to apply one function (`get_genre`) to many data samples (the 2000 track IDs), try to use a functional approach. Check out `tracks.apply()` or the built-in `map`. In Python, you can declare an [anonymous function](https://en.wikipedia.org/wiki/Anonymous_function) as `lambda x: x + 1`.
* Your table should look like the below table, except with the correct number instead of 0.

In [None]:
tracks['genre'] = 0
tracks.head(4)

In [None]:
tracks = tracks[:10]

#tracks['genre'] = tracks.apply(lambda track: get_genre(track.name), axis=1)

# Alternatively:
# tracks['genre'] = list(map(get_genre, tracks.index))

# Alternatively:
for tid in tqdm(tracks.index[:10]):
    tracks.at[tid, 'genre'] = get_genre(tid)

In [None]:
tracks.head(4)

### 1.4 Save the data

To avoid having to collect the data everytime you restart the IPython kernel, save the DataFrame as a CSV file.

In [None]:
#tracks.to_csv('../data/fma_tracks_with_genre.csv')

You can now load it back with the following call instead of running the code in sections 1.1 to 1.3.

In [None]:
tracks = pd.read_csv('../data/fma_tracks_with_genre.csv', index_col=0)

### 1.5 Data cleaning

As always, data cleaning is necessary when dealing with real (as opposed to synthetic) data. In this case, we only need to "summarize the genres". The tracks I've selected for the assignment belong to either one of the following *top-level genres*: Rock (`genre_id=12`) and Hip-Hop (`genre_id=21`). There *actual genre(s)* might however be more specific and be a sub-genre of those. For example Punk is a sub-genre of Rock. You can explore the genre hierarchy on the [Free Music Archive](http://freemusicarchive.org/genre/Rock/). The below function will return the correct top-level genre for any of the sub-genres you'll encounter.

In [None]:
#tracks_m.loc[tracks.index, ('track', 'genres')].apply(lambda x: x[0]).unique()
#tracks['genre'] = tracks_m.loc[tracks.index, ('track', 'genres')].apply(lambda x: x[0])

In [None]:
def get_top_genre(genre_id):
    return 21 if genre_id in [21, 83, 100, 539, 542, 811] else 12

tracks = tracks.applymap(lambda genre: get_top_genre(genre))
tracks.head(4)

If everything went fine, you should now have 1000 Rock (`genre_id=12`) and 1000 Hip-Hop (`genre_id=12`) tracks.

In [None]:
tracks['genre'].value_counts()

## 2 Feature extraction

As is often the case, the data at hand is too large to be dealt with directly. We have to represent it with a smaller set of features, chosen to be maximally relevant to the task. (Manual feature extraction can sometimes be replaced by end-to-end learning systems.)

For music, MFCC are often relevant spectral features.

Listen to the music

In [None]:
PATH = '/data/research/projects/fma_dataset/data/fma_small'

def get_path(track_id):
    tid_str = '{:06d}'.format(track_id)
    return os.path.join(PATH, tid_str[:3], tid_str + '.mp3')

filepath = get_path(tracks.index[0])
print('File: {}'.format(filepath))

audio, sampling_rate = librosa.load(filepath, sr=None, mono=True)
print('Duration: {:.2f}s, {} samples'.format(audio.shape[-1] / sampling_rate, audio.size))

start, end = 7, 17
ipd.Audio(data=audio[start*sampling_rate:end*sampling_rate], rate=sampling_rate)

### 2.1 Computation

In [None]:
features = pd.read_csv('../data/fma_features.csv', index_col=0, header=[0, 1, 2])
features = features.loc[tracks.index, 'mfcc']
assert (tracks.index == features.index).all()

features.shape, tracks.shape

Hint:
* Use `tqdm` to show progress. For example `for i in tqdm(range(10)): print(i)`.

In [None]:
N_MFCC = 20

columns = []
for stat in ['mean', 'std', 'skew', 'kurtosis', 'median', 'min', 'max']:
    columns.extend(('mfcc', stat, '{:02d}'.format(i+1)) for i in range(N_MFCC))

names = ('feature', 'statistics', 'number')
columns = pd.MultiIndex.from_tuples(columns, names=names).sort_values()
features = pd.DataFrame(index=tracks.index, columns=columns, dtype=np.float32)

In [None]:
def compute_mfcc(track_id):
    audio, sampling_rate = librosa.load(get_path(track_id), sr=None, mono=True)
    return librosa.feature.mfcc(y=audio, sr=sampling_rate, n_mfcc=N_MFCC)

compute_mfcc(tracks.index[0]).shape

In [None]:
for tid in tqdm(tracks.index):
    mfcc = compute_mfcc(tid)
    features.at[tid, ('mfcc', 'mean')] = np.mean(mfcc, axis=1)
    features.at[tid, ('mfcc', 'std')] = np.std(mfcc, axis=1)
    features.at[tid, ('mfcc', 'skew')] = stats.skew(mfcc, axis=1)
    features.at[tid, ('mfcc', 'kurtosis')] = stats.kurtosis(mfcc, axis=1)
    features.at[tid, ('mfcc', 'median')] = np.median(mfcc, axis=1)
    features.at[tid, ('mfcc', 'min')] = np.min(mfcc, axis=1)
    features.at[tid, ('mfcc', 'max')] = np.max(mfcc, axis=1)

In [None]:
print(features.shape)
features.head(4)

### 2.2 Feature selection

### 2.3 Feature normalization

In [None]:
features -= features.mean(0)
features /= features.std(0)

## 3 Graph construction

* Is the graph connected?
* Shall we use the un-normalized or normalized Laplacian? Choose and justify.

Compute the l2. Or choose.

Hints:
* Use the `distance.pdist()` function.

### 3.1 Compute distances

Metric

The Euclidean distance is defined as $$d(i,j) = \|x_i - x_j\|_2$$

In [None]:
from scipy.spatial import distance

distances = distance.pdist(features, metric='euclidean')
distances = distance.squareform(distances)

In [None]:
plt.hist(distances.reshape(-1), bins=50);

Why are some distances equal to zero?

In [None]:
print('{} distances equal exactly zero. Why?'.format(np.sum(distances == 0)))

### 3.2 Compute the weight matrix

Gaussian kernel $$\mathbf{W}(i,j) = \exp \left( \frac{-d^2(i, j)}{\sigma^2} \right)$$

In [None]:
kernel_width = distances.mean()
weights = np.exp(-distances**2 / kernel_width**2)

np.fill_diagonal(weights, 0)

What kind of graph is that? Fully connected.

Sparsify the graph. Either knn or $\epsilon$. knn better to enforce connectedness.

In [None]:
fix, axes = plt.subplots(2, 2, figsize=(17, 8))
def plot(weights, axes):
    axes[0].spy(weights)
    axes[1].hist(weights[weights > 0].reshape(-1), bins=50);
plot(weights, axes[:, 0])

if False:
    epsilon = np.percentile(weights, 80)
    weights[weights < epsilon] = 0
else:
    NEIGHBORS = 10
    idx = np.argsort(weights)[:, :-NEIGHBORS]
    for i in range(weights.shape[0]):
        weights[i, idx[i, :]] = 0
    weights = np.maximum(weights, weights.T)

plot(weights, axes[:, 1])

### 3.3 Compute the Laplacian

In [None]:
degrees = weights.sum(0)

plt.hist(degrees, bins=50);

In [None]:
# Combinatorial Laplacian.
laplacian = np.diag(degrees) - weights

# Normalized Laplacian.
deg_inv = np.diag(1 / np.sqrt(degrees))
laplacian = deg_inv @ laplacian @ deg_inv

# Alternatively:
# laplacian = np.identity(weights.shape[0]) - deg_inv @ weights @ deg_inv

plt.spy(laplacian)

In [None]:
laplacian = sparse.csr_matrix(laplacian)

How many edges?

In [None]:
print('{} edges out of {} x {} = {}'.format(laplacian.nnz, *weights.shape, weights.size))

### 3.4 Bonus

Can you think of a way to observe if the two genres form clusters in the graph we created?

Hint: Use only the weight matrix / laplacian and the labels.

Sort the rows and columns given the labels.

## 4 Eigenvectors & eigenvalues

No need to compute the Fourier basis, only the Fiedler vector, i.e. the eigenvector associated to $\lambda_2$.

Use one of the following functions: `np.linalg.eig`, `np.linalg.eigh`, `sparse.linalg.eigs`, `sparse.linalg.eigsh`. Justify your choice.

In [None]:
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian, k=10, which='SM')

# That's much slower:
# eigenvalues, eigenvectors = np.linalg.eigh(laplacian.toarray())

In [None]:
plt.plot(eigenvalues, '.-');

Is the graph connected? Justify.

In [None]:
eigenvalues

What do you expect as the result of the below computation? Justify. Do you get the value you expected? If not, why?

Note that `x @ y` is equivalent to `np.matmul(x, y)`. You should prefer the former as it makes it easier to read formulas.

In [None]:
np.sum(laplacian @ eigenvectors[:, 0])

**Your answer here.** We expect zero because the first eigenvalue is zero. The small error is due to numerical precision.

## 5 Clustering

Compare different techniques.
PCA, Fiedler, spectral clustering

Visualization with Laplacian eigenmaps

Principal component analysis (PCA), no graph.

In [None]:
from sklearn import preprocessing, decomposition

features_pca = decomposition.PCA(n_components=2).fit_transform(features)

genres = preprocessing.LabelEncoder().fit_transform(tracks['genre'])

plt.scatter(features_pca[:,0], features_pca[:,1], c=genres, cmap='RdBu', alpha=0.5);

Note how this plot summarizes well 2GB of data and 2000 tracks.

In [None]:
plt.scatter(eigenvectors[:, 1], eigenvectors[:, 2], c=genres, cmap='RdBu', alpha=0.5);

Cluster the tracks with the Fiedler vector.

In [None]:
labels = (eigenvectors[:, 1] > 0)

plt.scatter(eigenvectors[:, 1], eigenvectors[:, 2], c=labels, cmap='RdBu', alpha=0.5);

How many tracks were wrongly identified?

In [None]:
err = np.sum(np.abs(labels - genres))
err = err if err < len(labels)/2 else len(labels) - err
print('{} errors ({}%)'.format(err, err/len(labels)*100))

Tune some parameters (e.g. `kernel_width`, `NEIGHBORS`) to get less errors. You should get an error rate lower than 15% (i.e. less than 300 errors in total). Try to understand the effect of each parameter.

## 6 Conclusion

Among other things, this assignment showed us that a graph can be useful for e.g. visualization or clustering, even when there is none in the original data. We exercised here two steps of the Data Science process: i) data collection, and ii) data exploitation. The exploitation of the data showed us that a machine can discern musical genres by looking at pairwise distances between spectral features extracted from audio recordings.

### 6.1 Bonus

What is the name of the technique we used to visualize the data in the last two plots? What does it try to preserve when reducing the dimensionality (of the ambiant space) from 140 to 2?