#### COMSAR Tutorial 

# Train a Timbre Track SOM

This notebook assumes, that you have successfully create a feature matrix of several ESRA Timbre Track files.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from apollon.som.som import IncrementalMap
import apollon.som.utilities as asu

## 1. Prepare your data set

In [None]:
sfm = pd.read_csv('../data/tt_example.sfm', index_col=0)
sfm

Different features tend to have different scales. In the this example, the
*centroid* is in the $10^3$ range, whereas *roughness* lies somwhere around
$10^{-2}$. Hence, the scales of these features differ by about five orders
of magnitude. From a geometrical view point, this means that the data space
for the SOM to learn is tremedously bigger on the *centroid*
dimension than on the *roughness* dimension.

In order to ensure a better and faster convergence of the learning algorithm, it is generally a good
advice to normalize each feature to zero mean and unit standard deviation, 
that is, to calculate [z-scores](https://en.wikipedia.org/wiki/Standard_score).
Note, however, that this is not mandatory with the SOM as for other algorithms
such as PCA.

In [None]:
data = sfm.to_numpy()
scaled_data = (data - data.mean(axis=0)) / data.std(axis=0)
scaled_data

## 2. Init the SOM

The SOM takes several model parameters. Check out 

T. Kohonen, "The Basic SOM," in *Self-Organizing Maps*, Springer, 1995.

and especially section 3.9 "Practical Advice for the Construction of Good Maps" for further information.

In [None]:
dims = (10, 10, data.shape[1])    # units on vertical axis, units of horizontal axis, number of features
n_iter = 100                      # number of iterations
eta = .01                         # initial learning rate
nhr = 7                           # initial radius of the map neighbourhood

s = IncrementalMap(dims, n_iter, eta, nhr)

## 3. Train the SOM

Training of the SOM is fairly simple. Note, however, that this is only a toy
example for illustration. In generall, it does not make any sense at all to
train a SOM with 100 units on a data set with only 4 items.

In [None]:
%time s.fit(scaled_data)

## 4. Examine the SOM

### 4.1 Quantizatuion Error. How well does the SOM represent the data?

In [None]:
fig, ax = plt.subplots(1)
ax.set_xlabel('Number of iterations')
ax.set_ylabel('Mean error')
ax.plot(s.quantization_error);

### 4.2 u-Matrix. Are there possibly clusters in the data?

The u-Matrix represents the mean distance between a certain unit and its
direct neighbours on the map. High values indicate heterogenous
neighbourhoods and thus represent cluster borders. "Basins" of low values
thus indicate clusters.

In [None]:
um = asu.umatrix(s.weights, s.shape, s.metric)

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.set_xlabel('units')
ax.set_ylabel('units')
umx = ax.imshow(um, cmap='terrain', origin='lower', vmin=0, vmax=1);
fig.colorbar(umx, ax=ax)

### 4.3 Project data on the map

In [None]:
pos, err = asu.best_match(s.weights, scaled_data, s.metric)
py, px = np.unravel_index(pos, s.shape)

In [None]:
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.set_xlabel('units')
ax.set_ylabel('units')
ax.imshow(um, cmap='terrain', origin='lower', vmin=0, vmax=1);
ax.scatter(px, py, 100, 'r', 'x')
for item, x, y in zip(sfm.index, px, py):
    ax.text(x+.3, y, item)

### 4.4 Component planes. Where are the features represented on the map?

Yellow = high ratio, dark blue = low ratio

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
for i, (feat, ax) in enumerate(zip(sfm.columns, axs.flat)):
    ax.set_title(feat)
    ax.imshow(s.weights[:, i].reshape(s.shape), origin='lower')