# Correlation clustering

In this notebook we use a correlation-based distance measure to cluster waterbodies. It is taken from Basalto+07, and is:

$$
    c_{ij} = \frac{
        \mathbb E[y_i y_j] - \mathbb E[y_i] \mathbb E[y_j]
    }{
        \sqrt{
            (\mathbb E[y_i^2] - \mathbb E[y_i]^2)
            (\mathbb E[y_j^2] - \mathbb E[y_j]^2)
        }
    }
$$

where $y_i(t) = x_i(t) - x_i(t - 1)$. Actually they log their data... but we can't do that, since we go to zero. (Though theirs technically could too...)

## Setup

### Load modules

In [83]:
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import scipy.optimize as opt
import matplotlib.colors
from tqdm.notebook import tqdm
import sklearn.cluster
import sklearn.decomposition

%matplotlib widget

### Load the data

This was generated in WaterbodyClustering.ipynb.

In [2]:
history = np.load('history_murray_full_norivers.npy')
times = np.load('time_axis_murray_full_norivers.npy').astype('datetime64[D]')
waterbodies = gpd.read_file('waterbodies_murray_norivers.geojson')

## Get the coefficients

We now calculate $c_{ij}$ for all of the waterbodies. This is super slow, so let's subsample along the time axis, dividing by 7 (to get weekly data), and also reduce the number of waterbodies by a factor of 10.

(Smooth or unsmooth? Let's try unsmooth, but smooth might be good to try at some point.)

In [28]:
diffs = np.diff(history[::10, ::7], axis=1)

In [29]:
diffs.shape

(909, 1769)

In [31]:
total_product = np.zeros((diffs.shape[0], diffs.shape[0]))
for i in tqdm(range(diffs.shape[1])):
    total_product += diffs[:, None, i] * diffs[None, :, i]
mean_product = total_product / diffs.shape[1]

HBox(children=(FloatProgress(value=0.0, max=1769.0), HTML(value='')))




In [33]:
mean_values = diffs.mean(axis=1)
mean_sq_values = (diffs ** 2).mean(axis=1)

In [34]:
numerator = total_product - mean_values[:, None] * mean_values[None, :]

In [35]:
denominator = np.sqrt((mean_sq_values - mean_values ** 2)[:, None] * (mean_sq_values - mean_values ** 2)[None, :])

In [43]:
correlation_coefficients = numerator / denominator

In [48]:
correlation_coefficients.max()

1770.1001789183952

## Visualise the coefficients

How else but PCA? Treat each coefficient vector as a feature vector, then do PCA.

In [91]:
pca = sklearn.decomposition.PCA(n_components=2)
pca_f = pca.fit_transform(correlation_coefficients)

In [97]:
plt.figure()
plt.scatter(pca_f[:, 0], pca_f[:, 1], s=10, c=waterbodies.RivRegNum[::10].astype(int), cmap='tab20')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.collections.PathCollection at 0x7ff2c77a8b00>

## Agglomerative clustering

Basalto+07 used Hausdorff agglomerative clustering. We have access to scikit-learn and a distance matrix... so let's use DBSCAN. Agglomerative performs quite poorly on this dataset for some reason.

In [159]:
clusterer = sklearn.cluster.DBSCAN(metric='precomputed', eps=0.55)

In [160]:
distances = -correlation_coefficients
distances -= distances.min()
distances /= distances.max()

In [161]:
predictions = clusterer.fit_predict(distances)

In [162]:
predictions.max()

9

In [163]:
plt.figure()
for i in range(8):
    plt.subplot(4, 2, i + 1)
    plt.plot(times, history[::10][predictions == i][:100].T, c='k', alpha=max(1 / sum(predictions == i), 0.01))
    plt.plot(times, history[::10][predictions == i].mean(axis=0), c='k')
    plt.title('Cluster {} (n = {})'.format(i, sum(predictions == i)))
plt.tight_layout()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

This worked poorly, but I think the error is in the coefficients, which don't seem correct: they don't range from -1 to 1.

In [164]:
plt.figure()
plt.scatter(pca_f[:, 0], pca_f[:, 1], s=10, c=predictions, cmap='tab20')

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

<matplotlib.collections.PathCollection at 0x7ff2c5f742e8>