# Olive oil data

Using this notebook, it is possible to reproduce the clustering results on the olive oil data from the paper [Stable and consistent density-based clustering](https://arxiv.org/abs/2005.09048).

In [1]:
import persistable
import json
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.metrics import confusion_matrix
from sklearn.metrics import adjusted_rand_score

## Load data

See the paper for references

In [2]:
# olive_oil_scaled has been scaled by sklearn.preprocessing.StandardScaler
# Each feature is independently centered to have mean zero and scaled to unit variance
from olive_oil_data import olive_oil_scaled
from olive_oil_data import olive_oil_regions
from olive_oil_data import olive_oil_areas

## Reproduce the instance of the Persistable interactive mode from the paper

### Basic usage of the Persistable interactive mode

- Run the cell below to open the graphical user interface.
- To see the Component Counting Function, click "Compute".
- To see the Prominence Vineyard, in the box "Interactive inputs selection", choose "Family of lines". Now, one sees the two chosen lines that determine the Prominence Vineyard. Next, click "Compute" under "Prominence Vineyard".
- To get a clustering, in the box "Parameter selection", choose "On".
- To re-create the clustering in the paper, select Line number 15 and Gap number 3 or 8. Then click "Choose parameter".
- To get the labels for this clustering, run the cell below the graphical user interface.

In [3]:
# to reproduce the instance of the Persistable interactive mode 
# from the paper, we load a state dictionary and pass it to start_ui

with open('olive-oil-data-state.json', 'r') as fp:
    state = json.load(fp)

# create Persistable object
p = persistable.Persistable(olive_oil_scaled, n_neighbors='all')

# start UI
pi = persistable.PersistableInteractive(p)
port = pi.start_ui(ui_state=state, jupyter_mode='inline')

In [7]:
# get clustering with parameters chosen via the interactive mode
cluster_labels = pi.cluster()

## Comparing with region labels

Choose Gap number 3 in the instance of the interactive mode launched above.

In [6]:
# select labels to compare with
true_labels = np.asarray(olive_oil_regions)

# print confusion matrix
print('Confusion matrix:')
print(confusion_matrix(true_labels, cluster_labels))

# print adjusted rand index, and percentage of data points clustered
clustered_points = (cluster_labels >= 0)
ari = adjusted_rand_score(true_labels[clustered_points], cluster_labels[clustered_points])
pct_clustered = (np.sum(clustered_points) / true_labels.shape[0])
print('adjusted rand index: ' + str(ari))
print('percentage of data points clustered: ' + str(pct_clustered))

Confusion matrix:
[[  0   0   0   0]
 [ 30   0   0 293]
 [  1  97   0   0]
 [ 33   0 118   0]]
adjusted rand index: 1.0
percentage of data points clustered: 0.8881118881118881


## Comparing with area labels

Choose Gap number 8 in the instance of the interactive mode launched above.

In [8]:
# select labels to compare with
true_labels = np.asarray(olive_oil_areas)

# print confusion matrix
print('Confusion matrix:')
print(confusion_matrix(true_labels, cluster_labels))

# print adjusted rand index, and percentage of data points clustered
clustered_points = (cluster_labels >= 0)
ari = adjusted_rand_score(true_labels[clustered_points], cluster_labels[clustered_points])
pct_clustered = (np.sum(clustered_points) / true_labels.shape[0])
print('adjusted rand index: ' + str(ari))
print('percentage of data points clustered: ' + str(pct_clustered))

Confusion matrix:
[[  0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0]
 [ 13   0   0   0   0   0   0   0  12   0   0]
 [ 48   0   0   7   1   0   0   0   0   0   0]
 [106   0   0   0 100   0   0   0   0   0   0]
 [ 33   0   0   0   0   0   0   0   3   0   0]
 [ 14   0  51   0   0   0   0   0   0   0   0]
 [ 14  19   0   0   0   0   0   0   0   0   0]
 [ 35   0   0   0   0  14   0   1   0   0   0]
 [ 21   0   0   0   0   0   0  29   0   0   0]
 [  9   0   0   0   0   0  42   0   0   0   0]]
adjusted rand index: 0.98526786530503
percentage of data points clustered: 0.48776223776223776


### Using exhaustive persistence-based flattening to cluster more data points

Again, choose Gap number 8 in the instance of the interactive mode launched above.

In [9]:
cluster_labels = pi.cluster(flattening_mode='exhaustive')

In [10]:
# select labels to compare with
true_labels = np.asarray(olive_oil_areas)

# print confusion matrix
print('Confusion matrix:')
print(confusion_matrix(true_labels, cluster_labels))

# print adjusted rand index, and percentage of data points clustered
clustered_points = (cluster_labels >= 0)
ari = adjusted_rand_score(true_labels[clustered_points], cluster_labels[clustered_points])
pct_clustered = (np.sum(clustered_points) / true_labels.shape[0])
print('adjusted rand index: ' + str(ari))
print('percentage of data points clustered: ' + str(pct_clustered))

Confusion matrix:
[[  0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0]
 [  2   0   0   0   0   0   2   0  21   0   0]
 [  2   0   0   0   3   0  50   1   0   0   0]
 [  9   0   0   0 196   0   1   0   0   0   0]
 [  3   0   0   0   7   0  20   0   6   0   0]
 [  0   0   0  65   0   0   0   0   0   0   0]
 [  0   0  31   2   0   0   0   0   0   0   0]
 [  7   0   0   0   0   2   0  41   0   0   0]
 [  3   0   0   0   0  47   0   0   0   0   0]
 [  1  48   0   0   0   0   0   2   0   0   0]]
adjusted rand index: 0.8995112949478538
percentage of data points clustered: 0.9527972027972028
