# Using `thisnotthat` for examining clustered data and editing labels

Consider a dataset composed of a sequence of _records_, each consisting in a loose set of named _fields_. A great process for auditing and understanding such a dataset is to figure out the correlations between records: which are similar and why, as opposed to which are dissimilar, and why. While somebody with a monk's patience and tireless eyeballs can work such correlations out of a spreadsheet view of the data, there exists a more interesting approach:

1. Embed the records in a vector space.
1. Reduce the dimension of the record vectors to 2 and visualize them.
    - Bonus: use hover-like tooling to keep a description of the records handy during visualization.

Some of the natural record correlations will take the form of _clusters_ in the plot display, which can be visually appreciated. A small upgrade to this process involves exploring the similarities quantitatively, using a clustering algorithm. The exploration process then goes:

1. Embed the records in a vector space.
1. _Maybe_ reduce the dimension a little, just enough to compress the data a bit?
1. Run the clustering algorithm to discover the groups of similar records, according to a chosen _distance function_. These group identifiers become the groups' respective _labels_.
1. Reduce the dimension of the record vectors to 2 and visualize them, colored by label.
    - Bonus: use hover-like tooling to keep a description of the records handy during visualization.

In both cases, the plotting of the dataset grants insights into the correlations and the reasons behind them. These insights can drive the scientist towards matching or breaking assumptions or prior knowledge regarding phenomena the data is supposed to describe. Thus, a given cluster can correspond to one of these phenomena. Two or more clusters may describe the same phenomenon, even if the algorithm has marked them as distinct. Alternatively, a correlation between some records might be spurious with respect to a phenomenon of interest, so one may want to split a cluster into two or more. All these cases involve _editing labels_, a process by which one exchanges knowledge between information present in data and prior knowledge one possesses.

`thisnotthat` provides a data viewer and label editor as an ipywidget.

In [1]:
import thisnotthat as tnt

### Brief digression: getting "clustered" data to plot

Let's simulate a few Gaussian clusters centered respectively at regular intervals on the circumference of a circle.

Number of points to split among clusters.

In [2]:
NUM_POINTS = 1000
assert NUM_POINTS > 0

Cluster "distribution": this is a list with one number for each cluster to simulate. The number corresponds to the "weight" of the cluster: for instance, a cluster of weight 3 will be associated to approximately three times as many points as a cluster of weight 1. Under the hood, this list is $l_1$-normalized to yield the probabilities that make up a multinomial distribution.

In [3]:
DISTRIBUTION_CLUSTERS = [10, 10, 20, 30]
assert len(DISTRIBUTION_CLUSTERS) > 0
assert all(p > 0.0 for p in DISTRIBUTION_CLUSTERS)

The clusters will be respectively centered around a circle centered at the origin. This determines the radius of this circle.

In [4]:
RADIUS = 5
assert RADIUS > 0.0

This _scale_ parameters determines how wide the Gaussian clusters will tend to be. The value is mostly relevant in comparison to the circle's radius: a high radius/scale ratio will more likely yield well-separated clusters; a low radius/scale ratio will yield confusion and chaos.

In [5]:
SCALE_CLUSTERS = 3
assert SCALE_CLUSTERS > 0.0

In [6]:
import random
import numpy as np
import numpy.linalg as npl
from scipy.spatial.transform import Rotation
import scipy.stats as st

Put all the ingredients in a put, bring to a boil, simmer for 30 ms.

In [7]:
num_clusters = len(DISTRIBUTION_CLUSTERS)
sizes = st.multinomial(
    n=NUM_POINTS,
    p=np.array(DISTRIBUTION_CLUSTERS) / npl.norm(DISTRIBUTION_CLUSTERS, ord=1)
).rvs().squeeze()

seeds = 2 * np.pi * np.arange(num_clusters) / num_clusters + st.uniform(scale=np.pi).rvs()
means = RADIUS * np.hstack([np.cos(seeds).reshape((-1, 1)), np.sin(seeds).reshape((-1, 1))])
stds = st.expon(scale=SCALE_CLUSTERS * np.outer(np.ones((num_clusters,)), np.ones((2,)))).rvs() / 3
angles = st.uniform(scale=np.pi * np.ones((num_clusters,))).rvs()

cluster = []
ez = np.array([0, 0, 1])
for i in range(num_clusters):
    S = np.diag(stds[i, :] ** 2)
    R = Rotation.from_rotvec(angles[i] * ez).as_matrix()[:2, :2]
    cluster.append(
        st.multivariate_normal(mean=means[i, :], cov=R @ S @ R.transpose()).rvs((sizes[i],))
    )
data = np.vstack(cluster)
assert data.shape == (NUM_POINTS, 2)
hover = list(range(NUM_POINTS))

We determine a random word as label for each cluster, using the word file included in this repository.

In [8]:
with open("words.txt", "r", encoding="utf-8") as file:
    words = [line.strip() for line in file]

labels = sum(
    [[random.choice(words)] * s for s in sizes],
    []
)
assert len(labels) == NUM_POINTS

## Here's the labeler!

In [9]:
lb = tnt.Labeler(data, labels, hover=hover)
lb

Labeler(children=(Figure(axes=[Axis(scale=LinearScale()), Axis(orientation='vertical', scale=LinearScale())], …

It starts with the **Pick** tool selected. This is useful to roam around the plot and get information about each point (as a tooltip shown when hovering over the point). To look at a certain part of the plot more closely, use the **Pan/Zoom** tool: point somewhere and roll the mouse wheel to zoom in and out; drag the mouse to move the center of the plot. Clicking the **Reset** button will restore the original perspective.

On the upper right corner of the display, you have an editable legend of the plot: each legend element corresponds to the label associated to the points of the corresponding color. You can click on the colors to change them if they don't contrast to your taste. You can also edit the label names so they correspond to your understanding of the correlations.

The fun things occur when you _select_ some of the points. There are three ways to do so:
    
1. With the **Pick** tool, click on points that interest you (click again to deselect).
1. With the **Lasso** tool, draw a line around the points that interest you.
1. Click on the ☑ button by the name of a cluster to select all the points of that cluster (click again to deselect).

When you have points selected, all other points are faded out. You can perform multiple selection actions in most orders: only Lassoing points after picking will reset the selection (likely a bug in the BQPlot library). You can also clear the selection by clicking the **Reset** button (click a second time to restore the perspective).

With points selected, two new actions become possible. The first consists in creating a new cluster composed of the points that are selected: do this by clicking the **Split** button. You will see the legend modified to describe your new cluster, newly drawn in its own color.

The second action consists in merging the selected points to an existing cluster. You do this by choosing the label of this cluster in the **Merge to** drop-down menu. Note that if _all_ the points of a cluster are selected, it suddenly no longer count any point to itself; we then delete it from the display.

The goal of all this is effectively to label the points to your heart's content. At any point during the process, you can look up the labels associated to your various data points (rows of the `data` matrix), as a list:

In [10]:
lb.labels_named[:10]

['palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave',
 'palstave']