# Hand-crafted cuts
We have already noticed that Tangles sometimes perform very well vs SOE-kMeans. One advantage of the normal Tangles algorithm is that we don't need very many cuts, if we have a few high quality ones. We want to see if we can take this to the max by providing only very few triplets, but hand-choosing them to be of very high quality. 

In [168]:
import sys
sys.path.append("..")
import numpy as np
import pandas as pd
import altair as alt
import cblearn.datasets as datasets
from data_generation import generate_gmm_data_fixed_means
from sklearn.metrics import pairwise_distances
from questionnaire import Questionnaire, unify_triplet_order 
from estimators import OrdinalTangles, SoeKmeans
from cblearn.embedding import SOE
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from plotting import AltairPlotter
from sklearn.neighbors import DistanceMetric

In [169]:
seed = 1
data = generate_gmm_data_fixed_means(n = 200, means = np.array([[-6, 3], [-6, -3], [6, 3]]), std=1.3, seed=1)

Taking a first look at the data, it seems pretty easy to cluster with the right information.

In [170]:
p = AltairPlotter()
p.assignments(data.xs, data.ys)

## Investigating density on cluster performance
### Clustering with a low number of triplets
We will now first try to cluster this using the standard approach, but with a low number of triplets. We will try both random triplets (for SOE-kMeans) and low-density draws.

Using Davids formula

$$ \text{\# triplets} = O(n d \log(n)) $$

We would need about 

$$ 200 * 2 * \log(200) = 2120 $$

Triplets. Using a factor of 3-10 between this is appropriate, so we would need about $6000 - 20000$ triplets.

For SOE-kMeans we will try both random triplet draws as well as systematic ones.
There will be some variation between the methods, which we ascribe to randomness (after all, in the very low triplet regime, it greatly matters which triplets one draws).

In [171]:
q = Questionnaire.from_metric(data.xs, density=0.001, seed=1, verbose=False)
num_triplets = q.values.size
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(q.values, data.ys)

soe_kmeans = SoeKmeans(embedding_dimension=2, n_clusters=3)
ys_soe_kmeans = soe_kmeans.fit_predict(*q.to_bool_array())

t_random, r_random = datasets.make_random_triplets(data.xs, size=num_triplets, result_format="list-boolean")
ys_soe_kmeans_random = soe_kmeans.fit_predict(t_random, r_random)

print(f"Using {num_triplets} triplets")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"SOE-kMeans score: {normalized_mutual_info_score(ys_soe_kmeans, data.ys)}")
print(f"SOE-kMeans random triplets score: {normalized_mutual_info_score(ys_soe_kmeans_random, data.ys)}")

Using 108000 triplets
Tangles score: 0.9660168044526185 (3 clusters)
SOE-kMeans score: 0.9593738837538622
SOE-kMeans random triplets score: 0.9593738837538622


We can see, both methods are pretty apt at clustering the data. 

As an aside, repeat this with a lower amount of triplets and Tangles GREATLY outperforms SOE-kMeans already! This has been tried with multiple seeds, and most of the time, Tangles is a lot better.


In [172]:
q = Questionnaire.from_metric(data.xs, density=0.00005, seed=1, verbose=False)
num_triplets = q.values.size
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(q.values, data.ys)

soe_kmeans = SoeKmeans(embedding_dimension=2, n_clusters=3)
ys_soe_kmeans = soe_kmeans.fit_predict(*q.to_bool_array())

t_random, r_random = datasets.make_random_triplets(data.xs, size=num_triplets, result_format="list-boolean")
ys_soe_kmeans_random = soe_kmeans.fit_predict(t_random, r_random)

print(f"Using {num_triplets} triplets")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"SOE-kMeans score: {normalized_mutual_info_score(ys_soe_kmeans, data.ys)}")
print(f"SOE-kMeans random triplets score: {normalized_mutual_info_score(ys_soe_kmeans_random, data.ys)}")

Using 5400 triplets
Tangles score: 0.947501408273783 (3 clusters)
SOE-kMeans score: 0.7713571172529203
SOE-kMeans random triplets score: 0.7070149858417031


Funnily enough, once you go even lower with the density, SOE-kMeans starts to be better again.

In [173]:
q = Questionnaire.from_metric(data.xs, density=0.00001, seed=1, verbose=False)
num_triplets = q.values.size
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(q.values, data.ys)

soe_kmeans = SoeKmeans(embedding_dimension=2, n_clusters=3)
ys_soe_kmeans = soe_kmeans.fit_predict(*q.to_bool_array())

t_random, r_random = datasets.make_random_triplets(data.xs, size=num_triplets, result_format="list-boolean")
ys_soe_kmeans_random = soe_kmeans.fit_predict(t_random, r_random)

print(f"Using {num_triplets} triplets")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"SOE-kMeans score: {normalized_mutual_info_score(ys_soe_kmeans, data.ys)}")
print(f"SOE-kMeans random triplets score: {normalized_mutual_info_score(ys_soe_kmeans_random, data.ys)}")

Using 1200 triplets
Tangles score: 0.0 (1 clusters)
SOE-kMeans score: 0.6816237172899461
SOE-kMeans random triplets score: 0.005349325115180935


### Picking good cuts
We will now go a step further and try to see if we can make Tangles competitive with SOE-kMeans again when we pick smart cuts. In a perfect world, we would only need 2 cuts (that cleanly separate the clusters), which would amount to $2 \times 600 = 1200$ Triplets.

In [174]:
def get_pivot(xs, point):
    """Returns closest point in dataset to the given point."""
    return data.xs[np.argmin(np.linalg.norm(xs - point, axis=1))]
def cut_between_pivots(xs, a, b):
    dists_to_a = np.linalg.norm(xs - get_pivot(xs, a), axis=1)
    dists_to_b = np.linalg.norm(xs - get_pivot(xs, b), axis=1)
    return dists_to_a < dists_to_b
def plot_cut(xs, c):
    p = AltairPlotter()
    return p.assignments(xs, c.astype(int)).properties(width=300, height=240)
# first cut between orange and green
orange_green_cut = cut_between_pivots(data.xs, np.array([-6, -6]), np.array([-6, 6]))

# second cut between green and violet
green_violet_cut = cut_between_pivots(data.xs, np.array([-6, 3]), np.array([6, 3]))
plot_cut(data.xs, orange_green_cut) | plot_cut(data.xs, green_violet_cut)

The cuts look pretty good. Let's see how the clustering works.

Note that due to the pruning mechanism of Tangles, we just add the cuts multiple times, so the paths get long enough. This doesn't represent adding "more information" tho.

In [175]:
cuts = np.array([orange_green_cut, green_violet_cut, orange_green_cut, green_violet_cut]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1200 triplets.
Tangles score: 0.9639021522110472 (3 clusters)


Next we will see what percentage of Tangles we can corrupt to still get acceptable performance.

In [176]:
num_corrupted_per = 50
orange_green_corrupted = orange_green_cut.copy()
orange_green_corrupted[np.random.choice(600, size=num_corrupted_per)] = np.random.random(size=num_corrupted_per) > 0.5
green_violet_corrupted = green_violet_cut.copy()
green_violet_corrupted[np.random.choice(600, size=num_corrupted_per)] = np.random.random(size=num_corrupted_per) > 0.5
# 
cuts = np.array([orange_green_corrupted, green_violet_corrupted, orange_green_corrupted,green_violet_corrupted]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2 - num_corrupted_per * 2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1100 triplets.
Tangles score: 0.7319138296041795 (3 clusters)


Turns out that the answer is "basically none", we only use such a small amount of cuts. 

Let us now try how SOE-kMeans fares with this set of cuts.

In [177]:
def get_pivot_idx(xs, point):
    """Returns closest point in dataset to the given point."""
    return np.argmin(np.linalg.norm(xs - point, axis=1))
orange_green_b = get_pivot_idx(data.xs, np.array([-6, -6]))
orange_green_c = get_pivot_idx(data.xs, np.array([-6, 6]))
green_violet_b = get_pivot_idx(data.xs, np.array([-6, 3]))
green_violet_c = get_pivot_idx(data.xs, np.array([6, 3]))

cuts = np.array([orange_green_cut, green_violet_cut]).T
labels = [(orange_green_b, orange_green_c), (green_violet_b, green_violet_c)]
q = Questionnaire(cuts, labels)
soe_kmeans = SoeKmeans(embedding_dimension=2, n_clusters=3, seed=1)
ys_soe_kmeans = soe_kmeans.fit_predict(*q.to_bool_array())
print(f"SOE-kMeans score: {normalized_mutual_info_score(ys_soe_kmeans, data.ys)}")

SOE-kMeans score: 0.7019390560184521


As we can see, Tangles performs a lot better here. Do note that this is a benevolent seed used here, we often also get performances around 0.2 or something.

### Picking imperfect cuts
If we imagine the above procedure applied to f.e. ordinal embeddings on pictures of animals, the requirement seems pretty stringent. We not only require to pick two pivots that perfectly separate two groups (e.g. one very central cat and a very central dog, such that all cats are closer to the cat than the dog and all dogs are closer to the dog than to the cat), but those pivot elements also have to separate the other groups perfectly (all elephants have to be closer to the cat than to the dog).

Assuming this isn't the case, can we still cluster perfectly? To investigate this, we set the answer in the other cluster to random.

In [178]:
orange_green_randomized = orange_green_cut.copy()
orange_green_randomized[data.ys == 2] = np.random.random(size=200) > 0.5
green_violet_randomized = green_violet_cut.copy()
green_violet_randomized[data.ys == 1] = np.random.random(size=200) > 0.5

cuts = np.array([orange_green_randomized, green_violet_randomized, orange_green_randomized, green_violet_randomized]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1200 triplets.
Tangles score: 0.5194110320246351 (2 clusters)


We can only differ between two clusters now. Assuming we have another cut between violet and orange:

In [179]:
orange_violet_cut = cut_between_pivots(data.xs, np.array([-4, -1.8]), np.array([6,0]))
orange_violet_randomized = orange_violet_cut.copy()
orange_violet_randomized[data.ys == 0] = np.random.random(size=200) > 0.5

cuts = np.array([orange_green_randomized, green_violet_randomized, orange_violet_randomized, orange_green_randomized, green_violet_randomized, orange_violet_randomized]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1800 triplets.
Tangles score: 0.7192612425206788 (3 clusters)


The randomization is probably too systematic. We can try randomizing the cut in different ways:

In [180]:
orange_green_randomized2 = orange_green_cut.copy()
orange_green_randomized2[data.ys == 2] = np.random.random(size=200) > 0.5
green_violet_randomized2 = green_violet_cut.copy()
green_violet_randomized2[data.ys == 1] = np.random.random(size=200) > 0.5
orange_violet_randomized2 = orange_violet_cut.copy()
orange_violet_randomized2[data.ys == 0] = np.random.random(size=200) > 0.5

cuts = np.array([orange_green_randomized, green_violet_randomized, orange_violet_randomized, orange_green_randomized2, green_violet_randomized2, orange_violet_randomized2]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1800 triplets.
Tangles score: 0.5532081014106206 (2 clusters)


Still doesn't work. Upon further thinking, this does make sense,
as Tangles requires kind of "hierarchical" cuts: One cut would first have to separate {green, orange} from {violet}, then one cut would have to separate {green} {orange}. This cut is free to have random information for violet.

We will test this:

In [181]:
cuts = np.array([green_violet_cut, orange_green_randomized, green_violet_cut, orange_green_randomized]).T
tangles = OrdinalTangles(agreement=100, verbose=False)
ys_tangles = tangles.fit_predict(cuts)
print(f"Using {int(cuts.size/2)} triplets.")
print(f"Tangles score: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")

Using 1200 triplets.
Tangles score: 0.9734533486498999 (3 clusters)


We note that this cut is free to have completely random information on the violet cluster, so we only really have 1000 informative triplets. 

Again, we compare to SOE-kMeans:

In [182]:
cuts = np.array([orange_green_randomized, green_violet_cut]).T
labels = [(orange_green_b, orange_green_c), (green_violet_b, green_violet_c)]
q = Questionnaire(cuts, labels)
soe_kmeans = SoeKmeans(embedding_dimension=2, n_clusters=3, seed=1)
ys_soe_kmeans = soe_kmeans.fit_predict(*q.to_bool_array())
print(f"SOE-kMeans score: {normalized_mutual_info_score(ys_soe_kmeans, data.ys)}")

SOE-kMeans score: 0.39155920636760577


Again, a lot worse.

## Intuition on selecting pivot points for Tangles
Assume we want to design a study that shall be clustered with Tangles. 
To get the best possible result with the fewest clusters, we want to select pivots that 
divide the data in a hierarchical fashion as the tangles algorithm does. Assume that we have a set of images with three groups, {tigers, dogs, house cats}. 

We might first want to select an image pair {dog, house cat} as a point point. The resulting cut should subdivide {tiger, house cat} and {dog}. Next, we would want to select an image pair {tiger, house cat}. This should result in subdividing {tiger} and {house cat}, while the answers that users give when presented with the question "is this dog closer to the house cat or the tiger?" are completely irrelevant (and don't affect clustering performance). 