# Tweaking euclidean data 

To get a more realistic image of the performance of Tangles, we want to tweak the data around a bit. Currently, we are very "nice" to SOE-kMeans (embedding euclidean data into a same-dimensional space doesn't seem too crazy hard).

We have two ideas, different metrics for triplet generation and higher dimensional spaces.

## Other metrics

As we already noticed, projecting euclidean data with triplets from euclidean distances into an euclidean space might give an unfair advantage to the SOE-kMeans algorithm.

It might be advantageous to use another metric, like the minkowski metric.

In [24]:
import sys
sys.path.append("..")
import numpy as np
import pandas as pd
import altair as alt
import cblearn.datasets as datasets
from data_generation import generate_gmm_data_fixed_means
from sklearn.metrics import pairwise_distances
from questionnaire import Questionnaire, unify_triplet_order 
from estimators import OrdinalTangles
from cblearn.embedding import SOE
from sklearn.cluster import KMeans
from sklearn.metrics import normalized_mutual_info_score
from plotting import AltairPlotter
from sklearn.neighbors import DistanceMetric

In [2]:
seed = 9
data = generate_gmm_data_fixed_means(n=100, means=np.array([[-6,3], [6,3], [-6,-3]]), std=2, seed=seed)
minkowski_1_5 = DistanceMetric.get_metric("minkowski", p=1.5)
questionnaire = Questionnaire.from_metric(data.xs, metric=minkowski_1_5, density=0.01, seed=seed)

Generating questionnaire...
Generating question set...
Filling out questionnaire...


100%|██████████| 300/300 [00:00<00:00, 4812.26it/s]


In [3]:
tangles = OrdinalTangles(agreement=35)
ys_tangles = tangles.fit_predict(questionnaire.values)

soe = SOE(n_components=2, random_state=seed)
kmeans = KMeans(3)
ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire.to_bool_array()))
print(f"NMI Tangles: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"NMI SOE: {normalized_mutual_info_score(ys_soe, data.ys)}")

NMI Tangles: 0.8236059492065199 (3 clusters)
NMI SOE: 0.8112617504898255


It seems Tangles can perform competitively against SOE. We will see how this looks like when comparing different values of the p in the Minkowski norm.

In [4]:
ps = np.arange(1.0, 3.01, 0.1)
df_other_metric = pd.DataFrame()
seed = None
for p in ps:
    print("p =", p)
    for i in range(50):
        # data generation
        data = generate_gmm_data_fixed_means(n=100, means=np.array([[-6,3], [6,3], [-6,-3]]), std=2, seed=seed)
        minkowski_1_5 = DistanceMetric.get_metric("minkowski", p=p)
        questionnaire = Questionnaire.from_metric(data.xs, metric=minkowski_1_5, density=0.01, seed=seed, verbose=0)

        # results
        tangles = OrdinalTangles(agreement=35, verbose=0)
        ys_tangles = tangles.fit_predict(questionnaire.values)

        soe = SOE(n_components=2, random_state=seed)
        kmeans = KMeans(3)
        ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire.to_bool_array()))
        df_other_metric = df_other_metric.append({"p": p, "NMI Tangles": normalized_mutual_info_score(ys_tangles, data.ys), "NMI SOE": normalized_mutual_info_score(ys_soe, data.ys), "run": i}, ignore_index=True)
df_other_metric.to_csv("results/18_tweaking_euclidean_data.csv")

p = 1.0
p = 1.1
p = 1.2000000000000002
p = 1.3000000000000003
p = 1.4000000000000004
p = 1.5000000000000004
p = 1.6000000000000005
p = 1.7000000000000006
p = 1.8000000000000007
p = 1.9000000000000008
p = 2.000000000000001
p = 2.100000000000001
p = 2.200000000000001
p = 2.300000000000001
p = 2.4000000000000012
p = 2.5000000000000013
p = 2.6000000000000014
p = 2.7000000000000015
p = 2.8000000000000016
p = 2.9000000000000017
p = 3.0000000000000018


In [19]:
df_avg = df_other_metric.groupby("p").mean().reset_index()
alt.Chart(df_avg).transform_fold(["NMI SOE", "NMI Tangles"]).mark_point(size=5).encode(x="p", y="value:Q", color="key:N").interactive()

## Higher dimensional spaces

Real data usually comes from a higher dimensional space (like images). We want to see if we still have acceptable performance if we need to embed into a lower space. 

We realise this by taking data from a 6-dimensional gaussian mixture and ask SOE-kMeans to embed it into a 2-dimensional space.
To make it even harder, we keep the Minkowski Metric on which tangles performed reasonably well.

In [71]:
seed = 8
data = generate_gmm_data_fixed_means(n=100, means=np.array([[-6,3,6,-2,-3], [6,3,-6,-2,3], [-6,-3,6,2,4], [-6,-3,-6,2,-4], [6,3,6,2,4], [-6,-3,-6,-2,-4]]), std=2, seed=seed)
minkowski_1_5 = DistanceMetric.get_metric("minkowski", p=1.5)
questionnaire_low = Questionnaire.from_metric(data.xs, metric=minkowski_1_5, density=0.0001, seed=seed)
questionnaire_high = Questionnaire.from_metric(data.xs, metric=minkowski_1_5, density=0.001, seed=seed)

Generating questionnaire...
Generating question set...
Filling out questionnaire...


100%|██████████| 600/600 [00:00<00:00, 79977.83it/s]


Generating questionnaire...
Generating question set...
Filling out questionnaire...


100%|██████████| 600/600 [00:00<00:00, 11928.91it/s]


In [80]:
soe_embedding_dim = 2
#
tangles = OrdinalTangles(agreement=30)
ys_tangles = tangles.fit_predict(questionnaire_low.values)

soe = SOE(n_components=soe_embedding_dim, random_state=seed)
kmeans = KMeans(6)
ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire_low.to_bool_array()))
print("FOR p = 1.5")
print(f"For low density (0.0001):")
print(f"\tNMI Tangles:\t {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"\tNMI SOE:\t {normalized_mutual_info_score(ys_soe, data.ys)}")

tangles = OrdinalTangles(agreement=30)
ys_tangles = tangles.fit_predict(questionnaire_high.values)

soe = SOE(n_components=soe_embedding_dim, random_state=seed)
kmeans = KMeans(6)
ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire_high.to_bool_array()))
print(f"For high density (0.001):")
print(f"\tNMI Tangles:\t {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"\tNMI SOE:\t {normalized_mutual_info_score(ys_soe, data.ys)}")

FOR p = 1.5
For low density (0.0001):
	NMI Tangles:	 0.6121033811131482 (7 clusters)
	NMI SOE:	 0.5504339949868314
For high density (0.001):
	NMI Tangles:	 0.7917505156964619 (6 clusters)
	NMI SOE:	 0.83536875257996


Very interesting: If we only put a small density (0.0001), Tangles actually performs _better_, while with a higher density (0.001), SOE-kMeans performs better. I've also tried with different seeds, this does not seem like it's purely by chance.

As an aside, if you repeat the same spiel with a different minkowski distance (f.e. p = 2), then SOE catches up and tangles gets a tad worse.

In [81]:
minkowski_2 = DistanceMetric.get_metric("minkowski", p=2)
questionnaire_low = Questionnaire.from_metric(data.xs, metric=minkowski_2, density=0.0001, seed=seed, verbose=0)
questionnaire_high = Questionnaire.from_metric(data.xs, metric=minkowski_2, density=0.001, seed=seed, verbose=0)

# 
tangles = OrdinalTangles(agreement=30)
ys_tangles = tangles.fit_predict(questionnaire_low.values)

soe = SOE(n_components=soe_embedding_dim, random_state=seed)
kmeans = KMeans(6)
ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire_low.to_bool_array()))
print("FOR p = 2")
print(f"For low density (0.0001):")
print(f"\tNMI Tangles:\t {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"\tNMI SOE:\t {normalized_mutual_info_score(ys_soe, data.ys)}")

tangles = OrdinalTangles(agreement=30)
ys_tangles = tangles.fit_predict(questionnaire_high.values)

soe = SOE(n_components=soe_embedding_dim, random_state=seed)
kmeans = KMeans(6)
ys_soe = kmeans.fit_predict(soe.fit_transform(*questionnaire_high.to_bool_array()))
print(f"For high density (0.001):")
print(f"\tNMI Tangles:\t {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"\tNMI SOE:\t {normalized_mutual_info_score(ys_soe, data.ys)}")

FOR p = 2
For low density (0.0001):
	NMI Tangles:	 0.6121033811131482 (7 clusters)
	NMI SOE:	 0.6146421654981518
For high density (0.001):
	NMI Tangles:	 0.7917505156964619 (6 clusters)
	NMI SOE:	 0.8329846335451797


As an aside, if we use the real embedding dimension, SOE-kMeans gets way ahead again.

### With majority cuts 


In [46]:
seed = 9
data = generate_gmm_data_fixed_means(n=100, means=np.array([[-6,3,6,-2,-3], [6,3,-6,-2,3], [-6,-3,6,2,4], [-6,-3,-6,2,-4], [6,3,6,2,4], [-6,-3,-6,-2,-4]]), std=2, seed=seed)
triplets, responses = datasets.make_random_triplets(data.xs, result_format="list-boolean", size=5000000)
unified_triplets = unify_triplet_order(triplets, responses)

In [63]:
# tangles
cuts = triplets_to_majority_neighbour_cuts(unified_triplets, radius=1)
tangles = OrdinalTangles(agreement=30)
ys_tangles = tangles.fit_predict(cuts)

soe = SOE(n_components=2, random_state=seed)
kmeans = KMeans(6)
ys_soe = kmeans.fit_predict(soe.fit_transform(triplets[:10000], responses[:10000]))
print(f"NMI Tangles: {normalized_mutual_info_score(ys_tangles, data.ys)} ({np.unique(ys_tangles).size} clusters)")
print(f"NMI SOE: {normalized_mutual_info_score(ys_soe, data.ys)}")

NMI Tangles: 0.5174528143170685 (4 clusters)
NMI SOE: 0.18883892495428412


It somehow works but that is way too many triplets required (if we use 1/10 of that, algorithm dies). Majority neighbour cuts are a bit eh. Not Tangles' strong suite. Note that we cannot use that many more triplets for SOE, as it crashes afterwards.