# Hierarchy models
In the following, we will investigate the hierarchy block models that
are described in Ghoshdastidar et al., 2019. 

We contrast a tangles-based approach to ComparisonHC.

In [1]:
import sys
sys.path.append("..")
from data_generation import generate_planted_hierarchy
from comparison_hc import ComparisonHC
from estimators import LandmarkTangles, SoeKmeans, MajorityTangles
from questionnaire import Questionnaire
import pandas as pd
import altair as alt
import numpy as np

In [2]:
def eval_hierarchical(noise=0.0, density=0.1, hier_noise=0.0, n_runs=1):
    l, s, m, c = [], [], [], []
    for i in range(n_runs):
        data = generate_planted_hierarchy(2, 10, 5, 1, hier_noise)
        q = Questionnaire.from_precomputed(data.xs, density=density, use_similarities=True, noise=noise, verbose=False).impute("random")
        t, r = q.to_bool_array()
        l.append(LandmarkTangles(4).score(t,r, data.ys))
        s.append(SoeKmeans(2, 4).score(t,r,data.ys)) 
        m.append(MajorityTangles(4).score(t,r,data.ys)) 
        c.append(ComparisonHC(4).score(t,r,data.ys))
    return np.mean(l), np.mean(s), np.mean(m), np.mean(c)
landmark, soe, maj, comp = eval_hierarchical()
print(f"Landmark: {landmark}, SoeKmeans: {soe}, MajorityTangles: {maj}, ComparisonHC: {comp}")

Landmark: 1.0, SoeKmeans: 1.0, MajorityTangles: 0.5174755217178904, ComparisonHC: 0.5876810155547433


We again run into the same problem that we have observed before: 
As soon as we add noise to the hierarchy, 
Tangles cannot cope with it anymore (because we have to have a cut who cleanly takes out one cluster from the rest).

See the graphics in `20_hierarchical_clusters.ipynb`.

If we add noise to the triplets directly however, we can get a MUCH better 
performance of the algorithm, with Tangles outperform the other algorithms
or being on par.

Arguments can be made for both noise models. Let us view two hierarchies,
where the first one is fruit and the second one is vegetable, with fruit
consisting of the objects apples and pears, and vegetables consisting of
tomatoes and carrots.

The one where we add noise to the hierarchy corresponds to the view, that 
objects from different categories all have completely different distances to
each other. This means, if I pick an apple, a tomato and a carrot,
it is completely random whether the apple is closer to the tomato or to the carrot.

The other view corresponds to there being a tendency to always answer the same way 
(apples are always closer to carrots!), but we have some noise on how the question might be answered (so _some_ apples might be closer to tomatoes than to carrots).

# Adding triplet noise 

In [3]:
def df_add_triplet_noise(density):
    df = pd.DataFrame(columns=["Triplet Noise", "Landmark", "SoeKmeans", "MajorityTangles", "ComparisonHC"])
    for tn in [0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5]:
        l, s, m, c = eval_hierarchical(noise=tn, n_runs=5, density=0.1)
        df = df.append(pd.Series([tn, l, s, m, c], index=df.columns), ignore_index=True)
    return df.melt(id_vars=["Triplet Noise"], var_name="Method", value_vars=list(df.columns[1:]), value_name="NMI")
df_01 = df_add_triplet_noise(0.1)
df_005 = df_add_triplet_noise(0.05)

In [4]:
alt.Chart(df_01).mark_line(point={}).encode(x="Triplet Noise:Q", y="NMI:Q", color="Method:N").properties(title="Triplet Noise with density 0.1").display()
alt.Chart(df_005).mark_line(point={}).encode(x="Triplet Noise:Q", y="NMI:Q", color="Method:N").properties(title="Triplet Noise with density 0.005")

# Varying density

In [5]:
df = pd.DataFrame(columns=["Density", "Landmark", "SoeKmeans", "MajorityTangles", "ComparisonHC"])
for density in [0.1, 0.05, 0.01, 0.005, 0.001]:
    l, s, m, c = eval_hierarchical(density=density, n_runs=5)
    df = df.append(pd.Series([density, l, s, m, c], index=df.columns), ignore_index=True)
df = df.melt(id_vars=["Density"], var_name="Method", value_vars=list(df.columns[1:]), value_name="NMI")

In [6]:
alt.Chart(df).mark_line(point={}).encode(x="Density", y="NMI:Q", color="Method:N")

# Adding hierarchy noise

In [7]:
df = pd.DataFrame(columns=["Hierarchy Noise", "Landmark", "SoeKmeans", "MajorityTangles", "ComparisonHC"])
for hn in [0.0, 0.25, 0.5, 0.75, 1.0, 2.0, 3.0, 4.0]:
    l, s, m, c = eval_hierarchical(hier_noise=hn, n_runs=5)
    df = df.append(pd.Series([hn, l, s, m, c], index=df.columns), ignore_index=True)
df = df.melt(id_vars=["Hierarchy Noise"], var_name="Method", value_vars=list(df.columns[1:]), value_name="NMI")

In [8]:
alt.Chart(df).mark_line(point={}).encode(x="Hierarchy Noise", y="NMI:Q", color="Method:N")

# Recovering the hierarchy
As-is, all the algorithms only recover the cluster structure, but not the hierarchy. 
Tangles and ComparisonHC have the possibility to also discover the hierarchy of the underlying structure, so we want to add this as well and see how Tangles holds up.