# Evaluate Similarity Grouping

In this notebook, we evaluate how effective a relation can be integrated using the NoiseAwareGroupBy Operator.
Therefore, we utilize the [Music Brainz 20K](https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution).

The dataset contains modified (usign the DAPO data generator) song records from different sources.
The goal is to group same songs into buckets. E.g. The records {"title": "Daniel Balavoine - L'enfant aux yeux d'Italie", "artist": null, "album": "De vous à elle en passant par moi", ...} and {"name": L'enfant aux yeux d'Italie - De vous à elle en passant par moi", "artist": "Daniel Balavoine", "album": null} describe the same song.

The column "CID" describes the cluster of the record. Using the  `SoftAggregateScikit` operator, we determine clusters and calculate the metrics:
* Adjusted Rand Index (ARI)
* Normalized Mutual Information (NMI)
* Fowlkes-Mallows Index (FMI)


In [1]:
%%capture
!pip3 install faiss-gpu-cu12
!pip3 install pgvector

In [1]:
%%capture
!rm -rf SofteningQueryEvaluation
!git clone https://github.com/HackerBschor/SofteningQueryEvaluation
%cd SofteningQueryEvaluation

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
import pandas as pd
import time

from models import ModelMgr
from models.embedding.SentenceTransformer import SentenceTransformerEmbeddingModel

from db.operators import Dummy, SoftAggregateScikit
from db.operators.Aggregate import SetAggregation
from sklearn.cluster import KMeans, DBSCAN, HDBSCAN

from sklearn.metrics import rand_score, adjusted_rand_score
from sklearn.metrics import fowlkes_mallows_score
from sklearn.metrics import mutual_info_score, adjusted_mutual_info_score, normalized_mutual_info_score
from sklearn.metrics import homogeneity_score, completeness_score
from sklearn.metrics import v_measure_score, homogeneity_completeness_v_measure, homogeneity_completeness_v_measure

In [4]:
m = ModelMgr()
stem = SentenceTransformerEmbeddingModel(m)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [7]:
df_music = pd.read_csv("musicbrainz-20-A01.csv", index_col=0).drop(columns=["length"], axis=1)
significant_cols = ["title", "artist", "album", "year", "language"]
df_music.head()

Unnamed: 0_level_0,CID,CTID,SourceID,id,number,title,artist,album,year,language
TID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,1,2,MBox7368722-HH,9,Daniel Balavoine - L'enfant aux yeux d'Italie,,De vous à elle en passant par moi,75.0,French
2,2512,5,4,139137-A047,7,007,[unknown],Cantigas de roda (unknown),,Por.
3,2,1,2,MBox38440522-HH,17,Action PAINTING! - Mustard Gas,,There and Back Again Lane,95.0,English
4,3,1,5,4489993,10,Your Grace,Kathy Troccoli,Comfort,2005.0,English
5,4,1,5,10339621,2,Well You Needn't,Ernie Stadler Jazz Quintet,First Down,2010.0,English


In [70]:
def evaluate(df, cluster_columns, id_column, cluster_class, cluster_params, serialization_mode, reduce_dimensions, drop_na):
    key = (str(cluster_class), str(serialization_mode), str(reduce_dimensions), str(drop_na))

    if drop_na:
        df = df.dropna()

    if cluster_class == KMeans:
        cluster_params = {"n_clusters": len(df["CID"].unique())}

    columns = [col.strip() for col in df.columns]
    data = [[str(y) for y in x] for x in df.itertuples(name=None)]

    d = Dummy("data", ["tid"] + columns, data).open()
    agg = SoftAggregateScikit(
        d,
        cluster_columns,
        [SetAggregation("tid", "ids")],
        em=stem,
        cluster_class = cluster_class,
        cluster_params = cluster_params,
        serialization_mode = serialization_mode,
        reduce_dimensions = reduce_dimensions
    )

    tic = time.time()
    result = agg.open().fetch_all()
    toc = time.time()


    predictions = []
    for i, row in enumerate(result):
        predictions.append(pd.Series([i for _ in range(len(row["ids"]))], index=[int(idx) for idx in row["ids"]]))

    predicted_labels = pd.concat(predictions).sort_index()
    true_labels = df[id_column].sort_index()

    return key, {
        "rand_score": rand_score(true_labels, predicted_labels),
        "adjusted_rand_score": adjusted_rand_score(true_labels, predicted_labels),
        "fowlkes_mallows_score": fowlkes_mallows_score(true_labels, predicted_labels),
        "mutual_info_score": mutual_info_score(true_labels, predicted_labels),
        "adjusted_mutual_info_score": adjusted_mutual_info_score(true_labels, predicted_labels),
        "normalized_mutual_info_score": normalized_mutual_info_score(true_labels, predicted_labels),
        "homogeneity_score": homogeneity_score(true_labels, predicted_labels),
        "completeness_score": completeness_score(true_labels, predicted_labels),
        "v_measure_score": v_measure_score(true_labels, predicted_labels),
        "homogeneity_completeness_v_measure": homogeneity_completeness_v_measure(true_labels, predicted_labels),
        "runtime": toc-tic,
        "pred": predicted_labels
    }

In [71]:
overall_results = {}

cluster_classes = [
    (KMeans, None),
    (DBSCAN, {"eps": 0.1, "min_samples": 1}),
    (HDBSCAN, {"min_cluster_size": 2}),
]

for cc, cp in cluster_classes:
    for sm in ["FIELD_SERIALIZED", "FULL_SERIALIZED"]:
        for dn in [True, False]:
            for dim in [None, 2, 10, 50, 100]:
                res = evaluate(df_music, significant_cols, "CID", cluster_class=cc, cluster_params=cp, serialization_mode = sm, reduce_dimensions = dim, drop_na = dn)
                overall_results[res[0]] = res[1]
                print(res[0], res[1]["adjusted_rand_score"])

("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', 'None', 'True') 0.7227659245977957




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '2', 'True') -2.4531393179505998e-05




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '10', 'True') 0.008484851851779879




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '50', 'True') 0.017518879204225565




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '100', 'True') 0.007973073401744864
("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', 'None', 'False') 0.1447399509695958




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '2', 'False') 0.00046501468502420455




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '10', 'False') 0.003652624300078057




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '50', 'False') 0.003690991858100789




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FIELD_SERIALIZED', '100', 'False') 0.0033327104396177476
("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', 'None', 'True') 0.7476574574975191




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '2', 'True') -2.3753909189524985e-05




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '10', 'True') 0.015123853205574708




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '50', 'True') 0.052088402676638994




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '100', 'True') 0.060579675365317415
("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', 'None', 'False') 0.8193905275310946




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '2', 'False') 0.15029748845678173




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '10', 'False') 0.42161984326588114




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '50', 'False') 0.43902107002161744




("<class 'sklearn.cluster._kmeans.KMeans'>", 'FULL_SERIALIZED', '100', 'False') 0.4326432332993688
("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', 'None', 'True') 0.0




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '2', 'True') 0.0003469658669101768




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '10', 'True') 0.004838811796534633




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '50', 'True') 0.006914720431063362




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '100', 'True') 0.010136855032518616
("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', 'None', 'False') 0.0008610930005805606




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '2', 'False') -3.818446852587077e-05




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '10', 'False') 0.000176200479341946




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '50', 'False') 0.00018356149484793842




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FIELD_SERIALIZED', '100', 'False') 0.00022871936935967528
("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', 'None', 'True') 0.2782563569718429




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '2', 'True') 0.00022497755620687652




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '10', 'True') 0.01243016962611369




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '50', 'True') 0.01418800554008786




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '100', 'True') 0.018164975897916633
("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', 'None', 'False') 0.004664718769860029




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '2', 'False') 0.0002749628346233918




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '10', 'False') 0.04135354932494886




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '50', 'False') 0.05198700452933587




("<class 'sklearn.cluster._dbscan.DBSCAN'>", 'FULL_SERIALIZED', '100', 'False') 0.050894188928070704
("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', 'None', 'True') 0.05192791393826211




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '2', 'True') 0.005532233275780948




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '10', 'True') 0.004969767810254126




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '50', 'True') 0.004349075702464014




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '100', 'True') 0.011665111011044752
("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', 'None', 'False') 0.265633749556373




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '2', 'False') 0.0014350551739791815




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '10', 'False') 0.003283994488971422




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '50', 'False') 0.0029068538852484176




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FIELD_SERIALIZED', '100', 'False') 0.0007078001477686686
("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', 'None', 'True') 0.020980038415771833




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '2', 'True') 0.020844990212210073




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '10', 'True') 0.02280014168060225




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '50', 'True') 0.0291117279723106




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '100', 'True') 0.02375896405903127
("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', 'None', 'False') 0.871507546722749




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '2', 'False') 0.19357111522398596




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '10', 'False') 0.5092503647132791




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '50', 'False') 0.511497910753623




("<class 'sklearn.cluster._hdbscan.hdbscan.HDBSCAN'>", 'FULL_SERIALIZED', '100', 'False') 0.5013349419763493


In [72]:
keys = ["cluster", "serialization", "dimension", "drop_na"]
evaluation_results_list = [v | {ki: vi for ki, vi in zip(keys, k)} for k, v in overall_results.items()]
df_evaluation_results = pd.DataFrame.from_records(evaluation_results_list)
df_evaluation_results["cluster"] = df_evaluation_results["cluster"].apply(lambda x: x.split(".")[-1].replace("'>", ""))
df_evaluation_results = df_evaluation_results.set_index(keys)
df_evaluation_results.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,rand_score,adjusted_rand_score,fowlkes_mallows_score,mutual_info_score,adjusted_mutual_info_score,normalized_mutual_info_score,homogeneity_score,completeness_score,v_measure_score,homogeneity_completeness_v_measure,runtime,pred
cluster,serialization,dimension,drop_na,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
KMeans,FIELD_SERIALIZED,,True,0.999987,0.722766,0.722914,7.937553,0.736875,0.998469,0.998429,0.998509,0.998469,"(0.9984286289107432, 0.9985094085043371, 0.998...",110.525394,4 0 5 1 9 2 10 ...
KMeans,FIELD_SERIALIZED,2.0,True,0.999951,-2.5e-05,0.0,7.901016,-2.6e-05,0.994026,0.993833,0.99422,0.994026,"(0.9938327384406908, 0.9942198343219951, 0.994...",53.860461,4 0 5 1 9 2 10 ...
KMeans,FIELD_SERIALIZED,10.0,True,0.999947,0.008485,0.008618,7.899162,0.00957,0.993938,0.9936,0.994277,0.993938,"(0.9935995138490926, 0.9942769177684889, 0.993...",54.437845,4 0 5 1 9 2 10 ...
KMeans,FIELD_SERIALIZED,50.0,True,0.999949,0.017519,0.017698,7.900311,0.019306,0.99404,0.993744,0.994336,0.99404,"(0.9937440661637883, 0.9943361763255057, 0.994...",56.846164,4 0 5 1 9 2 10 ...
KMeans,FIELD_SERIALIZED,100.0,True,0.999944,0.007973,0.008179,7.897666,0.009422,0.993843,0.993411,0.994276,0.993843,"(0.9934113839666784, 0.9942758401487835, 0.993...",60.259333,4 0 5 1 9 2 10 ...


In [81]:
df_evaluation_results.to_pickle("EvaluateClustering.pkl")
df_evaluation_results.drop(columns=["pred"]).to_csv("EvaluateClustering.csv")

In [80]:
for x,y in df_evaluation_results.iterrows():
  print(y["pred"])
  break

4           0
5           1
9           2
10          3
19          4
         ... 
19296    2867
19302    2868
19319      83
19332    2869
19341    2870
Length: 2969, dtype: int64
