# Clustering pipeline

This clustering notebook helps to group contextless name attestations together based on how similar the strings are.

## Prep: launching dependencies:

To start the process of clustering we'll need to do some base imports - os and sys are standard modules. To avoid overloading this notebook with large code blocks, different tasks of the pipeline are outsourced to separate files that each take care of a different task. These so called *utility* classes are imported as well and do most of the heavy lifting for you. 

This clustering pipeline is compatible with CUDA-acceleration (NVIDIA), but will work on CPU-only devices too. The notebook automatically looks for a free GPU to use. If none is found, inference will happen on CPU-only. 

In [1]:
#base imports - nothing fancy here.
import os
import pandas as pd
import sys

#reusable componenent utilities : 
sys.path.append('utils')
import gpu_manager  #hardware interaction 
import callables as c   # getters/setters for pipeline
## multiple 'steps' each have their own layer ==> data excahnge using callables
from vectorization_layer import TransformerGUI
from dimred_layer import DimensionalityReductionGUI
from clustering_layer import Clustermachine
from noise_extension_layer import NoiseExtender
from dimviz import DimensionVisualizer
from cluster_inspector import ClusterInspectorGUI


In [2]:
max_gpus = 1        #(INT): how many GPUS is this notebook allowed to use at most? 

######### Leave this block of code as is: #########
gpu_manager.pick_gpu(1, 'auto', int(max_gpus))
os.environ["TOKENIZERS_PARALLELISM"] = "true"
devices = gpu_manager.pick_gpu(mode = 'report', verbosity=0)
if not bool(devices): 
    device = 'cpu'
else:
    device = 'cuda'

print(f"Using {device}")

[INFO] Found 0 free GPUs: []
[WARN] No free GPUs found. CUDA_VISIBLE_DEVICES not set.
Using cpu


## Step1: Data embedding:
Upload the data you want to cluster using the GUI element as a flat file (.txt, .csv or .tsv). Each entry needs to be written on a new line. Quotes can be included, there are UI-elements in place that help you to clean this up. If your exported file contains a header on the first line, you can leave the option "Drop first row (header)" ticked, otherwise you should untick it. 

Once the uploaded data is parsed by the notebook, you can select an embeddings-algorithm, the tool provides a dropdown with some pre-selected options. One of the options is labeled **Custom**, when you select this option, you can provide a reference to any openly accessible model on [Huggingface.co](https://huggingface.co/) and use it in this notebook. 

The reference for a model is easily found on Huggingface. Let's assum you want to use one of the KaLM-embedding models such as [KaLM-embedding-multilingual-mini-instruct-v2.5](https://huggingface.co/KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5). The reference for this model is found at the top of the screen and has a click-to-copy button right next to it. In this case, the custom value you need to provide is: KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5 

Once you have chosen an embeddings-model, uploaded your data and chose the necessary pre-processing options on it, you can click 'Get embeddings'. 

Models that are cached on your system will be used immediately, if a model isn't cached, it'll be downloaded - beware that you'll need enough storage space and that this process may take a while depending on your bandwith. 

In [3]:
TransformerGUI(
    config_path = "transformerconfig.json", 
    device = device, 
    on_result = c.receive_embeddings
).display()

VBox(children=(HBox(children=(FileUpload(value=(), accept='.csv,.txt,.tsv', description='Upload data'), Checkb…

Batches:   0%|          | 0/50 [00:00<?, ?it/s]

## Step2: Dimension reduction
Embedding models generate a high dimensional dataset. A dimension can be interpreted as a plane on to where you project your data. A 1D plane would be a single line, with points scattered over that line. A 2D plane would be a flat square, where each point has an x and y coordinate. 3D planes are still quite intuitive, you've essentially got a cube in which each point floats. Anything up from this starts to become more exotic. These embedding algorithms produce easily upwards of hundreds of dimensions, e.g. LaBSE - a BERT based model - produces 768 dimensions. 

High-dimensional data such as this is difficult and slow to cluster, a way to solve this is by applying dimensionality reduction. However, there's a balance to be found. Reduce your dimensions too far (say to a single dimension) and you'd lose a lot of nuance. Keep your dimensions too high and your clustering algorithm will be slow and ineffective. Choosing the right amount of dimensions depends on your dataset, chosen vectorization model and the clustering algorithm you want to use in the end. Finding the right settings for your dataset will include some trial-and-error. It might be needed for you to revisit this step a few times.

In [4]:
DimensionalityReductionGUI(
    c.fetch_vectors,
    config_path = "dimredconfig.json", 
    on_result = c.receive_reduced
).display()

VBox(children=(Dropdown(description='Dimensionality reduction method:', layout=Layout(width='initial'), option…

## Step3: multidimensionality exploration

Visualize the normalized variance of each dimension to get a grasp on how well the embeddings actually represent clusters. Wen sorting on one of the dimension, you'd like to see 'vertical groups' of data; the larger your input dataset, the more difficult it is to actually see this pattern. 

In [5]:
DimensionVisualizer(
    c.get_reduced, 
    normalize=True
).display()

VBox(children=(HBox(children=(Button(button_style='primary', description='Update visualization', icon='refresh…

## Step4: Clustering
With the vectors in a reduced state, use the clustering algorithms to extract the clusters; HDBSCAN is a good start for larger datasets, if your dataset is not too big, you can try OPTICS - it'll typically produce smaller, more granular clusters but it doesn't scale well to larger datasets.

In [None]:
Clustermachine(
    c.get_reduced, 
    config_path = "clusterconfig.json",
    on_result = c.receive_vectorlabels, 
    stringgetter = c.get_input_list
).display()

Tab(children=(VBox(children=(Dropdown(description='Clustering:', options=(('HDBSCAN (density-based)', 'HDBSCAN…

## Step5: noise reduction: 
This step does not produce new clusters, but it will extend existing clusters with noise. There are two methods provided: 
1) Nearest neighbor (slow): this method will iterate over the reduced-dimensions vector embeddings, pick the noise-point that lives closest to a non-noise point and assign it to that cluster. It'll repeat this process untill all noise has been assigned to a cluster.
2) Nearest neighbor (fast): Will group all noise-points and assign each noise-point to the cluster it lives closest to in a single pass. 

In [7]:
NoiseExtender(
    c.get_reduced, 
    c.get_cluster_labels, 
    "noise_extension_methods.json", 
    c.receive_denoised_results
).display()

VBox(children=(Dropdown(description='Extension method:', options=(('Nearest neighbor (slow, precise)', 'neares…

## Step6: Inspect clustering output
You can look for specific mentions from your original dataset and see what other mentions share the same cluster label. Alternatively you can pick random labels and inspect them. 

In [8]:
ClusterInspectorGUI(
    c.get_input_list,
    c.get_reduced,
    c.get_cluster_labels,
    c.get_denoised_labels,
    c.get_denoised_sources
).display()

VBox(children=(Dropdown(description='Use labels:', options=(('Base labels', 'base'), ('Extended labels', 'exte…