## kmeans clustering using `faiss`

`faiss` comes with a speedy kmeans implementation so I thought I'd give it a go. kmeans isn't ideal for us, as we need to specify the number of clusters (and here the cluster size too)

In [28]:
%load_ext autoreload
%autoreload 2

import utils

import faiss
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
# load embeddings matrix using KGEmbeddingStore class
emb_store = utils.load_embedding_store()
X = emb_store.ent_embedding_matrix
dim = X.shape[1]

dim

800

In [6]:
# get cluster centroids using faiss
N_CENTROIDS = 50
N_ITER = 20
VERBOSE = True

kmeans = faiss.Kmeans(dim, N_CENTROIDS, niter=N_ITER, verbose=VERBOSE)
kmeans.train(X)

Sampling a subset of 12800 / 645565 for training
Clustering 12800 points in 800D to 50 clusters, redo 1 times, 20 iterations
  Preprocessing in 0.87 s
  Iteration 19 (0.83 s, search 0.79 s): objective=1718.5 imbalance=1.273 nsplit=0        

1718.496337890625

In [36]:
# get clusters of fixed size around each centroid
CLUSTER_SIZE = 30

index = faiss.IndexFlatL2(dim)
index.add(X)
D, I = index.search(kmeans.centroids, CLUSTER_SIZE)

In [37]:
# print clusters
for idx, cluster_idxs in enumerate(I):
    cluster_ents = emb_store.idxs_to_entities(cluster_idxs)
    cluster_label_mapping = utils.get_labels(cluster_ents)
    cluster_labels = [cluster_label_mapping.get(ent, "<no label>") for ent in cluster_ents]
    
    print(f"--- Cluster {idx+1} ---")
    print("\n".join([f"{cluster_labels[i]} - {cluster_ents[i]}" for i in range(len(cluster_ents))]))
    

--- Cluster 1 ---
Prototype ‘Manpack’ ground terminals, 1980-1989 - https://collection.sciencemuseumgroup.org.uk/objects/co8359823
Prototype ‘Manpack’ ground terminal, 1980-1989 - https://collection.sciencemuseumgroup.org.uk/objects/co8359824
First sealed-off travelling wave tube, 1945-1946 - https://collection.sciencemuseumgroup.org.uk/objects/co30803
Ferranti AC/DC Broadcast Receiver - https://collection.sciencemuseumgroup.org.uk/objects/co35745
Part of Rugby transatlantic telephone transmitter, 1927-1956 - https://collection.sciencemuseumgroup.org.uk/objects/co35831
Main frame, for Automatic Computing Engine (ACE) pilot model, 1949 - https://collection.sciencemuseumgroup.org.uk/objects/co62378
RCA Theremin, USA, 1929 - https://collection.sciencemuseumgroup.org.uk/objects/co8593059
Pye model 1108 broadcast receiver, Serial No. 6164 - https://collection.sciencemuseumgroup.org.uk/objects/co35928
Home-constructed Williamson amplifier, c. 1949 - https://collection.sciencemuseumgroup.org.