Skip to content

Simple script to create clusters from embeddings in word2vec format

Notifications You must be signed in to change notification settings

GateNLP/cluster-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

Tool for clustering embeddings

So far, just a simple script to run a bunch of sklearn clustering algorithms on embeddings in word2vec format.

Some results on usefulness:

Usage

Get usage information by running:

  • python3 ./python/cluster-embs.py -h

Currently the following clustering algorithms are supported (using the sklearn back-end). For some of the algorithms, information about the clusters is written to a file with extension info.json:

  • MiniBatchKMeans (default): k-means clustering using minibatch SGD Info: cluster_centers, inertia, counts, n_iter
  • KMeans: k-means clustering Info: cluter_centers, inertia, n_iter
  • AgglomerativeClusteringWard: agglomerative clustering using ward linkage
  • AgglomerativeClusteringAverageEuclidean: agglomerative clustering using average linkage
  • Birch
  • SpectralClustering: spectral clustering with rbf affinity

For MiniBatchKMeans and KMeans, the 20 elements most similar to each of the k cluster centroids are stored in a file with the extension mostsimilar.json.

Clustering times

embeddings machine alg k elapsed time, total elapsed time, clustering
fasttext wiki.en.vec derwent KMeans 100 5:21:16 ???
fasttext wiki.en.vec derwent MiniBatchKMeans 100 0:16:37 0:00:38
fasttext wiki.en.vec derwent MiniBatchKMeans 500 0:22:18 0:02:22
fasttext wiki.bg.vec derwent MiniBatchKMeans 100 0:02:27 0:00:13
fasttext wiki.bg.vec derwent MiniBatchKMeans 500 0:04:42 0:01:18

About

Simple script to create clusters from embeddings in word2vec format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages