CLUMPY is a parallel implementation of the Lloyd's k-means algorithm.
Clustering with CLUMPY is incredibly easy, as shown in this example:
from clumpy import CLUMPY
# create a default CLUMPY object (i.e., 2000 points in 2D) and 5 clusters
clumpy = CLUMPY(5)
#create clusters
clumpy.cluster()
#show clusters
clumpy.plot()
The example above will generate 2000 random points in 2 dimensions and create 5 clusters from them, which will be then plotted.
Do you want to use your own file containg points in any dimension? No problem, as shown in this example:
from clumpy import CLUMPY
# create a default CLUMPY object (i.e., 2000 points in 2D) and 5 clusters
clumpy = CLUMPY(5, file="datasets/202d")
#create clusters
clumpy.cluster()
#show clusters
clumpy.plot()
You can see the resulting clusters from the first example below:
In order to make CLUMPY works you will need:
- Linux OS
- Python 3
- Numpy
- Matplotlib
- SharedArray
- A multi-core architecture (optional, but highly reccomended)
Seed initialization methods implemented in CLUMPY are:
- Random points.
- K-means++
Termination conditions are:
- Unchanged centroids.
- Number of iterations.
- Clusters energy below a given threshold (todo).
Files are read using numpy.loadtxt()
, where you can specify your own delimiter
(see below for details).
CLUMPY can be used from command line, where:
python3 clumpy/clumpy.py --help
Will print the command line manual. Of course, you can import it as shown in the examples above.
This is the obtained speedup by a quad-core Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, where n=5000
, d=2
, k=7
and processes
goes from 1 to 4 (each configuration is run 10 times and then averaged):
As we can see, we reach almost optimal speedup even by using a low-end processor. By using 4 processes, CLUMPY takes less than 3 seconds on average to obtain the optimal clusters for the configuration above!