CLUMPY: CLUstering with Multicores and PYthon

Introduction:

CLUMPY is a parallel implementation of the Lloyd's k-means algorithm.

Clustering with CLUMPY is incredibly easy, as shown in this example:

from clumpy import CLUMPY

# create a default CLUMPY object (i.e., 2000 points in 2D) and 5 clusters
clumpy = CLUMPY(5)
#create clusters
clumpy.cluster()
#show clusters
clumpy.plot()

The example above will generate 2000 random points in 2 dimensions and create 5 clusters from them, which will be then plotted.

Do you want to use your own file containg points in any dimension? No problem, as shown in this example:

from clumpy import CLUMPY

# create a default CLUMPY object (i.e., 2000 points in 2D) and 5 clusters
clumpy = CLUMPY(5, file="datasets/202d")
#create clusters
clumpy.cluster()
#show clusters
clumpy.plot()

You can see the resulting clusters from the first example below:

Requirements:

In order to make CLUMPY works you will need:

Linux OS
Python 3
Numpy
Matplotlib
SharedArray
A multi-core architecture (optional, but highly reccomended)

User Manual:

Seed initialization methods implemented in CLUMPY are:

Random points.
K-means++

Termination conditions are:

Unchanged centroids.
Number of iterations.
Clusters energy below a given threshold (todo).

Files are read using numpy.loadtxt(), where you can specify your own delimiter (see below for details).

CLUMPY can be used from command line, where:

python3 clumpy/clumpy.py --help

Will print the command line manual. Of course, you can import it as shown in the examples above.

Performance

This is the obtained speedup by a quad-core Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz, where n=5000, d=2, k=7 and processes goes from 1 to 4 (each configuration is run 10 times and then averaged):

As we can see, we reach almost optimal speedup even by using a low-end processor. By using 4 processes, CLUMPY takes less than 3 seconds on average to obtain the optimal clusters for the configuration above!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
clumpy		clumpy
datasets		datasets
figures		figures
.gitignore		.gitignore
README.md		README.md
clumpy_file_example.py		clumpy_file_example.py
clumpy_random_example.py		clumpy_random_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clumpy

clumpy

datasets

datasets

figures

figures

.gitignore

.gitignore

README.md

README.md

clumpy_file_example.py

clumpy_file_example.py

clumpy_random_example.py

clumpy_random_example.py

Repository files navigation

CLUMPY: CLUstering with Multicores and PYthon

Introduction:

Requirements:

User Manual:

Performance

About

Releases

Packages

Languages

LucaLovagnini/CLUMPY

Folders and files

Latest commit

History

Repository files navigation

CLUMPY: CLUstering with Multicores and PYthon

Introduction:

Requirements:

User Manual:

Performance

About

Topics

Resources

Stars

Watchers

Forks

Languages