Skip to content

Mlj port #42

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,19 @@ authors = ["Bernard Brenyah", "Andrey Oskin"]
version = "0.1.0"

[deps]
Distances = "b4f34e82-e78d-54a5-968a-f98e89d6e8f7"
MLJModelInterface = "e80e1ace-859a-464e-9ed9-23947d8ae3ea"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"

[compat]
StatsBase = "0.32"
julia = "1.3"
StatsBase = "0.32, 0.33"
julia = "1.3, 1.4"

[extras]
MLJBase = "a7f614a8-145f-11e9-1d2a-a57a1082229d"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Suppressor = "fd094767-a336-5f1f-9728-57cf17d0bbfb"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "Random", "Suppressor"]
test = ["Test", "Random", "Suppressor", "MLJBase"]
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ ________________________________________________________________________________
_________________________________________________________________________________________________________

## Table Of Content

1. [Documentation](#Documentation)
2. [Installation](#Installation)
3. [Features](#Features)
Expand All @@ -18,13 +19,15 @@ ________________________________________________________________________________
_________________________________________________________________________________________________________

### Documentation

- Stable Documentation: [![Stable](https://img.shields.io/badge/docs-stable-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/stable)

- Experimental Documentation: [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://PyDataBlog.github.io/ParallelKMeans.jl/dev)

_________________________________________________________________________________________________________

### Installation

You can grab the latest stable version of this package by simply running in Julia.
Don't forget to Julia's package manager with `]`

Expand All @@ -39,9 +42,11 @@ pkg> dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
```

Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!

```bash
git checkout experimental
```

_________________________________________________________________________________________________________

### Features
Expand Down
8 changes: 4 additions & 4 deletions docs/Manifest.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ version = "0.8.1"

[[Documenter]]
deps = ["Base64", "Dates", "DocStringExtensions", "InteractiveUtils", "JSON", "LibGit2", "Logging", "Markdown", "REPL", "Test", "Unicode"]
git-tree-sha1 = "d497bcc45bb98a1fbe19445a774cfafeabc6c6df"
git-tree-sha1 = "646ebc3db49889ffeb4c36f89e5d82c6a26295ff"
uuid = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
version = "0.24.5"
version = "0.24.7"

[[InteractiveUtils]]
deps = ["Markdown"]
Expand Down Expand Up @@ -51,9 +51,9 @@ uuid = "a63ad114-7e13-5084-954f-fe012c677804"

[[Parsers]]
deps = ["Dates", "Test"]
git-tree-sha1 = "d112c19ccca00924d5d3a38b11ae2b4b268dda39"
git-tree-sha1 = "0c16b3179190d3046c073440d94172cfc3bb0553"
uuid = "69de0a69-1ddd-5017-9359-2bf0b02dc9f0"
version = "0.3.11"
version = "0.3.12"

[[Pkg]]
deps = ["Dates", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Test", "UUIDs"]
Expand Down
52 changes: 28 additions & 24 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@ Depth = 4
```

## Motivation

It's actually a funny story led to the development of this package.
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.
What started off as a personal toy project trying to re-construct the K-Means algorithm in native Julia blew up after a heated discussion on the Julia Discourse forum when I asked for Julia optimizaition tips. Long story short, Julia community is an amazing one! Andrey offered his help and together, we decided to push the speed limits of Julia with a parallel implementation of the most famous clustering algorithm. The initial results were mind blowing so we have decided to tidy up the implementation and share with the world as a maintained Julia pacakge.

Say hello to `ParallelKMeans`!

Expand All @@ -15,16 +16,18 @@ This package aims to utilize the speed of Julia and parallelization (both CPU &
In short, we hope this package will eventually mature as the "one stop" shop for everything KMeans on both CPUs and GPUs.

## K-Means Algorithm Implementation Notes

Since Julia is a column major language, the input (design matrix) expected by the package in the following format;

- Design matrix X of size n×m, the i-th column of X `(X[:, i])` is a single data point in n-dimensional space.
- Thus, the rows of the design design matrix represents the feature space with the columns representing all the training examples in this feature space.

One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
One of the pitfalls of K-Means algorithm is that it can fall into a local minima.
This implementation inherits this problem like every implementation does.
As a result, it is useful in practice to restart it several times to get the correct results.

## Installation

You can grab the latest stable version of this package from Julia registries by simply running;

*NB:* Don't forget to Julia's package manager with `]`
Expand All @@ -40,19 +43,21 @@ dev git@github.com:PyDataBlog/ParallelKMeans.jl.git
```

Don't forget to checkout the experimental branch and you are good to go with bleeding edge features and breaks!

```bash
git checkout experimental
```

## Features

- Lightening fast implementation of Kmeans clustering algorithm even on a single thread in native Julia.
- Support for multi-theading implementation of Kmeans clustering algorithm.
- 'Kmeans++' initialization for faster and better convergence.
- Modified version of Elkan's Triangle inequality to speed up K-Means algorithm.


## Pending Features
- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).

- [X] Implementation of [Hamerly implementation](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster).
- [ ] Full Implementation of Triangle inequality based on [Elkan - 2003 Using the Triangle Inequality to Accelerate K-Means"](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf).
- [ ] Implementation of [Geometric methods to accelerate k-means algorithm](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf).
- [ ] Support for DataFrame inputs.
Expand All @@ -63,9 +68,9 @@ git checkout experimental
- [ ] Improved Documentation
- [ ] More benchmark tests


## How To Use
Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.

Taking advantage of Julia's brilliant multiple dispatch system, the package exposes users to a very easy to use API.

```julia
using ParallelKMeans
Expand All @@ -83,7 +88,7 @@ The main design goal is to offer all available variations of the KMeans algorith
some_results = kmeans([algo], input_matrix, k; kwargs)

# example
r = kmeans(Lloyd(), X, 3) # same result as the default
r = kmeans(Lloyd(), X, 3) # same result as the default
```

```julia
Expand All @@ -95,30 +100,31 @@ r.iterations # number of elapsed iterations
r.converged # whether the procedure converged
```

### Supported KMeans algorithm variations.
- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
### Supported KMeans algorithm variations

- [Lloyd()](https://cs.nyu.edu/~roweis/csc2515-2006/readings/lloyd57.pdf)
- [Hamerly()](https://www.researchgate.net/publication/220906984_Making_k-means_Even_Faster)
- [Geometric()](http://cs.baylor.edu/~hamerly/papers/sdm2016_rysavy_hamerly.pdf) - (Coming soon)
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
- [Elkan()](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf) - (Coming soon)
- [MiniBatch()](https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf) - (Coming soon)


### Practical Usage Examples

Some of the common usage examples of this package are as follows:

#### Clustering With A Desired Number Of Groups

```julia
```julia
using ParallelKMeans, RDatasets, Plots

# load the data
iris = dataset("datasets", "iris");
iris = dataset("datasets", "iris");

# features to use for clustering
features = collect(Matrix(iris[:, 1:4])');
features = collect(Matrix(iris[:, 1:4])');

# various artificats can be accessed from the result ie assigned labels, cost value etc
result = kmeans(features, 3);
result = kmeans(features, 3);

# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
Expand All @@ -129,6 +135,7 @@ scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
![Image description](iris_example.jpg)

#### Elbow Method For The Selection Of optimal number of clusters

```julia
using ParallelKMeans

Expand All @@ -140,21 +147,18 @@ c = [ParallelKMeans.kmeans(X, i; tol=1e-6, max_iters=300, verbose=false).totalco

```


## Benchmarks
Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).

*Note*: All benchmark tests are made on the same computer to help eliminate any bias.

Currently, this package is benchmarked against similar implementation in both Python and Julia. All reproducible benchmarks can be found in [ParallelKMeans/extras](https://github.com/PyDataBlog/ParallelKMeans.jl/tree/master/extras) directory. More tests in various languages are planned beyond the initial release version (`0.1.0`).

Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.
*Note*: All benchmark tests are made on the same computer to help eliminate any bias.

Currently, the benchmark speed tests are based on the search for optimal number of clusters using the [Elbow Method](https://en.wikipedia.org/wiki/Elbow_method_(clustering)) since this is a practical use case for most practioners employing the K-Means algorithm.

### Benchmark Results

![benchmark_image.png](benchmark_image.png)


_________________________________________________________________________________________________________

| 1 million (ms) | 100k (ms) | 10k (ms) | 1k (ms) | package | language |
Expand All @@ -168,12 +172,12 @@ ________________________________________________________________________________

_________________________________________________________________________________________________________

## Release History

## Release History
- 0.1.0 Initial release


## Contributing

Ultimately, we see this package as potentially the one stop shop for everything related to KMeans algorithm and its speed up variants. We are open to new implementations and ideas from anyone interested in this project.

Detailed contribution guidelines will be added in upcoming releases.
Expand Down
3 changes: 3 additions & 0 deletions src/ParallelKMeans.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,16 @@
module ParallelKMeans

using StatsBase
using MLJModelInterface
import Base.Threads: @spawn
import Distances

include("seeding.jl")
include("kmeans.jl")
include("lloyd.jl")
include("light_elkan.jl")
include("hamerly.jl")
include("mlj_interface.jl")

export kmeans
export Lloyd, LightElkan, Hamerly
Expand Down
2 changes: 1 addition & 1 deletion src/kmeans.jl
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ end


"""
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=true)
Kmeans!(alg::AbstractKMeansAlg, containers, design_matrix, k; n_threads = nthreads(), k_init="k-means++", max_iters=300, tol=1e-6, verbose=false)

Mutable version of `kmeans` function. Definition of arguments and results can be
found in `kmeans`.
Expand Down
Loading