Change output type of clusters-kmeans #850

DenisCarriere · 2017-07-20T04:39:07Z

Updates to `@turf/clusters-kmeans`

Outputs simple FeatureCollection (with additional properties cluster)
Dropped numberOfClusters can\'t be greater than the number of points (number of Clusters fallbacks to the count instead of throwing an error since the user might be passing multiple featureCollections and the total number of features is unknown/variable)
Update typescript definition
Update tests
Dropped making centroids in index.js
Added clone to prevent any input mutation
Add Centroid & Concave polygons to test outputs using clusterEach (visual/testing purposes only)
Added Input mutation tests & add mutate param to allow input to be mutated
(Optional) Allow user input to be mutated via parameter (mutation should be disabled by default)?

JSDocs

/**
 * Takes a set of {@link Point|points} and partition them into clusters using the k-mean .
 * It uses the [k-means algorithm](https://en.wikipedia.org/wiki/K-means_clustering)
 *
 * @name clustersKmeans
 * @param {FeatureCollection<Point>} points to be clustered
 * @param {number} [numberOfClusters=Math.sqrt(numberOfPoints/2)] numberOfClusters that will be generated
 * @param {boolean} [mutate=false] allows GeoJSON input to be mutated (significant performance increase if true)
 * @returns {FeatureCollection<Point>} Clustered Points with an additional two properties associated to each Feature:
 * - {number} cluster - the associated clusterId
 * - {[number, number]} centroid - Centroid of the cluster [Longitude, Latitude]
 * @example
 * // create random points with random z-values in their properties
 * var points = turf.random('point', 100, {
 *   bbox: [0, 30, 20, 50]
 * });
 * var numberOfClusters = 7;
 * var clustered = turf.clustersKmeans(points, numberOfClusters);
 *
 * //addToMap
 * var addToMap = clustered;
 */
module.exports = function (points, numberOfClusters, mutate) {

Example

DenisCarriere · 2017-07-20T04:58:08Z

@morganherlocker These changes might help understand the other modules (getCluster, clusterEach, clusterReduce) might of been confusing if you looked at those methods without these changes.

stebogit · 2017-07-20T05:35:01Z

I love the colors in the test output, so festive! 🎉 😄

stebogit · 2017-07-20T07:14:59Z

@DenisCarriere although the module's output might still be acceptable, there is an issue I have been thinking about in the last few days.
Currently the choice of the initial centroids is not random as it's supposed to be in the actual K-means algorithm; in fact I set them arbitrarily as the first numberOfClusters points of the input. This guarantees a consistent output for each input (thus all the tests pass).
This algorithm though is known to be extremely fast (compared to other algorithms like DBSCAN for instance) but also likely to converge to a local optimum (basically a "wrong" clustering). This in practice means the output is generally not the same.
There are some solutions that attempt to improve the probability for the algorithm to reach the global optimum selecting an appropriate set of initial centroids, so I suggested the author of skmeans to add the K-means++ algorithm (which seems one of the most popular and effective) to the library. That feature is now available, although it doesn't seem to bring significant improvements in our test cases (maybe because the implementation has not been extensively tested or maybe because our tests are not enough complex to appreciate its benefit).

The above to say that I believe we should at least use the default random selection of initial centroids (provided by skmeans) or use the optional k-means++ algorithm to truly implement the algorithm that names this Turf module. However this would require a different smart (and less pretty 😞 😅 ) way to test the output since it's most likely to vary all the time. I believe for this purpose the new @turf/clusters module will come in handy.

We might define valid an output that:

have the expected amount of centroids
in which (if I'm not mistaken) the centroids coincide to the centers of mass of the correspondent cluster
labels all the points

which is basically all you can/should expect from k-means.

stebogit

neat! 👍

stebogit · 2017-07-20T07:39:53Z

packages/turf-clusters-kmeans/index.js

+    if (numberOfClusters > count) numberOfClusters = count;
+
+    // Clone points to prevent any mutations
+    points = clone(points, true);


Shouldn't we let the user decide if the input should change or (by default) not, like we did for the transform modules? I believe it's likely the user wants the input to be added the clustering properties

We could always add a mutate parameter (to allow input mutations).

It will significantly increase the performance if the user doesn't mind having the input be added.

We might be able to do handle this without the use of clone, but as a safe measure it was added.

Benchmark results

without clone

fiji x 260,987 ops/sec ±1.03% (87 runs sampled) many-points x 173 ops/sec ±7.86% (77 runs sampled) points-with-properties x 245,562 ops/sec ±1.54% (86 runs sampled) points1 x 71,572 ops/sec ±0.86% (88 runs sampled) points2 x 40,784 ops/sec ±1.06% (86 runs sampled)

with clone (~6 times slower)

fiji x 41,566 ops/sec ±1.18% (89 runs sampled) many-points x 145 ops/sec ±2.42% (79 runs sampled) points-with-properties x 38,453 ops/sec ±1.22% (83 runs sampled) points1 x 12,598 ops/sec ±1.09% (91 runs sampled) points2 x 8,124 ops/sec ±0.97% (89 runs sampled)

Yes that's what I had in mind :+1

👍 yep, I didn't realize how "slow" it was after adding clone, that JSON.parse + JSON.stringify really takes a toll on the performance benchmarks.

stebogit · 2017-07-20T07:43:01Z

packages/turf-clusters-kmeans/index.js

-        points: points,
-        centroids: featureCollection(centroids)
-    };
+    return points;


I see the reasons why we want to output GeoJSON objects, however I believe the centroids in this particular clustering method are relevant and would be beneficial to the user having them already available in the output (as it's a common pattern in ALL the kmeans library available).
Couldn't we add them as a centroids property of the output? It would still be a valid GeoJSON format, wouldn't it?

👍 How about... we add [number, number] as a new centroid property, that way it doesn't deviate too much from the GeoJSON specification.

DenisCarriere · 2017-07-20T13:54:48Z

@stebogit Added the centroid back to the test cases using the (new) centroid property.

stebogit · 2017-07-21T05:01:06Z

@DenisCarriere did you read my comment? Just curious about your point of view.

DenisCarriere · 2017-07-21T05:32:55Z

That was a big paragraph... :)

You might of not noticed it, but I started added bullet points of To-Do's in the first comment #850 (comment)

Things I might of missed:

suggested the author of skmeans to add the K-means++ algorithm (which seems one of the most popular and effective) to the library. That feature is now available, although it doesn't seem to bring significant improvements in our test cases

It would be interesting to see the difference between K-means++, I personally don't know the difference, I'd have to dig deeper into this.

👍 We can include the improved K-means++

default random selection of initial centroids
However this would require a different smart (and less pretty 😞 😅 ) way to test the output since it's most likely to vary all the time.

We should use the default random selection of centroids if it's recommended by K-means to do so.

As for the test cases, I'm sure we can figure out a clever way to test this, we can still output the results using write.sync() except we don't need to test if they are an exact match, we could do the following:

Test if clustered has Features (clustered.length > 0)
Test if first feature is not undefined
Does contain a cluster property

And the ones you mentioned:

have the expected amount of centroids
in which (if I'm not mistaken) the centroids coincide to the centers of mass of the correspondent cluster
labels all the points

Those should be enough tests to make sure the module is working correctly.

Try to summarize them in a concise "to-do" list and we can tackle them in a new PR.

stebogit · 2017-07-21T08:16:33Z

Sorry @DenisCarriere I didn't noticed your changes.
I'll think about how to approach the tests and create the PR with the summary of the above. 👍

Change output type of clusters-kmeans

5466e27

DenisCarriere requested a review from stebogit July 20, 2017 04:39

DenisCarriere added this to the 4.6.0 milestone Jul 20, 2017

DenisCarriere added new-module enhancement and removed new-module labels Jul 20, 2017

Fix clone require

4317194

DenisCarriere self-assigned this Jul 20, 2017

stebogit approved these changes Jul 20, 2017

View reviewed changes

DenisCarriere added 3 commits July 20, 2017 09:07

Add tests -- prevent input mutation

dd43078

Add centroid property & mutate param

2433f76

Add Centroid to tests

8afdd1b

DenisCarriere added 3 commits July 20, 2017 10:12

Merge branch 'master' into update-clusters-kmeans

7908556

Add concave & centroid using clusterEach

651830d

Merge branch 'master' into update-clusters-kmeans

9053dee

DenisCarriere merged commit ab7b472 into master Jul 21, 2017

DenisCarriere deleted the update-clusters-kmeans branch July 21, 2017 04:01

DenisCarriere mentioned this pull request Aug 2, 2017

New minor release! Turf 4.6.0 🎉 #884

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change output type of clusters-kmeans #850

Change output type of clusters-kmeans #850

DenisCarriere commented Jul 20, 2017 •

edited

DenisCarriere commented Jul 20, 2017

stebogit commented Jul 20, 2017

stebogit commented Jul 20, 2017 •

edited

stebogit left a comment

stebogit Jul 20, 2017

DenisCarriere Jul 20, 2017

stebogit Jul 20, 2017

DenisCarriere Jul 20, 2017

stebogit Jul 20, 2017 •

edited

DenisCarriere Jul 20, 2017

DenisCarriere commented Jul 20, 2017

stebogit commented Jul 21, 2017

DenisCarriere commented Jul 21, 2017

stebogit commented Jul 21, 2017

Change output type of clusters-kmeans #850

Change output type of clusters-kmeans #850

Conversation

DenisCarriere commented Jul 20, 2017 • edited

Updates to @turf/clusters-kmeans

JSDocs

Example

DenisCarriere commented Jul 20, 2017

stebogit commented Jul 20, 2017

stebogit commented Jul 20, 2017 • edited

stebogit left a comment

Choose a reason for hiding this comment

stebogit Jul 20, 2017

Choose a reason for hiding this comment

DenisCarriere Jul 20, 2017

Choose a reason for hiding this comment

Benchmark results

stebogit Jul 20, 2017

Choose a reason for hiding this comment

DenisCarriere Jul 20, 2017

Choose a reason for hiding this comment

stebogit Jul 20, 2017 • edited

Choose a reason for hiding this comment

DenisCarriere Jul 20, 2017

Choose a reason for hiding this comment

DenisCarriere commented Jul 20, 2017

stebogit commented Jul 21, 2017

DenisCarriere commented Jul 21, 2017

stebogit commented Jul 21, 2017

DenisCarriere commented Jul 20, 2017 •

edited

Updates to `@turf/clusters-kmeans`

stebogit commented Jul 20, 2017 •

edited

stebogit Jul 20, 2017 •

edited