Implement `@turf/clusters-dbscan` module #812

DenisCarriere · 2017-06-21T22:09:56Z

First draft of `@turf/clusters-dbscan` Ref. #811

Note: this module might be merged into clusters and support both distance & numberOfClusters params (still up in the air).

Compared to the kmeans cluster this is roughly 5x-50x times faster (the kdbush index does help a lot -- The performance will most likely slow down as weighted clusters becomes implemented).

To-Do

Translate existing GeoJSON properties
Weighted clusters based on distance (right now the first point from each new cluster is the "center")
Not too sure if we should be adding centroid/points in the outputs (instead of simply a FeatureCollection<Point>). Using the cluster property param is already enough to be able to create a centroid/center/centerOfMass.
Only supports kilometers as distance

More To-Dos

change library name to clusters-dbscan
add a dbscan property
drop the centroid
Prevent input mutation

Examples

points1.geojson

points2.geojson

stebogit · 2017-07-05T06:11:41Z

@DenisCarriere how can I help?

Not too sure if we should be adding centroid/points in the outputs (instead of simply a FeatureCollection). Using the cluster property param is already enough to be able to create a centroid/center/centerOfMass.

I believe including the centroids in the output might result actually useful as sometimes you are interested in those points and calculating them while processing the points is faster than doing it afterwards.

DenisCarriere · 2017-07-05T13:10:11Z

👍 Agreed about including the centroid.

As for help, when the cluster ID's are being associated, it only matches the first found and then skips all the other ones. What would need to happen is to sort/group the clusters by closest distance (not the first match).

stebogit · 2017-07-05T16:04:19Z

@DenisCarriere I don't know if this is exactly the issue you were referring to, but I feel there is something not right with the approach we are using for this clustering.
Take a look at the gif image that shows the greedy clustering algorithm (which we are basically using here) of supercluster; I added a few points to show the issue I'm referring to:

As you can see all points are grouped (which is the goal of that clustering method), however the green points 2 and 3 technically belong to two different maxDistance circles/cluster, but they are associated with a (maxDistance) cluster that is first calculated.

The problem I think is that for the @turf/clusters-distance circles (i.e. clusters) should not overlap, but it's impossible to completely cover an area using circles...

DenisCarriere · 2017-07-05T21:02:41Z

Yes this is the issue I was referencing, I actually used supercluster as inspiration for this module (removed all the WebGL & vector tile grouping).

What if we define the cluster param as clusters: Array<number> instead?

That way the points 2 & 3 would have this as a param clusters: [2, 3]

This approach might help solve any ambiguous cluster matching.

Points that only have 1 cluster would still be an Array except with only 1 number.

@stebogit thoughts?

DenisCarriere · 2017-07-05T21:05:46Z

By the way... Thanks @mourner for another great library 👍 (supercluster)

stebogit · 2017-07-06T16:08:07Z

@DenisCarriere I think we should better define the expected result of the module.
It seems to me that we are kind of mixing the (current) @turf/cluster functionality/output (a partitional clustering) with the supercluster one (a hierarchical clustering).

What if we assigned the clusters (array) parameter to each point, where the array contains all the index/reference to al the points that are within the maxDistance?
There wouldn't be an actual centroid for each cluster, as actually there wouldn't be any cluster (so this method might not be called clusters-)...

I have no idea if this would be of any utility though.

DenisCarriere · 2017-07-06T16:13:29Z

Well how would one go about to create clusters based on a distance?

Right now we can create clusters based on kmeans, but if you wanted to cluster points based on a maxDistance, then this would be the module for it.

No clue what the perfect outcome should be, however I'm sure we can come up with a creative way to make such module.

removed id attribute from output points;

DenisCarriere · 2017-07-07T18:46:47Z

packages/turf-clusters-distance/index.js

@@ -25,17 +25,12 @@ module.exports = function (points, maxDistance) {
    collectionOf(points, 'Point', 'Input must contain Points');

    // Create index
-    const load = points.features.map(function (point, index) {


👍 Love it! I think my first approached used the id fields to do the matching, however I might of dropped that workflow mid way and forgot to remove the .map().

@stebogit Good catch!

DenisCarriere · 2017-07-07T18:47:18Z

packages/turf-clusters-distance/bench.js

+ * points-with-properties: 0.164ms
+ * points1: 0.087ms
+ * points2: 0.694ms
+ * fiji x 1,320,371 ops/sec ±1.72% (80 runs sampled)


🚗 💨 Zoom Zoom!

stebogit · 2017-07-11T08:01:16Z

Well how would one go about to create clusters based on a distance?
Right now we can create clusters based on kmeans, but if you wanted to cluster points based on a maxDistance, then this would be the module for it.

@DenisCarriere I've been thinking a lot about this clustering problem and I keep wondering: maxDistance from what? There's no reference from which to calculate the distance.

Should we add a (required) centroids input and assign a centroid/cluster to each point within maxDistance from that centroid?
This would not necessary assign a cluster to all the points in the set (see partial clustering).
However, would this be a useful/common clustering method, that 'popular' to deserve a dedicated module?

In my research on clustering methods I keep seeing these three as most popular (i.e. I guess mostly used/useful) methods:

K-means (each point of the cluster is closest to its centroid that any other centroid, thus this is a clustering method based on distance)
Agglomerative Hierarchical Clustering (similar to supercluster and the current implementation of this module)
DBSCAN (density based, clusters are defined as regions of higher density, i.e. number of points per unit area, than the remainder of the data set)

DenisCarriere · 2017-07-11T20:06:51Z

Thanks @stebogit for that initial research.

For me, my initial intent with this distance cluster would most likely reflect DBSCAN's clustering algorithm. This approach makes the most sense to me and would be "relatively" easy to implement using geokdbush, I think I've already got an idea how to do it.

Using this approach would probably solve your question about (maxDistance from what?)

In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable.

@stebogit Are you on the same page if we use the DBSCAN approach for @turf/clusters-distance?

CC: @morganherlocker & @mourner Feel free to shim in anytime if you feel this is going in the wrong direction.

DenisCarriere · 2017-07-12T06:33:41Z

@stebogit I couldn't help myself to get this done in one night.

Check out the new implementation, it's pretty solid and it uses some of the DBSCAN logic (minus not providing noise points - they are simply excluded).

Next Steps

Optimize code for performance
Add maxPoints as param??

Many Points

Clusters will keep growing as long as clusters are within maximum distance. Used minPoints=3 to exclude smaller clusters.

Points 2

Outliers can be identified here by showing clusters of a single point (can be removed if minPoints param is defined greater than 1).

stebogit · 2017-07-12T09:04:30Z

👍 👏 @DenisCarriere I'll take a look at this tomorrow.

Quick thought, we could call this module @turf/clusters-density, to differentiate it from kmeans which does cluster points based on distance from the centroid.
Or we might just stick with the algorithm name, which is probably more clear for everybody, so renaming @turf/clusters to @turf/clusters-kmeans and this to @turf/clusters-dbscan.

DenisCarriere · 2017-07-12T15:34:03Z

👍 For @turf/clusters-density name, I don't think the average user would know what dbscan means, keeping the module names simple is best.

👍 renaming @turf/clusters to @turf/clusters-kmeans, that way the cluster process is better defined/scoped by using the name of the module.

DenisCarriere · 2017-07-12T15:36:45Z

The reason why I wouldn't call it @turf/clusters-dbscan is because this is more of an "inspiration" of dbscan and not a complete implementation of it.

We can always improve on this at a later date...

stebogit · 2017-07-14T15:55:26Z

@DenisCarriere why don't we return this instead:

return {
    points: FeatureCollection<Points>, // with `clusterId` property
    edges: Array<Array<number>>, // edges ids
    centroids: Array<Array<number>>, // centroids ids
    noise: Array<Array<number>> // noise ids
};

This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

Edit:
With this in mind I would probably include a clusters: Array<Array<number>> field in the @turf/clusters-kmeans output as well, for allowing easier iteration

stebogit · 2017-07-14T15:58:50Z

packages/turf-clusters-distance/index.js

        noise.push(noisePoint);
    });

    return {
        points: featureCollection(newPoints),
+        edeges: featureCollection(edges),


typo:

- edeges: featureCollection(edges), + edges: featureCollection(edges),

DenisCarriere · 2017-07-14T16:07:39Z

Shall we rename also @turf/clusters-kmeans now or wait until next major release?

We just recently published that module... We change it to kmeans at the next major release, or include both (for now) and clusters will simply import the new module name clusters-kmeans and we can add a console.warning() message when using it.

DenisCarriere · 2017-07-14T16:09:57Z

This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

This is already deviating A LOT from the main purpose of TurfJS, all of the inputs/outputs should be in GeoJSON.

We shouldn't even be pushing out an Array of FeatureCollection (if we don't have too).

DenisCarriere · 2017-07-14T16:11:04Z

I like simply including the dbscan property, it's easy to understand the output.

Quick update on the Typescript definition

interface Point extends GeoJSON.Feature<GeoJSON.Point> {
    properties: {
        dbscan?: 'core' | 'edge' | 'noise';
        [key: string]: any;
    }
}

interface Output {
    type: 'FeatureCollection'
    features: Point[];
}

DenisCarriere · 2017-07-14T16:27:47Z

@stebogit Next commit should include a lot of changes, have a review:

Note: The Stars are edges (not center)

Dropped Center coordinates
Update typescript definition
Only output single FeatureCollection
Tag properties 'core' | 'edge' | 'noise'
Change minPoint default to 3 (recommendations DBSCAN to perform any clustering | 1 would cluster everything)

CC: @stebogit

- Update typescript definition - Only output single FeatureCollection - Tag properties 'core' | 'edge' | 'noise' - Change minPoint default to 3

DenisCarriere · 2017-07-14T17:01:49Z

@stebogit This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

These types of modules are probably better to be abstracted out of TurfJS (only dealing with 2D points) and afterwards TurfJS would wrap it to simplify the GeoJSON integration.

As a good example, @mourner's library are mainly all 2D (Array<number>) which can easily be used in all sorts of different libraries, whereas TurfJS is mostly focused on having pure GeoJSON outputs.

stebogit · 2017-07-14T16:50:29Z

packages/turf-clusters-distance/index.js

    // handle noise points, if any
+    // Skip Noise if cluster is already associated
+    // This might be a slight deviation of DBSCAN (or a bug in the library)


I'd drop this (since it's not a bug) and maybe add an explanation like
// edges points are tagged by DBSCAN as both 'noise' and 'cluster' as they can "reach" less than 'minPoints' number of points

stebogit · 2017-07-14T17:04:50Z

packages/turf-clusters-distance/test.js

-        point.properties['marker-symbol'] = 'circle-stroked';
-        point.properties['marker-size'] = 'medium';
-        points.push(point);
+    featureEach(clustered, function (point) {


I'd change this to:

switch (point.properties.dbscan) { case 'core': case 'edge': { const coreColor = colours[point.properties.cluster]; const edgeColor = chromatism.brightness(-20, colours[point.properties.cluster]).hex; point.properties['marker-color'] = (point.properties.dbscan === 'core') ? coreColor : edgeColor; point.properties['marker-size'] = 'small'; points.push(point); break; } case 'noise': { point.properties['marker-color'] = '#AEAEAE'; point.properties['marker-symbol'] = 'circle-stroked'; point.properties['marker-size'] = 'medium'; points.push(point); } }

as edges are not really a major feature to highlight.
But it's just a finesse 😄

lol Yep! Looks good.

I really like this Switch statement, makes it really easy to control those clustered points.

stebogit · 2017-07-14T17:07:53Z

packages/turf-clusters-distance/index.js

@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection;
 * @param {FeatureCollection<Point>} points to be clustered
 * @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only)
 * @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers
- * @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the
+ * @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the


stebogit · 2017-07-14T17:12:43Z

packages/turf-clusters-distance/index.js

@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection;
 * @param {FeatureCollection<Point>} points to be clustered
 * @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only)
 * @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers
- * @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the
+ * @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the
 *     cluster does not meet the minimum amounts of points.
 * @returns {Object} an object containing a `points` FeatureCollection, the input points where each Point
 *     has given a `cluster` property with the cluster number it belongs, a `centroids` FeatureCollection of


Should be updated with the new dbscan property

👍 Yep! It's defined in the Typescript definition, but yes it needs to be documented

DenisCarriere · 2017-07-14T18:21:25Z

packages/turf-clusters-distance/test.js

+ * @example
+ * var centroids = centroidFromProperty(points, 'cluster');
+ */
+function centroidFromProperty(geojson, property, properties) {


@stebogit Added back your centroids points 😄

Would be interesting to know the benchmark results on that centroidFromProperty method (i'm sure it's really fast... just quickly scans the FeatureCollection once and then applies clusters based on those bins).

Mmmh... in density clustering the centroids are less useful/identifying/important than in k-means, I guess. 🤔
But I might be wrong.

Oh, I see now, this is only in test.js. Good 👍

Yes! :) only for test.js This where centroidByProperty module would be used... we wouldn't apply this directly in the modules, but for visual purposes.

Also this can be applied against Polygons or any Geometry Types, finding the "centroid" of stuff based on properties is quite useful.

stebogit · 2017-07-14T19:03:29Z

Cool! 😃 🎊 🚀

DenisCarriere · 2017-07-14T19:05:49Z

29 commits later.. 👍

DenisCarriere added 2 commits June 21, 2017 17:54

Implement @turf/clusters-distance module

d0d8970

Update yarn lock

24cf66d

DenisCarriere added the new-module label Jun 21, 2017

DenisCarriere added this to the 4.5.0 milestone Jun 21, 2017

DenisCarriere self-assigned this Jun 21, 2017

Update debug file

8669d30

DenisCarriere modified the milestones: 4.6.0, 4.5.0 Jun 30, 2017

DenisCarriere added the need-help label Jun 30, 2017

simplified calculation (run almost x2 faster);

e5fb827

removed id attribute from output points;

DenisCarriere commented Jul 7, 2017

View reviewed changes

DenisCarriere added 3 commits July 11, 2017 21:49

Convert index.js to ES5

17ca549

Publish new clusters-distance approach

ed0e48a

Add minPoints to tests param

88d3953

Update Typescript tests

cf53d66

Merge branch 'master' into clusters-distance

4636f3a

stebogit reviewed Jul 14, 2017

View reviewed changes

DenisCarriere added 2 commits July 14, 2017 12:29

Major changes

8a45889

- Update typescript definition - Only output single FeatureCollection - Tag properties 'core' | 'edge' | 'noise' - Change minPoint default to 3

Create a set of clusters to colorize

c2be7be

Define edges with cross

01e92de

stebogit approved these changes Jul 14, 2017

View reviewed changes

DenisCarriere added 5 commits July 14, 2017 13:56

Add CentroidFromProperty to tests

01bc50b

Updates based on @stebogit comments

d3a3166

Update Readme

28436e7

Update benchmark results & drop geokdbush

d926702

Add noisePoint.properties fallback incase no props

7fe4acf

DenisCarriere commented Jul 14, 2017

View reviewed changes

DenisCarriere added 4 commits July 14, 2017 14:34

Added Array of Features handling

67e9071

Update library to clusters-dbscan

e7fb4a8

Rename folder to clusters-dbscan

9bdd634

Update readme to clusters-dbscan

af57608

stebogit mentioned this pull request Jul 14, 2017

Create Centroids from Property Name (new module proposal) #841

Closed

3 tasks

DenisCarriere merged commit 59a2b65 into master Jul 14, 2017

DenisCarriere deleted the clusters-distance branch July 14, 2017 18:58

stebogit mentioned this pull request Jul 15, 2017

Proposal new module distance-clustering #811

Closed

2 tasks

DenisCarriere mentioned this pull request Jul 16, 2017

Cluster modules - New modules/features #845

Closed

10 tasks

DenisCarriere changed the title ~~Implement @turf/clusters-distance module~~ Implement @turf/clusters-dbscan module Jul 18, 2017

DenisCarriere mentioned this pull request Aug 2, 2017

New minor release! Turf 4.6.0 🎉 #884

Closed

Implement @turf/clusters-dbscan module #812

Implement @turf/clusters-dbscan module #812

Conversation

DenisCarriere commented Jun 21, 2017 • edited Loading

First draft of @turf/clusters-dbscan Ref. #811

To-Do

More To-Dos

Examples

stebogit commented Jul 5, 2017 • edited Loading

DenisCarriere commented Jul 5, 2017 • edited Loading

stebogit commented Jul 5, 2017 • edited by DenisCarriere Loading

DenisCarriere commented Jul 5, 2017 • edited Loading

DenisCarriere commented Jul 5, 2017

stebogit commented Jul 6, 2017

DenisCarriere commented Jul 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stebogit commented Jul 11, 2017

DenisCarriere commented Jul 11, 2017

DenisCarriere commented Jul 12, 2017 • edited Loading

Next Steps

Many Points

Points 2

stebogit commented Jul 12, 2017 • edited Loading

DenisCarriere commented Jul 12, 2017

DenisCarriere commented Jul 12, 2017

stebogit commented Jul 14, 2017 • edited Loading

Choose a reason for hiding this comment

DenisCarriere commented Jul 14, 2017

DenisCarriere commented Jul 14, 2017

DenisCarriere commented Jul 14, 2017

DenisCarriere commented Jul 14, 2017 • edited Loading

DenisCarriere commented Jul 14, 2017 • edited Loading

stebogit Jul 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DenisCarriere Jul 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DenisCarriere Jul 14, 2017 • edited Loading

Choose a reason for hiding this comment

stebogit commented Jul 14, 2017

DenisCarriere commented Jul 14, 2017

Implement `@turf/clusters-dbscan` module #812

Implement `@turf/clusters-dbscan` module #812

DenisCarriere commented Jun 21, 2017 •

edited

Loading

First draft of `@turf/clusters-dbscan` Ref. #811

stebogit commented Jul 5, 2017 •

edited

Loading

DenisCarriere commented Jul 5, 2017 •

edited

Loading

stebogit commented Jul 5, 2017 •

edited by DenisCarriere

Loading

DenisCarriere commented Jul 5, 2017 •

edited

Loading

DenisCarriere commented Jul 12, 2017 •

edited

Loading

stebogit commented Jul 12, 2017 •

edited

Loading

stebogit commented Jul 14, 2017 •

edited

Loading

DenisCarriere commented Jul 14, 2017 •

edited

Loading

DenisCarriere commented Jul 14, 2017 •

edited

Loading

stebogit Jul 14, 2017 •

edited

Loading

DenisCarriere Jul 14, 2017 •

edited

Loading

DenisCarriere Jul 14, 2017 •

edited

Loading