Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement @turf/clusters-dbscan module #812

Merged
merged 33 commits into from
Jul 14, 2017
Merged

Conversation

DenisCarriere
Copy link
Member

@DenisCarriere DenisCarriere commented Jun 21, 2017

First draft of @turf/clusters-dbscan Ref. #811

Note: this module might be merged into clusters and support both distance & numberOfClusters params (still up in the air).

Compared to the kmeans cluster this is roughly 5x-50x times faster (the kdbush index does help a lot -- The performance will most likely slow down as weighted clusters becomes implemented).

To-Do

  • Translate existing GeoJSON properties
  • Weighted clusters based on distance (right now the first point from each new cluster is the "center")
  • Not too sure if we should be adding centroid/points in the outputs (instead of simply a FeatureCollection<Point>). Using the cluster property param is already enough to be able to create a centroid/center/centerOfMass.
  • Only supports kilometers as distance

More To-Dos

  • change library name to clusters-dbscan
  • add a dbscan property
  • drop the centroid
  • Prevent input mutation

Examples

points1.geojson

image

points2.geojson

image

@DenisCarriere DenisCarriere added this to the 4.5.0 milestone Jun 21, 2017
@DenisCarriere DenisCarriere self-assigned this Jun 21, 2017
@DenisCarriere DenisCarriere modified the milestones: 4.6.0, 4.5.0 Jun 30, 2017
@stebogit
Copy link
Collaborator

stebogit commented Jul 5, 2017

@DenisCarriere how can I help?

Not too sure if we should be adding centroid/points in the outputs (instead of simply a FeatureCollection). Using the cluster property param is already enough to be able to create a centroid/center/centerOfMass.

I believe including the centroids in the output might result actually useful as sometimes you are interested in those points and calculating them while processing the points is faster than doing it afterwards.

@DenisCarriere
Copy link
Member Author

DenisCarriere commented Jul 5, 2017

👍 Agreed about including the centroid.

As for help, when the cluster ID's are being associated, it only matches the first found and then skips all the other ones. What would need to happen is to sort/group the clusters by closest distance (not the first match).

@stebogit
Copy link
Collaborator

stebogit commented Jul 5, 2017

@DenisCarriere I don't know if this is exactly the issue you were referring to, but I feel there is something not right with the approach we are using for this clustering.
Take a look at the gif image that shows the greedy clustering algorithm (which we are basically using here) of supercluster; I added a few points to show the issue I'm referring to:
screen shot 2017-07-05 at 7 58 26 am
As you can see all points are grouped (which is the goal of that clustering method), however the green points 2 and 3 technically belong to two different maxDistance circles/cluster, but they are associated with a (maxDistance) cluster that is first calculated.

The problem I think is that for the @turf/clusters-distance circles (i.e. clusters) should not overlap, but it's impossible to completely cover an area using circles...

@DenisCarriere
Copy link
Member Author

DenisCarriere commented Jul 5, 2017

Yes this is the issue I was referencing, I actually used supercluster as inspiration for this module (removed all the WebGL & vector tile grouping).

What if we define the cluster param as clusters: Array<number> instead?

That way the points 2 & 3 would have this as a param clusters: [2, 3]

This approach might help solve any ambiguous cluster matching.

Points that only have 1 cluster would still be an Array except with only 1 number.

@stebogit thoughts?

@DenisCarriere
Copy link
Member Author

By the way... Thanks @mourner for another great library 👍 (supercluster)

@stebogit
Copy link
Collaborator

stebogit commented Jul 6, 2017

@DenisCarriere I think we should better define the expected result of the module.
It seems to me that we are kind of mixing the (current) @turf/cluster functionality/output (a partitional clustering) with the supercluster one (a hierarchical clustering).

What if we assigned the clusters (array) parameter to each point, where the array contains all the index/reference to al the points that are within the maxDistance?
There wouldn't be an actual centroid for each cluster, as actually there wouldn't be any cluster (so this method might not be called clusters-)...

I have no idea if this would be of any utility though.

@DenisCarriere
Copy link
Member Author

Well how would one go about to create clusters based on a distance?

Right now we can create clusters based on kmeans, but if you wanted to cluster points based on a maxDistance, then this would be the module for it.

No clue what the perfect outcome should be, however I'm sure we can come up with a creative way to make such module.

removed id attribute from output points;
@@ -25,17 +25,12 @@ module.exports = function (points, maxDistance) {
collectionOf(points, 'Point', 'Input must contain Points');

// Create index
const load = points.features.map(function (point, index) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Love it! I think my first approached used the id fields to do the matching, however I might of dropped that workflow mid way and forgot to remove the .map().

@stebogit Good catch!

* points-with-properties: 0.164ms
* points1: 0.087ms
* points2: 0.694ms
* fiji x 1,320,371 ops/sec ±1.72% (80 runs sampled)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚗 💨 Zoom Zoom!

@stebogit
Copy link
Collaborator

Well how would one go about to create clusters based on a distance?
Right now we can create clusters based on kmeans, but if you wanted to cluster points based on a maxDistance, then this would be the module for it.

@DenisCarriere I've been thinking a lot about this clustering problem and I keep wondering: maxDistance from what? There's no reference from which to calculate the distance.

Should we add a (required) centroids input and assign a centroid/cluster to each point within maxDistance from that centroid?
This would not necessary assign a cluster to all the points in the set (see partial clustering).
However, would this be a useful/common clustering method, that 'popular' to deserve a dedicated module?

In my research on clustering methods I keep seeing these three as most popular (i.e. I guess mostly used/useful) methods:

  • K-means (each point of the cluster is closest to its centroid that any other centroid, thus this is a clustering method based on distance)
  • Agglomerative Hierarchical Clustering (similar to supercluster and the current implementation of this module)
  • DBSCAN (density based, clusters are defined as regions of higher density, i.e. number of points per unit area, than the remainder of the data set)

@DenisCarriere
Copy link
Member Author

Thanks @stebogit for that initial research.

For me, my initial intent with this distance cluster would most likely reflect DBSCAN's clustering algorithm. This approach makes the most sense to me and would be "relatively" easy to implement using geokdbush, I think I've already got an idea how to do it.

Using this approach would probably solve your question about (maxDistance from what?)
image

In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor directly-reachable.

@stebogit Are you on the same page if we use the DBSCAN approach for @turf/clusters-distance?

CC: @morganherlocker & @mourner Feel free to shim in anytime if you feel this is going in the wrong direction.

@DenisCarriere
Copy link
Member Author

DenisCarriere commented Jul 12, 2017

@stebogit I couldn't help myself to get this done in one night.

Check out the new implementation, it's pretty solid and it uses some of the DBSCAN logic (minus not providing noise points - they are simply excluded).

Next Steps

  • Optimize code for performance
  • Add maxPoints as param??

Many Points

Clusters will keep growing as long as clusters are within maximum distance. Used minPoints=3 to exclude smaller clusters.

image

Points 2

Outliers can be identified here by showing clusters of a single point (can be removed if minPoints param is defined greater than 1).

image

@stebogit
Copy link
Collaborator

stebogit commented Jul 12, 2017

👍 👏 @DenisCarriere I'll take a look at this tomorrow.

Quick thought, we could call this module @turf/clusters-density, to differentiate it from kmeans which does cluster points based on distance from the centroid.
Or we might just stick with the algorithm name, which is probably more clear for everybody, so renaming @turf/clusters to @turf/clusters-kmeans and this to @turf/clusters-dbscan.

@DenisCarriere
Copy link
Member Author

👍 For @turf/clusters-density name, I don't think the average user would know what dbscan means, keeping the module names simple is best.

👍 renaming @turf/clusters to @turf/clusters-kmeans, that way the cluster process is better defined/scoped by using the name of the module.

@DenisCarriere
Copy link
Member Author

The reason why I wouldn't call it @turf/clusters-dbscan is because this is more of an "inspiration" of dbscan and not a complete implementation of it.

We can always improve on this at a later date...

@stebogit
Copy link
Collaborator

stebogit commented Jul 14, 2017

@DenisCarriere why don't we return this instead:

return {
    points: FeatureCollection<Points>, // with `clusterId` property
    edges: Array<Array<number>>, // edges ids
    centroids: Array<Array<number>>, // centroids ids
    noise: Array<Array<number>> // noise ids
};

This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

Edit:
With this in mind I would probably include a clusters: Array<Array<number>> field in the @turf/clusters-kmeans output as well, for allowing easier iteration

noise.push(noisePoint);
});

return {
points: featureCollection(newPoints),
edeges: featureCollection(edges),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

- edeges: featureCollection(edges),
+ edges: featureCollection(edges),

@DenisCarriere
Copy link
Member Author

Shall we rename also @turf/clusters-kmeans now or wait until next major release?

We just recently published that module... We change it to kmeans at the next major release, or include both (for now) and clusters will simply import the new module name clusters-kmeans and we can add a console.warning() message when using it.

@DenisCarriere
Copy link
Member Author

This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

This is already deviating A LOT from the main purpose of TurfJS, all of the inputs/outputs should be in GeoJSON.

We shouldn't even be pushing out an Array of FeatureCollection (if we don't have too).

@DenisCarriere
Copy link
Member Author

I like simply including the dbscan property, it's easy to understand the output.

Quick update on the Typescript definition

interface Point extends GeoJSON.Feature<GeoJSON.Point> {
    properties: {
        dbscan?: 'core' | 'edge' | 'noise';
        [key: string]: any;
    }
}

interface Output {
    type: 'FeatureCollection'
    features: Point[];
}

@DenisCarriere
Copy link
Member Author

DenisCarriere commented Jul 14, 2017

@stebogit Next commit should include a lot of changes, have a review:

Note: The Stars are edges (not center)

  • Dropped Center coordinates
  • Update typescript definition
  • Only output single FeatureCollection
  • Tag properties 'core' | 'edge' | 'noise'
  • Change minPoint default to 3 (recommendations DBSCAN to perform any clustering | 1 would cluster everything)

image

CC: @stebogit

- Update typescript definition
- Only output single FeatureCollection
- Tag properties 'core' | 'edge' | 'noise'
- Change minPoint default to 3
@DenisCarriere
Copy link
Member Author

DenisCarriere commented Jul 14, 2017

@stebogit This would speed up the calculation (no feature creation) and slim the output, which seems now a little bloated to me, while still allowing iteration through each cluster or point type.

These types of modules are probably better to be abstracted out of TurfJS (only dealing with 2D points) and afterwards TurfJS would wrap it to simplify the GeoJSON integration.

As a good example, @mourner's library are mainly all 2D (Array<number>) which can easily be used in all sorts of different libraries, whereas TurfJS is mostly focused on having pure GeoJSON outputs.

// handle noise points, if any
// Skip Noise if cluster is already associated
// This might be a slight deviation of DBSCAN (or a bug in the library)
Copy link
Collaborator

@stebogit stebogit Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop this (since it's not a bug) and maybe add an explanation like
// edges points are tagged by DBSCAN as both 'noise' and 'cluster' as they can "reach" less than 'minPoints' number of points

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

point.properties['marker-symbol'] = 'circle-stroked';
point.properties['marker-size'] = 'medium';
points.push(point);
featureEach(clustered, function (point) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this to:

switch (point.properties.dbscan) {
    case 'core':
    case 'edge': {
        const coreColor = colours[point.properties.cluster];
        const edgeColor = chromatism.brightness(-20, colours[point.properties.cluster]).hex;
        point.properties['marker-color'] = (point.properties.dbscan === 'core') ? coreColor : edgeColor;
        point.properties['marker-size'] = 'small';
        points.push(point);
        break;
    }
    case 'noise': {
        point.properties['marker-color'] = '#AEAEAE';
        point.properties['marker-symbol'] = 'circle-stroked';
        point.properties['marker-size'] = 'medium';
        points.push(point);
    }
}

as edges are not really a major feature to highlight.
But it's just a finesse 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol Yep! Looks good.

I really like this Switch statement, makes it really easy to control those clustered points.

@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection;
* @param {FeatureCollection<Point>} points to be clustered
* @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only)
* @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers
* @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the
* @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -17,7 +15,7 @@ var featureCollection = helpers.featureCollection;
* @param {FeatureCollection<Point>} points to be clustered
* @param {number} maxDistance Maximum Distance between any point of the cluster to generate the clusters (kilometers only)
* @param {string} [units=kilometers] in which `maxDistance` is expressed, can be degrees, radians, miles, or kilometers
* @param {number} [minPoints=1] Minimum number of points to generate a single cluster, points will be excluded if the
* @param {number} [minPoints=3] Minimum number of points to generate a single cluster, points will be excluded if the
* cluster does not meet the minimum amounts of points.
* @returns {Object} an object containing a `points` FeatureCollection, the input points where each Point
* has given a `cluster` property with the cluster number it belongs, a `centroids` FeatureCollection of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be updated with the new dbscan property

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yep! It's defined in the Typescript definition, but yes it needs to be documented

* @example
* var centroids = centroidFromProperty(points, 'cluster');
*/
function centroidFromProperty(geojson, property, properties) {
Copy link
Member Author

@DenisCarriere DenisCarriere Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stebogit Added back your centroids points 😄

Would be interesting to know the benchmark results on that centroidFromProperty method (i'm sure it's really fast... just quickly scans the FeatureCollection once and then applies clusters based on those bins).

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmmh... in density clustering the centroids are less useful/identifying/important than in k-means, I guess. 🤔
But I might be wrong.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see now, this is only in test.js. Good 👍

Copy link
Member Author

@DenisCarriere DenisCarriere Jul 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! :) only for test.js This where centroidByProperty module would be used... we wouldn't apply this directly in the modules, but for visual purposes.

Also this can be applied against Polygons or any Geometry Types, finding the "centroid" of stuff based on properties is quite useful.

@DenisCarriere DenisCarriere merged commit 59a2b65 into master Jul 14, 2017
@DenisCarriere DenisCarriere deleted the clusters-distance branch July 14, 2017 18:58
@stebogit
Copy link
Collaborator

Cool! 😃 🎊 🚀

@DenisCarriere
Copy link
Member Author

29 commits later.. 👍

@DenisCarriere DenisCarriere changed the title Implement @turf/clusters-distance module Implement @turf/clusters-dbscan module Jul 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants