-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495
fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting workaround for the turf issue. It seems correct but I'll let @malgar to review and put the final ✔️ before merging
} | ||
|
||
// Sort unique values alphabetically | ||
uniqueValues.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdelacruzb Are you sorting the locations (unique values and duplicatedValues) so the result is deterministic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ey @malgar, indeed on the one hand by splitting and concatenating uniques and duplicates we get rid of the problem of getting less clusters than expected.
And on the other, the sorting just ensures that the result is deterministic.
Both are not dependant.
I would say just the alphabetical sorting translates into lng/lat sorting. And this kind of issues could happen. I could try sort by distance to a point but I think the problem would be similar that the present one. Which kind of sorting do you think could work in this case?
The final approach has been removing any kind of alphabetical sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdelacruzb LGTM. I just added a small note to remove misleading comments regarding previous lexicographic sorting that no longer apply
Description
Shortcut
The
ST_CLUSTERKMEANS
functions have seen been modified to try to ensure as much as possible te requested number of clusters. In this algorithm is important that the points used for initialization are distinct. For this reason the approach followed is the next:This way the beginning of the array will only contain unique elements. Everything is sorted so the response is deterministic.
Also fixed a bug in redshift that returns all the points with cluster 0 if the requested number of clusters is bigger than the distinct values at the array.
Type of change
Acceptance
Bigquery
Snowflake
Redshift