Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

Merged

Conversation

vdelacruzb
Copy link
Contributor

@vdelacruzb vdelacruzb commented Apr 16, 2024

Description

Shortcut

The ST_CLUSTERKMEANS functions have seen been modified to try to ensure as much as possible te requested number of clusters. In this algorithm is important that the points used for initialization are distinct. For this reason the approach followed is the next:

  1. Follow the array of coords and split the first appearance of each coord and the successive ones into two arrays.
  2. Order both.
  3. Concatenate.

This way the beginning of the array will only contain unique elements. Everything is sorted so the response is deterministic.

Also fixed a bug in redshift that returns all the points with cluster 0 if the requested number of clusters is bigger than the distinct values at the array.

Type of change

  • Fix

Acceptance

Bigquery

-- Now we allow to create as many clusters as distinct values in the table
SELECT count(DISTINCT ST_ASBINARY(geom))
from `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`;
-- 38101
with a as (
  SELECT ST_ASBINARY(geom) binary_geom FROM 
`carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded` group by binary_geom ORDER BY binary_geom limit 39000
),
b as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_GEOGFROMWKB(binary_geom) ORDER BY binary_geom), 39000) arr
from a
)
SELECT count(DISTINCT element.cluster) from b, UNNEST(arr) element;
-- 38101

with a as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 200) arr
FROM `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`
)
SELECT count(DISTINCT element.cluster) from a, UNNEST(arr) element;
-- 200

-- No elements are removed
SELECT count(*)
from `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`;
-- 46916

with a as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 200) arr
FROM `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`
)
SELECT count(*) from a, UNNEST(arr) element;
-- 46916

Snowflake

CREATE or replace table CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds AS
SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)') as geom, 1 as id
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)'), 2
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 3
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 4
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 5
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.33815 -29.9541)'), 6;

-- No rows are removed
with a as (
SELECT CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING) WITHIN GROUP (ORDER BY ID ASC), 2) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select count(*) FROM a, LATERAL FLATTEN(input=>arr);
-- 6 rows

-- We create the exact name of clusters even if the input is inserted in an order containing duplicateds
with a as (
SELECT CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING) WITHIN GROUP (ORDER BY ID ASC), 3) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select COUNT(DISTINCT "VALUE":cluster) FROM a, LATERAL FLATTEN(input=>arr);
-- 3

Redshift

-- No elements are removed
SELECT get_array_length(vdelacruz_carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))', 3)));
-- 6

SELECT vdelacruz_carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))'), 3);
-- By looking to the output we can observe that 3 clusters where created  even if the input is inserted in an order containing duplicateds

Copy link
Member

@Jesus89 Jesus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting workaround for the turf issue. It seems correct but I'll let @malgar to review and put the final ✔️ before merging

}

// Sort unique values alphabetically
uniqueValues.sort();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdelacruzb Are you sorting the locations (unique values and duplicatedValues) so the result is deterministic?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the alphabetical sorting for GeoJSON features translate into a sorting based on lat/lon? I'm asking because I see some risk in doing so, especially when clustering small amounts of points (see image)

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ey @malgar, indeed on the one hand by splitting and concatenating uniques and duplicates we get rid of the problem of getting less clusters than expected.
And on the other, the sorting just ensures that the result is deterministic.
Both are not dependant.

I would say just the alphabetical sorting translates into lng/lat sorting. And this kind of issues could happen. I could try sort by distance to a point but I think the problem would be similar that the present one. Which kind of sorting do you think could work in this case?

@vdelacruzb
Copy link
Contributor Author

The final approach has been removing any kind of alphabetical sorting.
Also have added a test on each provider that should fail if we wouldn't split into unique, duplicateds.

@vdelacruzb vdelacruzb requested a review from malgar April 18, 2024 08:40
Copy link

@malgar malgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdelacruzb LGTM. I just added a small note to remove misleading comments regarding previous lexicographic sorting that no longer apply

@vdelacruzb vdelacruzb merged commit 8a29098 into main Apr 18, 2024
17 checks passed
@vdelacruzb vdelacruzb deleted the bug/sc-402642/restore-st-clusterkmeans-and-solve-duplicated branch April 18, 2024 09:40
@vdelacruzb vdelacruzb mentioned this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants