fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

vdelacruzb · 2024-04-16T15:09:46Z

Description

Shortcut

Story: https://app.shortcut.com/cartoteam/story/402642/restore-st-clusterkmeans-and-solve-duplicated-points-bug
Autolink: [sc-402642]

The ST_CLUSTERKMEANS functions have seen been modified to try to ensure as much as possible te requested number of clusters. In this algorithm is important that the points used for initialization are distinct. For this reason the approach followed is the next:

Follow the array of coords and split the first appearance of each coord and the successive ones into two arrays.
Order both.
Concatenate.

This way the beginning of the array will only contain unique elements. Everything is sorted so the response is deterministic.

Also fixed a bug in redshift that returns all the points with cluster 0 if the requested number of clusters is bigger than the distinct values at the array.

Type of change

Fix

Acceptance

Bigquery

-- Now we allow to create as many clusters as distinct values in the table
SELECT count(DISTINCT ST_ASBINARY(geom))
from `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`;
-- 38101
with a as (
  SELECT ST_ASBINARY(geom) binary_geom FROM 
`carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded` group by binary_geom ORDER BY binary_geom limit 39000
),
b as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_GEOGFROMWKB(binary_geom) ORDER BY binary_geom), 39000) arr
from a
)
SELECT count(DISTINCT element.cluster) from b, UNNEST(arr) element;
-- 38101

with a as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 200) arr
FROM `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`
)
SELECT count(DISTINCT element.cluster) from a, UNNEST(arr) element;
-- 200

-- No elements are removed
SELECT count(*)
from `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`;
-- 46916

with a as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 200) arr
FROM `carto-dw-ac-gs6bn7ia.shared_us.Locations-geocoded`
)
SELECT count(*) from a, UNNEST(arr) element;
-- 46916

Snowflake

CREATE or replace table CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds AS
SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)') as geom, 1 as id
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)'), 2
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 3
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 4
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)'), 5
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.33815 -29.9541)'), 6;

-- No rows are removed
with a as (
SELECT CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING) WITHIN GROUP (ORDER BY ID ASC), 2) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select count(*) FROM a, LATERAL FLATTEN(input=>arr);
-- 6 rows

-- We create the exact name of clusters even if the input is inserted in an order containing duplicateds
with a as (
SELECT CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING) WITHIN GROUP (ORDER BY ID ASC), 3) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select COUNT(DISTINCT "VALUE":cluster) FROM a, LATERAL FLATTEN(input=>arr);
-- 3

Redshift

-- No elements are removed
SELECT get_array_length(vdelacruz_carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))', 3)));
-- 6

SELECT vdelacruz_carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))'), 3);
-- By looking to the output we can observe that 3 clusters where created  even if the input is inserted in an order containing duplicateds

Jesus89

Interesting workaround for the turf issue. It seems correct but I'll let @malgar to review and put the final ✔️ before merging

malgar · 2024-04-17T12:52:25Z

clouds/bigquery/libraries/javascript/src/clustering.js

+    }
+
+    // Sort unique values alphabetically
+    uniqueValues.sort();


@vdelacruzb Are you sorting the locations (unique values and duplicatedValues) so the result is deterministic?

Does the alphabetical sorting for GeoJSON features translate into a sorting based on lat/lon? I'm asking because I see some risk in doing so, especially when clustering small amounts of points (see image)

ey @malgar, indeed on the one hand by splitting and concatenating uniques and duplicates we get rid of the problem of getting less clusters than expected.
And on the other, the sorting just ensures that the result is deterministic.
Both are not dependant.

I would say just the alphabetical sorting translates into lng/lat sorting. And this kind of issues could happen. I could try sort by distance to a point but I think the problem would be similar that the present one. Which kind of sorting do you think could work in this case?

vdelacruzb · 2024-04-18T08:40:50Z

The final approach has been removing any kind of alphabetical sorting.
Also have added a test on each provider that should fail if we wouldn't split into unique, duplicateds.

clouds/redshift/libraries/python/lib/clustering/__init__.py

malgar

@vdelacruzb LGTM. I just added a small note to remove misleading comments regarding previous lexicographic sorting that no longer apply

vdelacruzb added 4 commits April 16, 2024 16:38

update doc

d84e397

fix clustering in bq

6f7f574

fix clustering in sf

f871201

fix clustering in rs

0600b89

vdelacruzb requested review from malgar and Jesus89 April 16, 2024 15:10

Jesus89 approved these changes Apr 16, 2024

View reviewed changes

malgar reviewed Apr 17, 2024

View reviewed changes

vdelacruzb added 2 commits April 18, 2024 09:52

avoid sorting output

1dab6ab

add tests

2ad1113

vdelacruzb requested a review from malgar April 18, 2024 08:40

lint redshift

8b665c1

malgar reviewed Apr 18, 2024

View reviewed changes

clouds/redshift/libraries/python/lib/clustering/__init__.py Outdated Show resolved Hide resolved

malgar reviewed Apr 18, 2024

View reviewed changes

clouds/redshift/libraries/python/lib/clustering/__init__.py Outdated Show resolved Hide resolved

malgar approved these changes Apr 18, 2024

View reviewed changes

remove comments

9b90acf

vdelacruzb merged commit 8a29098 into main Apr 18, 2024
17 checks passed

vdelacruzb deleted the bug/sc-402642/restore-st-clusterkmeans-and-solve-duplicated branch April 18, 2024 09:40

vdelacruzb mentioned this pull request Apr 18, 2024

release: 2024-04-18 #497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

vdelacruzb commented Apr 16, 2024 •

edited

Loading

Jesus89 left a comment

malgar Apr 17, 2024

malgar Apr 17, 2024

vdelacruzb Apr 18, 2024

vdelacruzb commented Apr 18, 2024

malgar left a comment

fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

fix(bq,sf,rs|clustering): improve how ST_CLUSTERKMEANS deals with duplicates #495

Conversation

vdelacruzb commented Apr 16, 2024 • edited Loading

Description

Type of change

Acceptance

Bigquery

Snowflake

Redshift

Jesus89 left a comment

Choose a reason for hiding this comment

malgar Apr 17, 2024

Choose a reason for hiding this comment

malgar Apr 17, 2024

Choose a reason for hiding this comment

vdelacruzb Apr 18, 2024

Choose a reason for hiding this comment

vdelacruzb commented Apr 18, 2024

malgar left a comment

Choose a reason for hiding this comment

vdelacruzb commented Apr 16, 2024 •

edited

Loading