Landcover based sampling strategy. #29

yellowcap · 2023-11-13T18:30:10Z

The current sampling strategy is kept purposefully simple. We can expand to more criteria that are not only landcover, and potentially use a cluster based approach to replace the human biased perspective in the current implementation.

Refs #28

yellowcap · 2023-11-15T14:15:35Z

The sampling scripts v0 are complete and should be fully reproducible. The current scripts selects 950 tiles based on landcover, making sure we capture diverse tiles, and a good representation of every class. The resulting tiles are visualizeed in the map below

Magenta are the selcted ones, the other ones are colored by number of land cover classes present in them.

scripts/landcover.py

weiji14 · 2023-11-15T22:20:28Z

scripts/landcover.py

+    100 samples from all tiles with water between 30% an 70% (making sure we
+    capture some, but exclude only purely water so we catch coasts)
+    """
+    data = geopandas.read_file(Path(wd, "mgrs_stats.fgb"))


Could you share this mgrs_stats.fgb file please? I got a permission error trying to access s3://esa-worldcover/v200/2021/map for some reason, and downloading from https://worldcover2021.esa.int/downloader is taking a long time!

Strange s3 sync s3://esa-worldcover/v200/2021/map $wd/esa-worldcover-v200-2021-map --no-sign-request works perfectly for me. But yes, its quite a lot of data (120GB) so no need for everyone to reproduce the stats file! Attaching below.

weiji14 · 2023-11-15T22:33:46Z

scripts/landcover.py

+    result = pandas.concat(
+        [
+            diversity,
+            urban,
+            wetland,
+            mangroves,
+            moss,
+            cropland,
+            trees,
+            shrubland,
+            grassland,
+            bare,
+            snow,
+            water,
+        ]
+    )


Just to understand, this sampling function is independently getting the highest values for each category (plus some extra MGRS tiles for diversity and water areas), and then concatenating those rows together into a single dataframe?

Plotting your mgrs_sample.geojson file from #29 (comment), I see a few cases where the exact MGRS tile is sampled more than once. E.g.:

MGRS tile 56VLM - sampled 3 times:

MGRS tile 17RMQ - sampled 2 times:

MGRS tile 32TLP - sampled 2 times:

The duplicates might be due to the independent random sampling per-category and then concatenation. Perhaps we could remove such duplicate rows before saving out the GeoJSON file?

Yes your assessment is correct, I had another version before that would split the selected rows but that dropped along the way. So, good catch, will drop duplicates as part of the script.

yellowcap · 2023-11-16T12:22:48Z

Sharing the stats file here @weiji14
mgrs_stats.zip

yellowcap · 2023-11-16T12:32:43Z

Attaching the updated sampled mgrs tiles as geojson, this time without duplicates. Thanks @weiji14 for the catch. We can use this as input for the datacubes, cc @lillythomas

mgrs_sample.zip

Closes #28

yellowcap self-assigned this Nov 13, 2023

This was referenced Nov 14, 2023

Datacube #27

Merged

Add geopandas-base #34

Merged

yellowcap force-pushed the worldcover-sampling branch from 321730b to 9e3a219 Compare November 15, 2023 14:08

yellowcap marked this pull request as ready for review November 15, 2023 14:15

yellowcap requested a review from a team November 15, 2023 14:15

yellowcap force-pushed the worldcover-sampling branch 3 times, most recently from 7abd947 to 18e9bd1 Compare November 15, 2023 14:29

srmsoumya approved these changes Nov 15, 2023

View reviewed changes

weiji14 reviewed Nov 15, 2023

View reviewed changes

weiji14 linked an issue Nov 16, 2023 that may be closed by this pull request

Develop geographic sampling strategy based on Worldcover #28

Closed

2 tasks

yellowcap added 3 commits November 16, 2023 15:41

Add landcover based sampling scripts

97978d4

Closes #28

Drop duplicates, fix typo, uncomment compute_stats function.

723081b

Fix comment that was out of sync with code

20a6682

yellowcap force-pushed the worldcover-sampling branch from 356cc0e to 20a6682 Compare November 16, 2023 15:42

yellowcap merged commit d525d59 into main Nov 16, 2023
2 checks passed

yellowcap deleted the worldcover-sampling branch November 16, 2023 15:46

weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 20, 2023

weiji14 mentioned this pull request Nov 21, 2023

Bump conda-lock to 2.5.1, add fiona and h5netcdf #46

Merged

weiji14 mentioned this pull request Dec 2, 2023

Send early sample of embeddings #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Landcover based sampling strategy. #29

Landcover based sampling strategy. #29

yellowcap commented Nov 13, 2023 •

edited

yellowcap commented Nov 15, 2023

weiji14 Nov 15, 2023

yellowcap Nov 16, 2023

weiji14 Nov 15, 2023

yellowcap Nov 16, 2023 •

edited

yellowcap commented Nov 16, 2023

yellowcap commented Nov 16, 2023

Landcover based sampling strategy. #29

Landcover based sampling strategy. #29

Conversation

yellowcap commented Nov 13, 2023 • edited

yellowcap commented Nov 15, 2023

weiji14 Nov 15, 2023

Choose a reason for hiding this comment

yellowcap Nov 16, 2023

Choose a reason for hiding this comment

weiji14 Nov 15, 2023

Choose a reason for hiding this comment

yellowcap Nov 16, 2023 • edited

Choose a reason for hiding this comment

yellowcap commented Nov 16, 2023

yellowcap commented Nov 16, 2023

yellowcap commented Nov 13, 2023 •

edited

yellowcap Nov 16, 2023 •

edited