Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Landcover based sampling strategy. #29

Merged
merged 3 commits into from
Nov 16, 2023
Merged

Landcover based sampling strategy. #29

merged 3 commits into from
Nov 16, 2023

Conversation

yellowcap
Copy link
Member

@yellowcap yellowcap commented Nov 13, 2023

The current sampling strategy is kept purposefully simple. We can expand to more criteria that are not only landcover, and potentially use a cluster based approach to replace the human biased perspective in the current implementation.

Refs #28

@yellowcap yellowcap self-assigned this Nov 13, 2023
This was referenced Nov 14, 2023
@yellowcap
Copy link
Member Author

The sampling scripts v0 are complete and should be fully reproducible. The current scripts selects 950 tiles based on landcover, making sure we capture diverse tiles, and a good representation of every class. The resulting tiles are visualizeed in the map below

image

Magenta are the selcted ones, the other ones are colored by number of land cover classes present in them.

@yellowcap yellowcap marked this pull request as ready for review November 15, 2023 14:15
@yellowcap yellowcap requested a review from a team November 15, 2023 14:15
@yellowcap yellowcap force-pushed the worldcover-sampling branch 3 times, most recently from 7abd947 to 18e9bd1 Compare November 15, 2023 14:29
scripts/landcover.py Outdated Show resolved Hide resolved
100 samples from all tiles with water between 30% an 70% (making sure we
capture some, but exclude only purely water so we catch coasts)
"""
data = geopandas.read_file(Path(wd, "mgrs_stats.fgb"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share this mgrs_stats.fgb file please? I got a permission error trying to access s3://esa-worldcover/v200/2021/map for some reason, and downloading from https://worldcover2021.esa.int/downloader is taking a long time!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange s3 sync s3://esa-worldcover/v200/2021/map $wd/esa-worldcover-v200-2021-map --no-sign-request works perfectly for me. But yes, its quite a lot of data (120GB) so no need for everyone to reproduce the stats file! Attaching below.

Comment on lines +181 to +198
result = pandas.concat(
[
diversity,
urban,
wetland,
mangroves,
moss,
cropland,
trees,
shrubland,
grassland,
bare,
snow,
water,
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, this sampling function is independently getting the highest values for each category (plus some extra MGRS tiles for diversity and water areas), and then concatenating those rows together into a single dataframe?

Plotting your mgrs_sample.geojson file from #29 (comment), I see a few cases where the exact MGRS tile is sampled more than once. E.g.:

MGRS tile 56VLM - sampled 3 times:

image

MGRS tile 17RMQ - sampled 2 times:

image

MGRS tile 32TLP - sampled 2 times:

image

The duplicates might be due to the independent random sampling per-category and then concatenation. Perhaps we could remove such duplicate rows before saving out the GeoJSON file?

Copy link
Member Author

@yellowcap yellowcap Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your assessment is correct, I had another version before that would split the selected rows but that dropped along the way. So, good catch, will drop duplicates as part of the script.

@weiji14 weiji14 linked an issue Nov 16, 2023 that may be closed by this pull request
2 tasks
@yellowcap
Copy link
Member Author

Sharing the stats file here @weiji14
mgrs_stats.zip

@yellowcap
Copy link
Member Author

Attaching the updated sampled mgrs tiles as geojson, this time without duplicates. Thanks @weiji14 for the catch. We can use this as input for the datacubes, cc @lillythomas

mgrs_sample.zip

@yellowcap yellowcap merged commit d525d59 into main Nov 16, 2023
2 checks passed
@yellowcap yellowcap deleted the worldcover-sampling branch November 16, 2023 15:46
@weiji14 weiji14 added the data-pipeline Pull Requests about the data pipeline label Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-pipeline Pull Requests about the data pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop geographic sampling strategy based on Worldcover
3 participants