Skip to content

Commit

Permalink
Documented v1 sampling strategy (#249)
Browse files Browse the repository at this point in the history
* Documented v1 sampling strategy

* Included suggestions from review.
  • Loading branch information
yellowcap committed May 24, 2024
1 parent bb38678 commit e7cdb7f
Showing 1 changed file with 118 additions and 26 deletions.
144 changes: 118 additions & 26 deletions docs/data_sampling.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,43 @@
# Data sampling strategy
# Training Data

To create a balanced dataset for model training, we used a sampling strategy
This section describes how we created the training dataset for the clay model.

## Data sources selection

The goal for the Clay model is for it to be as general as possible. It should be able to accept data from any platform coming from satellites, aerial, or drone platforms. For this to be possible, clearly the model design is the basis. Drawing inspiration from earlier works on Foundation models like Prithvi, SatMAE, ScaleMAE, DOFA, and SpectralGPT, we have developed a model architecture capable of accepting inputs of diverse spectral bands and resolutions in different sizes.

To train such a model, it is necessary to create a training dataset
that contains data from multiple platforms, and is as varied as possible in terms of

- spectral band definitions
- spatial distribution
- temporal distribution
- ground sampling distance

To achieve this we have first complied a [list of possible input platforms](https://github.com/Clay-foundation/model/issues/128). The list of candidate systems is rather long, and will be growing in the future. To reduce complexity, we have converged to a shorter list of platforms for the first round of model training.

Criteria was availability in the cloud, existence of STAC catalogs, and cloud optimized formats. This resulted in the following list of systems that we have
included in the trainig for Clay v1


| Platform | Spatial Coverage | Spectral bands | GSD (meters) |
---------|------------------|----------------|--------------|
| Landsat 8 and 9 | Global | 6 optical bands | 30 |
| Sentinel 2 L2A | Global | 10 optical bands | 10 |
| Sentinel 1 RTC | Global | 2 radar bands | 10 |
| NAIP | USA | 4 optical bands | < 1 |
| LINZ | New Zealand | 3 optical bands | < 0.5 |


## Sampling strategy

Once imagery sources are selected, the next step is to develop a sampling strategy. We are not able to process the entire archive, and so it is important to select the right subset of the archives for training.

Our driving principle is that the model should learn natural features as well as human made features. Human made features are smaller and less evenly distributed in many cases. This has driven some of the decisions for the sampling, as described below.

### Global sampling

We created a single sampling strategy for all four global satellite systems that we included in the model training (Sentinel 1 and 2, and Landsat 8 and 9). To create a balanced dataset for model training, we used a sampling strategy
based on land cover classes from the [ESA WorldCover](https://esa-worldcover.org/)
layer.

Expand All @@ -26,8 +63,87 @@ After selecting MGRS tiles for each of these criteria, we removed duplicates.

The following table summarizes the selection criteria for each class.

| Class | Nr of Tiles | From highest |
|---|---|---|
Diversity | 400 | 2000
Built-up | 300 | 300
Built-up | 1000 | 1500
Herbaceous wetland | 50 | 500
Mangroves | 50 | 500
Moss and lichen | 50 | 500
Cropland | 800 | 3600
Tree cover | 150 | 750
Shrubland | 100 | 500
Grassland | 200 | 500
Bare / sparse vegetation | 50 | 500
Snow and Ice | 25 | 500
Permanent water bodies | 50 | 1000

This resulted in a sample of 2728 MGRS tiles total in our sample. The resulting sample file can be downloaded from the following link

https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb

We used these locations for all of the global platforms. For more details about how exactly we implemented the sample selection, review the corresponding [stacchip processors](https://github.com/Clay-foundation/stacchip/blob/main/stacchip/processors/).

### Landsat 8 and 9 sampling strategy

To further increase variety in the dataset, we used both L1 and L2 products for training. For each location and each level of the platform, we selected one random year between 2018 and 2023, and used the least cloudy scenes in each quarter of the selected year.

### Sentinel-2 sampling strategy

For each location we selected two random years between 2018 and 2023, and for each year we used the least cloudy scene in each quarter.


### NAIP sampling strategy

The sampling strategy for [NAIP](https://catalog.data.gov/dataset/national-agriculture-imagery-program-naip) was based on [Natural Earth](https://www.naturalearthdata.com) data. The sample includes all popluated places, protected
areas and parks, airports, and ports. In addition, we sampled one random point
along each river, and one random location within each lake that is registered
in Natural Earth. Finally, we sampled 4000 random points. All data was
filtered to be within the CONUS region.

### LINZ sampling strategy

For [LINZ](https://github.com/linz/imagery) we used simple random subsampling because there is no STAC api to do spatial search with. We selected a random subset of all scenes for the different sub-collections that are available for LINZ.

More specifically, we randomly select 50% the scenes, with a minimum of 10
and a maximum of 2000 scenes for each catalog that was included.
We selected the latest imagery for each of the available regions
of new zealand. The list of catalogs is in the linz processor file.


## Data preparation

To be able to include multiple platforms in model training, we worked on a standardisation of the processing pipeline. The goal for this was to develop a framework that can be used to collect data from a large variety of formats and locations in a consistent way. For this we developed [stacchip](https://clay-foundation.github.io/stacchip/), a library to help preparing training data images. Please consult the documentation of the library to know more, but at a high level the goals of stacchip are

- Keeping the data in original format for as long as possible
- Scalable extendable indexing of chips
- Indexing processors for different platforms
- Chipping utility that takes the index and dynamically creates images for training
- Use geoparquet: fast storage option and easy to combine indexes from platforms
- Can be used for training and inference on the fly

## Dataset size

Using stacchip, we created a dataset with a size of 33.8 TB of imagery, with about 70 million chips created. The following table shows the distribution of imagery chips used for Clay v1 training.

| Source | Number of chips |
| ------ | --------------- |
| NAIP | 20984171 |
| LINZ | 3299006 |
| Sentinel-2-l2a | 18683945 |
| Landsat-c2l1 | 5827333 |
| Landsat-c2l2-sr | 5790651 |
| Sentinel-1-rtc | 16133394 |

# Older versions

For older versions of the model we used the following sampling stragegies.

## For model version v0.1

For v0.1 we used a smaller sample that was slightly less focused on human landscapes. The distribution of the MGRS tiles we used was as follows

| Class | Nr of Tiles | From highest |
|---|---|---|
Diversity | 500 | 3000
Expand All @@ -48,27 +164,3 @@ This resulted in a sample of 1517 MGRS tiles total in our sample.
The resulting sample file can be downloaded from the following link

https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample.fgb

## For model version v0.2

| Class | Nr of Tiles | From highest |
|---|---|---|
Diversity | 400 | 2000
Built-up | 300 | 300
Built-up | 1000 | 1500
Herbaceous wetland | 50 | 500
Mangroves | 50 | 500
Moss and lichen | 50 | 500
Cropland | 800 | 3600
Tree cover | 150 | 750
Shrubland | 100 | 500
Grassland | 200 | 500
Bare / sparse vegetation | 50 | 500
Snow and Ice | 25 | 500
Permanent water bodies | 50 | 1000

This resulted in a sample of 2728 MGRS tiles total in our sample.

The resulting sample file can be downloaded from the following link

https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb

0 comments on commit e7cdb7f

Please sign in to comment.