Documented v1 sampling strategy (#249)

* Documented v1 sampling strategy * Included suggestions from review.
Clay-foundation · May 24, 2024 · e7cdb7f · e7cdb7f
1 parent bb38678
commit e7cdb7f
Showing 1 changed file with 118 additions and 26 deletions.
diff --git a/docs/data_sampling.md b/docs/data_sampling.md
@@ -1,6 +1,43 @@
-# Data sampling strategy
+# Training Data
 
-To create a balanced dataset for model training, we used a sampling strategy
+This section describes how we created the training dataset for the clay model.
+
+## Data sources selection
+
+The goal for the Clay model is for it to be as general as possible. It should be able to accept data from any platform coming from satellites, aerial, or drone platforms. For this to be possible, clearly the model design is the basis. Drawing inspiration from earlier works on Foundation models like Prithvi, SatMAE, ScaleMAE, DOFA, and SpectralGPT, we have developed a model architecture capable of accepting inputs of diverse spectral bands and resolutions in different sizes.
+
+To train such a model, it is necessary to create a training dataset
+that contains data from multiple platforms, and is as varied as possible in terms of
+
+- spectral band definitions
+- spatial distribution
+- temporal distribution
+- ground sampling distance
+
+To achieve this we have first complied a [list of possible input platforms](https://github.com/Clay-foundation/model/issues/128). The list of candidate systems is rather long, and will be growing in the future. To reduce complexity, we have converged to a shorter list of platforms for the first round of model training.
+
+Criteria was availability in the cloud, existence of STAC catalogs, and cloud optimized formats. This resulted in the following list of systems that we have
+included in the trainig for Clay v1
+
+
+| Platform | Spatial Coverage | Spectral bands | GSD (meters) |
+---------|------------------|----------------|--------------|
+| Landsat 8 and 9 | Global | 6 optical bands | 30 |
+| Sentinel 2 L2A | Global | 10 optical bands | 10 |
+| Sentinel 1 RTC | Global | 2 radar bands | 10 |
+| NAIP | USA | 4 optical bands | < 1 |
+| LINZ | New Zealand | 3 optical bands | < 0.5 |
+
+
+## Sampling strategy
+
+Once imagery sources are selected, the next step is to develop a sampling strategy. We are not able to process the entire archive, and so it is important to select the right subset of the archives for training.
+
+Our driving principle is that the model should learn natural features as well as human made features. Human made features are smaller and less evenly distributed in many cases. This has driven some of the decisions for the sampling, as described below.
+
+### Global sampling
+
+We created a single sampling strategy for all four global satellite systems that we included in the model training (Sentinel 1 and 2, and Landsat 8 and 9). To create a balanced dataset for model training, we used a sampling strategy
 based on land cover classes from the [ESA WorldCover](https://esa-worldcover.org/)
 layer.
 
@@ -26,8 +63,87 @@ After selecting MGRS tiles for each of these criteria, we removed duplicates.
 
 The following table summarizes the selection criteria for each class.
 
+| Class | Nr of Tiles | From highest |
+|---|---|---|
+Diversity | 400 | 2000
+Built-up | 300 | 300
+Built-up | 1000 | 1500
+Herbaceous wetland | 50 | 500
+Mangroves | 50 | 500
+Moss and lichen | 50 | 500
+Cropland | 800 | 3600
+Tree cover | 150 | 750
+Shrubland | 100 | 500
+Grassland | 200 | 500
+Bare / sparse vegetation | 50 | 500
+Snow and Ice | 25 | 500
+Permanent water bodies | 50 | 1000
+
+This resulted in a sample of 2728 MGRS tiles total in our sample. The resulting sample file can be downloaded from the following link
+
+https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb
+
+We used these locations for all of the global platforms. For more details about how exactly we implemented the sample selection, review the corresponding [stacchip processors](https://github.com/Clay-foundation/stacchip/blob/main/stacchip/processors/).
+
+### Landsat 8 and 9 sampling strategy
+
+To further increase variety in the dataset, we used both L1 and L2 products for training. For each location and each level of the platform, we selected one random year between 2018 and 2023, and used the least cloudy scenes in each quarter of the selected year.
+
+### Sentinel-2 sampling strategy
+
+For each location we selected two random years between 2018 and 2023, and for each year we used the least cloudy scene in each quarter.
+
+
+### NAIP sampling strategy
+
+The sampling strategy for [NAIP](https://catalog.data.gov/dataset/national-agriculture-imagery-program-naip) was based on [Natural Earth](https://www.naturalearthdata.com) data. The sample includes all popluated places, protected
+areas and parks, airports, and ports. In addition, we sampled one random point
+along each river, and one random location within each lake that is registered
+in Natural Earth. Finally, we sampled 4000 random points. All data was
+filtered to be within the CONUS region.
+
+### LINZ sampling strategy
+
+For [LINZ](https://github.com/linz/imagery) we used simple random subsampling because there is no STAC api to do spatial search with. We selected a random subset of all scenes for the different sub-collections that are available for LINZ.
+
+More specifically, we randomly select 50% the scenes, with a minimum of 10
+and a maximum of 2000 scenes for each catalog that was included.
+We selected the latest imagery for each of the available regions
+of new zealand. The list of catalogs is in the linz processor file.
+
+
+## Data preparation
+
+To be able to include multiple platforms in model training, we worked on a standardisation of the processing pipeline. The goal for this was to develop a framework that can be used to collect data from a large variety of formats and locations in a consistent way. For this we developed [stacchip](https://clay-foundation.github.io/stacchip/), a library to help preparing training data images. Please consult the documentation of the library to know more, but at a high level the goals of stacchip are
+
+- Keeping the data in original format for as long as possible
+- Scalable extendable indexing of chips
+- Indexing processors for different platforms
+- Chipping utility that takes the index and dynamically creates images for training
+- Use geoparquet: fast storage option and easy to combine indexes from platforms
+- Can be used for training and inference on the fly
+
+## Dataset size
+
+Using stacchip, we created a dataset with a size of 33.8 TB of imagery, with about 70 million chips created. The following table shows the distribution of imagery chips used for Clay v1 training.
+
+| Source | Number of chips |
+| ------ | --------------- |
+| NAIP           | 20984171 |
+| LINZ            | 3299006 |
+| Sentinel-2-l2a | 18683945 |
+| Landsat-c2l1    | 5827333 |
+| Landsat-c2l2-sr | 5790651 |
+| Sentinel-1-rtc | 16133394 |
+
+# Older versions
+
+For older versions of the model we used the following sampling stragegies.
+
 ## For model version v0.1
 
+For v0.1 we used a smaller sample that was slightly less focused on human landscapes. The distribution of the MGRS tiles we used was as follows
+
 | Class | Nr of Tiles | From highest |
 |---|---|---|
 Diversity | 500 | 3000
@@ -48,27 +164,3 @@ This resulted in a sample of 1517 MGRS tiles total in our sample.
 The resulting sample file can be downloaded from the following link
 
 https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample.fgb
-
-## For model version v0.2
-
-| Class | Nr of Tiles | From highest |
-|---|---|---|
-Diversity | 400 | 2000
-Built-up | 300 | 300
-Built-up | 1000 | 1500
-Herbaceous wetland | 50 | 500
-Mangroves | 50 | 500
-Moss and lichen | 50 | 500
-Cropland | 800 | 3600
-Tree cover | 150 | 750
-Shrubland | 100 | 500
-Grassland | 200 | 500
-Bare / sparse vegetation | 50 | 500
-Snow and Ice | 25 | 500
-Permanent water bodies | 50 | 1000
-
-This resulted in a sample of 2728 MGRS tiles total in our sample.
-
-The resulting sample file can be downloaded from the following link
-
-https://clay-mgrs-samples.s3.amazonaws.com/mgrs_sample_v02.fgb