## The most basic end-to-end pipeline 

Required user inputs:

- Which constellation (Sentinel1, Sentinel2). Also provide support (i.e. default STAC metadata) for other collections? 
- Which bands --> for S2, allow option to chose distance_to_cloud or scl_dilation_mask as a band
- Temporal extent
- Reference dataset --> Require standardized input format?
- Patch (if so, which size/shape?) or point?
- Which openEO backend(s)

## Reference data

First of all, the reference data needs to be read in and used to split the jobs. 

## Extractions

First we need do do some extractions based on reference data. Here already a lot of variety is possible. We distinguish between point and patch extractions.

### Point Extractions

In [1]:
# Some user defined parameters
temporal_extent = ["2019-01-01", "2019-07-01"]
collection = "SENTINEL2_L2A"  # These names are standardized in GFMap
bands = ["B02", "B03", "B04"]  # These names are standardized in GFMap
backend = "openeo.vito.be"  # This name is standardized in GFMap

In [3]:
import openeo
c = openeo.connect(backend).authenticate_oidc()
raw_extraction = c.load_collection(collection_id=collection, 
                                   bands=bands, 
                                   temporal_extent=temporal_extent)

Authenticated using refresh token.


Here the user already has a lot of choices:
- Do they want to extract the raw data without any form of preprocessing? 
- Do they want to perform some form of compositing?
- Do they want to perform some cloud masking?
- Do they want to apply extra features on top of the vanilla bands?

So in GFMap there should be the option do immediately do a raw extraction in patch or point form, where only the raw bands are extracted (possible to add distance_to_cloud and cloud_mask). In this option the user extracts data that still has a time dimension.
There should also be the option to first perform preprocessing and feature computation (do we set a default, does the user have to specify? what about bespoke methods like Presto?) and only then do the extractions. In this option the user extracts data without time dimension.

It should also be possible (like in worldcereal e.g.) that the user does both of the above. First extract raw, then load in again with `load_stac`, and only then do feature computation and extract again. 

Let's for now do feature computation and only then extractions.

In [4]:
from openeo.processes import ProcessBuilder, array_concat

# Calculate for each band the 10%, 50% and 90% quantiles and the standard deviation
def compute_features(input_timeseries:ProcessBuilder):
    return array_concat(input_timeseries.quantiles(probabilities=[0.1,0.5,0.9]),input_timeseries.sd())

# Use apply_dimension to remove the time dimension and map the quantiles to the band dimension
# Linear_scale_range is used to ensure the datatype is int8
features = raw_extraction.apply_dimension(dimension='t',target_dimension='bands', process=compute_features).apply(lambda x: x.linear_scale_range(0,250,0,250))

# Finally, rename the bands to reflect the computed features
all_bands = [band + "_" + stat for band in raw_extraction.metadata.band_names for stat in ["p10","p50","p90","sd"]]
features = features.rename_labels('bands',all_bands)

In [5]:
all_bands

['B02_p10',
 'B02_p50',
 'B02_p90',
 'B02_sd',
 'B03_p10',
 'B03_p50',
 'B03_p90',
 'B03_sd',
 'B04_p10',
 'B04_p50',
 'B04_p90',
 'B04_sd']