## The most basic end-to-end pipeline 

Required user inputs:

- Which constellation (Sentinel1, Sentinel2). Also provide support (i.e. default STAC metadata) for other collections? 
- Which bands --> for S2, allow option to chose distance_to_cloud or scl_dilation_mask as a band
- Temporal extent
- Reference dataset --> Require standardized input format?
- Patch (if so, which size/shape?) or point?
- Which openEO backend(s)

## Reference data

First of all, the reference data needs to be read in. In the example we use 20 geometries belonging to the same S2 tile.

We can think about a normalize_reference_data, to make sure the reference data is in the correct format

In [1]:
import geopandas as gpd

base_df = gpd.read_file('./ref_data/sample_jobs.geojson')
base_df

Unnamed: 0,sample_id,landcover_label,croptype_label,irrigation_label,valid_date,ref_id,tile,geometry
0,2021_BE_Flanders_full_2195082011,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.70009 51.08117)
1,2021_BE_Flanders_full_1010873978,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (3.05771 51.20072)
2,2021_BE_Flanders_full_2195311171,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.65647 50.92515)
3,2021_BE_Flanders_full_936816805,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (3.03073 51.19602)
4,2021_BE_Flanders_full_1961534808,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (3.02891 51.19640)
5,2021_BE_Flanders_full_2194143333,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.57456 51.06969)
6,2021_BE_Flanders_full_2075647224,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.72138 51.13372)
7,2021_BE_Flanders_full_2191190590,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.84396 51.09410)
8,2021_BE_Flanders_full_935326035,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.63926 50.91796)
9,2021_BE_Flanders_full_2195520329,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,POINT (2.87549 51.08698)


## Extractions

First we need do do some extractions based on reference data. Here already a lot of variety is possible. We distinguish between point and patch extractions.

### Point Extractions

In [2]:
# Some user defined parameters
temporal_extent = ["2019-01-01", "2019-07-01"]
collection = "SENTINEL2_L2A"  # These names are standardized in GFMap
bands = ["B02", "B03", "B04"]  # These names are standardized in GFMap
backend = "openeo.vito.be"  # This name is standardized in GFMap

In [3]:
import openeo
c = openeo.connect(backend).authenticate_oidc()
raw_extraction = c.load_collection(collection_id=collection, 
                                   bands=bands, 
                                   temporal_extent=temporal_extent)

Authenticated using refresh token.


Here the user already has a lot of choices:
- Do they want to extract the raw data without any form of preprocessing? 
- Do they want to perform some form of compositing?
- Do they want to perform some cloud masking?
- Do they want to apply extra features on top of the vanilla bands?

So in GFMap there should be the option do immediately do a raw extraction in patch or point form, where only the raw bands are extracted (possible to add distance_to_cloud and cloud_mask). In this option the user extracts data that still has a time dimension.
There should also be the option to first perform preprocessing and feature computation (do we set a default, does the user have to specify? what about bespoke methods like Presto?) and only then do the extractions. In this option the user extracts data without time dimension.

It should also be possible (like in worldcereal e.g.) that the user does both of the above. First extract raw, then load in again with `load_stac`, and only then do feature computation and extract again. 

Let's for now do feature computation and only then extractions.

In [4]:
from openeo.processes import ProcessBuilder, array_concat

# Calculate for each band the 10%, 50% and 90% quantiles and the standard deviation
def compute_features(input_timeseries:ProcessBuilder):
    return array_concat(input_timeseries.quantiles(probabilities=[0.1,0.5,0.9]),input_timeseries.sd())

# Use apply_dimension to remove the time dimension and map the quantiles to the band dimension
# Linear_scale_range is used to ensure the datatype is int8
features = raw_extraction.apply_dimension(dimension='t',target_dimension='bands', process=compute_features).apply(lambda x: x.linear_scale_range(0,250,0,250))

# Finally, rename the bands to reflect the computed features
all_bands = [band + "_" + stat for band in raw_extraction.metadata.band_names for stat in ["p10","p50","p90","sd"]]
features = features.rename_labels('bands',all_bands)

Based on the reference data, we need to sample a point per sample_id. Currently the geometries are already Points, so no conversion is needed. Commented is a piece of code that would convert to Point geometries in case they were not already.

In [5]:
# # Set the coordinates to UTM to accurately calculate centroids
# proj_base_df = base_df.to_crs(base_df.estimate_utm_crs())
# # Take the centroid of the geometry as Point --> could also make a random point within the geometry
# proj_base_df.geometry = proj_base_df.geometry.centroid
# # Drop the columns confidence and extract, since json converts None to null and True to true, so it's not parsable.
# proj_base_df = proj_base_df.to_crs(epsg=4326)  # Convert back to WGS84, as required by geojson

# json changes True to true, False to false and None to null. This is not parsable so we need to add a check that these properties are removed.
geom = eval(base_df.to_json())

geom

{'type': 'FeatureCollection',
 'features': [{'id': '0',
   'type': 'Feature',
   'properties': {'sample_id': '2021_BE_Flanders_full_2195082011',
    'landcover_label': 11,
    'croptype_label': 1520,
    'irrigation_label': 0,
    'valid_date': '2021-06-01',
    'ref_id': '2021_EUR_DEMO_POLY_110',
    'tile': '31UDS'},
   'geometry': {'type': 'Point',
    'coordinates': [2.700085999992169, 51.081172216239715]}},
  {'id': '1',
   'type': 'Feature',
   'properties': {'sample_id': '2021_BE_Flanders_full_1010873978',
    'landcover_label': 11,
    'croptype_label': 1120,
    'irrigation_label': 0,
    'valid_date': '2021-06-01',
    'ref_id': '2021_EUR_DEMO_POLY_110',
    'tile': '31UDS'},
   'geometry': {'type': 'Point',
    'coordinates': [3.057707187625569, 51.20072475737662]}},
  {'id': '2',
   'type': 'Feature',
   'properties': {'sample_id': '2021_BE_Flanders_full_2195311171',
    'landcover_label': 11,
    'croptype_label': 1120,
    'irrigation_label': 0,
    'valid_date': '2021-06

Aggregate the features to the point geometries and save as geoparquet.Two VectorCube related issues in the Geopyspark Driver need to be solved:

- [723](https://github.com/Open-EO/openeo-geopyspark-driver/issues/723): Band names are lost after `aggregate_spatial`
- [620](https://github.com/Open-EO/openeo-geopyspark-driver/issues/620): Second error mentioned in this issue (no time dimension): unsupported output format geoparquet (even though supported)

In [6]:
point_features = features.aggregate_spatial(geometries=geom, reducer="mean")  # The reducer is not relevant for point aggregation
#  point_features = point_features.rename_labels('bands', all_bands)  # we want to support this in the feature, or immediately make it so that band names are not lost after aggregate_spatial

# point_features.execute_batch('point_features.parquet')

# The below also doesn't work yet
# job = point_features.execute_batch(
#     title='Point feature extraction',
#     out_format='GeoParquet',
# )
# results = job.get_results()
# results.download_file('point_features.geoparquet')

In [8]:
import geopandas as gpd
test = gpd.read_parquet('point_features.parquet')
test.head()

Unnamed: 0,geometry,sample_id,landcover_label,croptype_label,irrigation_label,valid_date,ref_id,tile,feature_index,avg_band_0_,...,avg_band_2_,avg_band_3_,avg_band_4_,avg_band_5_,avg_band_6_,avg_band_7_,avg_band_8_,avg_band_9_,avg_band_10_,avg_band_11_
0,POINT (2.70009 51.08117),2021_BE_Flanders_full_2195082011,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,0,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,221.0,250.0,250.0,250.0
1,POINT (3.05771 51.20072),2021_BE_Flanders_full_1010873978,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,1,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0
2,POINT (2.65647 50.92515),2021_BE_Flanders_full_2195311171,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,2,220.0,...,250.0,250.0,250.0,250.0,250.0,250.0,227.0,250.0,250.0,250.0
3,POINT (3.03073 51.19602),2021_BE_Flanders_full_936816805,11,1520,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,3,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0
4,POINT (3.02891 51.19640),2021_BE_Flanders_full_1961534808,11,1120,0,2021-06-01,2021_EUR_DEMO_POLY_110,31UDS,4,250.0,...,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0,250.0


### Patch Extractions

TODO