## Point feature extraction using GFMap

### Reference data

Start by reading in the reference data.

Once the reference data is read, the batch jobs can be prepared. The user can split the batch jobs per H3 hex, or S2 tile. The end result will be a list of geodataframes, each item in the list containing the batch jobs belonging to a certain H3/S2 tile.

In [1]:
# Configuring the logging for the openeo_gfmap package
from openeo_gfmap.manager import _log
import logging

_log.setLevel(logging.DEBUG)

stream_handler = logging.StreamHandler()
_log.addHandler(stream_handler)

formatter = logging.Formatter('%(asctime)s|%(name)s|%(levelname)s:  %(message)s')
stream_handler.setFormatter(formatter)

# Exclude the other loggers from other libraries
class MyLoggerFilter(logging.Filter):
    def filter(self, record):
        return record.name == _log.name

stream_handler.addFilter(MyLoggerFilter())

In [2]:
from openeo_gfmap.manager.job_splitters import split_job_s2grid
import geopandas as gpd

base_df = gpd.read_file("https://artifactory.vgt.vito.be/artifactory/auxdata-public/gfmap/DEMO_CROPTYPE.gpkg")

split_jobs = split_job_s2grid(
    base_df, max_points=60
)

# Remove the geometry where there are no points with the "extract" flag
split_jobs = [
    job for job in split_jobs if job.extract.any()
]


  polygons["geometry"] = polygons.geometry.centroid

  s2_grid["geometry"] = s2_grid.geometry.centroid



Next, from this list, a new dataframe can be constructed, where each row represents a batch job. The user can customize all the columns to be added to this job dataframe.

In [3]:
import pandas as pd
from openeo_gfmap import Backend 
from typing import List

def create_job_dataframe(
    backend: Backend, split_jobs: List[gpd.GeoDataFrame], prefix: str = "S2-L2A-features"
) -> pd.DataFrame:
    """Create a dataframe from the split jobs, containg all the necessary information to run the job."""
    columns = [
        "backend_name",
        "out_prefix",
        "out_extension",
        "start_date",
        "end_date",
        "s2_tile",
        "geometry",
    ]
    rows = []
    for job in split_jobs:
        # Compute the average in the valid date and make a buffer of 1.5 year around
        median_time = pd.to_datetime(job.valid_date).mean()
        start_date = median_time - pd.Timedelta(days=275)  # A bit more than 9 months
        end_date = median_time + pd.Timedelta(days=275)  # A bit more than 9 months
        s2_tile = job.tile.iloc[0] 
        rows.append(
            pd.Series(
                dict(
                    zip(
                        columns,
                        [
                            backend.value,
                            prefix,
                            ".parquet",
                            start_date.strftime("%Y-%m-%d"),
                            end_date.strftime("%Y-%m-%d"),
                            s2_tile,
                            job.to_json(),
                        ],
                    )
                )
            )
        )

    return pd.DataFrame(rows)

job_df = create_job_dataframe(Backend.CDSE, split_jobs)

# For the sake of example, we will only run the first 5 batch jobs
job_df = job_df.head(5)
job_df

Unnamed: 0,backend_name,out_prefix,out_extension,start_date,end_date,s2_tile,geometry
0,cdse,S2-L2A-features,.parquet,2020-08-30,2022-03-03,31UDS,"{""type"": ""FeatureCollection"", ""features"": [{""i..."
1,cdse,S2-L2A-features,.parquet,2020-08-30,2022-03-03,31UES,"{""type"": ""FeatureCollection"", ""features"": [{""i..."
2,cdse,S2-L2A-features,.parquet,2020-08-30,2022-03-03,31UES,"{""type"": ""FeatureCollection"", ""features"": [{""i..."
3,cdse,S2-L2A-features,.parquet,2020-08-30,2022-03-03,31UFS,"{""type"": ""FeatureCollection"", ""features"": [{""i..."
4,cdse,S2-L2A-features,.parquet,2020-08-30,2022-03-03,31UFS,"{""type"": ""FeatureCollection"", ""features"": [{""i..."


### Feature extraction

Now that the reference data is read in, and the batch jobs are split by S2 tile, the user can define which datacube has to be constructed by each batch job, i.e. which features should be calculated for each batch job. For this, the user can specify a function, which takes as input a row from the job_df and gives as output a batch job.

Define a helper function to create a pre-masked DataCube

In [4]:
import geojson
import openeo 
from openeo_gfmap import Backend, BackendContext, FetchType, TemporalContext
from openeo_gfmap.fetching.s2 import build_sentinel2_l2a_extractor
from typing import Union 

def masked_cube(connection: openeo.Connection,
                 bands: List[str],
                 temporal_extent: TemporalContext,
                 spatial_extent: Union[geojson.FeatureCollection, dict],
                 backend_context: BackendContext,
                 fetch_type: FetchType):
    
    # Extract the SCL collection only and calculate the dilation mask
    scl_cube_properties = {"eo:cloud_cover": lambda val: val <= 95.0}

    scl_cube = connection.load_collection(
        collection_id="SENTINEL2_L2A",
        bands=["SCL"],
        temporal_extent=[temporal_extent.start_date, temporal_extent.end_date],
        spatial_extent=dict(spatial_extent) if fetch_type == FetchType.TILE else None,
        properties=scl_cube_properties,
    )

    # Resample to 10m resolution for the SCL layer
    scl_cube = scl_cube.resample_spatial(10)

    # Compute the SCL dilation mask
    scl_dilated_mask = scl_cube.process(
        "to_scl_dilation_mask",
        data=scl_cube,
        scl_band_name="SCL",
        kernel1_size=17,  # 17px dilation on a 10m layer
        kernel2_size=77,  # 77px dilation on a 10m layer
        mask1_values=[2, 4, 5, 6, 7],
        mask2_values=[3, 8, 9, 10, 11],
        erosion_kernel_size=3,
    ).rename_labels("bands", ["S2-L2A-SCL_DILATED_MASK"])

    # Create the job to extract S2
    extraction_parameters = {
        "target_resolution": 10,  
        "load_collection": {
            "eo:cloud_cover": lambda val: val <= 95.0,
        },
    }

    # Immediately apply the mask 
    extraction_parameters["pre_mask"] = scl_dilated_mask

    extractor = build_sentinel2_l2a_extractor(
        backend_context,
        bands=bands,
        fetch_type=fetch_type,
        **extraction_parameters,
    )

    return extractor.get_cube(connection, spatial_extent, temporal_extent)

Next, define a function that performs compositing and calculates some features

In [5]:
from openeo_gfmap.preprocessing import median_compositing, linear_interpolation

def create_datacube(
    row: pd.Series,
    connection: openeo.Connection,
    provider=None,
    connection_provider=None,
):
    temporal_extent = TemporalContext(row.start_date, row.end_date)
    spatial_extent = geojson.loads(row.geometry)

    backend = Backend(row.backend_name)
    backend_context = BackendContext(backend)

    # Select some bands to download (chosen at random at this point)
    bands_to_download = [
        "S2-L2A-B04",
        "S2-L2A-B08",
        "S2-L2A-B8A",
        "S2-L2A-B09",
        "S2-L2A-B11",
        "S2-L2A-B12",
    ]

    fetch_type = FetchType.POINT 

    cube = masked_cube(connection=connection,
                       bands=bands_to_download,
                       temporal_extent=temporal_extent,
                       spatial_extent=spatial_extent,
                       backend_context=backend_context,
                       fetch_type=fetch_type)
    
    # # Calculate the NDVI and add it to the cube
    # ndvi = cube.ndvi(nir="S2-L2A-B08", red="S2-L2A-B04")
    # ndvi.add_dimension("bands", ["S2-L2A-NDVI"], "bands")
    # # ndvi.rename_labels("bands", ["S2-L2A-NDVI"])
    # cube = cube.merge_cubes(ndvi)

    # Create monthly median composites
    cube = median_compositing(cube=cube,
                              period="month")
    # Perform linear interpolation
    cube = linear_interpolation(cube)

    # In this case the features will be the average of the bands/NDVI, so just take the average:
    cube = cube.reduce_dimension(dimension="t", reducer="mean")

    # Finally, create a vector cube based on the Point geometries
    cube = cube.aggregate_spatial(geometries=spatial_extent, reducer="mean")

    job_options = {
        "executor-memory": "5G",
        "executor-memoryOverhead": "2G",
    }

    return cube.create_job(
        out_format="Parquet",
        title=f"GFMAP_Feature_Extraction_S2_{row.s2_tile}",
        job_options=job_options
    )

     

Once the features to be calculated have been defined, start the GFMAPJobManager to actually start the extractions

First the user has to define a function to generate the output path per extraction. This is a callable that takes as input the root directory, the geometry index and the row corresponding to the batch job.

In [6]:
from pathlib import Path

def generate_output_path(
    root_folder: Path, geometry_index: int, row: pd.Series
):
    features = geojson.loads(row.geometry)
    sample_id = features[geometry_index].properties.get("sample_id", None)
    if sample_id is None:
        sample_id = features[geometry_index].properties["sampleID"]

    s2_tile_id = row.s2_tile
    
    subfolder = root_folder / s2_tile_id 
    return (
        subfolder
        / f"{row.out_prefix}_{sample_id}{row.out_extension}"
    )

In [7]:
from openeo_gfmap.manager.job_manager import GFMAPJobManager
from openeo_gfmap.backend import cdse_connection

manager = GFMAPJobManager(
        output_dir=Path("/data/users/Public/vincent.verelst/gfmap_feature_extractions/"),
        output_path_generator=generate_output_path,
        collection_id="SENTINEL2-FEATURE-EXTRACTION",
        collection_description="Sentinel-2 basic point feature extraction",
        poll_sleep=60,
        n_threads=2,
        post_job_params={},
    )

manager.add_backend(
        Backend.CDSE.value, cdse_connection, parallel_jobs=2
    )

tracking_df_path = "/data/users/Public/vincent.verelst/gfmap_feature_extractions/job_tracking.csv"
manager.run_jobs(job_df, create_datacube, tracking_df_path)

2024-05-16 15:55:20,517|openeo_gfmap.manager|INFO:  Starting a fresh STAC collection.
2024-05-16 15:55:20,520|openeo_gfmap.manager|INFO:  Starting ThreadPoolExecutor with 2 workers.
2024-05-16 15:55:20,522|openeo_gfmap.manager|INFO:  Creating and running jobs.
2024-05-16 15:55:20,819|openeo_gfmap.manager|DEBUG:  Normalizing dataframe. Columns: Index(['backend_name', 'out_prefix', 'out_extension', 'start_date', 'end_date',
       's2_tile', 'geometry', 'status', 'id', 'start_time', 'cpu', 'memory',
       'duration', 'description', 'costs'],
      dtype='object')


Authenticated using refresh token.


2024-05-16 15:57:23,800|openeo_gfmap.manager|DEBUG:  Status of job j-24051601b1ca47178f251dbfeceeb6f7 is running (on backend cdse).
2024-05-16 15:57:27,485|openeo_gfmap.manager|DEBUG:  Status of job j-240516eb37e64a8989dcd2c1140ce1f6 is running (on backend cdse).
2024-05-16 15:58:32,942|openeo_gfmap.manager|DEBUG:  Status of job j-24051601b1ca47178f251dbfeceeb6f7 is running (on backend cdse).
2024-05-16 15:58:39,212|openeo_gfmap.manager|DEBUG:  Status of job j-240516eb37e64a8989dcd2c1140ce1f6 is running (on backend cdse).
2024-05-16 15:59:46,773|openeo_gfmap.manager|DEBUG:  Status of job j-24051601b1ca47178f251dbfeceeb6f7 is running (on backend cdse).
2024-05-16 15:59:50,294|openeo_gfmap.manager|DEBUG:  Status of job j-240516eb37e64a8989dcd2c1140ce1f6 is running (on backend cdse).
2024-05-16 16:00:57,543|openeo_gfmap.manager|DEBUG:  Status of job j-24051601b1ca47178f251dbfeceeb6f7 is running (on backend cdse).
2024-05-16 16:01:05,576|openeo_gfmap.manager|DEBUG:  Status of job j-240516e