# EOEPCA+ Use Case: NO2 Tropospheric Content Cloud Filtering - Register Input Data

![end2end_workflow](img/end2end_workflow.png)

`micromamba create -n eoepca_end2end -c conda-forge pystac pystac-client odc odc-stac openeo xarray rioxarray rasterio geopandas pyproj numpy folium shapely pip jupyterlab`

In [None]:
from pystac_client import Client
from odc.stac import stac_load
import pystac
from pystac import Collection, Catalog, Extent, SpatialExtent, TemporalExtent, Item, Asset
from datetime import datetime
import numpy as np
import openeo

## Cloud Fraction

Get cloud fraction data from the DLR GeoService STAC API.
- S5P Cloud Fraction Inpuls L3: EOC Geoservice Sentinel-5P TROPOMI L3 Daily Composites - Cloud Fraction (CF)
- https://geoservice.dlr.de/eoc/ogc/stac/v1/collections/S5P_TROPOMI_L3_P1D_CF

### Request

In [None]:
url = "https://geoservice.dlr.de/eoc/ogc/stac/v1/"
catalog = Client.open(url)

In [None]:
collection_id = "S5P_TROPOMI_L3_P1D_CF"
bbox = [-10.0, 35.0, 30.0, 70.0]  # Europe
date_time = "2023-08-01T00:00:00Z/2023-12-31T23:59:59Z"

In [None]:
search = catalog.search(
    collections=[collection_id],
    bbox=bbox,
    datetime=date_time,
    limit=400  # adjust as needed
)

In [None]:
items = list(search.items())

### Check Data
Load the data and check that it's valid.

In [None]:
ds = stac_load(
    items,
    #bands=["CF"], 
    crs="EPSG:4326",
    resolution=0.1,
    bbox=bbox,
    chunks={"time": 1} 
)


In [None]:
ds

In [None]:
monthly_mean_cf = ds['cf'].groupby('time.month').median(dim='time')

In [None]:
monthly_mean_cf.plot(col="month",
    col_wrap=3,
    cmap="viridis",
    vmin=0,
    vmax=1,
    figsize=(12, 6),
    cbar_kwargs={"label": "Cloud Fraction"})

### Register to EOEPCA via Registration BB

For registration in EOEPCA create a catalogue and collection from these items.

In [None]:
print(len(items))
print(items[0])
print(items[-1])

**To Do DataCubeAccess BB**: Adapt itmes to best practices. Best Practice document WIP available [here](https://github.com/EOEPCA/datacube-access/blob/main/best_practices/stac_best_practices.md).

In [None]:
items[0].stac_extensions

Items into Collection

and 

**To Do DataCubeAccess BB:** Adapt Collection to best practices

In [None]:
#example of some necessary adaptions...
#item_dates = [item.datetime for item in items]
item_dates = [item.datetime for item in items if isinstance(item.datetime, datetime)]
start = min(item_dates)
end = max(item_dates)
temp_extent = TemporalExtent([[start, end]])
req_bbox = SpatialExtent([bbox])
ori_bbox = SpatialExtent([items[0].bbox])
extent = Extent(spatial=ori_bbox, temporal=temp_extent)
extensions = items[0].stac_extensions

collection = Collection(
    id="s5p-cloud-fraction-2023-aug-dec",
    description="Subset of Sentinel-5P Cloud Fraction L3 data for August-December 2023, from DLR Geoservice STAC API.",
    extent=extent,
    license="proprietary",
    keywords=["Sentinel-5P", "TROPOMI", "Cloud Fraction", "Europe", "DLR"],
    providers=[],
    summaries={},
    #stac_extensions=[extensions]
)


In [22]:
from pystac import Item, Collection, Catalog, Extent, SpatialExtent, TemporalExtent, Asset
import numpy as np
import pandas as pd
import datetime

source_items = list(search.items())

times = ds['time'].values
start = np.min(times)
end = np.max(times)
bbox = [float(ds.longitude.min()), float(ds.latitude.min()), float(ds.longitude.max()), float(ds.latitude.max())]
extent = Extent(
    spatial=SpatialExtent([bbox]),
    temporal=TemporalExtent([[pd.to_datetime(start).to_pydatetime(), pd.to_datetime(end).to_pydatetime()]])
)
collection = Collection(
    id="s5p-cloud-fraction-2023-aug-dec",
    description="Sentinel-5P Cloud Fraction L3 data (Aug-Dec 2023), generated from xarray Dataset.",
    extent=extent,
    license="proprietary",
    stac_extensions=[
        "https://stac-extensions.github.io/projection/v2.0.0/schema.json",
        "https://stac-extensions.github.io/datacube/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json",
        "https://stac-extensions.github.io/eo/v1.1.0/schema.json"
    ]
)

collection.extra_fields["cube:dimensions"] = {
    "x": {
        "type": "spatial",
        "axis": "x",
        "extent": [float(ds.longitude.min()), float(ds.longitude.max())]
    },
    "y": {
        "type": "spatial",
        "axis": "y",
        "extent": [float(ds.latitude.min()), float(ds.latitude.max())]
    },
    "t": {
        "type": "temporal",
        "extent": [str(ds.time.values[0]), str(ds.time.values[-1])]
    }
}

catalog = Catalog(id="s5p-bp-stac-catalog", description="Root catalog")
catalog.add_child(collection)

for i, src_item in enumerate(source_items):
    t = ds['time'].values[i]
    timestamp = np.datetime_as_string(t, 's')
    cf_data = ds['cf'].isel(time=i).values
    nodata = (
        ds.attrs.get("nodata") or
        ds.attrs.get("_FillValue") or
        "nan"
    )
    safe_stat = lambda val: float(val) if np.isfinite(val) else None

    stats = {
        "minimum": safe_stat(np.nanmin(cf_data)),
        "maximum": safe_stat(np.nanmax(cf_data)),
        "mean": safe_stat(np.nanmean(cf_data)),
        "stddev": safe_stat(np.nanstd(cf_data))
    }

    proj = {
        "proj:epsg": 4326,
        "proj:shape": [ds.latitude.size, ds.longitude.size],
        "proj:transform": [
            float(ds.longitude[1] - ds.longitude[0]), 0.0, float(ds.longitude.min()),
            0.0, float(ds.latitude[1] - ds.latitude[0]), float(ds.latitude.min())
        ]
    }

    datacube = {
        "cube:dimensions": {
            "x": {
                "type": "spatial",
                "axis": "x",
                "extent": [float(ds.longitude.min()), float(ds.longitude.max())]
            },
            "y": {
                "type": "spatial",
                "axis": "y",
                "extent": [float(ds.latitude.min()), float(ds.latitude.max())]
            },
            "t": {
                "type": "temporal",
                "extent": [str(t), str(t)]
            }
        }
    }

    properties = dict(src_item.properties)
    if "license" in properties and properties["license"] == "CC-BY 4.0":
        properties["license"] = "CC-BY-4.0"
    if "instruments" in properties and isinstance(properties["instruments"], str):
        properties["instruments"] = [properties["instruments"]]
    if "sci:doi" in properties and properties["sci:doi"] == "N/A":
        properties["sci:doi"] = "10.xxxx/xxxxxx"
    properties.update(proj)
    properties.update(datacube)

    original_exts = set(src_item.stac_extensions or [])
    new_exts = {
        "https://stac-extensions.github.io/projection/v2.0.0/schema.json",
        "https://stac-extensions.github.io/datacube/v1.0.0/schema.json",
        "https://stac-extensions.github.io/raster/v1.1.0/schema.json",
        "https://stac-extensions.github.io/eo/v1.1.0/schema.json"
    }

    view_ext_url = "https://stac-extensions.github.io/view/v1.0.0/schema.json"
    all_exts = original_exts | new_exts
    stac_extensions = [ext for ext in all_exts if ext != view_ext_url]

    item = Item(
        id=src_item.id,
        geometry=src_item.geometry,
        bbox=src_item.bbox,
        datetime=pd.to_datetime(t).to_pydatetime(),
        properties=properties,
        stac_extensions=stac_extensions
    )

    for key, asset in src_item.assets.items():
        item.add_asset(key, asset)

    if "cf" in item.assets and "cf" in src_item.assets:
        src_cf = src_item.assets["cf"]
    # Copy raster:bands if present in the original asset
    if "raster:bands" in src_cf.extra_fields:
        item.assets["cf"].extra_fields["raster:bands"] = src_cf.extra_fields["raster:bands"]
    else:
        # fallback to your calculated bands if needed
        item.assets["cf"].extra_fields["raster:bands"] = [{
            "nodata": nodata,
            "sampling": "point",
            "data_type": "float32",
            "spatial_resolution": float(abs(ds.longitude[1] - ds.longitude[0]))
        }]
    # Copy proj:epsg if present
    if "proj:epsg" in src_cf.extra_fields:
        item.assets["cf"].extra_fields["proj:epsg"] = src_cf.extra_fields["proj:epsg"]
    else:
        item.assets["cf"].extra_fields["proj:epsg"] = 4326

    collection.add_item(item)

catalog.normalize_and_save("s5p-bp-stac-catalog", catalog_type="SELF_CONTAINED")

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  return ufunc.reduce(obj, axis, dtype, out, **passkwarg

In [23]:
collection = pystac.Catalog.from_file("s5p-bp-stac-catalog/catalog.json")

print("Collection Extensions:", collection.stac_extensions)
print("Collection Metadata keys:", collection.extra_fields.keys())
print("Collection valid:", collection.validate())


for item in collection.get_all_items():
    print("\nItem ID:", item.id)
    print("  Extensions:", item.stac_extensions)
    print("  Properties:", item.properties.keys())
    print("  Datetime:", item.datetime)
    print("  BBox:", item.bbox)
    print("  Valid:", item.validate())

    for asset_key, asset in item.assets.items():
        print(f"    Asset key: {asset_key}")
        print(f"      HREF: {asset.href}")
        print(f"      Media type: {asset.media_type}")
        print(f"      Roles: {asset.roles}")
        print(f"      Title: {asset.title}")
        print(f"      Extra fields: {asset.extra_fields}")

valid = collection.validate()
valid

Collection Extensions: []
Collection Metadata keys: dict_keys(['type'])
Collection valid: ['https://schemas.stacspec.org/v1.1.0/catalog-spec/json-schema/catalog.json']

Item ID: S5P_DLR_NRTI_01_040201_L3_CF_20231231
  Extensions: ['https://stac-extensions.github.io/raster/v1.1.0/schema.json', 'https://stac-extensions.github.io/datacube/v2.2.0/schema.json', 'https://stac-extensions.github.io/projection/v2.0.0/schema.json', 'https://stac-extensions.github.io/eo/v1.1.0/schema.json', 'https://stac-extensions.github.io/scientific/v1.0.0/schema.json', 'https://stac-extensions.github.io/processing/v1.0.0/schema.json']
  Properties: dict_keys(['created', 'updated', 'datetime', 'start_datetime', 'end_datetime', 'platform', 'constellation', 'instruments', 'license', 'sci:doi', 'processing:facility', 'processing:level', 'processing:software', 'product:type', 'proj:bbox', 'proj:shape', 's5p:collection_identifier', 's5p:datasource', 's5p:head_facility', 's5p:l2_algorithm_version', 's5p:product_name

['https://schemas.stacspec.org/v1.1.0/catalog-spec/json-schema/catalog.json']

Collections into Catalogue

In [None]:
catalog = Catalog(
    id="s5p-cloud-fraction-europe",
    description="Catalog of Sentinel-5P L3 Cloud Fraction data August-December 2023)"
)

# Link the collection to the catalog
catalog.add_child(collection)

# Add all items to the collection
for item in items:
    collection.add_item(item)

In [None]:
catalog

**To Do Workspace BB**: Save json to Workspace BB or devcluster object storage.

In [None]:
output_dir = "s5p-stac-catalog" # adapt to workspace or dev cluster object storage

catalog.normalize_and_save(
    root_href=output_dir, 
    catalog_type="SELF_CONTAINED"
)

**To Do Registration BB**: Have Registration BB - Harvester add the catalogue to EOEPCA STAC API

In [None]:
#  https://github.com/EOEPCA/demo/blob/main/demoroot/notebooks/06%20Resource%20Registration%20Harvester.ipynb

**To Do Registration BB:** Replicate workflow with [eodm](https://github.com/geopython/eodm).

- As long as the corrected STAC Items and Collection are in memory, they can be registered using eodm [`load_stac_api_collections()`](stactools-sentinel2/examples/s2_dateline at s2_dateline · DLR-terrabyte/stactools-sentinel2) and [`load_stac_api_items()`](https://github.com/geopython/eodm/blob/main/src/eodm/load.py#L9)
- The target should be the URL of the EOEPCA STAC API **--> Which one would that be currently?**
- This would be a shortcut by not storing the jsons and not using the Registration BB Harvester.

In [None]:
# https://github.com/geopython/eodm

## Tropospheric NO2 - Terrascope STAC API

**To Do:** Evaluate whether it makes sense to follow Terrascope STAC API approach or if openEO makes more sense.

Get Tropospheric NO2 Data from a publicly available STAC API: S5P NO2 Troposphere L2: Sentinel-5P Nitrogen Dioxide tropospheric column

CDSE: not well filled for NO2
- Offline: https://browser.stac.dataspace.copernicus.eu/collections/sentinel-5p-l2-no2-offl
- Near Real Time: https://browser.stac.dataspace.copernicus.eu/collections/sentinel-5p-l2-no2-nrti?.language=de

Terrascope: need special credentials
- https://services.terrascope.be/stac/collections/urn:eop:VITO:TERRASCOPE_S5P_L3_NO2_TD_V1/items
- https://docs.terrascope.be/Developers/WebServices/TerraCatalogue/STACAPI.html
- https://docs.terrascope.be/Developers/WebServices/TerraCatalogue/ProductDownload.html#authentication

Request

In [None]:
#url = "https://stac.dataspace.copernicus.eu/v1"
url = "https://services.terrascope.be/stac/"
catalog = Client.open(url)

In [None]:
#collection_id = "sentinel-5p-l2-no2-offl"
collection_id = "urn:eop:VITO:TERRASCOPE_S5P_L3_NO2_TD_V2"

In [None]:
search = catalog.search(
    collections=[collection_id],
    bbox=bbox,
    datetime=date_time,
    #limit=1000 # adjust as needed
)

In [None]:
items_no2 = list(search.items())

In [None]:
print(len(items_no2))
print(items_no2[0])
print(items_no2[-1])

In [None]:
items_no2[0]

Check data

In [None]:
ds_no2 = stac_load(
    items_no2,
    #bands=["NO2"], 
    crs="EPSG:4326",
    resolution=0.1,
    bbox=bbox,
    chunks={"time": 1}  # Enable Dask chunking
)

In [None]:
ds_no2 # lazy

In [None]:
monthly_mean_no2 = ds_no2['NO2'].groupby('time.month').median(dim='time') #lazy

In [None]:
monthly_mean_no2 # lazy

To actually access data authentication is needed. **This is probably not the right way to get data from terrascope (ideally it would be analog to the example above).**

In [None]:
import requests
import xarray as xr
import rioxarray
from rasterio.io import MemoryFile

def get_terrascope_token(username: str, password: str) -> str:
    url = "https://sso.terrascope.be/auth/realms/terrascope/protocol/openid-connect/token"
    data = {
        "grant_type": "password",
        "client_id": "public",
        "username": username,
        "password": password
    }
    response = requests.post(url, data=data)
    response.raise_for_status()
    return response.json()["access_token"]

def load_no2_from_items(items, token, asset_key="NO2"):
    """Takes a list of STAC items and loads the NO2 band from each into a time-stacked xarray DataArray."""
    datasets = []
    for item in items:
        try:
            url = item.assets[asset_key].href
            headers = {"Authorization": f"Bearer {token}"}
            r = requests.get(url, headers=headers)
            r.raise_for_status()

            with MemoryFile(r.content) as memfile:
                with memfile.open() as dataset:
                    da = rioxarray.open_rasterio(dataset).squeeze("band", drop=True)
                    da = da.rio.write_crs("EPSG:4326")
                    da = da.expand_dims(time=[item.datetime])
                    datasets.append(da)
        except Exception as e:
            print(f"Failed to load {item.id}: {e}")

    if datasets:
        return xr.concat(datasets, dim="time").sortby("time")
    else:
        print("No valid datasets loaded.")
        return None

In [None]:
import getpass

username = "peter.zellner"
password = getpass.getpass("Terrascope password: ")

token = get_terrascope_token(username, password)

Trying to simulate how the data access would look like after registering the STAC Metadata via the Registration BB...

In [None]:
no2_data = load_no2_from_items(items_no2, token)

if no2_data is not None:
    print(no2_data)
    no2_data.mean(dim="time").plot(cmap="viridis", robust=True)

## Tropospheric NO2 - CDSE aggregator openEO

- CDSE openEO aggregator with terrascope
- https://openeofed.dataspace.copernicus.eu/

In [None]:
# Option A: Save files to eopca workspace, adapt asset path in STAC
# Option B: Register files with original href -> Authentication at access?? -> Don't get the original terrascope STAC Items from openEO

In [None]:
import openeo
connection = openeo.connect("openeofed.dataspace.copernicus.eu").authenticate_oidc()

Using openEO the data has to be retrieved/downloaded directly. STAC items are created for the results.

In [None]:
bbox

In [None]:
%%time
load = connection.load_collection(collection_id = "TERRASCOPE_S5P_L3_NO2_TD", 
                                  spatial_extent = {"west": bbox[0], "east": bbox[1], "south": bbox[2], "north": bbox[3]}, 
                                  temporal_extent = ["2023-08-01T00:00:00Z", "2023-12-31T00:00:00Z"], 
                                  bands = ["NO2"])
save = load.save_result(format = "GTIFF")

job = save.create_job()
job.start_and_wait()

# The process can be executed synchronously (see below), as batch job or as web service now
#result = connection.execute(save2)

These files could be downloaded and stored alongside with the created STAC metadata for the registration BB. Probably there's a more elegant solution...

In [None]:
job.get_results()

In [None]:
%%time
job.get_results().download_files("output")