# STAC Geoparquet with DuckDB

## A short word on STAC

STAC is the central way of accessing any spatio-temporal data on terrabyte. See here for an introduction and the detailed sepcification:
* https://stacspec.org/en
* https://github.com/radiantearth/stac-spec

Principally, data is offered over a *catalog* containing data from various sources. This catalog is further sub-divided into *collections*. 
A collection could for example contain a certain satellite data product like Sentinel-1 GRD, SLC or Sentinel-2 L2A. 
Each collection consists of multiple items, which might represent individual satellite scenes or product tiles (e.g. the MGRS tiles of Sentinel-2). 
Each item consists of one or many *assets*, which contain links to the actual data. For example individual GeoTIFF files for each band.

## STCA Items as Geoparquet files

Besides the STAC API (https://stac.terrabyte.lrz.de/public/api) we have made available an export of the STAC Items for each collection as Geoparquet files. You can query those Geoparquet files with the in-memory database [DuckDB](https://duckdb.org/). Without interacting with the terrabyte STAC API you can query and filter STAC items with your own computing resources. 

## Requirements

For this example, you need to have the Python libraries `duckdb` (we still need version 1.0.0!), `pygeofilter-duckdb` (https://github.com/DLR-terrabyte/pygeofilter-duckdb), and `stac-geoparquet` installed.

In [1]:
#!pip install duckdb==1.0.0 stac-geoparquet git+https://github.com/DLR-terrabyte/pygeofilter-duckdb.git

## Import of libraries

In [2]:
import json
import pystac
import duckdb
from stac_geoparquet.arrow._api import stac_table_to_items

# Install and load DuckDB spatial extension
duckdb.install_extension('spatial')
duckdb.load_extension('spatial')

# Use pygeofilter library to convert between CQL2-JSON/Text and SQL query
from pygeofilter.parsers.cql2_json import parse as json_parse
from pygeofilter.backends.duckdb import to_sql_where
from pygeofilter.util import IdempotentDict

## Define STAC Item export as Geoparquet files

In [3]:
geoparquet_files = {
    'sentinel-1-grd': '/dss/dsstbyfs03/pn56su/pn56su-dss-0022/Sentinel-1/GRD/geoparquet/*.parquet',
    'sentinel-1-slc': '/dss/dsstbyfs03/pn56su/pn56su-dss-0022/Sentinel-1/SLC/geoparquet/*.parquet',
    'sentinel-2-c1-l1c': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Sentinel-2-Col-1/L1C/geoparquet/*.parquet',
    'sentinel-2-c1-l2a': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Sentinel-2-Col-1/L2A/geoparquet/*.parquet',
    'sentinel-3-olci-l1-efr': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Sentinel-3/OLCI/OL_1_EFR___/geoparquet/*.parquet',
    'landsat-tm-c2-l2': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Landsat/collection-2/level-2/standard/tm/geoparquet/*.parquet',
    'landsat-etm-c2-l2': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Landsat/collection-2/level-2/standard/etm/geoparquet/*.parquet',
    'landsat-ot-c2-l2': '/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Landsat/collection-2/level-2/standard/oli-tirs/geoparquet/*.parquet',
}

## Define query as CQL2-JSON

In [4]:
start = '2023-02-01T00:00:00Z'
end = '2023-02-28T23:59:59Z'
collection = 'sentinel-2-c1-l2a'

cql2_filter = {
  "op": "and",
  "args": [
    {
      "op": "between",
      "args": [
        {
          "property": "eo:cloud_cover"
        },
        [0, 21]
      ]
    },
    {
      "op": "between",
      "args": [
        {
          "property": "datetime"
        },
        [start, end]
      ]
    },
    {
        "op": "s_intersects",
        "args": [
          { "property": "geometry" } ,
          {
            "type": "Polygon", # Baden-Württemberg
            "coordinates": [[
                [7.5113934084, 47.5338000528],
    			[10.4918239143, 47.5338000528],
    			[10.4918239143, 49.7913749328],
    			[7.5113934084, 49.7913749328],
    			[7.5113934084, 47.5338000528]
            ]]
          }
        ]
      }
  ]
}

sql_where = to_sql_where(json_parse(cql2_filter), IdempotentDict())

## Query data with DuckDB

In [5]:
%%time
# Define geoparquet files
geoparquet = geoparquet_files[collection]

# Define and execute query
# Note: union_by_name slows down the query process, but is necessary when there are properties not available in all STAC items
sql_query = f"SELECT * FROM read_parquet('{geoparquet}', union_by_name=False) WHERE {sql_where}"
print(f"DuckDB Query:\n{sql_query}\n")
db = duckdb.query(sql_query)

## Convert DuckDB result to Arrow table
table = db.fetch_arrow_table()

## Convert Arrow table to List of PyStac-Items
items = []
for item in stac_table_to_items(table): 
    item['assets'] = json.loads(item['assets'])
    items.append(pystac.Item.from_dict(item))

print("%s items found\n" % len(items))

DuckDB Query:
SELECT * FROM read_parquet('/dss/dsstbyfs01/pn56su/pn56su-dss-0008/Sentinel-2-Col-1/L2A/geoparquet/*.parquet', union_by_name=False) WHERE ((("eo:cloud_cover" BETWEEN 0 AND 21) AND ("datetime" BETWEEN '2023-02-01T00:00:00Z' AND '2023-02-28T23:59:59Z')) AND ST_Intersects(ST_GeomFromWKB(geometry),ST_GeomFromHEXEWKB('0103000000010000000500000034DFB1B6AA0B1E4085B0648F53C44740509E1658D0FB244085B0648F53C44740509E1658D0FB244006A017C64BE5484034DFB1B6AA0B1E4006A017C64BE5484034DFB1B6AA0B1E4085B0648F53C44740')))

35 items found

CPU times: user 29.8 s, sys: 1min 54s, total: 2min 24s
Wall time: 19.3 s


## Show first item

In [6]:
items[0]

## List items as Pandas GeoDataFrame

In [7]:
import geopandas as gpd
dataframe = gpd.GeoDataFrame.from_features(items)
dataframe

Unnamed: 0,geometry,constellation,eo:cloud_cover,grid:code,instruments,mgrs:grid_square,mgrs:latitude_band,mgrs:utm_zone,platform,proj:centroid,...,s2:water_percentage,sat:orbit_state,sat:relative_orbit,terrabyte:stactools_id,view:azimuth,view:incidence_angle,view:sun_azimuth,view:sun_elevation,created,datetime
0,"POLYGON ((7.57204 48.74304, 6.28064 48.72091, ...",sentinel-2,7.996309,MGRS-32ULU,[msi],LU,U,32,sentinel-2a,"{'lat': 48.27686, 'lon': 6.83282}",...,0.222254,descending,8,S2A_T32ULU_20230211T104302_L2A,288.551167,8.836733,164.146943,26.377804,2024-05-20T16:52:14.584596Z,2023-02-11T11:41:51.024000Z
1,"POLYGON ((7.75088 46.85850, 9.12806 46.86563, ...",sentinel-2,3.603994,MGRS-32TMT,[msi],MT,T,32,sentinel-2b,"{'lat': 47.33211, 'lon': 8.52417}",...,4.238812,descending,65,S2B_T32TMT_20230210T102051_L2A,105.386213,8.32656,160.311293,26.162376,2024-05-20T18:44:59.158467Z,2023-02-10T11:20:49.024000Z
2,"POLYGON ((8.99973 47.85370, 8.99974 46.86570, ...",sentinel-2,12.093081,MGRS-32TNT,[msi],NT,T,32,sentinel-2b,"{'lat': 47.35663, 'lon': 9.72673}",...,4.375875,descending,65,S2B_T32TNT_20230210T102051_L2A,152.988379,2.96218,161.691112,26.452285,2024-05-20T18:44:59.851764Z,2023-02-10T11:20:49.024000Z
3,"POLYGON ((11.75191 46.84844, 11.80285 47.81947...",sentinel-2,1.939114,MGRS-32TPT,[msi],PT,T,32,sentinel-2b,"{'lat': 47.3406, 'lon': 11.05059}",...,0.904028,descending,65,S2B_T32TPT_20230210T102051_L2A,285.73679,6.642694,163.076612,26.736361,2024-05-20T18:45:00.391288Z,2023-02-10T11:20:49.024000Z
4,"POLYGON ((8.07584 47.75956, 9.13025 47.76509, ...",sentinel-2,0.597896,MGRS-32UMU,[msi],MU,U,32,sentinel-2b,"{'lat': 48.22222, 'lon': 8.68893}",...,0.434245,descending,65,S2B_T32UMU_20230210T102051_L2A,106.277863,9.309764,160.382539,25.299265,2024-05-20T18:45:00.672984Z,2023-02-10T11:20:49.024000Z
5,"POLYGON ((8.41111 48.66112, 9.13255 48.66495, ...",sentinel-2,16.493364,MGRS-32UMV,[msi],MV,U,32,sentinel-2b,"{'lat': 49.09985, 'lon': 8.85464}",...,0.173872,descending,65,S2B_T32UMV_20230210T102051_L2A,105.922841,10.255324,160.447572,24.435612,2024-05-20T18:45:00.753325Z,2023-02-10T11:20:49.024000Z
6,"POLYGON ((8.99973 48.75301, 8.99973 47.76517, ...",sentinel-2,3.414702,MGRS-32UNU,[msi],NU,U,32,sentinel-2b,"{'lat': 48.25592, 'lon': 9.73938}",...,0.563901,descending,65,S2B_T32UNU_20230210T102051_L2A,115.135146,3.93679,161.778088,25.588092,2024-05-20T18:45:01.190912Z,2023-02-10T11:20:49.024000Z
7,"POLYGON ((10.36028 48.74498, 10.33433 47.75741...",sentinel-2,4.758111,MGRS-32UPU,[msi],PU,U,32,sentinel-2b,"{'lat': 48.23937, 'lon': 11.0863}",...,1.520126,descending,65,S2B_T32UPU_20230210T102051_L2A,269.702743,4.830915,163.179111,25.871311,2024-05-20T18:45:01.848092Z,2023-02-10T11:20:49.024000Z
8,"POLYGON ((6.23438 49.55801, 6.23590 49.53173, ...",sentinel-2,0.471554,MGRS-32ULA,[msi],LA,U,32,sentinel-2a,"{'lat': 50.01406, 'lon': 7.07489}",...,0.806024,descending,108,S2A_T32ULA_20230208T103552_L2A,104.936734,8.202581,161.814501,23.196688,2024-05-20T16:35:12.733615Z,2023-02-08T11:32:11.024000Z
9,"POLYGON ((6.28064 48.72091, 6.33246 47.73416, ...",sentinel-2,1.271214,MGRS-32ULU,[msi],LU,U,32,sentinel-2a,"{'lat': 48.24168, 'lon': 7.04561}",...,0.747165,descending,108,S2A_T32ULU_20230208T103552_L2A,105.077485,4.6685,161.762893,24.94453,2024-05-20T16:35:13.023435Z,2023-02-08T11:32:11.024000Z


## Visualize the covered area

In [8]:
import folium
import folium.plugins as folium_plugins

map = folium.Map()
layer_control = folium.LayerControl(position='topright', collapsed=True)
fullscreen = folium_plugins.Fullscreen()
style = {'fillColor': '#00000000', "color": "#0000ff", "weight": 1}

footprints = folium.GeoJson(
    dataframe.to_json(),
    name='Stac Item footprints',
    style_function=lambda x: style,
    control=True
)

footprints.add_to(map)
layer_control.add_to(map)
fullscreen.add_to(map)
map.fit_bounds(map.get_bounds())
map