In [1]:
import pystac
import pystac_client
import stac2dcache

In [2]:
stac2dcache.__version__

'0.2.0'

# Sentinel-2 data from Earth Search STAC endpoint (AWS Open Datasets)

In this notebook we:
* query the public STAC endpoint that lists the Sentinel-2 Open Datasets on AWS for all assets corresponding to one MGRS tile;
* add the returned items to a new catalog;
* save the catalog (metadata only, with links to the "real" data on AWS) to the SURF dCache storage system. 

## Earth Search STAC endpoint

The [Earth Search STAC endpoint](https://www.element84.com/earth-search/) offers access to the following Open Data collections on AWS:

* Sentinel-2, Level 1C (`sentinel-s2-l1c`) and Level 2A (`sentinel-s2-l2a`), info [here](https://registry.opendata.aws/sentinel-2/), data available **only as requster-pays**;
* Sentinel-2, Level 2A, (`sentinel-s2-l2a-cogs`)saved as cloud-optimized GeoTIFFs (COGs), info [here](https://registry.opendata.aws/sentinel-2-l2a-cogs/), public data;
* Landsat 8, Collection 1 Level 1 (`landsat-8-l1-c1`), info [here](https://registry.opendata.aws/usgs-landsat/index.html), public data;

The STAC API endpoint is available at the following link:

In [3]:
api_url = "https://earth-search.aws.element84.com/v0"

which can also be browsed [here](https://stacindex.org/catalogs/earth-search#/).

## Search for available assets

We look for the Sentinel-2 scenes that are available for the Red Glacier (Alaska). We define the area-of interest using its MGRS tile:

In [4]:
utm_zone = "5"
latitude_band = "V"
grid_square = "MG"

We use [PySTAC Client](https://pystac-client.readthedocs.io) to query the STAC endpoint:

In [5]:
client = pystac_client.Client.open(api_url)
client

<Client id=earth-search>

We start by looking for data processed at Level-1C (top-of-atmosphere reflectance), which has collection ID:

In [6]:
collection_id = "sentinel-s2-l1c"

In [7]:
search = client.search(
    collections=[collection_id],
    query=[
        f"sentinel:utm_zone={utm_zone}",
        f"sentinel:latitude_band={latitude_band}",
        f"sentinel:grid_square={grid_square}"   
    ]
)

In [8]:
# print how may items are found
search.matched()

627

In [None]:
items_l1c = search.get_all_items()

`items_l1c` is a PySTAC `ItemCollection` object. To list all items:

In [None]:
items_l1c.items

[<Item id=S2A_5VMG_20210417_0_L1C>,
 <Item id=S2B_5VMG_20210415_0_L1C>,
 <Item id=S2A_5VMG_20210413_0_L1C>,
 <Item id=S2B_5VMG_20210412_0_L1C>,
 <Item id=S2A_5VMG_20210410_0_L1C>,
 <Item id=S2B_5VMG_20210408_0_L1C>,
 <Item id=S2A_5VMG_20210407_0_L1C>,
 <Item id=S2B_5VMG_20210405_0_L1C>,
 <Item id=S2A_5VMG_20210403_0_L1C>,
 <Item id=S2B_5VMG_20210402_0_L1C>,
 <Item id=S2A_5VMG_20210331_0_L1C>,
 <Item id=S2B_5VMG_20210329_0_L1C>,
 <Item id=S2B_5VMG_20210326_0_L1C>,
 <Item id=S2A_5VMG_20210324_0_L1C>,
 <Item id=S2B_5VMG_20210323_0_L1C>,
 <Item id=S2A_5VMG_20210321_0_L1C>,
 <Item id=S2B_5VMG_20210319_0_L1C>,
 <Item id=S2A_5VMG_20210318_0_L1C>,
 <Item id=S2B_5VMG_20210316_0_L1C>,
 <Item id=S2A_5VMG_20210314_0_L1C>,
 <Item id=S2B_5VMG_20210313_0_L1C>,
 <Item id=S2A_5VMG_20210311_0_L1C>,
 <Item id=S2B_5VMG_20210309_0_L1C>,
 <Item id=S2A_5VMG_20210308_0_L1C>,
 <Item id=S2B_5VMG_20210306_0_L1C>,
 <Item id=S2A_5VMG_20210304_0_L1C>,
 <Item id=S2B_5VMG_20210303_0_L1C>,
 <Item id=S2A_5VMG_20210301_

Repeat search for the Level-2A (bottom-of-atmosphere reflectance) dataset where images have been converted to cloud-optimized GeoTIFF format (and publicly available from AWS):

In [None]:
collection_id = "sentinel-s2-l2a-cogs"

In [None]:
search = client.search(
    collections=[collection_id],
    query=[
        f"sentinel:utm_zone={utm_zone}",
        f"sentinel:latitude_band={latitude_band}",
        f"sentinel:grid_square={grid_square}"
    ]
)

In [None]:
# print how may items are found
search.matched()

765

In [14]:
items_l2a = search.get_all_items()

Again, print the list of all items:

In [15]:
items_l2a.items

[<Item id=S2A_5VMG_20211116_0_L2A>,
 <Item id=S2B_5VMG_20211114_0_L2A>,
 <Item id=S2A_5VMG_20211113_0_L2A>,
 <Item id=S2B_5VMG_20211111_0_L2A>,
 <Item id=S2A_5VMG_20211109_0_L2A>,
 <Item id=S2B_5VMG_20211108_0_L2A>,
 <Item id=S2A_5VMG_20211106_0_L2A>,
 <Item id=S2B_5VMG_20211104_0_L2A>,
 <Item id=S2A_5VMG_20211103_0_L2A>,
 <Item id=S2B_5VMG_20211101_0_L2A>,
 <Item id=S2A_5VMG_20211030_0_L2A>,
 <Item id=S2B_5VMG_20211029_0_L2A>,
 <Item id=S2A_5VMG_20211027_0_L2A>,
 <Item id=S2B_5VMG_20211025_0_L2A>,
 <Item id=S2A_5VMG_20211024_0_L2A>,
 <Item id=S2B_5VMG_20211022_0_L2A>,
 <Item id=S2A_5VMG_20211020_0_L2A>,
 <Item id=S2B_5VMG_20211019_0_L2A>,
 <Item id=S2A_5VMG_20211017_0_L2A>,
 <Item id=S2B_5VMG_20211015_0_L2A>,
 <Item id=S2A_5VMG_20211014_0_L2A>,
 <Item id=S2B_5VMG_20211012_0_L2A>,
 <Item id=S2A_5VMG_20211010_0_L2A>,
 <Item id=S2B_5VMG_20211009_0_L2A>,
 <Item id=S2A_5VMG_20211007_0_L2A>,
 <Item id=S2B_5VMG_20211005_0_L2A>,
 <Item id=S2A_5VMG_20211004_0_L2A>,
 <Item id=S2B_5VMG_20211002_

Find out which items are present in L2A collection but not in L1C:

In [16]:
l1c_dates = [item.properties["datetime"] for item in items_l1c]
missing = [item for item in items_l2a if item.properties["datetime"] not in l1c_dates]
missing

[<Item id=S2A_5VMG_20211116_0_L2A>,
 <Item id=S2B_5VMG_20211114_0_L2A>,
 <Item id=S2A_5VMG_20211113_0_L2A>,
 <Item id=S2B_5VMG_20211111_0_L2A>,
 <Item id=S2A_5VMG_20211109_0_L2A>,
 <Item id=S2B_5VMG_20211108_0_L2A>,
 <Item id=S2A_5VMG_20211106_0_L2A>,
 <Item id=S2B_5VMG_20211104_0_L2A>,
 <Item id=S2A_5VMG_20211103_0_L2A>,
 <Item id=S2B_5VMG_20211101_0_L2A>,
 <Item id=S2A_5VMG_20211030_0_L2A>,
 <Item id=S2B_5VMG_20211029_0_L2A>,
 <Item id=S2A_5VMG_20211027_0_L2A>,
 <Item id=S2B_5VMG_20211025_0_L2A>,
 <Item id=S2A_5VMG_20211024_0_L2A>,
 <Item id=S2B_5VMG_20211022_0_L2A>,
 <Item id=S2A_5VMG_20211020_0_L2A>,
 <Item id=S2B_5VMG_20211019_0_L2A>,
 <Item id=S2A_5VMG_20211017_0_L2A>,
 <Item id=S2B_5VMG_20211015_0_L2A>,
 <Item id=S2A_5VMG_20211014_0_L2A>,
 <Item id=S2B_5VMG_20211012_0_L2A>,
 <Item id=S2A_5VMG_20211010_0_L2A>,
 <Item id=S2B_5VMG_20211009_0_L2A>,
 <Item id=S2A_5VMG_20211007_0_L2A>,
 <Item id=S2B_5VMG_20211005_0_L2A>,
 <Item id=S2A_5VMG_20211004_0_L2A>,
 <Item id=S2B_5VMG_20211002_

The oldest missing assets seems to be the ones whose cloud cover is reported as "invalid" from the Sentinel-2 metadata:

In [17]:
invalid_cloud_cover = [item for item in items_l2a if not item.properties['sentinel:valid_cloud_cover']]
invalid_cloud_cover

[<Item id=S2B_5VMG_20210624_0_L2A>,
 <Item id=S2A_5VMG_20210530_0_L2A>,
 <Item id=S2B_5VMG_20210522_0_L2A>,
 <Item id=S2A_5VMG_20210520_0_L2A>,
 <Item id=S2B_5VMG_20210518_0_L2A>,
 <Item id=S2B_5VMG_20210515_0_L2A>,
 <Item id=S2A_5VMG_20210513_0_L2A>,
 <Item id=S2B_5VMG_20210512_0_L2A>,
 <Item id=S2A_5VMG_20210510_0_L2A>,
 <Item id=S2A_5VMG_20210503_0_L2A>,
 <Item id=S2A_5VMG_20210328_0_L2A>,
 <Item id=S2B_5VMG_20210115_0_L2A>,
 <Item id=S2B_5VMG_20201010_0_L2A>,
 <Item id=S2B_5VMG_20200904_0_L2A>,
 <Item id=S2B_5VMG_20200831_0_L2A>,
 <Item id=S2B_5VMG_20200828_0_L2A>,
 <Item id=S2A_5VMG_20200823_0_L2A>,
 <Item id=S2A_5VMG_20200820_0_L2A>]

However, assets seem to be missing from ~June 2021 onwards. Raised an issue [here](https://github.com/stac-utils/stac-sentinel/issues/4).

## Create catalog with search results

We store the both sets of items in a single catalog, using the [PySTAC](https://pystac.readthedocs.io/en/latest/) library:

In [19]:
catalog_id = "red-glacier_earth-search"

In [20]:
catalog = pystac.Catalog(
    id=catalog_id,
    description=("This catalog contains MGRS tiles that include "
                 "the Red Glacier (Alaska) as retrieved from the "
                 "Earth Search STAC endpoint.")
)
catalog

<Catalog id=red-glacier_earth-search>

In [21]:
# add search results to catalog
for items in (items_l1c, items_l2a):
    catalog.add_items(items)

In [22]:
# replace the self-links of the items to remote with relative links
catalog.normalize_hrefs(catalog_id)

We reorganize the catalog using the following template:

In [23]:
template = "${collection}/${year}/${month}/${day}"
catalog.generate_subcatalogs(template)

[<Catalog id=sentinel-s2-l1c>,
 <Catalog id=2021>,
 <Catalog id=4>,
 <Catalog id=17>,
 <Catalog id=15>,
 <Catalog id=13>,
 <Catalog id=12>,
 <Catalog id=10>,
 <Catalog id=8>,
 <Catalog id=7>,
 <Catalog id=5>,
 <Catalog id=3>,
 <Catalog id=2>,
 <Catalog id=3>,
 <Catalog id=31>,
 <Catalog id=29>,
 <Catalog id=26>,
 <Catalog id=24>,
 <Catalog id=23>,
 <Catalog id=21>,
 <Catalog id=19>,
 <Catalog id=18>,
 <Catalog id=16>,
 <Catalog id=14>,
 <Catalog id=13>,
 <Catalog id=11>,
 <Catalog id=9>,
 <Catalog id=8>,
 <Catalog id=6>,
 <Catalog id=4>,
 <Catalog id=3>,
 <Catalog id=1>,
 <Catalog id=2>,
 <Catalog id=27>,
 <Catalog id=26>,
 <Catalog id=24>,
 <Catalog id=22>,
 <Catalog id=21>,
 <Catalog id=19>,
 <Catalog id=17>,
 <Catalog id=16>,
 <Catalog id=14>,
 <Catalog id=12>,
 <Catalog id=11>,
 <Catalog id=9>,
 <Catalog id=7>,
 <Catalog id=6>,
 <Catalog id=4>,
 <Catalog id=2>,
 <Catalog id=1>,
 <Catalog id=1>,
 <Catalog id=30>,
 <Catalog id=28>,
 <Catalog id=27>,
 <Catalog id=25>,
 <Catalog id=23>

Display current catalog structure:

In [24]:
catalog.describe()

* <Catalog id=red-glacier_earth-search>
    * <Catalog id=sentinel-s2-l1c>
        * <Catalog id=2021>
            * <Catalog id=4>
                * <Catalog id=17>
                  * <Item id=S2A_5VMG_20210417_0_L1C>
                * <Catalog id=15>
                  * <Item id=S2B_5VMG_20210415_0_L1C>
                * <Catalog id=13>
                  * <Item id=S2A_5VMG_20210413_0_L1C>
                * <Catalog id=12>
                  * <Item id=S2B_5VMG_20210412_0_L1C>
                * <Catalog id=10>
                  * <Item id=S2A_5VMG_20210410_0_L1C>
                * <Catalog id=8>
                  * <Item id=S2B_5VMG_20210408_0_L1C>
                * <Catalog id=7>
                  * <Item id=S2A_5VMG_20210407_0_L1C>
                * <Catalog id=5>
                  * <Item id=S2B_5VMG_20210405_0_L1C>
                * <Catalog id=3>
                  * <Item id=S2A_5VMG_20210403_0_L1C>
                * <Catalog id=2>
                  * <Item id=S2B_5VMG_20210402_

## Save the catalog

To save the catalog (only metadata) locally:

In [25]:
catalog.normalize_and_save(
    f"./{catalog_id}",
    catalog_type='SELF_CONTAINED'
)

To save the catalog on the dCache storage system, we use [STAC2dCache](https://github.com/NLeSC-GO-common-infrastructure/stac2dcache). In order to authenticate on dCache, we use a macaroon, which we have saved in a plain-text file.

In [26]:
url = (f"https://webdav.grid.surfsara.nl:2880/pnfs/"
       f"grid.sara.nl/data/eratosthenes/disk/{catalog_id}")

In [27]:
# configure PySTAC to read from/write to dCache
fs = stac2dcache.configure_filesystem(
    filesystem="dcache", 
    token_filename="macaroon.dat"
)
stac_io = stac2dcache.configure_stac_io(fs)

In [29]:
# save catalog to storage
catalog._stac_io = stac_io
catalog.normalize_and_save(url, catalog_type='SELF_CONTAINED')