In [1]:
# using sat-search from fork: https://github.com/trailbehind/sat-search - see issues below
!pip install --ignore-installed https://github.com/trailbehind/sat-search/archive/fix/paginator.tar.gz

Collecting https://github.com/trailbehind/sat-search/archive/fix/paginator.tar.gz
  Using cached https://github.com/trailbehind/sat-search/archive/fix/paginator.tar.gz
Collecting sat-stac~=0.4.0
  Using cached sat_stac-0.4.1-py3-none-any.whl
Collecting requests>=2.19.1
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting python-dateutil~=2.7.5
  Using cached python_dateutil-2.7.5-py2.py3-none-any.whl (225 kB)
Collecting six>=1.5
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting chardet<5,>=3.0.2
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.4-py2.py3-none-any.whl (153 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.12.5-py2.py3-none-any.whl (147 kB)
Building wheels for collected packages: sat-search
  Building wheel for sat-search (setup.py) ... [?25ldone
[?25h  Created wheel for sat-sea

In [20]:
import pystac
import satsearch
import stac2dcache

In [3]:
satsearch.__version__

'0.3.0'

In [4]:
stac2dcache.__version__

'0.1.0'

# Search for Sentinel-2 data in AWS Public Datasets

## Search for available assets

We look for the Sentinel-2 scenes that are available for the Red Glacier (Alaska). We define the area-of interest using its MGRS tile:

In [5]:
utm_zone = "5"
latitude_band = "V"
grid_square = "MG"

We use `sat-search` to query the STAC catalog for the [AWS Sentinel-2 Datasets](https://registry.opendata.aws/sentinel-2), whose API URL is:

In [6]:
api_url = "https://earth-search.aws.element84.com/v0"

We start by looking for data processed at Level-1C (top-of-atmosphere reflectance), which has collection ID:

In [7]:
collection_id = "sentinel-s2-l1c"

In [8]:
search_kwargs = dict(
    url=api_url,
    collections=[collection_id],
    query=[
        f"sentinel:utm_zone={utm_zone}",
        f"sentinel:latitude_band={latitude_band}",
        f"sentinel:grid_square={grid_square}"
    ]
)

In [9]:
search = satsearch.Search.search(**search_kwargs)

In [10]:
# print how may items are found
search.found()

616

Up to 2021-03-31 there is a bug in the paging of sat-server such that `search.items()` would return multiple copies of the first 500 items, up to 10000 - see [this issue](https://github.com/sat-utils/sat-search/pull/107). We use here a forked repo of [sat-search](https://github.com/trailbehind/sat-search), which implements a workaround.

In [11]:
items_l1c = search.items()

Print data summary:

In [12]:
print(items_l1c.summary(params=["date", "id", "sentinel:data_coverage", "eo:cloud_cover"]))

Items (616):
date                      id                        sentinel:data_coverage    eo:cloud_cover            
2021-03-29                S2B_5VMG_20210329_0_L1C   20.11                     99.99                     
2021-03-26                S2B_5VMG_20210326_0_L1C   100                       61.37                     
2021-03-24                S2A_5VMG_20210324_0_L1C   20.07                     100                       
2021-03-23                S2B_5VMG_20210323_0_L1C   99.08                     99.98                     
2021-03-21                S2A_5VMG_20210321_0_L1C   100                       67.48                     
2021-03-19                S2B_5VMG_20210319_0_L1C   20.04                     69.51                     
2021-03-18                S2A_5VMG_20210318_0_L1C   99.1                      78.54                     
2021-03-16                S2B_5VMG_20210316_0_L1C   100                       37.93                     
2021-03-14                S2A_5VMG_2021031

Repeat search for the Level-2A (bottom-of-atmosphere reflectance) dataset where images have been converted to cloud-optimized GeoTIFF format (and publicly available from AWS):

In [13]:
collection_id = "sentinel-s2-l2a-cogs"

In [14]:
search_kwargs.update(collections=["sentinel-s2-l2a-cogs"])

In [15]:
search = satsearch.Search.search(**search_kwargs)

In [16]:
# print how may items are found
search.found()

624

In [17]:
items_l2a = search.items()

Again, print data summary:

In [18]:
print(items_l2a.summary(params=["date", "id", "sentinel:data_coverage", "eo:cloud_cover"]))

Items (624):
date                      id                        sentinel:data_coverage    eo:cloud_cover            
2021-03-29                S2B_5VMG_20210329_0_L2A   20.11                     99.99                     
2021-03-28                S2A_5VMG_20210328_0_L2A   99.06                     0                         
2021-03-26                S2B_5VMG_20210326_0_L2A   100                       61.37                     
2021-03-24                S2A_5VMG_20210324_0_L2A   20.07                     100                       
2021-03-23                S2B_5VMG_20210323_0_L2A   99.08                     99.98                     
2021-03-21                S2A_5VMG_20210321_0_L2A   100                       67.48                     
2021-03-19                S2B_5VMG_20210319_0_L2A   20.04                     69.51                     
2021-03-18                S2A_5VMG_20210318_0_L2A   99.1                      78.54                     
2021-03-16                S2B_5VMG_2021031

Find out which items are present in L2A collection but not in L1C:

In [19]:
missing = [item for item in items_l2a if item.date not in items_l1c.dates()]
missing

[S2A_5VMG_20210328_0_L2A,
 S2B_5VMG_20210115_0_L2A,
 S2B_5VMG_20201010_0_L2A,
 S2B_5VMG_20200904_0_L2A,
 S2B_5VMG_20200831_0_L2A,
 S2B_5VMG_20200828_0_L2A,
 S2A_5VMG_20200823_0_L2A,
 S2A_5VMG_20200820_0_L2A]

In [21]:
invalid_cloud_cover = [item for item in items_l2a if not item.properties['sentinel:valid_cloud_cover']]
invalid_cloud_cover

[S2A_5VMG_20210328_0_L2A,
 S2B_5VMG_20210115_0_L2A,
 S2B_5VMG_20201010_0_L2A,
 S2B_5VMG_20200904_0_L2A,
 S2B_5VMG_20200831_0_L2A,
 S2B_5VMG_20200828_0_L2A,
 S2A_5VMG_20200823_0_L2A,
 S2A_5VMG_20200820_0_L2A]

## Create catalog with search results

We store the both sets of items in a single catalog, using the [`PySTAC`](https://pystac.readthedocs.io/en/latest/) library:

In [74]:
catalog_id = "red-glacier_sentinel-2"

In [75]:
catalog = pystac.Catalog(
    id=catalog_id,
    description='This catalog contains Sentinel-2 tiles for the Red Glacier (Alaska)'
)
catalog

<Catalog id=red-glacier_sentinel-2>

In [76]:
# add search results to catalog
for item_collection in (items_l1c, items_l2a):
    items = (pystac.Item.from_dict(item._data) for item in item_collection)
    catalog.add_items(items)

In [77]:
# replace the self-links of the items to remote
# with relative links
catalog.normalize_hrefs(catalog_id)

<Catalog id=red-glacier_sentinel-2>

We reorganize the catalog using the following template:

In [78]:
template = "${collection}/${year}/${month}/${day}"
catalog.generate_subcatalogs(template)

[<Catalog id=sentinel-s2-l1c>,
 <Catalog id=2021>,
 <Catalog id=3>,
 <Catalog id=29>,
 <Catalog id=26>,
 <Catalog id=24>,
 <Catalog id=23>,
 <Catalog id=21>,
 <Catalog id=19>,
 <Catalog id=18>,
 <Catalog id=16>,
 <Catalog id=14>,
 <Catalog id=13>,
 <Catalog id=11>,
 <Catalog id=9>,
 <Catalog id=8>,
 <Catalog id=6>,
 <Catalog id=4>,
 <Catalog id=3>,
 <Catalog id=1>,
 <Catalog id=2>,
 <Catalog id=27>,
 <Catalog id=26>,
 <Catalog id=24>,
 <Catalog id=22>,
 <Catalog id=21>,
 <Catalog id=19>,
 <Catalog id=17>,
 <Catalog id=16>,
 <Catalog id=14>,
 <Catalog id=12>,
 <Catalog id=11>,
 <Catalog id=9>,
 <Catalog id=7>,
 <Catalog id=6>,
 <Catalog id=4>,
 <Catalog id=2>,
 <Catalog id=1>,
 <Catalog id=1>,
 <Catalog id=30>,
 <Catalog id=28>,
 <Catalog id=27>,
 <Catalog id=25>,
 <Catalog id=23>,
 <Catalog id=22>,
 <Catalog id=20>,
 <Catalog id=18>,
 <Catalog id=17>,
 <Catalog id=13>,
 <Catalog id=12>,
 <Catalog id=2020>,
 <Catalog id=12>,
 <Catalog id=3>,
 <Catalog id=1>,
 <Catalog id=11>,
 <Catalog 

In [79]:
catalog.describe()

* <Catalog id=red-glacier_sentinel-2>
    * <Catalog id=sentinel-s2-l1c>
        * <Catalog id=2021>
            * <Catalog id=3>
                * <Catalog id=29>
                  * <Item id=S2B_5VMG_20210329_0_L1C>
                * <Catalog id=26>
                  * <Item id=S2B_5VMG_20210326_0_L1C>
                * <Catalog id=24>
                  * <Item id=S2A_5VMG_20210324_0_L1C>
                * <Catalog id=23>
                  * <Item id=S2B_5VMG_20210323_0_L1C>
                * <Catalog id=21>
                  * <Item id=S2A_5VMG_20210321_0_L1C>
                * <Catalog id=19>
                  * <Item id=S2B_5VMG_20210319_0_L1C>
                * <Catalog id=18>
                  * <Item id=S2A_5VMG_20210318_0_L1C>
                * <Catalog id=16>
                  * <Item id=S2B_5VMG_20210316_0_L1C>
                * <Catalog id=14>
                  * <Item id=S2A_5VMG_20210314_0_L1C>
                * <Catalog id=13>
                  * <Item id=S2B_5VMG_202103

## Save the catalog to dCache

We save the catalog one the dCache storage system, using a macaroon to authenticate access.

In [80]:
url = (f"https://webdav.grid.surfsara.nl:2880/pnfs/"
       f"grid.sara.nl/data/eratosthenes/disk/{catalog_id}")

In [81]:
# configure connection to dCache
dcache = stac2dcache.configure(
    filesystem="dcache", 
    token_filename="macaroon.dat"
)

In [82]:
# save catalog to storage
catalog.normalize_and_save(url, catalog_type='SELF_CONTAINED')