# Using the PySTAC API

There is an abundance of data searchable through NASA's [Earthdata Search website](https://search.earthdata.nasa.gov). The preceding link connects to a GUI for searching [SpatioTemporal Asset Catalogs (STAC)](https://stacspec.org/) by specifying an *Area of Interest (AOI)* and a *time-window* or *range of dates*.

For the sake of reproducibility, we want to be able to search asset catalogs programmatically. This is where the [PySTAC](https://pystac.readthedocs.io/en/stable/) library comes in.

---

## Defining AOI & range of dates

In this notebook, we'll set up a DataFrame to process results retrieved when searching relevant OPERA DSWx-HLS data catalogs. Let's start by considering a particular example. [Heavy rains severely impacted Southeast Texas in May 2024](https://www.texastribune.org/2024/05/03/texas-floods-weather-harris-county/), resulting in [flooding and causing significant damage to property and human life](https://www.texastribune.org/series/east-texas-floods-2024/).
 

Let's start with relevant imports.

In [1]:
from warnings import filterwarnings
filterwarnings('ignore')
# data wrangling imports
import numpy as np
import pandas as pd
import xarray as xr
import rioxarray as rio
import rasterio

In [2]:
# Imports for plotting
import hvplot.pandas
import geoviews as gv
from geoviews import opts
gv.extension('bokeh')

In [3]:
# STAC imports to retrieve cloud data
from pystac_client import Client
from osgeo import gdal
# GDAL setup for accessing cloud data
gdal.SetConfigOption('GDAL_HTTP_COOKIEFILE','~/.cookies.txt')
gdal.SetConfigOption('GDAL_HTTP_COOKIEJAR', '~/.cookies.txt')
gdal.SetConfigOption('GDAL_DISABLE_READDIR_ON_OPEN','EMPTY_DIR')
gdal.SetConfigOption('CPL_VSIL_CURL_ALLOWED_EXTENSIONS','TIF, TIFF')

Next, let's define search parameters so we can retrieve data pertinent to that flooding event. This involves specifying an *area of interest (AOI)* and a *range of dates*.
+ The AOI is specified as a rectangle of longitude-latitude coordinates in a single 4-tuple of the form
  $$({\mathtt{longitude}}_{\mathrm{min}},{\mathtt{latitude}}_{\mathrm{min}},{\mathtt{longitude}}_{\mathrm{max}},{\mathtt{latitude}}_{\mathrm{max}}),$$
  i.e., the lower,left corner coordinates followed by the upper, right corner coordinates.
+ The range of dates is specified as a string of the form
  $$ {\mathtt{date}_{\mathrm{start}}}/{\mathtt{date}_{\mathrm{end}}}, $$
  where dates are specified in standard ISO 8601 format `YYYY-MM-DD`.

In [4]:
# Study location
livingston_tx_lonlat = (-95.09,30.69) # (lon, lat) form

In [5]:
# simple utility to make a rectangle centered at pt of width dx & height dy
def make_bbox(pt,dx,dy):
    '''Returns bounding-box represented as tuple (x_lo, y_lo, x_hi, y_hi)
    given inputs pt=(x, y), width & height dx & dy respectively,
    where x_lo = x-dx/2, x_hi=x+dx/2, y_lo = y-dy/2, y_hi = y+dy/2.
    '''
    return tuple(coord+sgn*delta for sgn in (-1,+1) for coord,delta in zip(pt, (dx/2,dy/2)))

In [6]:
# simple utility to make a rectangle of width dx & height dy & centre pt
def plot_bbox(bbox):
    '''Returns GeoViews plot Rectangle & Point given
    + bbox: bounding-box that includes lon_lat specified as (lon_min, lat_min, lon_max, lat_max)
    '''
    # These plot options are fixed but can be over-ridden
    point_opts = opts.Points(size=12, alpha=0.25, color='blue')
    rect_opts = opts.Rectangles(line_width=0, alpha=0.1, color='red')
    lon_lat = (0.5*sum(bbox[::2]), 0.5*sum(bbox[1::2]))
    return (gv.Points([lon_lat]) * gv.Rectangles([bbox])).opts(point_opts, rect_opts)

In [7]:
AOI = make_bbox(livingston_tx_lonlat, 0.5, 0.25)
basemap = gv.tile_sources.OSM.opts(width=500, height=500)
plot_bbox(AOI) * basemap

Let's add a date range. The flooding happened primarily between April 30th & May 2nd; we'll set a longer time window covering the months of April & May.

In [8]:
start_date, stop_date = '2024-04-01', '2024-05-31'
DATE_RANGE = f'{start_date}/{stop_date}'

Finally, let's create a dictionary `search_params` that stores the AOI and the range of dates.

In [9]:
search_params = dict(bbox=AOI, datetime=DATE_RANGE)
print(search_params)

{'bbox': (-95.34, 30.565, -94.84, 30.815), 'datetime': '2024-04-01/2024-05-31'}


---

## Executing a search with the PySTAC API

Three other pieces of information are required to initiate a search for data: the *Endpoint* (a URL), the *Provider* (a string representing a path extending the Endpoint), & the *Collection identifiers* (a list of strings referring to specific catalogs). We generally need to experiment with NASA's [Earthdata Search website](https://search.earthdata.nasa.gov) to determine these values correctly for the specific data products we want to retrieve.

For the search for DSWx data products that we want to execute, these parameters are as defined in the next code cell.

In [10]:
ENDPOINT = 'https://cmr.earthdata.nasa.gov/stac' # base URL for the STAC to search
PROVIDER = 'POCLOUD'
# Update the dictionary opts with list of collections to search
COLLECTIONS = ["OPERA_L3_DSWX-HLS_V1_1.0"]
search_params.update(collections=COLLECTIONS)
print(search_params)

{'bbox': (-95.34, 30.565, -94.84, 30.815), 'datetime': '2024-04-01/2024-05-31', 'collections': ['OPERA_L3_DSWX-HLS_V1_1.0']}


Having defined the search parameters in the Python dictionary `search_params`, we can instantiate a `Client` and search the spatio-temporal asset catalog using the `Client.search` method.

In [11]:
catalog = Client.open(f'{ENDPOINT}/{PROVIDER}/')
search_results = catalog.search(**search_params)
print(f'{type(search_results)=}\n',search_results)

type(search_results)=<class 'pystac_client.item_search.ItemSearch'>
 <pystac_client.item_search.ItemSearch object at 0x7f7438937fe0>


The object `search_results` returned by calling the `search` method is of type `ItemSearch`. To retrieve the results, we invoke the `items` method and cast the result as a Python `list` we'll bind to the identifier `granules`.

In [12]:
%%time
granules = list(search_results.items())
print(f"Number of granules found with tiles overlapping given AOI: {len(granules)}")

Number of granules found with tiles overlapping given AOI: 95
CPU times: user 160 ms, sys: 155 μs, total: 160 ms
Wall time: 12.8 s


Let's examine the contents of the list `granules`.

In [13]:
granule = granules[0]
print(f'{type(granule)=}')

type(granule)=<class 'pystac.item.Item'>


In [14]:
granule

The object `granule` has a rich output representation in this Jupyter notebook. We can expand the attributes in the output cell by clicking the triangles.

![](../assets/granule_output_repr.png)

The term *granule* refers to a collection of data files (raster data in this case) all associated with raw data acquired by a particular satellite at a fixed timestamp over a particular geographic tile. There are a number of interesting attributes associated with this granule.
+ `properties['datetime']`: a string representing the time of data acquisition for the raster data files in this granule;
+ `properties['eo:cloud_cover']`: the percentage of pixels obscured by cloud and cloud shadow in this granule's raster data files; and
+ `assets`: a Python `dict` whose values summarize the bands or levels of raster data associated with this granule.

In [15]:
print(f"{type(granule.properties)=}\n")
print(f"{granule.properties['datetime']=}\n")
print(f"{granule.properties['eo:cloud_cover']=}\n")
print(f"{type(granule.assets)=}\n")
print(f"{granule.assets.keys()=}\n")

type(granule.properties)=<class 'dict'>

granule.properties['datetime']='2024-04-02T16:43:40.705000Z'

granule.properties['eo:cloud_cover']=79

type(granule.assets)=<class 'dict'>

granule.assets.keys()=dict_keys(['browse', 'thumbnail_0', 'thumbnail_1', '0_B01_WTR', '0_B02_BWTR', '0_B03_CONF', '0_B04_DIAG', '0_B05_WTR-1', '0_B06_WTR-2', '0_B07_LAND', '0_B08_SHAD', '0_B09_CLOUD', '0_B10_DEM', 'metadata'])



Each object in `granule.assets` is an instance of the `Asset` class that has an attribute `href`. It is the `href` attribute that tells us where to locate a GeoTiff file associated with this asset of this granule.

In [16]:
for a in granule.assets:
    print(f"{a=}\t{type(granule.assets[a])=}")
    print(f"{granule.assets[a].href=}\n\n")

a='browse'	type(granule.assets[a])=<class 'pystac.asset.Asset'>
granule.assets[a].href='https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RUQ_20240402T164340Z_20240404T080942Z_L8_30_v1.0_BROWSE.png'


a='thumbnail_0'	type(granule.assets[a])=<class 'pystac.asset.Asset'>
granule.assets[a].href='https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RUQ_20240402T164340Z_20240404T080942Z_L8_30_v1.0_BROWSE.png'


a='thumbnail_1'	type(granule.assets[a])=<class 'pystac.asset.Asset'>
granule.assets[a].href='s3://podaac-ops-cumulus-public/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RUQ_20240402T164340Z_20240404T080942Z_L8_30_v1.0_BROWSE.png'


a='0_B01_WTR'	type(granule.assets[a])=<class 'pystac.asset.Asset'>
granule.assets[a].href='https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx

---

## Summarizing search results in a DataFrame

The details of the search results are complicated to parse in this manner. Let's extract a few particular fields from the granules obtained into a Pandas `DataFrame` using a convenient Python function. We'll define the function here and re-use it in later notebooks.

In [17]:
def search_to_dataframe(search, filter_assets=True):
    '''Constructs Pandas DataFrame from PySTAC Earthdata search results.
    DataFrame columns are determined from search item properties and assets.
    'asset': string identifying an Asset type associated with a granule
    'href': data URL for file associated with the Asset in a given row.
    If filter_assets is True (default), then assets with other labels (e.g.,
    'metadata', 'thumbnail_1', etc.) are ignored.'''
    granules = list(search.items())
    props = list({prop for g in granules for prop in g.properties.keys()})
    rows = (([g.properties.get(k, None) for k in props] + [a, g.assets[a].href])
                for g in granules for a in g.assets )
    df = pd.concat(map(lambda x: pd.DataFrame(x, index=props+['asset','href']).T, rows),
                   axis=0, ignore_index=True)
    if filter_assets:
        df = df.loc[df.asset.str.startswith('0_')]
    return df

This first invocation preserves unwanted rows (that correspond to thumbnail images or metadata).

In [18]:
df = search_to_dataframe(search_results, filter_assets=False)
df.head()

Unnamed: 0,start_datetime,datetime,end_datetime,eo:cloud_cover,asset,href
0,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,browse,https://archive.podaac.earthdata.nasa.gov/poda...
1,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,thumbnail_0,https://archive.podaac.earthdata.nasa.gov/poda...
2,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,thumbnail_1,s3://podaac-ops-cumulus-public/OPERA_L3_DSWX-H...
3,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...
4,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B02_BWTR,https://archive.podaac.earthdata.nasa.gov/poda...


The default parameter `filter_assets=True` ensures that we keep only rows corresponding to useful raster data.

In [19]:
df = search_to_dataframe(search_results)
df.head()

Unnamed: 0,start_datetime,datetime,end_datetime,eo:cloud_cover,asset,href
3,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...
4,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B02_BWTR,https://archive.podaac.earthdata.nasa.gov/poda...
5,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B03_CONF,https://archive.podaac.earthdata.nasa.gov/poda...
6,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B04_DIAG,https://archive.podaac.earthdata.nasa.gov/poda...
7,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,2024-04-02T16:43:40.705Z,79,0_B05_WTR-1,https://archive.podaac.earthdata.nasa.gov/poda...


The `DataFrame.info` method allows us to examine the schema associated with this `DataFrame`.

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 950 entries, 3 to 1328
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   start_datetime  950 non-null    object
 1   datetime        950 non-null    object
 2   end_datetime    950 non-null    object
 3   eo:cloud_cover  600 non-null    object
 4   asset           950 non-null    object
 5   href            950 non-null    object
dtypes: object(6)
memory usage: 52.0+ KB


Let's clean up the DataFrame of search results. First, for these results, only one `Datetime` column is necessary; we can drop the others.

In [21]:
df = df.drop(['start_datetime', 'end_datetime'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 950 entries, 3 to 1328
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   datetime        950 non-null    object
 1   eo:cloud_cover  600 non-null    object
 2   asset           950 non-null    object
 3   href            950 non-null    object
dtypes: object(4)
memory usage: 37.1+ KB


Remember, the filename associated with OPERA products includes an identifier for an MGRS geographic tile. We can extract that filename and that identifier applying Python string manipulations to the `href` column. Let's do that and store the result in a new `tile_id` column.

In [22]:
df['tile_id'] = df.href.map(lambda x: x.split('/')[-1].split('_')[3])
display(df.tail())
df.info()

Unnamed: 0,datetime,eo:cloud_cover,asset,href,tile_id
1324,2024-05-30T17:15:25.056Z,,0_B06_WTR-2,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP
1325,2024-05-30T17:15:25.056Z,,0_B07_LAND,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP
1326,2024-05-30T17:15:25.056Z,,0_B08_SHAD,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP
1327,2024-05-30T17:15:25.056Z,,0_B09_CLOUD,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP
1328,2024-05-30T17:15:25.056Z,,0_B10_DEM,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP


<class 'pandas.core.frame.DataFrame'>
Index: 950 entries, 3 to 1328
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   datetime        950 non-null    object
 1   eo:cloud_cover  600 non-null    object
 2   asset           950 non-null    object
 3   href            950 non-null    object
 4   tile_id         950 non-null    object
dtypes: object(5)
memory usage: 44.5+ KB


Finally, let's fix the schema of the `DataFrame` `df` by casting the columns as sensible data types.

In [23]:
df['datetime'] = pd.DatetimeIndex(df['datetime'])
df['eo:cloud_cover'] = df['eo:cloud_cover'].astype(np.float16)
str_cols = ['asset', 'href', 'tile_id']
for col in str_cols:
    df[col] = df[col].astype(pd.StringDtype())

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 950 entries, 3 to 1328
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   datetime        950 non-null    datetime64[ns, UTC]
 1   eo:cloud_cover  600 non-null    float16            
 2   asset           950 non-null    string             
 3   href            950 non-null    string             
 4   tile_id         950 non-null    string             
dtypes: datetime64[ns, UTC](1), float16(1), string(3)
memory usage: 39.0 KB


Bundling the STAC search results into a Pandas `DataFrame` sensibly is a bit tricky. But, as we shall see, having the results in a `DataFrame` makes later manipulations a lot easier.

---

## Doing something else

For this search query, there are roughly a hundred DSWX granules, each of which comprises raster data for ten bands or levels. We can see this by applying the Pandas `Series.value_counts` method to the `asset` column.

In [25]:
df.asset.value_counts()

asset
0_B01_WTR      95
0_B02_BWTR     95
0_B03_CONF     95
0_B04_DIAG     95
0_B05_WTR-1    95
0_B06_WTR-2    95
0_B07_LAND     95
0_B08_SHAD     95
0_B09_CLOUD    95
0_B10_DEM      95
Name: count, dtype: Int64

Let's filter out the rows that correspond to the band `B01_WTR` of the DSWx data product. The Pandas `DataFrame.str` accessor makes this operation simple. We'll call the filtered `DataFrame` `b01_wtr`.

In [26]:
b01_wtr = df.loc[df.asset.str.contains('B01_WTR')]
b01_wtr.info()
b01_wtr.asset.value_counts()

<class 'pandas.core.frame.DataFrame'>
Index: 95 entries, 3 to 1319
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   datetime        95 non-null     datetime64[ns, UTC]
 1   eo:cloud_cover  60 non-null     float16            
 2   asset           95 non-null     string             
 3   href            95 non-null     string             
 4   tile_id         95 non-null     string             
dtypes: datetime64[ns, UTC](1), float16(1), string(3)
memory usage: 3.9 KB


asset
0_B01_WTR    95
Name: count, dtype: Int64

In [27]:
b01_wtr = b01_wtr.sort_values(by='datetime')
b01_wtr

Unnamed: 0,datetime,eo:cloud_cover,asset,href,tile_id
3,2024-04-02 16:43:40.705000+00:00,79.0,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUQ
17,2024-04-02 16:44:04.596000+00:00,99.0,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUP
31,2024-04-02 17:05:06.709000+00:00,73.0,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUQ
45,2024-04-02 17:05:11.720000+00:00,68.0,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTQ
59,2024-04-02 17:05:21.087000+00:00,59.0,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUP
...,...,...,...,...,...
1263,2024-05-27 17:05:31.338000+00:00,,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTP
1277,2024-05-30 17:15:06.786000+00:00,,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUQ
1291,2024-05-30 17:15:10.768000+00:00,,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RTQ
1305,2024-05-30 17:15:20.500000+00:00,,0_B01_WTR,https://archive.podaac.earthdata.nasa.gov/poda...,T15RUP


In [28]:
b01_wtr.tile_id.value_counts()

tile_id
T15RUQ    26
T15RUP    26
T15RTP    22
T15RTQ    21
Name: count, dtype: Int64

In [29]:
b01_wtr['eo:cloud_cover'].agg(['min','mean','max'])

min       0.0000
mean     61.0625
max     100.0000
Name: eo:cloud_cover, dtype: float16

In [30]:
b01_wtr.set_index('datetime').resample('1d').href.count()

datetime
2024-04-02 00:00:00+00:00    6
2024-04-03 00:00:00+00:00    0
2024-04-04 00:00:00+00:00    0
2024-04-05 00:00:00+00:00    4
2024-04-06 00:00:00+00:00    0
2024-04-07 00:00:00+00:00    4
2024-04-08 00:00:00+00:00    0
2024-04-09 00:00:00+00:00    1
2024-04-10 00:00:00+00:00    4
2024-04-11 00:00:00+00:00    0
2024-04-12 00:00:00+00:00    4
2024-04-13 00:00:00+00:00    0
2024-04-14 00:00:00+00:00    0
2024-04-15 00:00:00+00:00    0
2024-04-16 00:00:00+00:00    0
2024-04-17 00:00:00+00:00    0
2024-04-18 00:00:00+00:00    2
2024-04-19 00:00:00+00:00    0
2024-04-20 00:00:00+00:00    2
2024-04-21 00:00:00+00:00    0
2024-04-22 00:00:00+00:00    4
2024-04-23 00:00:00+00:00    0
2024-04-24 00:00:00+00:00    0
2024-04-25 00:00:00+00:00    2
2024-04-26 00:00:00+00:00    0
2024-04-27 00:00:00+00:00    4
2024-04-28 00:00:00+00:00    0
2024-04-29 00:00:00+00:00    0
2024-04-30 00:00:00+00:00    4
2024-05-01 00:00:00+00:00    0
2024-05-02 00:00:00+00:00    2
2024-05-03 00:00:00+00:00    0

In [31]:
b01_wtr.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95 entries, 3 to 1319
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   datetime        95 non-null     datetime64[ns, UTC]
 1   eo:cloud_cover  60 non-null     float16            
 2   asset           95 non-null     string             
 3   href            95 non-null     string             
 4   tile_id         95 non-null     string             
dtypes: datetime64[ns, UTC](1), float16(1), string(3)
memory usage: 3.9 KB


In [None]:
granules

Examining the index reveals that the timestamps of the granules returned are not unique, i.e., granules correspond to distinct data products deriveded during a single aerial acquisition by a satellite.

In [None]:
len(granules.index.unique()) / len(granules) # Notice the timestamps are not all unique, i.e., some are repeated

The `hrefs` (i.e., the URIs or URLs pointed to in a given row in `granules`) are unique, telling us that the granules refer to distinct data products or bands derived from each data acquisition even if the timestamps match.

In [None]:
len(granules.hrefs.unique()) / len(granules) # Make sure all the hrefs are unique

Let's get a sense of how many granules are available for each day of the month. Note, we don't know how many of these tiles contain cloud cover obscuring features of interest yet.

The next few lines do some Pandas manipulations of the DataFrame `granules` to yield a line plot showing what dates are associated with the most granules.

In [None]:
granules_by_day = granules.resample('1d')  # Grouping by day, i.e., "resampling"

In [None]:
granule_counts = granules_by_day.count() # Aggregating counts

In [None]:
# Ignore the days with no associated granules
granule_counts = granule_counts[granule_counts.hrefs > 0]

In [None]:
# Relabel the index & column of the DataFrame
granule_counts.index.name = 'Day of Month'
granule_counts.rename({'hrefs':'Granule count'}, inplace=True, axis=1)

In [None]:
count_title = '# of DSWx-HLS granules available / day'
granule_counts.hvplot.line(title=count_title, grid=True, frame_height=300, frame_width=600)

The floods primarily occurred between March 11th and 13th. Unfortunately, there are few granules associated with those particular days. We can, in principal, use the URIs stored in this DataFrame to set up analysis of the data associated with this event; we'll do so in other examples with better data available.

---

We could go further to download data from the URIs provided but we won't with this example. This notebook primarily provides an example to show how to use the PySTAC API.

In subsequent notebooks, we'll use this general workflow:

1. Set up a search query by identifying a particular AOI and range of dates.
2. Identify a suitable asset catalog and execute the search using `pystac.Client`.
3. Convert the search results into a Pandas DataFrame containing the principal fields of interest.
4. Use the resulting DataFrame to access relevant remote data for analysis and/or visualization.

In [32]:
b01_wtr.iloc[0].href

'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RUQ_20240402T164340Z_20240404T080942Z_L8_30_v1.0_B01_WTR.tif'

In [44]:
smaller = b01_wtr.loc[b01_wtr.tile_id=='T15RTQ']

In [53]:
s = []
for i, row in smaller.iterrows():
    print(i)
    da = rio.open_rasterio(row.href)
    s_df = pd.DataFrame(pd.Series(da.values.flatten()).value_counts().sort_index()).T
    s_df.index = [i]
    s.append(s_df)

pd.concat(s)
#URL = 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RTQ_20240405T165849Z_20240418T141046Z_S2B_30_v1.0_B01_WTR.tif'
#smaller.iloc[0].href
#print(URL)

45
101
157
227
283
395
451
479
535
591
647
703
759
815
857
899
983
1039
1095
1151
1291


Unnamed: 0,0,1,2,252,253,255
45,469784.0,56444.0,4268.0,1321.0,1141763,11722020
101,12785308.0,398432.0,77344.0,30918.0,103594,4
157,373231.0,770.0,2533.0,158.0,1311092,11707816
227,5734.0,171.0,663.0,12545.0,13376479,8
283,1450164.0,189844.0,8723.0,1511.0,154176,11591182
395,1491142.0,270843.0,8208.0,2235.0,33262,11589910
451,,,,,3850321,9545279
479,45874.0,1173.0,284.0,875.0,1807641,11539753
535,12179940.0,103957.0,148538.0,159625.0,803532,8
591,356138.0,11310.0,26089.0,,9836963,3165100


In [54]:
summary = _

In [66]:
for k in [0,1,2]:
    print(k, summary.loc[:,k].max(), summary.loc[:,k].idxmax())
#summary.iloc[:,1:3].sum(axis=1)


0 12785308.0 101
1 398432.0 101
2 225317.0 899


In [79]:
URL = smaller.loc[101].href
print(URL)
#da1 = rio.open_rasterio(URL).squeeze()
#da1.hvplot.image()

https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/OPERA_L3_DSWX-HLS_PROVISIONAL_V1/OPERA_L3_DSWx-HLS_T15RTQ_20240405T165849Z_20240418T141046Z_S2B_30_v1.0_B01_WTR.tif


In [36]:
def urls_to_stack(granule_dataframe):
    '''Processes DataFrame of PySTAC search results (with OPERA tile URLs) &
    returns stacked Xarray DataArray (dimensions time, latitude, & longitude)'''
    
    stack = []
    for i, row in granule_dataframe.iterrows():
        with rasterio.open(row.href) as ds:
            # extract CRS string
            crs = str(ds.crs).split(':')[-1]
            # extract the image spatial extent (xmin, ymin, xmax, ymax)
            xmin, ymin, xmax, ymax = ds.bounds
            # the x and y resolution of the image is available in image metadata
            x_res = np.abs(ds.transform[0])
            y_res = np.abs(ds.transform[4])
            # read the data 
            img = ds.read()
            # Ensure img has three dimensions (bands, y, x)
            if img.ndim == 2:
                img = np.expand_dims(img, axis=0) 
            lon = np.arange(xmin, xmax, x_res)
            lat = np.arange(ymax, ymin, -y_res)
            bands = np.arange(img.shape[0])
            da = xr.DataArray(
                                data=img,
                                dims=["band", "lat", "lon"],
                                coords=dict(
                                            lon=(["lon"], lon),
                                            lat=(["lat"], lat),
                                            time=i,
                                            band=bands
                                            ),
                                attrs=dict(
                                            description="OPERA DSWx B01",
                                            units=None,
                                          ),
                             )
            da.rio.write_crs(crs, inplace=True)   
            stack.append(da)
    return xr.concat(stack, dim='time').squeeze()

In [42]:
da2 = urls_to_stack(smaller.iloc[1:2])

In [43]:
pd.Series(da2.values.flatten()).value_counts()

0      12785308
1        398432
253      103594
2         77344
252       30918
255           4
Name: count, dtype: int64