# STAC operations in the terminal

One of the great ideas of cloud-native geospatial is to make data easily discoverable in a consistent, standardized way. A key piece of this is ensuring that the data itself does not need to be downloaded and accessed to determine if it is relevant or usable, hence STAC. 

We can work with STAC directly in the terminal with the right tools. The purpose of this notebook is to demonstrate the processes and tools that make doing so possible.

Tools used include:
* pystac-client (python)
* stacterm (python)
* jq
* gdal

To start this notebook, we'll look at querying STAC APIs for items and interacting with the search results. Then we'll turn our attention to working with data from an item to find the elevation of the summit of Mt. Hood in Oregon and how a cloud-optimized data format like COG can make accessing data more efficient.

## Querying STAC APIs from the command line

We can use the `stac-client` command to interact with STAC APIs, and `jq` to extract desired data from the json responses. We can also use the `stacterm` command to visualize the distribution of items returned by a search in some interesting and useful ways.

In [None]:
# Set our STAC API URL
STAC_API=https://earth-search.aws.element84.com/v1            # Earth Search
#STAC_API=https://planetarycomputer.microsoft.com/api/stac/v1  # Planetary Computer
#STAC_API=https://landsatlook.usgs.gov/stac-server             # USGS Landsat

# Set our AOI geojson file
AOI="../aois/mthood.geojson"

### Getting collections

If we know a thing or two about STAC APIs, we can interact with them using something as simple (or complicated, depending on the frame of reference) as `curl`. Something like listing collections isn't particularly difficult, so let's see an example.

In [None]:
curl -s $STAC_API/collections | jq '.collections[].id'

We can do the same thing using the `stac-client` command (from the python package `pystac-client`). That is, the `collections` subcommand of the `stac-client` command allows a user to query a STAC API to see its collections. The difference between `curl` and `stac-client` and why to use the latter isn't readily apparent when listing collections; we'll see the advantages of `stac-client` when we get to item searches.

In [None]:
# See what collections we have in the catalog
stac-client collections $STAC_API | jq '.[].id'

# Try the above command without the `| jq '.[].id'` to see the whole output

In [None]:
# Pick a collection name from the output above
COLLECTION="sentinel-2-l1c"

In [None]:
# Get the metadata for the specified collection
stac-client collections $STAC_API | jq '.[] | select(.id == "'$COLLECTION'")'

### Searching for items

Here we can start to see why we want to use a purpose-built tool like `stac-client` over something like `curl`, as the former greatly simplifies search queries.

In [None]:
# See how many items are in the specified collection
# (note: this does not work with all STAC API implementations)
stac-client search $STAC_API --collection $COLLECTION --matched

In [None]:
# The same search, but limited to an API
# (note: this does not work with all STAC API implementations)
stac-client search $STAC_API --collection $COLLECTION --intersects $AOI --matched

In [None]:
# We can continue to refine this search with additional
# parameters like a date range and a cloud cover threshold
stac-client search $STAC_API --collection $COLLECTION --intersects $AOI \
    --query 'eo:cloud_cover<20' \
    --datetime '2019-01-01/2019-05-01' \
    --matched

#### Follow-on questions

* What might one of these complex queries look like with curl?
* What happens if you choose a different AOI or target collection?
* Dig into the help for the `stac-client search` command. What other interesting parameters are supported?

### Using `stacterm` to visualize search results

Sometimes aggregating and visualizing STAC search results can help answer certain questions or refine search parameters. The `stacterm` tool provides a mechanism to do this from the shell.

In [None]:
# We can inspect all the scenes from our AOI search more closely,
# so let's save the search results to a file for reuse.
stac-client search $STAC_API --intersects $AOI --datetime '2019-01-01/2020-01-01' > items.json

In [None]:
# We can see what platforms collected in our AOI by date
<items.json stacterm cal --label-field platform

In [None]:
# Maybe we just want to see Sentinel 2 items?
# We can pre-filter with jq!
<items.json jq '.features |= map(select(.collection == "sentinel-2-l1c"))' | stacterm cal --label-field platform

In [None]:
# We can even make a histogram of Sentinel 2 scenes by percent cloud cover
<items.json jq '.features |= map(select(.collection == "sentinel-2-l1c"))' | stacterm hist eo:cloud_cover

# Or plot cloud cover over time
<items.json jq '.features |= map(select(.collection == "sentinel-2-l1c"))' | stacterm plot datetime eo:cloud_cover

#### Follow-on questions

* How do the results change for different time frames or AOIs?
* What other filters can subset the result data in interesting ways?
* Try running `stacterm -h` and digging into the options for each subcommand. What other interesting visualizations can you come up with?

## Finding the elevation of Mt. Hood

Cloud-native geospatial is not just about metadata, data formats also play a crucial role. One such cloud-native geospatial data format is the "cloud-optimized geotiff" (COG).

COGs are structured in a way that allows a user to access part of the file when only a subset (by space or resolution) is required. This means that users can save time, money, and compute and storage resources by only accessing the relevant part of a larger file.

We'll leverage the COG format to do a short analysis to see if we can find the elevation of Mt. Hood in Oregon using a 30-meter DEM dataset.

In [None]:
# Set our STAC API URL and collection name
STAC_API="https://earth-search.aws.element84.com/v1"
COLLECTION="cop-dem-glo-30"

# Set our AOI geojson file
AOI="../aois/mthood.geojson"

# Disable AWS client authentication
export AWS_NO_SIGN_REQUEST="YES"

In [None]:
# Find the DEM tile that intersects Mt. Hood AOI
ITEM="$(stac-client search $STAC_API --intersects $AOI --collection $COLLECTION)"
<<<$ITEM jq .

In [None]:
# Extract the geotransform values from the item's projection metadata
TRANSFORM="$(<<<$ITEM jq '.features[].properties."proj:transform"[]')"
eval $(
    <<<$TRANSFORM awk '
        {print "PX_WIDTH="tolower($1)};
        {print "ROW_ROT="tolower($2)};
        {print "UP_LEFT_LONG="tolower($3)};
        {print "COL_ROT="tolower($4)};
        {print "PX_HEIGHT="tolower($5)};
        {print "UP_LEFT_LAT="tolower($6)};
    '
)

# We can print out the upper left corner coordinates to see what they were set to
echo $UP_LEFT_LONG, $UP_LEFT_LAT

In [None]:
# The summit of Mt. Hood is at -121.695833, 45.373611 (https://en.wikipedia.org/wiki/Mount_Hood).
SUMMIT_LONG=-121.695833
SUMMIT_LAT=45.373611

# Calculate the pixel coords of the summit based on the item's geotransform
# (we use python for the arithmetic because bash doesn't support floats)
# (the cut command effectively floors the result by truncating to an int)
SUMMIT_COL=$(python -c "print(($SUMMIT_LONG - $UP_LEFT_LONG) / $PX_WIDTH)" | cut -d '.' -f 1)
SUMMIT_ROW=$(python -c "print(($SUMMIT_LAT - $UP_LEFT_LAT) / $PX_HEIGHT)" | cut -d '.' -f 1)

# Again, let's see what values we got
echo $SUMMIT_COL, $SUMMIT_ROW

In [None]:
# Extract the href for the item's data asset, replacing the scheme for use with GDAL
HREF="$(<<<$ITEM jq -r '.features[].assets.data.href' | sed 's|^s3://|/vsis3/|')"

echo $HREF

In [None]:
# We can use GDAL to get the value for the summit cell directly from the remote asset
time gdallocationinfo "$HREF" $SUMMIT_COL $SUMMIT_ROW

Note the time taken and the value retrieved. Let's see how that time compares to using `gdalinfo` to fetch just the COG header information.

In [None]:
time gdalinfo $HREF

This operation was faster, yeah? That's because when we ran `gdallocationinfo` we had to make this same request to get the COG header information. That header info gave us what we needed to calculate the offset in the file for the tile containing the cell in question, which we were able to download via a second request.

Let's see what happens if we do something that requires downloading the entire COG. Asking GDAL to calculate the statistics would necessitate fetching all the data.

In [None]:
time gdalinfo -stats $HREF

A lot slower, right?

Some of the extra time might be because we're calculating stats, but most of it is simply data transfer time. We can test this theory by downloading the file and running the command again on the local version.

In [None]:
time gdalmanage copy $HREF fullsize.tif
time gdalinfo -stats fullsize.tif

Calculating the stats was way fast. We can see most of the time spent here was in the download phase, which took even longer than we've seen before because we had to copy even more than just the full resolution data.

Why is that? One aspect of COGs that also helps with speed is the fact that COGs support overviews, or reduced resolution copies of the data. So when we copied the file we actually copied the data multiple times, just at different resolutions.

Okay, but why would we want to have lower resolution copies of the data? Isn't that just inefficient duplication?

Let's use `gdal_translate` to show how overviews are useful. We'll reduce the resolution to 1/10th the input. We'll run this operation with debug logging enabled to show exactly that.

In [None]:
time gdal_translate -outsize 10% 10% -of COG --debug on $HREF reduced.tif

Notice the lines like `GTiff: Opened ....x.... overview.`? Our input data is 3600x3600, so with an output target of 10% we want to make a 360x360 image. We can speed up the resampling operation by choosing the closest overview greater than our target resolution.

We see GDAL ends up selecting the 450x450 overview of the data for this reason. We also see doing so increased the efficiency of this operation, which we can tell because it only took a fraction of the time downloading the original resolution data would have required.

Just for fun, run `gdalinfo` on the resized output to see how the metadata has changed.

In [None]:
gdalinfo reduced.tif

#### Follow-on questions

* The summit of Mt. Hood has an elevation of 3428.8 meters, per the [National Geodetic Survey](https://www.ngs.noaa.gov/cgi-bin/ds_mark.prl?PidBox=RC2244). What explains the difference between that value and the value we extracted from the DEM?
* How might the COG data format, with its support for random access and overviews, be useful for web mapping?