# Querying Satellite Data

## ❓ Questions
-  Where can I find open-access satellite data?
-  How do I search for satellite imagery?
-  How do I fetch remote raster datasets using Python?


## ❗ Objectives
-  Search public STAC repositories of satellite imagery using Python.
-  Inspect a search result’s metadata.
-  Download (a subset of) the assets available for a satellite scene.
-  Open satellite imagery as raster data and save it to disk.

---

# Introduction
A number of satellites take snapshots of the Earth’s surface from space. The images recorded by these remote sensors represent a very precious data source for any activity that involves monitoring changes on Earth.

Satellite imagery is typically provided in the form of geo-spatial raster data, with the measurements in each grid cell (“pixel”) being associated to accurate geographic coordinate information.

In this lesson we will explore how to access open satellite data using Python. In particular, we will consider the Sentinel-2 data collections hosted at SARA and AWS. This dataset consists of multi-band optical images acquired by the two satellites of the Sentinel-2 mission and it is continuously updated with new images.

## API's
An API is an Application Programming Interface.   

It is a way of having one application talk (interface) with another application in a pre-defined way. For what we're doing, these will be using web addresses and JSON data, and will be handled by our library, `pystac-client`.   

We will:
1. First initialise `pystac-client`.
2. Give it the information it needs.
3. Ask it to send send the request.
4. Investigate the results
5. Download some of the results

A useful resource will be the [pystac-client documentation](https://pystac-client.readthedocs.io/en/stable/quickstart.html).


# Initial setup
Some parameters we'll need throughout the lesson

In [23]:
import os
from os.path import join

from google.colab import drive
google_dir = '/content/drive'
drive.mount(google_dir)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
os.listdir(join(google_dir, 'MyDrive'))

['Corporate-Powerpoint-Template.pptx',
 'Show and Tell.pptx',
 'CIC drop in session - Keegan.gdoc',
 'Quick_meeting_slides_2021-05-09.pptx',
 'asdaf',
 'pipe3_20k.mp4',
 'chl_7_bin_1_stride_4fps.mkv',
 'WAT_Waste_City_of_canning_Project_Lessons_Learnt.xlsx',
 'slides',
 'WCCC_yolov4-unseen.mp4',
 'CIC Carpentries Collaborative Google Doc.gdoc',
 'RezBaz 22.gslides',
 'S&T',
 'ASDAF_BMT_UHI_JIRA_New_Project_Questionnaire.xlsx',
 'CHL_Weekly_2018_1440p.mp4',
 'Chlor_a_Weekly_2018_1440p.mp4',
 'UChl_abs_Weekly_2018_1440p.mp4',
 'solo work',
 'Untitled form.gform',
 'CIDS Computational Resources 2024-03-22.gslides',
 'Colab Notebooks',
 'CIC_Carpentries_Python-master',
 'workshop_google',
 '202404_Intro_Rrs.gslides']

In [12]:
project_dir = join(google_dir, 'MyDrive', "workshop_google")
storage_location = join(project_dir, "workshop_data")

os.makedirs(storage_location, exist_ok=True)

In [21]:
!ls {project_dir}

data		 google_requirements.txt  notebook_pictures  notebooks_colab  workshop_data
environment.yml  LICENSE		  notebooks	     README.md


In [22]:
!pip install -r {project_dir}/google_requirements.txt

Collecting rioxarray (from -r /content/drive/MyDrive/workshop_google/google_requirements.txt (line 2))
  Downloading rioxarray-0.15.5-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.5/60.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting earthpy (from -r /content/drive/MyDrive/workshop_google/google_requirements.txt (line 4))
  Downloading earthpy-0.9.4-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
Collecting pystac-client (from -r /content/drive/MyDrive/workshop_google/google_requirements.txt (line 11))
  Downloading pystac_client-0.8.2-py3-none-any.whl (33 kB)
Collecting rasterio>=1.3 (from rioxarray->-r /content/drive/MyDrive/workshop_google/google_requirements.txt (line 2))
  Downloading rasterio-1.3.10-cp310-cp310-manylinux2014_x86_64.whl (21.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.5/21.5 MB[0m [31

# Initialising PySTAC

In [31]:
api_url = "https://earth-search.aws.element84.com/v1"

from pystac_client import Client
import pystac_client
print(pystac_client.__version__)

client = Client.open(api_url)

0.8.2


In the following, we ask for scenes belonging to the sentinel-2-l2a collection. This dataset includes Sentinel-2 data products pre-processed at level 2A (bottom-of-atmosphere reflectance) and saved in Cloud Optimized GeoTIFF (COG) format:

In [25]:
collection = "sentinel-2-l2a"  # Sentinel-2, Level 2A, Cloud Optimized GeoTiffs (COGs)

We also ask for scenes intersecting a geometry defined using the shapely library (in this case, a point):

In [32]:
from shapely.geometry import Point
point = Point(116.06, -31.87)

Note: at this stage, we are only dealing with metadata, so no image is going to be downloaded yet. But even metadata can be quite bulky if a large number of scenes match our search! For this reason, we limit the search result to 10 items:

In [37]:
search = client.search(
    collections=[collection],
    intersects=point,
    max_items=10,
    datetime=["2023-03-01", "2023-10-10"]
)

We submit the query and find out how many scenes match our search criteria (please note that this output can be different as more data is added to the catalog):

In [38]:
print(search.matched())

39


Finally, we retrieve the metadata of the search results:

In [39]:
items = search.item_collection()

The variable items is an ItemCollection object. We can check its size by:

In [40]:
print(len(items))

10


which is consistent with the maximum number of items that we have set in the search criteria. We can iterate over the returned items and print these to show their IDs:

In [41]:
for item in items:
    print(item)

<Item id=S2B_50HMK_20231008_0_L2A>
<Item id=S2A_50HMK_20231003_0_L2A>
<Item id=S2B_50HMK_20230928_0_L2A>
<Item id=S2A_50HMK_20230923_0_L2A>
<Item id=S2B_50HMK_20230918_0_L2A>
<Item id=S2B_50HMK_20230908_0_L2A>
<Item id=S2A_50HMK_20230903_0_L2A>
<Item id=S2B_50HMK_20230829_0_L2A>
<Item id=S2A_50HMK_20230824_0_L2A>
<Item id=S2B_50HMK_20230819_0_L2A>


In [42]:
from pprint import pprint
item = items[0]
print(item.datetime)
pprint(item.geometry)
pprint(item.properties)

2023-10-08 02:26:37.653000+00:00
{'coordinates': [[[115.94511956423673, -31.630657005038636],
                  [115.93367210864315, -32.62014210354955],
                  [116.94321976449075, -32.62447638927652],
                  [117.10333381557261, -32.01352934399153],
                  [117.10291301200607, -31.63497332014516],
                  [115.94511956423673, -31.630657005038636]]],
 'type': 'Polygon'}
{'constellation': 'sentinel-2',
 'created': '2023-10-08T10:21:38.129Z',
 'datetime': '2023-10-08T02:26:37.653000Z',
 'earthsearch:boa_offset_applied': True,
 'earthsearch:payload_id': 'roda-sentinel2/workflow-sentinel2-to-stac/4d5a9ccb116712bba799d88d480ebfc4',
 'earthsearch:s3_path': 's3://sentinel-cogs/sentinel-s2-l2a-cogs/50/H/MK/2023/10/S2B_50HMK_20231008_0_L2A',
 'eo:cloud_cover': 9.782664,
 'grid:code': 'MGRS-50HMK',
 'instruments': ['msi'],
 'mgrs:grid_square': 'MK',
 'mgrs:latitude_band': 'H',
 'mgrs:utm_zone': 50,
 'platform': 'sentinel-2b',
 'processing:software': {'

# ✏ Exercise: Search satellite scenes using metadata filters


Search for all the available Sentinel-2 scenes in the sentinel-2-l2a collection that satisfy the following criteria: - intersect a provided bounding box (use ±0.01 deg in lat/lon from the previously defined point); - have been recorded between 20 March 2020 and 30 March 2020; - have a cloud coverage smaller than 10% (hint: use the query input argument of client.search).

How many scenes are available? Save the search results in GeoJSON format.

Hint: Buffer the previous point, and use the `bbox` search parameters


# Access the assets
So far we have only discussed metadata - but how can one get to the actual images of a satellite scene (the “assets” in the STAC nomenclature)? These can be reached via links that are made available through the item’s attribute assets.

In [43]:
assets = items[0].assets  # first item's asset dictionary
print(assets.keys())

dict_keys(['aot', 'blue', 'coastal', 'granule_metadata', 'green', 'nir', 'nir08', 'nir09', 'red', 'rededge1', 'rededge2', 'rededge3', 'scl', 'swir16', 'swir22', 'thumbnail', 'tileinfo_metadata', 'visual', 'wvp', 'aot-jp2', 'blue-jp2', 'coastal-jp2', 'green-jp2', 'nir-jp2', 'nir08-jp2', 'nir09-jp2', 'red-jp2', 'rededge1-jp2', 'rededge2-jp2', 'rededge3-jp2', 'scl-jp2', 'swir16-jp2', 'swir22-jp2', 'visual-jp2', 'wvp-jp2'])


We can print a minimal description of the available assets:

In [45]:
for key, asset in assets.items():
    print(f"{key}: {asset.title}")

aot: Aerosol optical thickness (AOT)
blue: Blue (band 2) - 10m
coastal: Coastal aerosol (band 1) - 60m
granule_metadata: None
green: Green (band 3) - 10m
nir: NIR 1 (band 8) - 10m
nir08: NIR 2 (band 8A) - 20m
nir09: NIR 3 (band 9) - 60m
red: Red (band 4) - 10m
rededge1: Red edge 1 (band 5) - 20m
rededge2: Red edge 2 (band 6) - 20m
rededge3: Red edge 3 (band 7) - 20m
scl: Scene classification map (SCL)
swir16: SWIR 1 (band 11) - 20m
swir22: SWIR 2 (band 12) - 20m
thumbnail: Thumbnail image
tileinfo_metadata: None
visual: True color image
wvp: Water vapour (WVP)
aot-jp2: Aerosol optical thickness (AOT)
blue-jp2: Blue (band 2) - 10m
coastal-jp2: Coastal aerosol (band 1) - 60m
green-jp2: Green (band 3) - 10m
nir-jp2: NIR 1 (band 8) - 10m
nir08-jp2: NIR 2 (band 8A) - 20m
nir09-jp2: NIR 3 (band 9) - 60m
red-jp2: Red (band 4) - 10m
rededge1-jp2: Red edge 1 (band 5) - 20m
rededge2-jp2: Red edge 2 (band 6) - 20m
rededge3-jp2: Red edge 3 (band 7) - 20m
scl-jp2: Scene classification map (SCL)
swi

In [46]:
print(assets["thumbnail"].href)

https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/50/H/MK/2023/10/S2B_50HMK_20231008_0_L2A/thumbnail.jpg


Remote raster data can be directly opened via the rioxarray library. We will learn more about this library in the next episodes.

In [47]:
import rioxarray
nir_href = assets["nir"].href
nir = rioxarray.open_rasterio(nir_href)
print(nir)

<xarray.DataArray (band: 1, y: 10980, x: 10980)>
[120560400 values with dtype=uint16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.097e+05 5.098e+05
  * y            (y) float64 6.5e+06 6.5e+06 6.5e+06 ... 6.39e+06 6.39e+06
    spatial_ref  int64 0
Attributes:
    AREA_OR_POINT:       Area
    OVR_RESAMPLING_ALG:  AVERAGE
    _FillValue:          0
    scale_factor:        1.0
    add_offset:          0.0


We can then save the data to disk:

In [48]:
# save whole image to disk
import os

nir_fpath = os.path.join(storage_location, "S2", "nir.tif")
os.makedirs(os.path.dirname(nir_fpath), exist_ok=True)

# nir.rio.to_raster(nir_fpath)

Since that might take a while, given there are over 10000 x 10000 = a hundred million pixels in the 10 meter NIR band, you can take a smaller subset before downloading it. Becuase the raster is a COG, we can download just what we need!

Here, we specify that we want to download the first (and only) band in the tif file, and a slice of the width and height dimensions.

In [49]:
nir[0,2200:3200,1000:2200].rio.to_raster(nir_fpath)

The difference is 155 Megabytes for the large image vs about 4 Megabytes for the subset.

# Other bands
While we're at it, let's also download some other useful bands

In [50]:
band_name = "red"
band_href = assets[band_name].href
band_xr = rioxarray.open_rasterio(band_href)

band_fpath = os.path.join(storage_location, "S2", f"{band_name}.tif")
band_xr[0, 2200:3200, 1000:2200].rio.to_raster(band_fpath)

In [51]:
band_name = "visual"
band_href = assets[band_name].href
band_xr = rioxarray.open_rasterio(band_href)

band_fpath = os.path.join(storage_location, "S2", f"{band_name}.tif")
band_xr[:, 2200:3200, 1000:2200].rio.to_raster(band_fpath)

In [52]:
band_xr

# Public catalogs, protected data

Publicly accessible catalogs and STAC endpoints do not necessarily imply publicly accessible data. Data providers, in fact, may limit data access to specific infrastructures and/or require authentication. For instance, the NASA CMR STAC endpoint considered in the last exercise offers publicly accessible metadata for the HLS collection, but most of the linked assets are available only for registered users (the thumbnail is publicly accessible).

The authentication procedure for dataset with restricted access might differ depending on the data provider. For the NASA CMR, follow these steps in order to access data using Python:

- Create a NASA Earthdata login account [here](https://urs.earthdata.nasa.gov/);
- Set up a netrc file with your credentials, e.g. by using this [script](https://git.earthdata.nasa.gov/projects/LPDUR/repos/daac_data_download_python/browse/EarthdataLoginSetup.py);
- Define the following environment variables:
```
import os
os.environ["GDAL_HTTP_COOKIEFILE"] = "./cookies.txt"
os.environ["GDAL_HTTP_COOKIEJAR"] = "./cookies.txt"
```


## Storing our downloaded filenames for future notebooks
Let's now store the file we downloaded in a text file for future use.

In [53]:
product_directory = os.path.join(storage_location, "S2")
dir_text_filename = "product_dir.txt"

with open(dir_text_filename, 'w') as f:
    f.writelines(product_directory)

# Other options
Another valid choice for downloading satellite data using python is [eodag](https://eodag.readthedocs.io/en/stable/).

PySTAC takes a bit more knowhow of API's to use, but gives you more freedom as well.  
PySTAC also supports COGs, which are currently only supported by the bleeding edge `eodag-cube` library rather than `eodag` itself.  
SARA and NCI (the locations we'd get the satellite data from for eodag) supply Sentinel-2 data as zip-files rather than COGs.