# Read relevant portions of the HLS dataset from NASA earthdata

## Imports

In [None]:
from pathlib import Path

import earthaccess as ea
import geopandas as gpd
import numpy as np
import rasterio as rio

from envrs import hls_tools as hls
from envrs import rio_tools as rt
from envrs.download_path import make_url

## Login

To be able to download the data you will need to have an account on NASA Earthdata. You can register using the following link:

https://urs.earthdata.nasa.gov/

<img src="../images/portal_eartdata_login.png" width="700" alt="Earthdata login">

To be able to log you in, the next cell will ask you for the user and password and that you set during the registration. These are used to create the `~/.netrc` file that is used by the library to grant access to the data.

In [None]:
# Set the netrc path
netrc_path = Path("~/.netrc").expanduser()


# Ensure .netrc is written where it should
def write_netrc(auth, netrc_path=netrc_path):
    username = auth.username
    password = auth.password
    text = f"machine urs.earthdata.nasa.gov\n\tlogin {username}\n\tpassword {password}"
    return netrc_path.write_text(text)


# are we authenticated?
auth = ea.login()

if not auth.authenticated:
    # ask for credentials and persist them in a .netrc file
    auth.login(strategy="interactive", persist=True)

# Write the ~/.netrc file if it does not exist
if not netrc_path.exists():
    write_netrc(auth=auth, netrc_path=netrc_path)

## Identify the target datasets

The `earthaccess` package allows to check which datasets are available on the [NASA earthdata repository](https://search.earthdata.nasa.gov/search) and also download them. Here we are looking for datasets whose name starts with `HLS`, and acronym standing for "Harmonized Landsat/Sentinel-2". Both datasets and images are described using an standard called [Unified Metadata Model](https://www.earthdata.nasa.gov/fr/about/esdis/esco/standards-practices/unified-metadata-model) ("UMM"). That's why the `"umm"` keyword will be used many times along the notebook.

In [None]:
dataset_tree = ea.search_datasets(keyword="HLS*30")

The result is a [`list`](https://realpython.com/python-data-structures/#list-mutable-dynamic-arrays) of [`dictionaries`](https://realpython.com/python-data-structures/#list-mutable-dynamic-arrays), where each contains the details of one dataset. In this case we are just displaying the `"ShortName"` (within the `umm` portion) to see which datasets matched with the `"HLS"` keyword.

In [None]:
[d["umm"]["ShortName"] for d in dataset_tree]

`HLSS30` is the dataset here images Have originated from Sentinel-2, and `HLSL30` is the equivalent when images have originated from Landsat 8/9. These are kept separate because not all the spectal bands are present in the other, and also the numbering/naming varies across sensors. Unintendedly we found a couple of additional datasets ending in `*_VI`, which are are vegetation indices derived from the `HLSS20` and `HLSL30`.

## Query NASA datasets to identify overlapping datasets

We will use the `HLSextent.geojson` file to select the images covering our study area. For this, we read the file, take the geometry of the first row (in this case, the only one), and calculate the enclosing rectangle containing the polygon (bounding box, shortened to `bbox`).

In [None]:
aoi_uri = make_url("HLSextent.geojson")

aoi = gpd.read_file(aoi_uri)
geometry = aoi.loc[0, "geometry"]

bbox = geometry.bounds

Then we use `ea.search_data` to see which datasets overlap the area depicted in the spatial file (`bounding_box`) and where acquired during our period of interest (`temporal`).

In [None]:
granule_pile = ea.search_data(
    short_name=["HLSL30", "HLSS30"],
    cloud_hosted=True,
    temporal=("2018-01-01T00:00:00", "2024-12-31T23:59:59"),
    bounding_box=bbox,
)

## Examine the contents of a granule

The result of our search is stored in `granule_pile`, a Python `list` where each entry is an object detailing the information of the image (also called `granule`) and all the files available for said image. Some of these are bands (`B*` suffix), whereas `Fmask` is the cloud mask, and the ones ending on `*A` are sun or observation angles. Here on a notebooks these objects open a display where we can see a small color composite of the image, as well as some buttons to download the files. However, most of the time is more comfortable to do it on a more automated way (keep reading).

In [None]:
first_granule = granule_pile[0]
first_granule

The next cell shows how we can access the information describing the granule. Once again the `"umm"` holds the relevant data we may need such as the [MGRS tile](https://hls.gsfc.nasa.gov/products-description/tiling-system/) of the files (`"MGRS_TILE_ID"`) or their cloud cover (`"CLOUD_COVERAGE"`).

In [None]:
dict(first_granule.items())["umm"]

## Retain only one of the tiles, when the cloud cover is <90%

We will use the information about the granules/images to filter the images to download. We prepared the function `hls.extract_extra_attrs` to help us to extract the relevant information. Now that we have it we keep just the images corresponding to the [MGRS tile](https://hls.gsfc.nasa.gov/products-description/tiling-system/) `"T36KXE"` (Sentinel-2 tiles have some overlap). We also exclude images/granules where `"CLOUD_COVERAGE"` is greater than 90% because we would not see anything on the ground.

In [None]:
min_valid = 10.0
max_cloud = 100 - min_valid
target_tile = "T36KXE"

In [None]:
layer_pile, name_pile = [], []
for granule in granule_pile:
    # Extract the relevant attributes
    granule_attrs = hls.extract_extra_attrs(granule)

    # Skip if the target tile does not match
    if granule_attrs["MGRS_TILE_ID"] not in target_tile:
        continue

    # Skip if the cloud cover is too high
    if float(granule_attrs["CLOUD_COVERAGE"]) > max_cloud:
        continue

    # ad the URI's of the selected rasters to the pile
    layer_pile.extend(granule.data_links())
    name_pile.append(granule)

## Format the dataframe of the downloads

Again we use a couple of helper functions to clean up the result of our search. `hls.tabulate_hls_uris` formats the results as a table ([dataframe](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)), where the column `uri` has the link to every dataset available for the image granule, whereas the rest of columns are just information contained on the dataset name.

In [None]:
sel_frame = hls.tabulate_hls_uris(layer_pile)
sel_frame.head()

It is important to note that the band numbers are [NOT shared](https://www.earthdata.nasa.gov/data/projects/hls/spectral-bands) by the `L30` and the `S30` products. Furthermore, the numbers do not necessarily point to the same spectral band, and some bands are entirely absent in one product or another (`L30` does not have thermal, `S30` does not have red edge bands).

In [None]:
sel_frame.groupby(["sensor", "suffix"]).size().unstack()

To make things easier we have prepared the `hls.harmonize_hls_frame`, that reformats our table so every column is named after a specific portion of the spectrum. These columns contain the links to the files from the `L30` and `S30` depicting the same portion. That way we can download just the bands shared by both sensors in a format that is easy to understand.

In [None]:
harmonized_frame = hls.harmonize_hls_frame(sel_frame)
harmonized_frame.head()

Now every row contains `stem`, the base name we will use for our output files, and the rest of the columns containing the links to the datasets in the granule.

* `CoastalAerosol`, `Blue`, `Green`, `Red`, `NIRnarrow`, `SWIR1`, `SWIR2` are the bands
* `Fmask` contains information to mak the presence of clouds, shadows, water or snow
* The rest are the sun azimuth and zenith angles (`SAA`, `SZA`) and the view zenith and azimuth angles (`VAA`, `VZA`)

## Drop the coastal aerosol and the angle bands: they will not be used

In [None]:
reduced_frame = harmonized_frame.drop(
    columns=["CoastalAerosol", "SZA", "SAA", "VZA", "VAA"]
)

## Perform download with conditions

### Set the environment for the download:

This cell sets a series of [environment variables](https://en.wikipedia.org/wiki/Environment_variable) necessary to be able to read the image data over the internet using the [geospatial data abstraction library (GDAL)](https://courses.spatialthoughts.com/gdal-tools.html#introduction), an extremely powerful piece of software for working with spatial data.

In [None]:
# https://github.com/nasa/HLS-Data-Resources/blob/main/python/tutorials/HLS_Tutorial.ipynb
env_pairs = {
    "GDAL_HTTP_COOKIEFILE": "~/cookies.txt",
    "GDAL_HTTP_COOKIEJAR": "~/cookies.txt",
    "GDAL_DISABLE_READDIR_ON_OPEN": "EMPTY_DIR",
    "CPL_VSIL_CURL_ALLOWED_EXTENSIONS": "TIF",
    "GDAL_HTTP_UNSAFESSL": "YES",
    "GDAL_HTTP_MAX_RETRY": "10",
    "GDAL_HTTP_RETRY_DELAY": "0.5",
}

### Set the output directory

In [None]:
OUT_DIR = Path(r"~/Downloads/hls/automated").expanduser()
OUT_DIR.mkdir(parents=True, exist_ok=True)

### Select the bands we want to download

In [None]:
selected_columns = ["Blue", "Green", "Red", "NIRnarrow", "SWIR1", "SWIR2"]

### And download (but!)

In this cell is where the magic happens. We read a portion of the `fmask` dataset over the internet, and if too many pixels are masked as cloudy, we do not download anything else. If the image is clear enough, we also download the relevant portion from the bands. The cell also skip the files that already have been downloaded, and only reads the portion of our images covered by our AOI. The idea is to reduce data volume and the wait time required to acquire the data.

In [None]:
with rio.Env(**env_pairs) as download_env:
    for row_idx, (product_prefix, product_uris) in enumerate(reduced_frame.iterrows()):
        # Print progress
        if row_idx % 100 == 0:
            print(f"{row_idx:>04}", "/", reduced_frame.shape[0], product_prefix)

        # check if the fmask file exists, skip if that's the case
        fmask_path = OUT_DIR / f"{product_prefix}_fmask.tif"
        if fmask_path.is_file():
            continue

        # If does not exist, download and save
        fmask_pairs, fmask_profiles, fmask_tags = rt.clipped_read(
            product_uris[["Fmask"]], aoi
        )
        rt.write_raster(fmask_pairs, fmask_profiles, fmask_tags, fmask_path)

        # Decompose the bit flags
        fmask_number = next(iter(fmask_pairs.values()))
        fmask_flags = np.flip(
            np.unpackbits(np.expand_dims(fmask_number, axis=0), axis=0), axis=0
        )

        # Estimate the percentage of clear pixels
        # (not cirrus, cloud core/adjacent or shadow casted by them)
        is_clear = np.all(fmask_flags[0:4] == 0, axis=0)
        perc_valid = 100 * np.sum(is_clear) / fmask_number.size

        # Skip band download if the percentage of valid pixels is under threshold
        if perc_valid < min_valid:
            continue

        # check if the band file exists, skip if that's the case
        band_path = OUT_DIR / f"{product_prefix}_bands.tif"
        if band_path.is_file():
            continue

        # read the bands
        band_pairs, band_profiles, band_tags = rt.clipped_read(
            product_uris[selected_columns], aoi
        )

        # Write to the hard drive
        rt.write_raster(band_pairs, band_profiles, band_tags, band_path)