## This notebook is to demonstrate the use case that failed to load HLS data due to gdal configuration


## Background

In order to perform the analysis on HLS data, we need to load the data into memory. This step requires data being called from s3 server. 

There is some issues of retrieving HLS data using remote access via vsicurl (refer to this issue: [link](https://forum.earthdata.nasa.gov/viewtopic.php?t=5207&sid=15f046472f28eb9c21604f2cf8b87f79&start=10))

Thus, the idea is to put the function ```.load()``` in a loop and let the data load. If errors return due to the issue of gdal, the rasterio environment will be re-configured and load again. 

The example is demonstrated over **Houston, TX** from 2015-2024. 

# 1. Getting started


## 1.1. Python Environment and Packages

A compatible python environment can be created by following the Python Environment setup instructions, activating that environment and adding the pystac-client and odc-stac packages:



In [None]:
mamba create -n lpdaac_vitals -c conda-forge --yes python=3.10 gdal=3.7.2 hvplot geoviews rioxarray rasterio geopandas fiona=1.9.4 \
jupyter earthaccess jupyter_bokeh h5py h5netcdf spectral scikit-image seaborn \
jupyterlab dask ray-default ray-dashboard pystac-client odc-stac

In [2]:
mamba activate lpdaac_vitals

Run 'mamba init' to be able to run mamba activate/deactivate
and start a new shell session. Or use conda to activate/deactivate.


Note: you may need to restart the kernel to use updated packages.


In [1]:
%matplotlib inline

import os
from datetime import datetime
import numpy as np
import pandas as pd
import geopandas as gp
from skimage import io
import matplotlib.pyplot as plt
from osgeo import gdal
import xarray as xr
import rioxarray as rxr
import hvplot.xarray
import hvplot.pandas
import earthaccess
import pystac_client
import dask.distributed
import odc.stac
import xarray as xr
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch
%run -i ./tools/plotting.ipynb
%run -i ./tools/data_access.ipynb
%run -i ./tools/ultilities.ipynb

## 1.2. Earthdata Login

We will use the earthaccess package for authentication. <font color="blue">[earthaccess](https://github.com/nsidc/earthaccess#readme)</font> can either createa a new local .netrc file to store credentials or validate that one exists already in you user profile. If you do not have a .netrc file, you will be prompted for your credentials and one will be created.

In [2]:
earthaccess.login(persist=True)

Enter your Earthdata Login username:  trangthuyvo
Enter your Earthdata password:  ········


<earthaccess.auth.Auth at 0x7fdcc90c7220>

## 1.4. Configure GDAL Options and rio environment

In [3]:
# GDAL configurations used to successfully access LP DAAC Cloud Assets via vsicurl 
gdal.SetConfigOption('GDAL_HTTP_COOKIEFILE','~/cookies.txt')
gdal.SetConfigOption('GDAL_HTTP_COOKIEJAR', '~/cookies.txt')
gdal.SetConfigOption('GDAL_DISABLE_READDIR_ON_OPEN','EMPTY_DIR')
gdal.SetConfigOption('CPL_VSIL_CURL_ALLOWED_EXTENSIONS','TIF')
gdal.SetConfigOption('GDAL_HTTP_UNSAFESSL', 'YES')
gdal.SetConfigOption('GDAL_HTTP_MAX_RETRY', '10')
gdal.SetConfigOption('GDAL_HTTP_RETRY_DELAY', '0.5')
gdal.SetConfigOption('CPL_VSIL_CURL_USE_HEAD', 'FALSE')

# 2. CMR-STAC Search

To find the HLS data for a certain purpose, there are some analysis parameters to define:

The following cell set important parameters for the analysis:

* ```lat```: The central latitude to analyse.
* ```lon```: The central longitude to analyse.
* ```buffer```: The number of square degrees to load around the central latitude and longitude. For reasonable loading times, set this as 0.1 or lower.
* ```baseline_year```: The baseline year, to use as the baseline (or starting time) of urbanisation 
* ```analysis_year```: The analysis year to analyse the change in urbanisation


In [None]:
# Read shapefile of the AOI from U.S. 2020 Census Urban Area
city_name = 'Houston, TX'
df_geo = get_geometry_clip(city_name)
points = df_geo.geometry.centroid

In [9]:
# Alter the lat and lon to suit your study area
lon_offset = 0
lat_offset = 0
lat, lon = points.y.values[0] + lat_offset, points.x.values[0] + lon_offset

# Provide your area of extent here
# lat, lon = 33.349478, -96.554486
buffer = 0.4

# Combine central lat,lon with buffer to get area of interest
lat_range = (lat - buffer, lat + buffer)
lon_range = (lon - buffer, lon + buffer)

baseline_year = 2015
analysis_year = 2024

The next cell will display the selected area on an interactive map. Feel free to zoom in and out to get a better understanding of the area you'll be analysing. Clicking on any point of the map will reveal the latitude and longitude coordinates of that point.


In [10]:
display_map(lon_range, lat_range)

To find HLS data, we will use the pystac_client python library to search NASA's Common Metadata Repository SpatioTemporal Asset Catalog (CMR-STAC) for HLS data.

Add the collection, datetime range, results limit, a bounding box and store these as search parameters. After defining these, conduct a stac search using the LPCLOUD STAC endpoint and return our query as a list of items.


In [11]:
item_list = search_cmr_stac(baseline_year,analysis_year,lat_range,lon_range)

2015
Found 128 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2015-01-01 to 2015-12-31
2016
Found 319 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2016-01-01 to 2016-12-31
2017
Found 450 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2017-01-01 to 2017-12-31
2018
Found 552 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2018-01-01 to 2018-12-31
2019
Found 554 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2019-01-01 to 2019-12-31
2020
Found 530 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2020-01-01 to 2020-12-31
2021
Found 600 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2021-01-01 to 2021-12-31
2022
Found 759 granules at point [-95.7377, 29.412568500000003, -94.93769999999999, 30.2125685] from 2022-01-01 to 202

## 2.1. Rename Common Bands
To calculate the VI for each granule we need the **NIR, SWIR1, SWIRE2, Red, Blue, and Green** bands. Below you can find the different band numbers for each of the two products.

**Table**: HLS spectral bands nomenclature
|HLSL30 Band Name|HLSS30 Band Name|Band|Wave length (micrometers)|
|:---|:---:|---:|---:|
|B01|B01|Coastal Aerosol|0.43 – 0.45|
|B02|B02|Blue|0.45 – 0.51|
|B03|B03|Green|0.53 – 0.59|
|B04|B04|Red|0.64 – 0.67|
|-|B05|Red-Edge 1|0.69 – 0.71|
|-|B06|Red-Edge 2|0.73 – 0.75|
|-|B07|Red-Edge 3|0.77 – 0.79|
|-|B08|NIR Broad|0.78 – 0.88|
|B05|B8A|NIR Narrow|0.85 – 0.88|
|B06|B11|SWIR 1|1.57 – 1.65|
|B07|B12|SWIR 2|2.11 – 2.29|
|-|B09|Water Vapor|0.93 – 0.95|
|B10|-|Thermal Infrared 1|10.60 – 11.19|
|B11|-|Thermal Infrared 2|11.50 – 12.51|

Source: [HLS User Guide V2](https://lpdaac.usgs.gov/documents/1698/HLS_User_Guide_V2.pdf)

To stack the data from both Landsat and Sentinel instruments, we need common band names. For example, HLSL30 B05 and HLSS30 B8A (for NIR Narrow). 

In [12]:
# Rename HLSS B11 and HLSL B06 to common band name SWIR1
item_list_rename = rename_common_bands(item_list)

# 3. Loading HLS data using ODC-STAC

Now that we have a list of data search from CMR-STAC, using function ```odc.stac.stac_load``` would help to load HLS data as a dask paralleling operation. There are some additional variables to be defined: 
- ```crs```: projection of the dataset e.g., 'utm'
- ```spatial_res```: expected spatial resolution e.g., 30 for HLS data
- ```bands```: a list of desired bands to load 

> It might take around some minutes to get the data loaded

In [13]:
bbox = [min(lon_range), min(lat_range), max(lon_range), max(lat_range)]
bands = ['SWIR_1','SWIR_2','Red','Green','Blue']
ds = load_odc_stac('utm',bands,30,item_list_rename,bbox)

Preview the data, here we can see that the HLS over Houston, TX from 2015 til 2024 contains 4413 granules with x and y dimensions of 7179 and 9848 respectively

# 4. Urbanization Change Detection Analysis 

## 4.1. Load a Subset of the Dataset over an AOI
Before conducting the analysis, we would need to extract the data over the interested domain and time period. 
> Due to the limitation of speed of HLS data retrieval from S3 bucket, we tried to limit the spatial domain as much as possible

In this case, we want to conduct the analysis over Houston, TX and we use U.S. 2020 Census Urban Area as a reference for the assessment over 2015 and 2024 to evaluate the urbanization rate 


In [14]:
ds_mask = ds.rio.clip(df_geo.geometry.values, df_geo.crs, all_touched=True)

In [15]:
ds_median = ds_mask.groupby('time.year').median()

And we will only select year 2015 and 2024 data for the analysis

In [16]:
ds_median = ds_median.sel(year = [baseline_year,analysis_year])

## 4.2. Rechunk the dataset into a good chunk size

> Refer to this [article](https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes) about the definition of a **good** chunk size
> Generally, we would like to have a chunk size not too small and not too big (100 MB to 1 GB).
> For a small subset of the data (3000 x 3000 pixels), a chunk size of 512 x 512 is considered reasonable. 

In [17]:
ds_median = ds_median.chunk({'y':512,'x':512})

## 4.3. Scale the data
The ```scale_factor``` information in some of the HLSL30 granules are found in the file metadata, but missing from the Band metadata, meaning this isn't applied automatically. Manually scale each of the data arrays by the scale factor.



In [18]:
ds_mask_scaled = scale_hls_data(ds_median,bands)

## 4.4. Load the data 
This is the most time consuming step because until now, we only lazily load the data without performing any analysis. For instance, looking at the SWIR1 variable, we only see data as dask.array and there is no actual data inside 

In order to perform the analysis, we need to load the data. This step requires data being called from s3 server. 

There is some issues of retrieving HLS data using remote access via vsicurl (refer to this issue: [link](https://forum.earthdata.nasa.gov/viewtopic.php?t=5207&sid=15f046472f28eb9c21604f2cf8b87f79&start=10))

Thus, the idea is to put the function ```.load()``` in a loop and let the data load. If errors return due to the issue of gdal, the rasterio environment will be re-configured and load again. 

In [19]:
ds_mask_scaled

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 60.56 MiB 1.00 MiB Shape (2, 3009, 2638) (1, 512, 512) Dask graph 72 chunks in 67 graph layers Data type float32 numpy.ndarray",2638  3009  2,

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 60.56 MiB 1.00 MiB Shape (2, 3009, 2638) (1, 512, 512) Dask graph 72 chunks in 67 graph layers Data type float32 numpy.ndarray",2638  3009  2,

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 60.56 MiB 1.00 MiB Shape (2, 3009, 2638) (1, 512, 512) Dask graph 72 chunks in 67 graph layers Data type float32 numpy.ndarray",2638  3009  2,

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 60.56 MiB 1.00 MiB Shape (2, 3009, 2638) (1, 512, 512) Dask graph 72 chunks in 67 graph layers Data type float32 numpy.ndarray",2638  3009  2,

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 60.56 MiB 1.00 MiB Shape (2, 3009, 2638) (1, 512, 512) Dask graph 72 chunks in 67 graph layers Data type float32 numpy.ndarray",2638  3009  2,

Unnamed: 0,Array,Chunk
Bytes,60.56 MiB,1.00 MiB
Shape,"(2, 3009, 2638)","(1, 512, 512)"
Dask graph,72 chunks in 67 graph layers,72 chunks in 67 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


When you call .load() on a Dask‑backed Xarray object, Xarray schedules many small read‑tasks on your cluster (one per chunk), but by that point you’re no longer “inside” the rasterio.Env context that you used when you opened the file. You have three main ways to re‑inject your GDAL/Rasterio settings into those reads:

1. Monkey‑patch Xarray’s .load() to wrap every read in your Env
2. Monkey‑patch rasterio.open itself
3. Ensure every Dask worker has the GDAL env set before they read

These setting configuration can be done by calling ```modify_gdal_configure()``` function 

In [20]:
# Configure the setting gdal for rasterio and xarray 
configure_gdal_rasterio_dask()

In [21]:
from dask.distributed import Client, LocalCluster
# cluster = LocalCluster(n_workers=8, threads_per_worker=2,memory_limit = "15GB",processes=False,local_directory='/tmp')
cluster = LocalCluster(processes=False, local_directory='/tmp') 

client = Client(cluster)
from rasterio.env import Env
client.run(_setup_gdal)

{'inproc://192.168.15.31/5135/4': None}

Aborting load due to failure while reading: https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T15RTN.2024236T165023.v2.0/HLS.L30.T15RTN.2024236T165023.v2.0.B04.tif:1
2025-05-08 14:15:07,540 - distributed.worker - ERROR - Compute Failed
Key:       ('Red-5a7570b6c7a42f1d23eeab677a51baae', 3439, 2, 2)
State:     executing
Function:  _dask_loader_tyx
args:      ([[<odc.loader._rio.RioReader object at 0x7fdc5e049180>, <odc.loader._rio.RioReader object at 0x7fdba992bcd0>]], Tiles: 6x6|512x512px => 3009x2638px, (2, 2), (), (), RasterLoadParams(dtype='float32', fill_value=None, src_nodata_fallback=None, src_nodata_override=None, use_overviews=True, resampling='nearest', fail_on_error=True, dims=()), <odc.loader._rio.RioDriver object at 0x7fdcbc6eb1c0>, {'GDAL_DISABLE_READDIR_ON_OPEN': 'EMPTY_DIR', 'GDAL_HTTP_MAX_RETRY': '10', 'GDAL_HTTP_RETRY_DELAY': '0.5'}, <odc.loader._rio.RioReader.LoaderState object at 0x7fdc801b7850>, None)
kwargs:    {}
Exception: 'Rasterio

In [22]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /user/trangthuyvo0109/proxy/8787/status,

0,1
Dashboard: /user/trangthuyvo0109/proxy/8787/status,Workers: 1
Total threads: 16,Total memory: 121.25 GiB
Status: running,Using processes: False

0,1
Comm: inproc://192.168.15.31/5135/1,Workers: 1
Dashboard: /user/trangthuyvo0109/proxy/8787/status,Total threads: 16
Started: Just now,Total memory: 121.25 GiB

0,1
Comm: inproc://192.168.15.31/5135/4,Total threads: 16
Dashboard: /user/trangthuyvo0109/proxy/46553/status,Memory: 121.25 GiB
Nanny: None,
Local directory: /tmp/dask-scratch-space/worker-_ebyz_xy,Local directory: /tmp/dask-scratch-space/worker-_ebyz_xy


In [23]:
# it seems that everytime experienced the viscurl error, excecute this cell should work
import rasterio
import xarray as xr
from rasterio.env import Env

After modifying the gdal configuration, we can apply the load function to parralley loading data from remote server into memory

In [24]:
import time
# Assume ds is already chunked
start = time.time()
ds_mask_scaled_sel = load_data_into_memory(ds_mask_scaled)
end = time.time()
print(f"⏱️ Computation time: {end - start:.2f} seconds")

This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


0 0 0


This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.
See also https://docs.dask.org/en/stable/best-practices.html#load-data-with-dask for more information.


RasterioIOError: '/vsicurl/https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T15RTP.2024172T164944.v2.0/HLS.L30.T15RTP.2024172T164944.v2.0.B07.tif' not recognized as a supported file format.



> Visualizing the true color composite over 2 year 

