# Tutorial: Accessing ICESat-2 Data

icepyx, sliderule, h5coro

## Part 2: SlideRule

SlideRule is a collaborative effort between NASA Goddard Space Flight Center (GSFC) and the University of Washington, funded by the ICESat-2 program. It provides on-demand science data processing service for ICESat-2 and GEDI data that runs on Amazon Web Services (AWS) and responds to REST-like API calls to process and return science results. This _science-data-as-a-service_ model is a new way for researchers to work and analyze data, enabling them to have low-latency access to custom-generated, high-level data products.

SlideRule users provide specific parameters at the time of the request to compute products that fit their science needs. SlideRule then uses cloud-optimized versions of computational algorithms and a scalable cluster of EC2 instances to process data efficiently. All data is then returned to the user as a `geopandas` GeoDataFrame.

![SlideRule Overview](sliderule_overview_medium.png)

### For more information
***Website***: https://slideruleearth.io  
***Documentation***: https://slideruleearth.io/web/rtd/  
***GitHub***: https://github.com/SlideRuleEarth/sliderule  
***Examples***: https://github.com/SlideRuleEarth/sliderule-python  
***Contact***: support@mail.slideruleearth.io

In [None]:
# To use the latest version of the sliderule client, run this cell.
# It will install the sliderule Python client into your current conda environment.
# You will then need to restart your kernel to have the changes take effect.
%pip install --quiet "sliderule>=4.6"

### Example 1: Just Get Me Some Data

In [None]:
# (1) Import the client
from sliderule import sliderule, icesat2

In [None]:
# (2) Initialize the client
sliderule.init("slideruleearth.io");

In [None]:
# (3) Define an area of interest
region = sliderule.toregion("grandmesa.geojson");

In [None]:
# (4) Specify the processing parameters
parms = {
    "poly": region["poly"],
    "srt": icesat2.SRT_LAND,
    "len": 20.0,
    "res": 100.0
}

In [None]:
# (5) Make the processing request
gdf = icesat2.atl06p(parms)

#### Display the results

In [None]:
gdf

#### Plot the results

In [None]:
import matplotlib.pyplot as plt

In [None]:
region_lon = [e["lon"] for e in region["poly"]]
region_lat = [e["lat"] for e in region["poly"]]

In [None]:
f, ax = plt.subplots()
ax.set_title("ATL06-SR Points")
ax.set_aspect('equal')
gdf.plot(ax=ax, column='h_mean', cmap='inferno', s=0.1)
ax.plot(region_lon, region_lat, linewidth=1, color='g');

#### Explanation of what happened

#### (1) Import the client

The SlideRule Python client is broken up into different modules:
* `sliderule`: core general functionality
* `icesat2`: ICESat-2 on-demand, subsetting, raster sampling products
* `gedi`: GEDI subsetting, and raster sampling products
* `h5`: direct HDF5 data access
* `earthdata`: CMR, CMR-STAC, TNM helper functions _(use `earthaccess` instead)_
* `io`: reading and writing results to/from local files
* `ipysliderule`: toolbox for building SlideRule interfaces in a Jupyter notebook

#### (2) Initialize the client

Configure the client settings:
* `url`: address of sliderule service (default = "slideruleearth.io")
* `verbose`: display messages from server (default = False)
* `loglevel`: criticality of log messages to display (default = logging.INFO)
* `organization`: selection of cluster, used for private clusters (default = "sliderule")
* `desired_nodes`: number of nodes to run in a private cluster (default = None)
* `time_to_live`: how long to deploy a private cluster (default = 60 minutes)
* `bypass_dns`: query the provisioning system for IP address and don't use DNS lookup hostname (default = False)
* `plugins`: check if plugin is present (default = [])
* `trust_env`: use netrc file for authentication (default = False)
* `log_handler`: attach handler to client logging (default = None)
* `rethrow`: immediately rethrow any caught exception inside of the client (default = None)

#### (3) Define an area of interest

SlideRule uses an area of interest for determining which dataset resources to process and to then subset those resources to provide data only inside the area of interest.  The `sliderule.toregion` function converts multiple input types into a format understood by SlideRule.  The inputs types supported are: geojson, shapefile, GeoDataFrame, list of coordinates, and a dictionary of coordinates.

The resources (e.g. granules) to process can always be supplied in any of the processing APIs. But if they are not supplied (which is typical), then to determine which resources to process, the SlideRule server-side code uses the area of interest to make requests to NASA's Common Metadata Repository (CMR) legacy and STAC interfaces, along with USGS's The National Map interface. The server code automatically determines which interfaces should be queried and the parameters of the query needed for properly filtering results. 

In rare cases when the area of interest is very complex (e.g. a bunch of islands, or an extremely high vertice-count polygon), then the user can request the server to rasterize the area of interest and use it as a mask for determining which data to process.  See https://slideruleearth.io/web/rtd/user_guide/SlideRule.html#geojson for more details.

#### (4) Specify the processing parameters

There is a multitude of processing parameters that are available to each API.  The ones used here are:
* `poly`: area of interest
* `srt`: surface reference type; if set to -1 (or icesat2.DYNAMIC), then all surface types are used
* `len`: length of the extent (or variable-length segment) of along-track photon clouds to use in processing each posting
* `res`: the step size between postings

See user's guide for additional parameters: https://slideruleearth.io/web/rtd/index.html

#### (5) Make the processing request

Under-the-hood this makes an HTTP request to the SlideRule service running in AWS to perform the ATL06 surface-finding algorithm on ATL03 photons to produce an elevation, and then collects the results into a pandas GeoDataFrame.

The different ICESat-2 APIs available are:
* `atl03sp`: subset and filter ATL03 photons; provide custom YAPC and ATL08 classifications
* `atl03v`: fast segment level subsetting of ATL03 photons
* `atl06s`: subset the ATL06 land elevation product
* `atl06p`: dynamically generate ATL06 surface elevation product
* `atl08p`: dynamically generate the ATL08 vegetation density product (PhoREAL)
* `atl13p`: subset the ATL13 coastal water product

### Example 2: Sample GEDI Elevation Product at ICESat-2 Dynamically Generated Postings

In [None]:
from sliderule import sliderule, icesat2, gedi

In [None]:
sliderule.init("slideruleearth.io", verbose=True);

In [None]:
parms = {
    "poly": sliderule.toregion('grandmesa.geojson')['poly'],
    "t0": '2019-11-14T00:00:00Z',
    "t1": '2019-11-15T00:00:00Z',
    "srt": icesat2.SRT_LAND,
    "len": 100,
    "res": 100,
    "pass_invalid": False, 
    "atl08_class": ["atl08_ground", "atl08_canopy", "atl08_top_of_canopy"],
    "atl08_fields": ["h_dif_ref"],
    "phoreal": {"binsize": 1.0, "geoloc": "center", "use_abs_h": False, "send_waveform": False},
    "samples": {"gedi": {"asset": "gedil3-elevation"}}
};

In [None]:
atl08 = icesat2.atl08p(parms)

In [None]:
atl08

#### Plot the results

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
plt.figure(figsize=[8,6])

d0=np.min(atl08['x_atc'])

plt.plot(atl08['x_atc']-d0, atl08['h_te_median'], 'o',  markersize=1, color='green', label='h_mean_canopy')
plt.plot(atl08['x_atc']-d0, atl08['gedi.value'], 'o',  markersize=1, color='gray', label='gedi elevation')
hl=plt.legend(loc=3, frameon=False, markerscale=5)

plt.gca().set_ylim([1500, 3500])

#### Explanation of what's new

* The on-demand ATL08 product (different than the ICESat-2 Standard Data Product) was generated and streamed back to the user.  The ATL08 on-demand product uses University of Texas at Austin's PhoREAL algorithm which was integrated into SlideRule to generate customizable vegetation metrics using ATL03 photon data.

* A time range was specified in the request limiting the results to data collected only between the start and stop times supplied.

* The `"atl08_class"` parameter specified that only photons in ATL03 that were classified as `"atl08_ground"`, `"atl08_canopy"`, or `"atl08_top_of_canopy"` in the ATL08 standard data product are to be supplied to the PhoREAL algorithm and used in the results.

* The `"atl08_fields"` parameter specifies that the  `"h_dif_ref"` variable from the ATL08 standard data product is to be associated with each result returned by SlideRule.  SlideRule attempts to find the value of the variable closest in time to the dynamically generated result.

* The `"phoreal"` parameter provides the processing parameters for the PhoREAL algorithm.
  
* The `"samples"` parameter provides a list of raster datasets that SlideRule should sample at each generated result.  So for each 100m segment that PhoREAL processes, the server-side code will also sample the `gedil3-elevation` product at the latitude and longitude of that segment and return the value with the results.

For a list of raster datasets that are available to sample in SlideRule, see: https://slideruleearth.io/web/rtd/user_guide/GeoRaster.html#asset-directory

### Example 3: Produce GeoParquet of Summer and Winter ICESat-2 Tracks

In [None]:
from sliderule import sliderule, icesat2
import geopandas as gpd

In [None]:
sliderule.init(verbose=True)

In [None]:
region = sliderule.toregion("bathy.geojson");

In [None]:
# ATL03 subsetting request parameters
parms = {
    "poly": region['poly'],
    "srt": icesat2.SRT_DYNAMIC,
    "len": 100,
    "res": 100,
    "pass_invalid": True,
    "output": {
        "asset":"sliderule-stage",
        "format": "parquet",
        "as_geo": True,
        "open_on_complete": False
    }    
}

In [None]:
atl03_url = icesat2.atl03sp(parms, resources=['ATL03_20230213042035_08341807_006_02.h5'])

In [None]:
atl03_url = "s3://sliderule-public/sliderule.0000000D832589A9.geoparquet"

In [None]:
atl03 = gpd.pd.read_parquet(atl03_url)

In [None]:
atl03.keys()

#### Plot the results

In [None]:
import matplotlib.pyplot as plt
import numpy as pd

In [None]:
df = atl03
df = df[df["pair"] == icesat2.LEFT_PAIR]
df = df[df["track"] == 3]
plt.figure(figsize=[8,6])
plt.plot(df['x_atc']+df['segment_dist'], df['height'], 'o',  markersize=1, color='blue')

## Part 3: H5Coro - The HDF5 Cloud-Optimized Read-Only Python Package

`h5coro` is a pure Python implementation of a subset of the HDF5 specification that has been optimized for reading data out of S3. 

The project has its roots in SlideRule, where a new C++ implementation of the HDF5 specification was developed for performant read access to Earth science datasets stored in AWS S3. Over time, user's of SlideRule began requesting the ability to performantly read HDF5 and NetCDF files out of S3 from their own Python scripts. The result is `h5coro`: the re-implementation in Python of the core HDF5 reading logic that exists in SlideRule. Since then, `h5coro` has become its own project, which will continue to grow and diverge in functionality from its parent implementation.

`h5coro` is optimized for reading HDF5 data in high-latency high-throughput environments. It accomplishes this through a few key design decisions:

* __All reads are concurrent.__ Each dataset and/or attribute read by h5coro is performed in its own thread.
* __Intelligent range gets__ are used to read as many dataset chunks as possible in each read operation. This drastically reduces the number of HTTP requests to S3 and means there is no longer a need to re-chunk the data (it actually works better on smaller chunk sizes due to the granularity of the request).
* __Block caching__ is used to minimize the number of GET requests made to S3. S3 has a large first-byte latency (we've measured it at ~60ms on our systems), which means there is a large penalty for each read operation performed. h5coro performs all reads to S3 as large block reads and then maintains data in a local cache for access to smaller amounts of data within those blocks.
* The system is __serverless__ and does not depend on any external services to read the data. This means it scales naturally as the user application scales, and it reduces overall system complexity.
* __No metadata repository is needed.__ The structure of the file are cached as they are read so that successive reads to other datasets in the same file will not have to re-read and re-build the directory structure of the file.

### For more information:
***GitHub***: https://github.com/SlideRuleEarth/h5coro

### Example 1: Read ATL03 variables for bathymetry

In [24]:
# (1) Import modules
from h5coro import h5coro, s3driver
import earthaccess

In [25]:
# (2) Authenticate to Earth Data Login
auth = earthaccess.login()
s3_creds = auth.get_s3_credentials(daac="NSIDC")

In [28]:
# (3) Initialize h5coro object
granule = "nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2023/02/13/ATL03_20230213042035_08341807_006_02.h5"
h5obj = h5coro.H5Coro(granule, s3driver.S3Driver, errorChecking=True, verbose=False, credentials=s3_creds, multiProcess=False)

In [29]:
# (4) Read the data
variables = ["/gt3l/heights/h_ph", "/gt3l/heights/dist_ph_along", "/gt3l/geolocation/segment_dist_x", "/gt3l/geolocation/segment_ph_cnt"]
promise = h5obj.readDatasets(variables, block=True, enableAttributes=False)
for variable in promise:
    print(f'{variable}: {promise[variable][0:10]}')

gt3l/heights/h_ph: [-47.941536 -51.9231   -48.09843  -47.873924 -48.12945  -48.118694
 -48.308052 -48.208042 -47.802708 -48.004234]
gt3l/heights/dist_ph_along: [0.7542868  0.76623714 1.4717534  2.187351   2.1880984  2.9048157
 2.905563   3.621905   4.337497   5.0545855 ]
gt3l/geolocation/segment_dist_x: [17068770.48934802 17068790.54479094 17068810.60023396 17068830.65567708
 17068850.7111203  17068870.76656362 17068890.82200704 17068910.87745056
 17068930.93289417 17068950.98833789]
gt3l/geolocation/segment_ph_cnt: [37 25 44 39 22 40 37 42 35 38]


#### Explanation of what happened

#### (1) Import the necessary packages to use `h5coro`. 

`h5coro` relies on `earthaccess` for authenticating to Earth Data Login.  The modules a user might want to import are:
* `s3driver`: for reading data out of an s3 bucket
* `filedriver`: for reading data out of a local file
* `webdriver`: for reading data diretly over https (including objects in s3 buckets)
* `logger`: for configuring the logging in h5coro

#### (2) Authenticate to Earth Data Login

In my system I have a `.netrc` file setup with the following line:
```
machine urs.earthdata.nasa.gov login <my_user_name> password <my_password>
```

#### (3) Create an h5coro object for the granule that you want to read

`h5coro` is object oriented, so all context information associated with the provided granule is stored in the object.  Note that the full path to the granule is needed, including the s3 bucket.

#### (4) Read the data

`h5coro` implements an asynchronous I/O interface, meaning that when the `readDatasets` function is called, it makes a read "request" in the background and returns immediately back to the caller.  The caller receives something called a "promise" (or "future") which is a promise that data will be there in the future at some point.  You then can do other things while you wait, and when you finally need the data, you have to "block" or wait for it to be available.

In this example, I set the "block" parameter to True so that it would wait right away.  But in more sophisticated examples, other work could have been done by the notebook while waiting for the results of the read.