# Demonstration Notebook: GEDI L4A Subsetting on High End Computing Capability (HECC) and AWS Cloud platforms

This demonstration of a new capability under development begins after the running the prerequisite and authentication/authorization sections below.

## Prerequisite: Install Python packages

To run this notebook, we need to install the following Python packages, which might not already be installed in your environment:

Required
- `requests`
- `multiprocessing_logging`
- `boto3`
- `awscli`

Optional, to visualize your result
- `geopandas`
- `matplotlib`
- `folium`
- `mapclassify`

You may be using a JupyterLab container that has these dependencies already installed.  If not, you may run the following cell to get them installed.

In [None]:
%%bash
# python -m venv ./venv/notebook
# . ./venv/notebook/bin/activate
# python -m pip install --upgrade pip
# pip install requests multiprocessing_logging boto3 awscli geopandas matplotlib folium mapclassify
conda install -c conda-forge -y --file requirements.txt

## Authentication and Authorization
In the following cell, we set up credentials required to access our AWS and MAAP services.  These settings are used in subsequent cells below.  As described above, eventually we expect that none of these settings will be required and the system will automatically propagate the required credential after an initial login.

In [1]:
# MCP/AWS short term keys are only needed if you want to use your own bucket to stage out the results
# from the job. If you use the shared bucket that we provide for your, then no credentials are needed.
# out to an S3 bucket.
#s3_url = ""
#aws_access_key_id = ""
#aws_secret_access_key = ""
#aws_session_token = ""

# Configuration for long term access key.  This can be removed now that the ADES is updated to inject the 
# LTAK into every job's workspace.
#aws_config = {"class": "Directory", "path": "../../.aws"}

# Set MAAP Proxy Granting Ticket (PGT).  This is needed in order to access the input GEDI data files.
# This can be removed once MAAP is updated to automatically inject the PGT for use in running jobs.
maappgt = ""
print ("Credentials set up")

Credentials set up


# Subsetting GEDI L4A data on NASA Pleiades High End Computing Capability (HECC) and AWS Cloud platforms

In this notebook we demonstrate how algorithms can be run on the High End Computing Capability (HECC) and Cloud platforms without requiring the user to understand the complexities of those platforms.  The same algorithm can be run across the two platforms without modification.  Each platform has its own advantages and drawbacks in terms of cost, throughput, etc., so each project can optimize placement of their algorithms based on their specific needs.  Furthermore, the interface to deploy and execute jobs on these remote processing clusters is compliant with international standards to ensure **portability** and **extensibility** so that configured jobs will work without modification on all supported environments. 

This notebook demonstrates how to run scalable subsetting of NASA and University of Maryland's [Global Ecosystem Dynamics Investigation (GEDI)](https://gedi.umd.edu/) L4A granules as jobs running across the [NASA Pleiades Supercomputer](https://www.nas.nasa.gov/hecc/resources/pleiades.html) and the [Amazon Web Services (AWS) Cloud](https://aws.amazon.com/).  In the Cloud, we leverage the [AWS Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) to run the computations.  We also leverage software and interfaces developed by the MAAP HEC-AWS project to enable science users to run their notebooks in this way without knowledge of the underlying platforms (HECC or AWS).

This is a general capability that can be applied to any science algorithm notebook, but we will use the GEDI L4A subsetting algorithm as an example.  Building and deploying of the algorithms themselves to the remote platforms is not covered in this notebook.  Automation tools exist to make building and deploying of algorithm notebooks easy.  This assumes that an algorithm in a notebook has already been built and deployed to the remote processing cluster.  This notebook demonstrates:
1. Execution of a job on remote processing clusters on HECC or AWS
1. Monitoring of submitted jobs
1. Visualization of job output

## OGC standard approach to deploy, run, and monitor jobs on the HECC (Pleaides) and Cloud (AWS) platforms

The `ADES_WPST_SQS` interface used in the cells below is a convenience wrapper to simplify deploying algorithms and running them across High End Computing (Pleiades) and Cloud Computing (AWS) platforms.  Read on to learn about the implementation details, but that level of understanding is not necessary since the provided interface makes it entirely transparent to the end user. 

The current implementation communicates with services running on the underlying compute platforms according to an international standard called [Web Processing Service - Transactional (WPS-T)](https://eoepca.github.io/master-system-design/published/v1.0/#_wps_t_restjson), curated by the [Open Geospatial Consortium (OGC)](https://www.ogc.org/).  The [AWS Simple Queue Service (SQS)](https://aws.amazon.com/sqs/) is used as a messaging broker or intermediary, but that is done in a way that is seamless and transparent so that the complexities of that communication process are completely hidden from the end user. 

In addition, the output of jobs will be staged out to an [AWS S3](https://aws.amazon.com/s3/) bucket.  We provide a default bucket with credentials pre-configured, so if you choose to use that one then you don't need to provide any configuration.  An endpoint is provide to get the location of job's output so that you can retrieve it.  You do have an option to provide your own S3 bucket for your output, but this "bring your own bucket" option does require you to provide AWS access keys and a session token for secure access to these services. You can request these credentials from your cloud provider or system administrator.

For now, an additional credential is needed to support access to the GEDI data to be subsetted, but development is underway that will eliminate this requirement and automatically inject this credential into your job execution request.  This credential is required because the input data files needed for the processing are automatically downloaded on the compute platform via an API provided by NASA's [Multi-Mission Algorithm and Analysis Platform (MAAP)](https://scimaap.net/) project.  These are NASA data and require a credential called a Proxy Granting Ticket (PGT) to authenticate access.  The PGT is associated with your MAAP account when you login to the `ops` environment.  This PGT currently needs to be passed as an input to the job, but later this won't be necessary because it will be automatically provided.

## Interface to deploy, execute, and monitor algorithms running on the compute platforms
In the notebook cells below, we use an interface to easily deploy algorithms, execute jobs, and monitor status on the remote compute platforms.  As described above, the deployment and execution environment on each remote platform is compliant with the Open Geospatial Consortium (OGC) Web Processing Service - Transactional (WPS-T) interface, so this functionality is supported in a manner **compliant with international standards**.  As a result, this approach is **portable** to multiple processing platforms regardless of their underlying computational architecture.  Thus the system is **extensible** to additional remote platforms whether they are on-premises, in a public or private cloud, or at a high end computing center.
 
The `ADES_WPST_SQS` Class, used in the cell below, wraps the various [WPS-T](https://eoepca.github.io/master-system-design/published/v1.0/#_wps_t_restjson) endpoints to perform operations on the selected remote system.  For instance, the `execute` method lets us provide input values and run a job on the remote platform.  Methods (and endpoints) are available to interact with both "processes", defined to be an algorithm and associated version, and "jobs", defined to be an instantiation of a particular algorithm that includes a set of input parameter values to be applied when the job is run. The following are the types of endpoints avaiable and the `ADES_WPST_SQS` method that wraps each endpoint.

- `getLandingPage`: Describe the endpoints available on the platform
- `deployProcess`: Deploy a process
- `undeployProcess`: Undeploy a process
- `getProcesses`: Describe the processes currently deployed on a platform
- `getProcessDescription`: Get information about a particular process
- `execute`: Execute a job
- `dismiss`: Dismiss a job that has been submitted but not yet run
- `getJobList`: Describe the jobs that are currently running for a particular process
- `getStatus`: Get status and information about a particular job
- `getResult`: Get the location where results of a job can be retrieved

## Connect to distributed processing clusters on NASA High End Computing (Pleiades) and AWS Kubernetes (EKS)

For this example, we will connect to two servers running in two different systems: [NASA's Pleiades Supercomputer](https://www.nas.nasa.gov/hecc/resources/pleiades.html) and [AWS/EKS](https://aws.amazon.com/eks/). The services running in both of these platforms are secured behind network firewalls. As described above, we will use the `ADES_WPST_SQS` interface, provided by the MAAP HEC-AWS project to deploy the GEDI Subset algorithm, run jobs, and monitor the execution status of those jobs.  This interface enables us to manage workloads on each secure platform in a way that is seamless and transparent.

### Connect to Processing Cluster on NASA High End Computing (Pleiades)

In [6]:
from ADES_WPST_SQS import ADES_WPST_SQS

# Connect to ADES-PBS on NASA Pleiades (HECC)
config_file_pbs = "./sqsconfig-pbs.py"
wpst_pbs = ADES_WPST_SQS(config_file=config_file_pbs)
print("Connected to ADES-PBS on NASA Pleiades (HECC)")

self._request_queue_name : ades-pbs-maaphec-dev-002-wpst-request
self._reply_queue_name : ades-pbs-maaphec-dev-005-wpst-response
Connected to ADES-PBS on NASA Pleiades (HECC)


### Connect to Processing Cluster on AWS Kubernetes (EKS)

In [7]:
from ADES_WPST_SQS import ADES_WPST_SQS

# Connect to ADES-K8s on AWS EKS (Kubernetes)
config_file_k8s = "./sqsconfig-k8s.py"
wpst_eks = ADES_WPST_SQS(config_file=config_file_k8s)
print("Connected to ADES-K8s on AWS EKS (Kubernetes)")

self._request_queue_name : ades-eks-maaphec-dev-001-wpst-request
self._reply_queue_name : ades-eks-maaphec-dev-002-wpst-response
Connected to ADES-K8s on AWS EKS (Kubernetes)


### [OPTIONAL] Check which WPS-T endpoints are available
Let's check the landing pages on our HECC and AWS platforms to confirm our connection has been established successfully.

In [8]:
wpst_pbs.getLandingPage()

{'job_type': 'getLandingPage'}
{'QueueUrl': 'https://sqs.us-west-2.amazonaws.com/043575651191/ades-pbs-maaphec-dev-005-wpst-response', 'ResponseMetadata': {'RequestId': 'bfde00d7-a1ad-5dbd-a7e5-5c96fe75146c', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': 'bfde00d7-a1ad-5dbd-a7e5-5c96fe75146c', 'date': 'Wed, 18 Jan 2023 02:46:48 GMT', 'content-type': 'text/xml', 'content-length': '358'}, 'RetryAttempts': 0}}
_reply_queue : ades-pbs-maaphec-dev-005-wpst-response
message_id : 155647650149547774371021128130113786754


{'ades_id': 'ades-pbs-dev-jjacob-01',
 'api_version': '1.0',
 'landingPage': {'links': [{'example': 'curl http://127.0.0.1:5000/',
    'href': '/',
    'parameters': '',
    'payload': '',
    'title': 'getLandingPage',
    'type': 'GET'},
   {'example': 'curl http://127.0.0.1:5000/processes',
    'href': '/processes',
    'parameters': '',
    'payload': '',
    'title': 'getProcesses',
    'type': 'GET'},
   {'example': 'curl -X POST http://127.0.0.1:5000/processes/proc=https://public-url/to-your-application-descriptor.json',
    'href': '/processes',
    'parameters': 'proc=<url-to-app.json>',
    'payload': '',
    'title': 'deployProcess',
    'type': 'POST'},
   {'example': 'curl http://127.0.0.1:5000/processes/<your-process-id-from-getProcesses>',
    'href': '/processes/<procID>',
    'parameters': '',
    'payload': '',
    'title': 'getProcessDescription',
    'type': 'GET'},
   {'example': 'curl -X DELETE http://127.0.0.1:5000/processes/<your-process-id-from-getProcesses>',


In [9]:
wpst_eks.getLandingPage()

{'job_type': 'getLandingPage'}
{'QueueUrl': 'https://sqs.us-west-2.amazonaws.com/043575651191/ades-eks-maaphec-dev-002-wpst-response', 'ResponseMetadata': {'RequestId': '38da312a-53fe-599a-b44d-6d6b0c19672b', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '38da312a-53fe-599a-b44d-6d6b0c19672b', 'date': 'Wed, 18 Jan 2023 02:47:03 GMT', 'content-type': 'text/xml', 'content-length': '358'}, 'RetryAttempts': 0}}
_reply_queue : ades-eks-maaphec-dev-002-wpst-response
message_id : 281088674356381489982272573544132313613


{'ades_id': 'maap-hec-ades-k8s',
 'api_version': '1.0',
 'landingPage': {'links': [{'example': 'curl http://127.0.0.1:5000/',
    'href': '/',
    'parameters': '',
    'payload': '',
    'title': 'getLandingPage',
    'type': 'GET'},
   {'example': 'curl http://127.0.0.1:5000/processes',
    'href': '/processes',
    'parameters': '',
    'payload': '',
    'title': 'getProcesses',
    'type': 'GET'},
   {'example': 'curl -X POST http://127.0.0.1:5000/processes/proc=https://public-url/to-your-application-descriptor.json',
    'href': '/processes',
    'parameters': 'proc=<url-to-app.json>',
    'payload': '',
    'title': 'deployProcess',
    'type': 'POST'},
   {'example': 'curl http://127.0.0.1:5000/processes/<your-process-id-from-getProcesses>',
    'href': '/processes/<procID>',
    'parameters': '',
    'payload': '',
    'title': 'getProcessDescription',
    'type': 'GET'},
   {'example': 'curl -X DELETE http://127.0.0.1:5000/processes/<your-process-id-from-getProcesses>',
    '

### [OPTIONAL] See Present Processess and Jobs Running

In [None]:
wpst_eks.fullResult()

In [None]:
wpst_pbs.fullResult()

### Set up a GEDI Subset Job

In order to submit a job, you need to provide the workflow input values needed.  The inputs get "spoon-fed" into the algorithm on the remote platform and the outputs from the job will be staged out to a bucket on AWS S3.  As mentioned above, the `stage_out`section of the inputs is only required if you would like to provide credentials to use your own S3 bucket.  If you use the bucket provided as part of our system, then there is no need to provide the credentials since access to that bucket is pre-configured.

The branch is indicated in the process_id in the cells below.  You can run either the `hec` or `main` branch. The hec branch supports caching of the
GEDI files on ADES-PBS.  The main branch does not support caching of the GEDI files.  One just one of the two cells below, depending on which branch you
want to use.

#### Example main branch job

The `main` branch does not support caching and stages in only the GeoJSON file
that describes the area of interest (AOI).  The GEDI files that intersect the
AOI are searched for and localized in the "process" step (not in "stage-in").

In [None]:
# The branch is indicated in the process_id below.
# The hec branch supports caching of GEDI files on ADES-PBS.
# The main branch does not support caching of GEDI files.
process_id = "jplzhan.gedi-subset.main-1.0.0"
job_inputs = {
#   "stage_out": {
#      "s3_url": s3_url,
#      "aws_access_key_id": aws_access_key_id,
#      "aws_secret_access_key": aws_secret_access_key,
#      "aws_session_token": aws_session_token,
#      "region": "us-west-2"
#      "aws_config": aws_config,      
#I   },
   "parameters": {
      "aoi": {
         "url": "https://github.com/wmgeolab/geoBoundaries/raw/9f8c9e0f3aa13c5d07efaf10a829e3be024973fa/releaseData/gbOpen/GAB/ADM0/geoBoundaries-GAB-ADM0.geojson"
      },
      "columns": "agbd, agbd_se, l2_quality_flag, l4_quality_flag, sensitivity, sensitivity_a2, lon_lowestmode, lat_lowestmode",
      "query": "l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and sensitivity_a2 > 0.95",
#      "limit": 10000,
#      "limit": 100,
#      "limit": 10,
      "limit": 2,
      "maappgt": maappgt
   }
}
print(job_inputs)

#### Example hec branch job

The `hec` branch supports caching on ADES-PBS and localizes the GEDI input 
files during the "stage_in" step.  The URLs for the GEDI files must be specified
as job parameters.

In [10]:
# The branch is indicated in the process_id below.
# The hec branch supports caching of GEDI files on ADES-PBS.
# The main branch does not support caching of GEDI files.
process_id = "jplzhan.gedi-subset.hec-1.0.0"
job_inputs = {
#   "stage_out": {
#      "s3_url": s3_url,
#      "aws_access_key_id": aws_access_key_id,
#      "aws_secret_access_key": aws_secret_access_key,
#      "aws_session_token": aws_session_token,
#      "region": "us-west-2"
#      "aws_config": aws_config,    
#    },
    "parameters": {
        "aoi": {
            "url": "https://github.com/wmgeolab/geoBoundaries/raw/9f8c9e0f3aa13c5d07efaf10a829e3be024973fa/releaseData/gbOpen/GAB/ADM0/geoBoundaries-GAB-ADM0.geojson"
        },
        "granules": {
            "url": [
                "https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019146134206_O02558_01_T05641_02_002_02_V002.h5",
                "https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019160192444_O02779_04_T02282_02_002_02_V002.h5"
            ],
            "maap_pgt": maappgt
        },
        "columns": "agbd, agbd_se, l2_quality_flag, l4_quality_flag, sensitivity, geolocation/sensitivity_a2",
        "query": "l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and `geolocation/sensitivity_a2` > 0.95",
        "lat": "lat_lowestmode",
        "lon": "lon_lowestmode",
        "beams": "all"
    }
}
print(job_inputs)

{'parameters': {'aoi': {'url': 'https://github.com/wmgeolab/geoBoundaries/raw/9f8c9e0f3aa13c5d07efaf10a829e3be024973fa/releaseData/gbOpen/GAB/ADM0/geoBoundaries-GAB-ADM0.geojson'}, 'granules': {'url': ['https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019146134206_O02558_01_T05641_02_002_02_V002.h5', 'https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019160192444_O02779_04_T02282_02_002_02_V002.h5'], 'maap_pgt': ''}, 'columns': 'agbd, agbd_se, l2_quality_flag, l4_quality_flag, sensitivity, geolocation/sensitivity_a2', 'query': 'l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and `geolocation/sensitivity_a2` > 0.95', 'lat': 'lat_lowestmode', 'lon': 'lon_lowestmode', 'beams': 'all'}}


### Submit a GEDI Subset job to Pleiades
We will use the job inputs defined above and submit the job to run on the remote Pleiades supercomputer.  When a job is run, it is automatically assigned a unique identifier that can be used in subsequent calls to check the status of the job.

In [11]:
import json

job_id_pbs = wpst_pbs.execute(process_id, job_inputs)
print(job_id_pbs)

jplzhan.gedi-subset.hec-1.0.0
{
  "parameters": {
    "aoi": {
      "url": "https://github.com/wmgeolab/geoBoundaries/raw/9f8c9e0f3aa13c5d07efaf10a829e3be024973fa/releaseData/gbOpen/GAB/ADM0/geoBoundaries-GAB-ADM0.geojson"
    },
    "granules": {
      "url": [
        "https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019146134206_O02558_01_T05641_02_002_02_V002.h5",
        "https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4A_AGB_Density_V2_1/data/GEDI04_A_2019160192444_O02779_04_T02282_02_002_02_V002.h5"
      ],
      "maap_pgt": ""
    },
    "columns": "agbd, agbd_se, l2_quality_flag, l4_quality_flag, sensitivity, geolocation/sensitivity_a2",
    "query": "l2_quality_flag == 1 and l4_quality_flag == 1 and sensitivity > 0.95 and `geolocation/sensitivity_a2` > 0.95",
    "lat": "lat_lowestmode",
    "lon": "lon_lowestmode",
    "beams": "all"
  }
}
{'QueueUrl': 'https://sqs.us-west-2.amazonaws.com/043575651191/ades-pbs-m

### Check Status of the Job Submitted in Pleiades
Check the status of the submitted job by using the job identifier returned when the job was submitted for execution. Run this repeatedly until you see the status is `successful`.

In [None]:
import json
def job_status_for(process_id:str, job_id: str) -> str:
    response = wpst_pbs.getStatus(process_id, job_id)
    
    print(json.dumps(response, indent=2))
    return response["statusInfo"]["status"]

job_status = job_status_for(process_id, job_id_pbs)

### Submit a Job to EKS
Next, we will submit the same job to a Kubernetes cluster running in the AWS Cloud.

In [None]:
job_id_k8s = wpst_eks.execute(process_id, job_inputs)
print(job_id_k8s)

### Verify Job Status
Check the status of the submitted job by using the job identifier returned when the job was submitted for execution.  Run this repeatedly until you the `status` is `successful`.

In [None]:
import json
def job_status_for(process_id:str, job_id: str) -> str:
    response = wpst_eks.getStatus(process_id, job_id)
    
    print(json.dumps(response, indent=2))
    return response["statusInfo"]["status"]
    
job_status = job_status_for(process_id, job_id_k8s)

### Get location of job results

If the job has completed successfully, you can get the location on S3 where the job's output was staged for you to access it.

In [None]:
job_result = wpst_pbs.getResult(process_id, job_id_pbs)
print(job_result)

## Visually Verify the Results

### Localize the output file

Use the AWS CLI to download the output file from S3.

In [None]:
!aws s3 cp s3://maap-hec-ades-out-dev/jplzhan/gedi_subset/jplzhan.gedi-subset.main-1.0.0-6b86b462751c7983d3c2fb01141e7aa45d667e5d/output/gedi_subset.gpkg gedi_subset.gpkg

### Static visualization with geopandas
If you installed the `geopandas` and `matplotlib` Python packages, you can visually verify the output file by running the following cell.

In [None]:
try:
    import geopandas as gpd
    import matplotlib.pyplot as plt
except:
    print(
        "If you wish to visually verify your output file, "
        "you must install the `geopandas` and `matplotlib` packages."
    )
else:
    gedi_gdf = gpd.read_file("gedi_subset.gpkg")
    gedi_gdf = gedi_gdf.sample(n=1000)
    # gedi_gdf = gpd.read_file("gedi_subset.gpkg", rows=1000)
    print("done reading")
    agbd_colors = plt.cm.get_cmap("viridis_r")
    gedi_gdf.plot(column="agbd", cmap=agbd_colors)

### Dynamic visualization with Folium/Leaflet

Folium provides a more interactive visualization experience that lets you pan and zoom with the output data overlaid on a map.

In [None]:
# !pip install fiona geopandas folium matplotlib mapclassify
import folium
m = gedi_gdf.explore(
     #m=m, # pass the map object
     color="red", # use red color on all points
     marker_kwds=dict(radius=5, fill=True), # make marker radius 10px with fill
     tooltip="filename", # show "name" column in the tooltip
     tooltip_kwds=dict(labels=False), # do not show column label in the tooltip
     name="gedi_subset" # name of the layer in the map
)

folium.TileLayer('Stamen Toner', control=True).add_to(m)  # use folium to add alternative tiles
folium.LayerControl().add_to(m)  # use folium to add layer control
m

### Direct view

The data can also be viewed directly in a tabular format.

In [None]:
gedi_gdf