# Harmonizing data located within and outside of the NASA Earthdata Cloud

---

## Timing

- Exercise: 45 min


---

## Summary

This tutorial will combine several workflow steps and components from the previous days, demonstrating the process of using the geolocation of data available outside of the Earthdata Cloud to then access coincident variables of cloud-accessible data. This may be a common use case as NASA Earthdata continues to migrate to the cloud, producing a "hybrid" data archive across Amazon Web Services (AWS) and original on-premise data storage systems. Additionally, you may also want to combine field measurements with remote sensing data available on the Earthdata Cloud.

This specific example explores the harmonization of the ICESat-2 ATL03 data product, currently (as of November 2021) available publicly via direct download at the NSIDC DAAC, with Sea Surface Temperature variables available from PO.DAAC on the Earthdata Cloud. 


### Objectives

[TODO]

---

### Import packages

In [40]:
# import getpass
# from requests.auth import HTTPBasicAuth
# import os
# from pathlib import Path
# from xml.etree import ElementTree as ET
# import time
# import zipfile
# import io
# import shutil

import netrc
import requests
from pprint import pprint

# import json
# from urllib import request

import xarray as xr
# import icepyx as ipx

### Determine storage location of datasets of interest

First, let's see whether our datasets of interest reside in the Earthdata Cloud or whether they reside on premise, or "on prem" at a local data center.

Background from CMR API [TODO: consider removing]:
The cloud_hosted parameter can be set to “true” or “false”. When true, the results will be restricted to collections that have a DirectDistributionInformation element or have been tagged with gov.nasa.earthdatacloud.s3.

We are building off of the CMR introductory tutorial, beginning with a collection search.

In [4]:
cmr_search_url = 'https://cmr.earthdata.nasa.gov/search'

We want to search by collection to inspect the access and service options that exist:

In [6]:
cmr_collection_url = f'{cmr_search_url}/{"collections"}'

In the CMR introduction tutorial, we explored cloud-hosted collections from different DAAC providers, and identified the CMR concept-id for a given dataset id (also referred to as a short_name). Here we'll start with two datasets that we want to explore over a coincident area and time:

In [7]:
modis_name = 'MODIS_A-JPL-L2P-v2019.0'
icesat2_name = 'ATL03'

Like in the intro tutorial, we're going to first determine what concept-ids are returned for the MODIS dataset. First, retrieve collection results based on the MODIS `short_name`:

In [10]:
response = requests.get(cmr_collection_url, 
                        params={
                            'short_name': modis_name,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()

For each collection result, print out the CMR concept-id and version:

In [14]:
collections = response['feed']['entry']

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

C1940473819-POCLOUD version:2019.0
C1693233348-PODAAC version:2019.0


Two collections are returned, both at version 2019.0. We can see from the suffix of the id that one is associated with "POCLOUD" versus "PODAAC". That gives us a clue in terms of where the data are hosted, but we can also use the `cloud_hosted` parameter set to True to confirm.

In [17]:
response = requests.get(cmr_collection_url, 
                        params={
                            'short_name': modis_name,
                            'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()

In [18]:
collections = response['feed']['entry']

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

C1940473819-POCLOUD version:2019.0


We will save this concept-id to use later on when we access the data granules.

In [24]:
modis_cloud_id = collections[0]["id"]

Now we will try our ICESat-2 dataset to see what id's are returned for a given dataset name.

In [28]:
response = requests.get(cmr_collection_url, 
                        params={
                            'short_name': icesat2_name,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()

In [29]:
collections = response['feed']['entry']

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

C1705401930-NSIDC_ECS version:003
C1997321091-NSIDC_ECS version:004


Two separate datasets exist in the CMR, one at version 3 and one at version 4. Let's see if these are `cloud_hosted`:

In [38]:
response = requests.get(cmr_collection_url, 
                        params={
                            'short_name': icesat2_name,
                            'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()

In [39]:
collections = response['feed']['entry']

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

Nothing is returned! We have now determined that we have a copy of the MODIS dataset in the cloud, whereas the ICESat-2 dataset (both versions) remains "on premise", residing in a local data center. 

### Determine size of ATL03 data over area of interest 
(determine large size; need to keep geographic region small)

#### Specify time and area of interest 

These `bounding_box` and `temporal` variables will be used for data search, subset, and access below

(Quick demo of OpenAltimetry or Earthdata Search for exploration of ICESat-2 or both datasets??)

In [None]:
# # Bounding Box spatial parameter in decimal degree 'W,S,E,N' format. TODO: Show on a simple map??
# bounding_box = '-31.68073,61.21566,-12.15967,83.56771'

# # Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
# temporal = '2021-01-01T00:00:00Z,2021-01-01T23:59:59Z'


# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format.
bounding_box = '-62.8,81.7,-56.4,83'

# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-06-22T00:00:00Z,2019-06-22T23:59:59Z'

In [None]:
url = f'{CMR_OPS}/{"granules"}'
response = requests.get(url, 
                        params={
                            'concept_id': 'C1997321091-NSIDC_ECS',
                            'temporal': temporal,
                            'bounding_box': bounding_box,
                            'page_size': 200,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.headers['CMR-Hits'])

In [None]:
granules = response.json()['feed']['entry']

for granule in granules:
    print(f'{granule["producer_granule_id"]} {granule["granule_size"]} {granule["links"][0]["href"]}')

### Download ICESat-2 ATL03 granule
We've found 2 granules.  We'll download the first one and write it to a file with the same name as the `producer_granule_id`.

We need the url for the granule as well.  This is `href` links we printed out above.

In [None]:
icesat_id = granules[0]["producer_granule_id"]
icesat_url = granules[0]['links'][0]['href']

You need Earthdata login credentials to download data from NASA DAACs.  These are the credentials you stored in the `.netrc` file you setup in previous tutorials.  

We'll use the `netrc` package to retrieve your login and password without exposing them.

In [None]:
info = netrc.netrc()
login, account, password = info.authenticators('urs.earthdata.nasa.gov')

To retrieve the granule data, we use the `requests.get()` method, passing Earthdata login credentials as a `tuple` using the `auth` keyword.

In [None]:
r = requests.get(icesat_url, auth=(login, password))

The response returned by requests has the same structure as all the other responses: a header and contents.  The header information has information about the response, including the size of the data we downloaded in bytes. 

In [None]:
for k, v in r.headers.items():
    print(f'{k}: {v}')

The contents needs to be saved to a file.  To keep the directory clean, we will create a `downloads` directory to store the file.  We can use a shell command to do this or use the `mkdir` method from the `os` package. 

In [None]:
os.mkdir('downloads')

You should see a `downloads` directory in the file browser.

To write the data to a file, we use `open` to open a file.  We need to specify that the file is open for writing by using the _write-mode_ `w`.  We also need to specify that we want to write bytes by setting the _binary-mode_ `b`.  This is important because the response contents are bytes.  The default mode for `open` is `text-mode`. So make sure you use `b`.

We'll use the `with` statement _context-manager_ to open the file, write the contents of the response, and then close the file.  Once the data in `r.content` is written sucessfully to the file, or if there is an error, the file is closed by the _context-manager_.

We also need to prepend the `downloads` path to the filename.  We do this using `Path` from the `pathlib` package in the standard library.

In [None]:
outfile = Path('downloads', icesat_id)
with open(outfile, 'wb') as f:
    f.write(r.content)

Check to make sure it is downloaded.

In [None]:
ls -l ./downloads

`ATL03_20190622061415_12980304_004_01.h5` is an HDF5 file.  `xarray` can open this but you need to tell it which group to read the data from.  In this case we read the height data for ground-track 1 left-beam.

In [None]:
ds = xr.open_dataset(icesat_id, group='/gt1l/heights')
ds

### Determine subsetting capabilities of ATL03

Consider removing since we see basic service info in the colleciton-level tags (has_spatial_subsetting, etc.)

In [None]:
# # CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# # url = f'{CMR_OPS}/{"collections"}'

# response = requests.get(url, 
#                         params={
#                             'concept_id': 'C1997321091-NSIDC_ECS',
# #                            'cloud_hosted': 'True',
#                             },
#                         headers={
#                             'Accept': 'application/json'
#                             }
#                        )
# response = response.json()
# services = response['feed']['entry'][0]['associations']['services']
# output_format = "umm_json"
# service_url = "https://cmr.earthdata.nasa.gov/search/services"
# for i in range(len(services)):
#     response = requests.get(f"{service_url}.{output_format}?concept-id={services[i]}")
#     response = response.json()
#     if 'ServiceOptions' in response['items'][0]['umm']: pprint(response['items'][0]['umm']['ServiceOptions'])

## Direct Download of ATL03 

- Describe what services are available, including icepyx (provide references), but just direct download for simplicity

- Describe that this is being "downloaded" to our cloud environment - what does that mean in terms of cost, etc.

Icepyx workflow below...

In [None]:
# # # Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
# # temporal = '2021-01-01T00:00:00Z,2021-01-31T23:59:59Z'

# #icepyx params

# #convert bounding_box which is a string, to a list of floats

# to_list = bounding_box.split(",")
# icepyx_spatial = [float(x) for x in to_list]
# print(icepyx_spatial)

# # New date range since icepyx provides dates as YYYY-MM-DD

# date_range = ['2019-06-22','2019-06-22']

# region_a = ipx.Query(icesat2_name, icepyx_spatial, date_range)

In [None]:
# region_a.visualize_spatial_extent()

In [None]:
# region_a.avail_granules(ids=True)

In [None]:
# region_a.subsetparams()

In [None]:
# earthdata_uid = 'uid'
# email = 'email'
# region_a.earthdata_login(earthdata_uid, email)

In [None]:
# region_a.order_granules()

In [None]:
# path = './download'
# region_a.download_granules(path)

In [None]:
# #Set NSIDC data access base URL
# base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'

# # bounding box search and subset:
# param_dict = {'short_name': 'ATL03', 
#               'version': '004', 
#               'temporal': temporal, 
#               'bounding_box': bounding_box, 
#               'bbox': bounding_box, 
#               'coverage': '/gt1r/heights/h_ph,/gt1l/heights/h_ph,/gt2r/heights/h_ph,/gt2l/heights/h_ph,/gt1r/heights/lat_ph,/gt1l/heights/lat_ph,/gt2r/heights/lat_ph,/gt2l/heights/lat_ph,/gt1r/heights/lon_ph,/gt1l/heights/lon_ph,/gt2r/heights/lon_ph,/gt2l/heights/lon_ph',
#               'page_size': '10', 
#               'request_mode': 'async',
# #              'token' : _token,
#              }

# #Remove blank key-value-pairs
# param_dict = {k: v for k, v in param_dict.items() if v != ''}

# #Convert to string
# param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in param_dict.items())
# param_string = param_string.replace("'","")

# API_request = api_request = f'{base_url}?{param_string}'
# print(API_request)

In [None]:
# #TODO: Need to make code much simpler!!

# # Create an output folder if the folder does not already exist.

# path = str(os.getcwd() + '/Outputs')
# if not os.path.exists(path):
#     os.mkdir(path)

# # For all requests other than spatial file upload, use get function
# session = requests.session()
# request = session.get(base_url, params=param_dict)

# print('Request HTTP response: ', request.status_code)

# # Raise bad request: Loop will stop for bad response code.
# request.raise_for_status()
# print('Order request URL: ', request.url)
# esir_root = ET.fromstring(request.content)
# print('Order request response XML content: ', request.content)

# #Look up order ID
# orderlist = []   
# for order in esir_root.findall("./order/"):
#     orderlist.append(order.text)
# orderID = orderlist[0]
# print('order ID: ', orderID)

# #Create status URL
# statusURL = base_url + '/' + orderID
# print('status URL: ', statusURL)

# #Find order status
# request_response = session.get(statusURL)    
# print('HTTP response from order response URL: ', request_response.status_code)

# # Raise bad request: Loop will stop for bad response code.
# request_response.raise_for_status()
# request_root = ET.fromstring(request_response.content)
# statuslist = []
# for status in request_root.findall("./requestStatus/"):
#     statuslist.append(status.text)
# status = statuslist[0]
# print('Data request is submitting...')
# print('Initial request status is ', status)

# #Continue loop while request is still processing
# while status == 'pending' or status == 'processing': 
#     print('Status is not complete. Trying again.')
#     time.sleep(10)
#     loop_response = session.get(statusURL)

# # Raise bad request: Loop will stop for bad response code.
#     loop_response.raise_for_status()
#     loop_root = ET.fromstring(loop_response.content)

# #find status
#     statuslist = []
#     for status in loop_root.findall("./requestStatus/"):
#         statuslist.append(status.text)
#     status = statuslist[0]
#     print('Retry request status is: ', status)
#     if status == 'pending' or status == 'processing':
#         continue

# #Order can either complete, complete_with_errors, or fail:
# # Provide complete_with_errors error message:
# if status == 'complete_with_errors' or status == 'failed':
#     messagelist = []
#     for message in loop_root.findall("./processInfo/"):
#         messagelist.append(message.text)
#     print('error messages:')
#     pprint.pprint(messagelist)

# # Download zipped order if status is complete or complete_with_errors
# if status == 'complete' or status == 'complete_with_errors':
#     downloadURL = 'https://n5eil02u.ecs.nsidc.org/esir/' + orderID + '.zip'
#     print('Zip download URL: ', downloadURL)
#     print('Beginning download of zipped output...')
#     zip_response = session.get(downloadURL)
#     # Raise bad request: Loop will stop for bad response code.
#     zip_response.raise_for_status()
#     with zipfile.ZipFile(io.BytesIO(zip_response.content)) as z:
#         z.extractall(path)
#     print('Data request is complete.')
# else: print('Request failed.')

#### Clean up Outputs folder by removing individual granule folders 

In [None]:
# for root, dirs, files in os.walk(path, topdown=False):
#     for file in files:
#         try:
#             shutil.move(os.path.join(root, file), path)
#         except OSError:
#             pass
#     for name in dirs:
#         os.rmdir(os.path.join(root, name))

### Determine size of SAR data

In [None]:
# CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# url = f'{CMR_OPS}/{"granules"}'

# response = requests.get(url, 
#                         params={
#                             'concept_id': 'C1214472336-ASF',
# #                             'concept_id': 'C1997321091-NSIDC_ECS',
#                             'bounding_box': bounding_box,
#                             'temporal': temporal,
#                             },
#                         headers={
#                             'Accept': 'application/json'
#                             }
#                        )
# granules = response.json()['feed']['entry']
# #pprint(granules)

# results = json.loads(response.content)
# granules = []
# granules.extend(results['feed']['entry'])
# hits = int(response.headers['CMR-Hits'])
# print(f"Found {hits} files")
# granule_sizes = [float(granule['granule_size']) for granule in granules]
# print(f"The total size of all files is {sum(granule_sizes):.2f} MB")

### Determine variables of interest: SST, ocean color, chemistry...

In [None]:
# CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# url = f'{CMR_OPS}/{"collections"}'

response = requests.get(url, 
                        params={
                            'concept_id': 'C1940475563-POCLOUD',
                            'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()
variables = response['feed']['entry'][0]['associations']['variables']
output_format = "umm_json"
var_url = "https://cmr.earthdata.nasa.gov/search/variables"
for i in range(len(variables)):
    response = requests.get(f"{var_url}.{output_format}?concept-id={variables[i]}")
    response = response.json()
    # print(response['items'][0]['umm'])
    if 'Name' in response['items'][0]['umm']: pprint(response['items'][0]['umm']['Name'])

### Pull those variables into xarray "in place"

#### First, we need to determine the granules returned from our time and area of interest

In [None]:
gran_url = f'{CMR_OPS}/{"granules"}'

response = requests.get(gran_url, 
                        params={
                            'concept_id': 'C1940475563-POCLOUD',
                            'temporal': temporal,
                            'bounding_box': bounding_box,
                            'page_size': 200,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

granules = response.json()['feed']['entry']
print(response.headers['CMR-Hits'])
granule
for granule in granules:
   print(f'{granule["dataset_id"]} {granule["id"]} {granule["links"][0]["href"]}')

granules = response.json()['feed']['entry']

### Use geolocation of ICESat-2 to define the single transect used to pull coincident ocean data out from array

### Create a plot of the single transect of gridded data 

(bonus: time series) - describe what this means to egress out of the cloud versus pulling the original data down (benefit to processing in the cloud)

## Exercise

In [None]:
# maybe an exercise that builds off of previous tutorials? Discover services or storage in CMR?

---

## Resources (optional)

---

## Conclusion