# Harmonizing data located within and outside of the NASA Earthdata Cloud

---

## Timing

- Exercise: 45 min
-
-


---

## Summary

This tutorial will combine several workflow steps and components from the previous days, demonstrating the process of using the geolocation of data available outside of the Earthdata Cloud to then access coincident variables of cloud-accessible data. This may be a common use case as NASA Earthdata continues to migrate to the cloud, producing a "hybrid" data archive across Amazon Web Services (AWS) and original on-premise data storage systems. Additionally, you may also want to combine field measurements with remote sensing data available on the Earthdata Cloud.

This specific example explores the harmonization of the ICESat-2 ATL03 data product, currently (as of November 2021) available publicaly via direct download at the NSIDC DAAC, with Sea Surface Temperature variables available from PO.DAAC on the Earthdata Cloud. 


### Objectives




---

### Import packages

In [4]:
# pip install icepyx

In [5]:
# import getpass
# from requests.auth import HTTPBasicAuth

import requests
import netrc
from pprint import pprint
import os
from xml.etree import ElementTree as ET
import time
import zipfile
import io
import shutil
import json
from urllib import request

import xarray as xr
import h5py
# import icepyx as ipx

### Determine storage location of datasets of interest

First, let's see whether our datasets of interest reside in the Earthdata Cloud or whether they reside on premise, or "on prem" at a local data center.

Background from CMR API (consider removing):
The cloud_hosted parameter can be set to “true” or “false”. When true, the results will be restricted to collections that have a DirectDistributionInformation element or have been tagged with gov.nasa.earthdatacloud.s3.
curl “https://cmr.earthdata.nasa.gov/search/collections?cloud_hosted=true”

### Declare datasets of interest

Identify the dataset ID that is used internally within CMR to designate each dataset

In [6]:
# sentinel_name = 'SENTINEL-1A_DP_META_GRD_MEDIUM'
# # sentinel_id = 'C1214472336-ASF'

modis_name = 'MODIS_T-JPL-L2P-v2019.0'

icesat2_name = 'ATL03'
# icesat2_id = 'C1997321091-NSIDC_ECS'

In [7]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
url = f'{CMR_OPS}/{"collections"}'

response = requests.get(url, 
                        params={
                            'short_name': modis_name,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
collections = response.json()['feed']['entry']
# collection_id = collections[0]['id']  
# print(collection_id)

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

C1940475563-POCLOUD version:2019.0
C1693233387-PODAAC version:2019.0


Start with the MODIS dataset, setting the `cloud_hosted` parameter to True:

In [8]:
# CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# url = f'{CMR_OPS}/{"collections"}'

response = requests.get(url, 
                        params={
                            'concept_id': 'C1940475563-POCLOUD',
                            'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
collections = response.json()['feed']['entry']
pprint(collections)

[{'archive_center': 'NASA/JPL/PODAAC',
  'associations': {'services': ['S1962070864-POCLOUD', 'S2004184019-POCLOUD'],
                   'tools': ['TL2108419875-POCLOUD', 'TL2092786348-POCLOUD'],
                   'variables': ['V1997811750-POCLOUD',
                                 'V1997811794-POCLOUD',
                                 'V2112014697-POCLOUD',
                                 'V1997811877-POCLOUD',
                                 'V1997811902-POCLOUD',
                                 'V2028668027-POCLOUD',
                                 'V2028632036-POCLOUD',
                                 'V1997811775-POCLOUD',
                                 'V1997811764-POCLOUD',
                                 'V1997811783-POCLOUD',
                                 'V2112014702-POCLOUD',
                                 'V2028632034-POCLOUD',
                                 'V2112014700-POCLOUD',
                                 'V1997811759-POCLOUD',
                    

Now we will try our ICESat-2 dataset to see what id's are returned for a given dataset name.

In [9]:
response = requests.get(url, 
                        params={
                            'short_name': icesat2_name,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
collections = response.json()['feed']['entry']
# pprint(collections)

for collection in collections:
    print(f'{collection["id"]} {"version:"}{collection["version_id"]}')

C1705401930-NSIDC_ECS version:003
C1997321091-NSIDC_ECS version:004


Two separate datasets exist in the CMR. Now let's take each ID, setting the `cloud_hosted` parameter to True, to identify which dataset is cloud-hosted:

In [10]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
url = f'{CMR_OPS}/{"collections"}'

response = requests.get(url, 
                        params={
                            'concept_id': 'C1997321091-NSIDC_ECS',
                            # 'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
collections = response.json()['feed']['entry']
pprint(collections)

[{'archive_center': 'NASA NSIDC DAAC',
  'associations': {'services': ['S1568899363-NSIDC_ECS',
                                'S1613689509-NSIDC_ECS',
                                'S1977894169-NSIDC_ECS',
                                'S2013502342-NSIDC_ECS'],
                   'tools': ['TL1950215144-NSIDC_ECS',
                             'TL1977971361-NSIDC_ECS',
                             'TL1993837300-NSIDC_ECS',
                             'TL1952642907-NSIDC_ECS']},
  'boxes': ['-90 -180 90 180'],
  'browse_flag': False,
  'coordinate_system': 'CARTESIAN',
  'data_center': 'NSIDC_ECS',
  'dataset_id': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004',
  'has_formats': True,
  'has_spatial_subsetting': True,
  'has_temporal_subsetting': True,
  'has_transforms': False,
  'has_variables': True,
  'id': 'C1997321091-NSIDC_ECS',
  'links': [{'href': 'https://n5eil01u.ecs.nsidc.org/ATLAS/ATL03.004/',
             'hreflang': 'en-US',
             'length': '0.0KB',


What happens if we comment out this parameter? Do we see results returned? 
[TODO: Add instructions on how to comment lines using command slash]

Now we have determined that our Sentinel dataset is provided in the cloud, whereas the ICESat-2 dataset remains "on premise", residing in a local data center. 

### Determine size of ATL03 data over area of interest 
(determine large size; need to keep geographic region small)

#### Specify time and area of interest 

These `bounding_box` and `temporal` variables will be used for data search, subset, and access below

(Quick demo of OpenAltimetry or Earthdata Search for exploration of ICESat-2 or both datasets??)

In [11]:
# # Bounding Box spatial parameter in decimal degree 'W,S,E,N' format. TODO: Show on a simple map??
# bounding_box = '-31.68073,61.21566,-12.15967,83.56771'

# # Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
# temporal = '2021-01-01T00:00:00Z,2021-01-01T23:59:59Z'


# Bounding Box spatial parameter in decimal degree 'W,S,E,N' format.
bounding_box = '-62.8,81.7,-56.4,83'

# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-06-22T00:00:00Z,2019-06-22T23:59:59Z'

In [12]:
url = f'{CMR_OPS}/{"granules"}'
response = requests.get(url, 
                        params={
                            'concept_id': 'C1997321091-NSIDC_ECS',
                            'temporal': temporal,
                            'bounding_box': bounding_box,
                            'page_size': 200,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.headers['CMR-Hits'])

2


In [13]:
granules = response.json()['feed']['entry']

for granule in granules:
    print(f'{granule["producer_granule_id"]} {granule["granule_size"]} {granule["links"][0]["href"]}')

ATL03_20190622061415_12980304_004_01.h5 1825.3746356964 https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2019.06.22/ATL03_20190622061415_12980304_004_01.h5
ATL03_20190622202251_13070304_004_01.h5 3035.5987443924 https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2019.06.22/ATL03_20190622202251_13070304_004_01.h5


### Download ICESat-2 ATL03 granule
We've found 2 granules.  We'll download the first one and write it to a file with the name of the `producer_granule_id`.

We need the url for the granule as well.  This is found under the links section.

In [17]:
icesat_id = granules[0]["producer_granule_id"]
icesat_url = granules[0]['links'][0]['href']

You need Earthdata login credentials.  These are stored in the `.netrc` file you setup.  We'll use the `netrc` package to get the login and password.

In [12]:
info = netrc.netrc()
login, account, password = info.authenticators('urs.earthdata.nasa.gov')

We'll then use the `requests.get()` method to retrieve the file contents.  These are stored in the response.

In [13]:
r = requests.get(icesat_url, auth=(login, password))

The contents of the response can then be written to a file.  The file is opened for writing using the `w` and in binary mode `b`.  The default is `text-mode` so make sure you set `b`.

In [14]:
with open(icesat_id, 'wb') as f:
    f.write(r.content)

`ATL03_20190622061415_12980304_004_01.h5` is an HDF5 file.  `xarray` can open this but you need to tell it where to get the data.  In this case we read the height data for ground-track 1 left-beam.

In [16]:
ds = xr.open_dataset(icesat_id, group='/gt1l/heights')
ds

### Determine subsetting capabilities of ATL03

Consider removing since we see basic service info in the colleciton-level tags (has_spatial_subsetting, etc.)

In [22]:
# # CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# # url = f'{CMR_OPS}/{"collections"}'

# response = requests.get(url, 
#                         params={
#                             'concept_id': 'C1997321091-NSIDC_ECS',
# #                            'cloud_hosted': 'True',
#                             },
#                         headers={
#                             'Accept': 'application/json'
#                             }
#                        )
# response = response.json()
# services = response['feed']['entry'][0]['associations']['services']
# output_format = "umm_json"
# service_url = "https://cmr.earthdata.nasa.gov/search/services"
# for i in range(len(services)):
#     response = requests.get(f"{service_url}.{output_format}?concept-id={services[i]}")
#     response = response.json()
#     if 'ServiceOptions' in response['items'][0]['umm']: pprint(response['items'][0]['umm']['ServiceOptions'])

## Direct Download of ATL03 

- Describe what services are available, including icepyx (provide references), but just direct download for simplicity

- Describe that this is being "downloaded" to our cloud environment - what does that mean in terms of cost, etc.

Icepyx workflow below...

In [11]:
# # # Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
# # temporal = '2021-01-01T00:00:00Z,2021-01-31T23:59:59Z'

# #icepyx params

# #convert bounding_box which is a string, to a list of floats

# to_list = bounding_box.split(",")
# icepyx_spatial = [float(x) for x in to_list]
# print(icepyx_spatial)

# # New date range since icepyx provides dates as YYYY-MM-DD

# date_range = ['2019-06-22','2019-06-22']

# region_a = ipx.Query(icesat2_name, icepyx_spatial, date_range)

In [10]:
# region_a.visualize_spatial_extent()

In [12]:
# region_a.avail_granules(ids=True)

In [13]:
# region_a.subsetparams()

In [14]:
# earthdata_uid = 'uid'
# email = 'email'
# region_a.earthdata_login(earthdata_uid, email)

In [16]:
# region_a.order_granules()

In [17]:
# path = './download'
# region_a.download_granules(path)

In [None]:
# #Set NSIDC data access base URL
# base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'

# # bounding box search and subset:
# param_dict = {'short_name': 'ATL03', 
#               'version': '004', 
#               'temporal': temporal, 
#               'bounding_box': bounding_box, 
#               'bbox': bounding_box, 
#               'coverage': '/gt1r/heights/h_ph,/gt1l/heights/h_ph,/gt2r/heights/h_ph,/gt2l/heights/h_ph,/gt1r/heights/lat_ph,/gt1l/heights/lat_ph,/gt2r/heights/lat_ph,/gt2l/heights/lat_ph,/gt1r/heights/lon_ph,/gt1l/heights/lon_ph,/gt2r/heights/lon_ph,/gt2l/heights/lon_ph',
#               'page_size': '10', 
#               'request_mode': 'async',
# #              'token' : _token,
#              }

# #Remove blank key-value-pairs
# param_dict = {k: v for k, v in param_dict.items() if v != ''}

# #Convert to string
# param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in param_dict.items())
# param_string = param_string.replace("'","")

# API_request = api_request = f'{base_url}?{param_string}'
# print(API_request)

In [None]:
# #TODO: Need to make code much simpler!!

# # Create an output folder if the folder does not already exist.

# path = str(os.getcwd() + '/Outputs')
# if not os.path.exists(path):
#     os.mkdir(path)

# # For all requests other than spatial file upload, use get function
# session = requests.session()
# request = session.get(base_url, params=param_dict)

# print('Request HTTP response: ', request.status_code)

# # Raise bad request: Loop will stop for bad response code.
# request.raise_for_status()
# print('Order request URL: ', request.url)
# esir_root = ET.fromstring(request.content)
# print('Order request response XML content: ', request.content)

# #Look up order ID
# orderlist = []   
# for order in esir_root.findall("./order/"):
#     orderlist.append(order.text)
# orderID = orderlist[0]
# print('order ID: ', orderID)

# #Create status URL
# statusURL = base_url + '/' + orderID
# print('status URL: ', statusURL)

# #Find order status
# request_response = session.get(statusURL)    
# print('HTTP response from order response URL: ', request_response.status_code)

# # Raise bad request: Loop will stop for bad response code.
# request_response.raise_for_status()
# request_root = ET.fromstring(request_response.content)
# statuslist = []
# for status in request_root.findall("./requestStatus/"):
#     statuslist.append(status.text)
# status = statuslist[0]
# print('Data request is submitting...')
# print('Initial request status is ', status)

# #Continue loop while request is still processing
# while status == 'pending' or status == 'processing': 
#     print('Status is not complete. Trying again.')
#     time.sleep(10)
#     loop_response = session.get(statusURL)

# # Raise bad request: Loop will stop for bad response code.
#     loop_response.raise_for_status()
#     loop_root = ET.fromstring(loop_response.content)

# #find status
#     statuslist = []
#     for status in loop_root.findall("./requestStatus/"):
#         statuslist.append(status.text)
#     status = statuslist[0]
#     print('Retry request status is: ', status)
#     if status == 'pending' or status == 'processing':
#         continue

# #Order can either complete, complete_with_errors, or fail:
# # Provide complete_with_errors error message:
# if status == 'complete_with_errors' or status == 'failed':
#     messagelist = []
#     for message in loop_root.findall("./processInfo/"):
#         messagelist.append(message.text)
#     print('error messages:')
#     pprint.pprint(messagelist)

# # Download zipped order if status is complete or complete_with_errors
# if status == 'complete' or status == 'complete_with_errors':
#     downloadURL = 'https://n5eil02u.ecs.nsidc.org/esir/' + orderID + '.zip'
#     print('Zip download URL: ', downloadURL)
#     print('Beginning download of zipped output...')
#     zip_response = session.get(downloadURL)
#     # Raise bad request: Loop will stop for bad response code.
#     zip_response.raise_for_status()
#     with zipfile.ZipFile(io.BytesIO(zip_response.content)) as z:
#         z.extractall(path)
#     print('Data request is complete.')
# else: print('Request failed.')

#### Clean up Outputs folder by removing individual granule folders 

In [None]:
# for root, dirs, files in os.walk(path, topdown=False):
#     for file in files:
#         try:
#             shutil.move(os.path.join(root, file), path)
#         except OSError:
#             pass
#     for name in dirs:
#         os.rmdir(os.path.join(root, name))

### Determine size of SAR data

In [None]:
# CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# url = f'{CMR_OPS}/{"granules"}'

# response = requests.get(url, 
#                         params={
#                             'concept_id': 'C1214472336-ASF',
# #                             'concept_id': 'C1997321091-NSIDC_ECS',
#                             'bounding_box': bounding_box,
#                             'temporal': temporal,
#                             },
#                         headers={
#                             'Accept': 'application/json'
#                             }
#                        )
# granules = response.json()['feed']['entry']
# #pprint(granules)

# results = json.loads(response.content)
# granules = []
# granules.extend(results['feed']['entry'])
# hits = int(response.headers['CMR-Hits'])
# print(f"Found {hits} files")
# granule_sizes = [float(granule['granule_size']) for granule in granules]
# print(f"The total size of all files is {sum(granule_sizes):.2f} MB")

### Determine variables of interest: SST, ocean color, chemistry...

In [6]:
# CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'
# url = f'{CMR_OPS}/{"collections"}'

response = requests.get(url, 
                        params={
                            'concept_id': 'C1940475563-POCLOUD',
                            'cloud_hosted': 'True',
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
response = response.json()
variables = response['feed']['entry'][0]['associations']['variables']
output_format = "umm_json"
var_url = "https://cmr.earthdata.nasa.gov/search/variables"
for i in range(len(variables)):
    response = requests.get(f"{var_url}.{output_format}?concept-id={variables[i]}")
    response = response.json()
    # print(response['items'][0]['umm'])
    if 'Name' in response['items'][0]['umm']: pprint(response['items'][0]['umm']['Name'])

'sea_surface_temperature'
'l2p_flags'
'lat'
'wind_speed'
'dt_analysis'
'quality_level_4um'
'sses_bias_4um'
'sses_bias'
'quality_level'
'sses_standard_deviation'
'time'
'sea_surface_temperature_4um'
'lon'
'sst_dtime'
'sses_standard_deviation_4um'


### Pull those variables into xarray "in place"

#### First, we need to determine the granules returned from our time and area of interest

In [56]:
gran_url = f'{CMR_OPS}/{"granules"}'

response = requests.get(gran_url, 
                        params={
                            'concept_id': 'C1940475563-POCLOUD',
                            'temporal': temporal,
                            'bounding_box': bounding_box,
                            'page_size': 200,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

granules = response.json()['feed']['entry']
print(response.headers['CMR-Hits'])
granule
for granule in granules:
   print(f'{granule["dataset_id"]} {granule["id"]} {granule["links"][0]["href"]}')

granules = response.json()['feed']['entry']

200
11
GHRSST Level 2P Global Sea Surface Skin Temperature from the Moderate Resolution Imaging Spectroradiometer (MODIS) on the NASA Terra satellite (GDS2) G1956337241-POCLOUD https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20190622061001-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc
GHRSST Level 2P Global Sea Surface Skin Temperature from the Moderate Resolution Imaging Spectroradiometer (MODIS) on the NASA Terra satellite (GDS2) G1956082976-POCLOUD https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-public/MODIS_T-JPL-L2P-v2019.0/20190622093001-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc.md5
GHRSST Level 2P Global Sea Surface Skin Temperature from the Moderate Resolution Imaging Spectroradiometer (MODIS) on the NASA Terra satellite (GDS2) G1956324753-POCLOUD https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20190622111001-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc
GHRSS

### Use geolocation of ICESat-2 to define the single transect used to pull coincident ocean data out from array

### Create a plot of the single transect of gridded data 

(bonus: time series) - describe what this means to egress out of the cloud versus pulling the original data down (benefit to processing in the cloud)

## Exercise

In [None]:
# maybe an exercise that builds off of previous tutorials? Discover services or storage in CMR?

---

## Resources (optional)

---

## Conclusion