# Harmonizing data located within and outside of the NASA Earthdata Cloud

## Summary

This tutorial will combine several workflow steps and components from the previous days, demonstrating the process of using the geolocation of data available outside of the Earthdata Cloud to then access coincident variables of cloud-accessible data. This may be a common use case as NASA Earthdata continues to migrate to the cloud, producing a "hybrid" data archive across Amazon Web Services (AWS) and original on-premise data storage systems. Additionally, you may also want to combine field measurements with remote sensing data available on the Earthdata Cloud.

This specific example explores the harmonization of the ICESat-2 ATL03 data product, currently (as of November 2021) available publicly via direct download at the NSIDC DAAC, with Sea Surface Temperature variables available from PO.DAAC on the Earthdata Cloud. 


### Objectives

[TODO]

---

### Import packages

In [None]:
#

### Determine storage location of datasets of interest

First, let's see whether our datasets of interest reside in the Earthdata Cloud or whether they reside on premise, or "on prem" at a local data center.

Background from CMR API [TODO: consider removing]:
The cloud_hosted parameter can be set to “true” or “false”. When true, the results will be restricted to collections that have a DirectDistributionInformation element or have been tagged with gov.nasa.earthdatacloud.s3.

We are building off of the CMR introductory tutorial, beginning with a collection search.

In [None]:
#

We want to search by collection to inspect the access and service options that exist:

In [None]:
#

In the CMR introduction tutorial, we explored cloud-hosted collections from different DAAC providers, and identified the CMR concept-id for a given dataset id (also referred to as a short_name). Here we'll start with two datasets that we want to explore over a coincident area and time:

In [None]:
#

Like in the intro tutorial, we're going to first determine what concept-ids are returned for the MODIS dataset. First, retrieve collection results based on the MODIS `short_name`:

In [None]:
#

For each collection result, print out the CMR concept-id and version:

In [None]:
#

Two collections are returned, both at version 2019.0. We can see from the suffix of the id that one is associated with "POCLOUD" versus "PODAAC". That gives us a clue in terms of where the data are hosted, but we can also use the `cloud_hosted` parameter set to True to confirm.

In [None]:
#

In [None]:
#

We will save this concept-id to use later on when we access the data granules.

In [None]:
#

Now we will try our ICESat-2 dataset to see what id's are returned for a given dataset name.

In [None]:
#

In [None]:
#

Two separate datasets exist in the CMR, one at version 3 and one at version 4. Let's see if these are `cloud_hosted`:

In [None]:
#

In [None]:
#

When set to `False`, we get our collections back. We have now determined that we have a copy of the MODIS dataset in the cloud, whereas the ICESat-2 dataset (both versions) remains "on premise", residing in a local data center. 

Save the ATL03 concept ID and the MODIS GHRSST concept ID to variables:

In [None]:
#

#### Specify time range and area of interest 

We are going to focus on getting data for an area north of Greenland for a single day in June.

These `bounding_box` and `temporal` variables will be used for data search, subset, and access below

In [None]:
#

Perform a granule search over our time and area of interest. How many granules are returned?

In [None]:
#

In [None]:
#

Print the file names, size, and links:

In [None]:
#

### Download ICESat-2 ATL03 granule
[TODO] Describe what services are available, including icepyx (provide references), but just direct download for simplicity. Describe that this is being "downloaded" to our cloud environment - what does that mean in terms of cost, etc.

We've found 2 granules.  We'll download the first one and write it to a file with the same name as the `producer_granule_id`.

We need the url for the granule as well.  This is `href` links we printed out above.

In [None]:
#

You need Earthdata login credentials to download data from NASA DAACs.  These are the credentials you stored in the `.netrc` file you setup in previous tutorials.  

We'll use the `netrc` package to retrieve your login and password without exposing them.

In [None]:
#

To retrieve the granule data, we use the `requests.get()` method, passing Earthdata login credentials as a `tuple` using the `auth` keyword.

In [None]:
#

The response returned by requests has the same structure as all the other responses: a header and contents.  The header information has information about the response, including the size of the data we downloaded in bytes. 

In [None]:
#

The contents needs to be saved to a file.  To keep the directory clean, we will create a `downloads` directory to store the file.  We can use a shell command to do this or use the `mkdir` method from the `os` package. 

In [None]:
#

You should see a `downloads` directory in the file browser.

To write the data to a file, we use `open` to open a file.  We need to specify that the file is open for writing by using the _write-mode_ `w`.  We also need to specify that we want to write bytes by setting the _binary-mode_ `b`.  This is important because the response contents are bytes.  The default mode for `open` is `text-mode`. So make sure you use `b`.

We'll use the `with` statement _context-manager_ to open the file, write the contents of the response, and then close the file.  Once the data in `r.content` is written sucessfully to the file, or if there is an error, the file is closed by the _context-manager_.

We also need to prepend the `downloads` path to the filename.  We do this using `Path` from the `pathlib` package in the standard library.

In [None]:
#

Check to make sure it is downloaded.

In [None]:
#

`ATL03_20190622061415_12980304_004_01.h5` is an HDF5 file.  `xarray` can open this but you need to tell it which group to read the data from.  In this case we read the height data for ground-track 1 left-beam.

In [None]:
#

### Determine variables of interest: SST, ocean color, chemistry...

In [None]:
#

### Pull those variables into xarray "in place"

#### First, we need to determine the granules returned from our time and area of interest

In [None]:
#

In [None]:
#

In [None]:
#

### Get S3 credentials

In [None]:
#

In [None]:
#

### Open a s3 file

In [None]:
#

### Use geolocation of ICESat-2 to define the single transect used to pull coincident ocean data out from array


### Create a plot of the single transect of gridded data 

(bonus: time series) - describe what this means to egress out of the cloud versus pulling the original data down (benefit to processing in the cloud)


## Download MODIS GHRSST data from Cloud

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

## Resources (optional)

---

## Conclusion
