<img src='./img/nsidc_logo.png'/>

# Data Discovery and Access via **earthdata** library


### Credits
* Notebook by: Luis Lopez and Mikala Beig
* Source material: [earthdata demo notebook](https://github.com/nsidc/earthdata) by Luis Lopez

## Objective

* Use programmatic data access to discover and access NASA DAAC data using the **earthdata** library .

---

## Motivation and Background
**earthdata**  uses NASA APIs to search, preview and access NASA datasets on-prem and in the cloud (with 4 lines of Python!).

There are many ways to access NASA datasets. We can use the Earthdata Search portal. We can use DAAC specific portals or tools.
We could even use data.gov! These web portals are great but... they are not designed for programmatic access and reproducible workflows. 
This is extremely important in the age of the cloud and reproducible open science.

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. 
There are already some very useful client libraries for these APIs:

* python-cmr
* eo-metadata-tools
* harmony-py
* Hyrax (OpenDAP)
* cmr-stac
* others

Each of these libraries has amazing features and some similarities. 
* [cmr-stac](https://medium.com/pangeo/intake-stac-nasa-4cd78d6246b7) is probably the best option for a streamlined workflow from dataset search and discovery to efficiently loading data using python libraries like pandas or xarray.
* [*Harmony-py*](https://harmony.earthdata.nasa.gov/) is the more capable client if we want to pre process the data beforehand(reformat NetCDF to Zarr, reproject, subset). Unfortunately not all datasets are yet covered by Harmony.

In this context, **earthdata** aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-prem or in the cloud.


### Key Steps for Programmatic Data Access

There are a few key steps for accessing data from the NASA DAAC APIs:
1. Authenticate with NASA Earthdata Login (and for cloud-hosted data with AWS access keys and token).
2. Query CMR to find data using spatial and temporal filters.
3. Order and download your data


We'll go through each of these steps during this tutorial, at the end summarizing how `earthdata` streamlines this process into a minimal number of lines of code.

___
___


### **Step 0. Import classes**

In [1]:
# Import classes from earthdata

from earthdata import Auth, DataCollections, DataGranules, Accessor

### **Step 1. Earthdata login**

To access data using the <library name> library it is necessary to log into [Earthdata Login](https://urs.earthdata.nasa.gov/). To do this, enter your NASA Earthdata credentials in the next step after executing the following code cell.

**Note**: If you don't have NASA Earthdata credentials you have to register first at the link above. You don't need to be a NASA employee to register with NASA Earthdata!  Note that if you did not enter your Earthdata Login username and email into the form in the pre-Hackweek email, you will not be on the ICESat-2 cloud data early access list and you will not have access to ICESat-2 data in the cloud.  You will still have access to all publicly available data sets.



In [2]:
# Some explanation here about this step...  we instantiate an authentication object with...  <not sure of language here.>
auth = Auth()
auth.login()

Enter your Earthdata Login username:  mbeig
Enter your Earthdata password:  ···············


You're now authenticated with NASA Earthdata Login


True

### **Step 2 Query the Common Metadata Repository (CMR)**

### Query CMR for Data Collections


You can narrow your search in CMR by filtering on keyword, temporal range, area of interest, and data provider (among other things), e.g.:
- temporal("2016-01-01", "2020-12-12")
- keyword('fire')
- bounding_box(-134.7,58.9,-133.9,59.2)
- provider("NSIDC_ECS")


Example:  Set up a Query object that will return all NSIDC cloud hosted collections (NSIDC_CPRD = cloud hosted, NSIDC_ECS = on prem).  **Important Note** Because the NSIDC cloud data are behind restricted access, you need to pass the "auth" instance even when only querying CMR

In [None]:
Query = DataCollections(auth).provider("NSIDC_CPRD")

Find out how many Query hits were returned using the 'hits' methods.

In [None]:
print(f'Collections found: {Query.hits()}') 

Oh no!  What!?  Zero hits? :(   

The 'hits' method above will print out the number of query hits, but only for publicly available data sets.  
Because cloud hosted ICESat-2 data are not yet publicly available, CMR will return "0" hits, if you filtered DataCollections by provider = NSIDC_CPRD.
For now we need an alternative method of seeing how many cloud data sets are available at NSIDC.  We can create a collections object (we're going to want one of 
these soon anyhow) and print the len() of the collections object to see the true number of hits 

Create a collections object

In [None]:
collections = Query.get()

# Print the length of the object to determine the number of hits.

print(len(collections))

Identify the data sets returned in your query.  You can extract information from CMR about the data sets by using the 'fields' method.  You can filter out which UMM fields to print,  e.g. ShortName or Abstract, or you can grab all of the UMM fields using "*"

In [None]:
collections = Query.fields(['ShortName','Abstract']).get()
# Inspect 5 results printing just the ShortName and Abstract
print(collections[0:5])

Typically we'd like to narrow down our search by more than just 'provider'.  Examples below employ more methods to filter data.

In [None]:
# Query = DataCollections(auth).keyword('fire').temporal("2016-01-01", "2020-12-12")
#Query = DataCollections(auth).keyword('GEDI').bounding_box(-134.7,58.9,-133.9,59.2)
#Query = DataCollections(auth).keyword('elevation').bounding_box(-134.7,58.9,-133.9,59.2).provider("NSIDC_ECS")
Query = DataCollections(auth).keyword('ice').bounding_box(-134.7,58.9,-133.9,59.2).provider("NSIDC_ECS")

print(f'Collections found: {Query.hits()}')

collections = Query.fields(['ShortName']).get()
# Inspect 5 results printing just the ShortName and Abstract
print(collections[0:5])

The results from DataCollections (and DataGranules) are enhanced python dict objects.  We can select what metadata fields from CMR to display this way as well.

In [None]:
collections[0]["umm.ShortName"]
#collections[0]["meta.concept-id"]
#collections[0]["umm.RelatedUrls"]

The DataCollections class returns python dictionaries with some handy methods.

```python 
collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.
```

The same results can be obtained using the `dict` syntax:

```python
collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
```


In [None]:
collections[0]["meta"]["concept-id"]

In [None]:
collections[0].concept_id()

In [None]:
#  earthdata allows discovery and access to data at any DAAC, and to data on prem or in the cloud.   

# Query = DataCollections(auth).provider('POCLOUD')
Query = DataCollections().provider("POCLOUD")

print(f'Collections found: {Query.hits()}')
collections = Query.fields(['ShortName']).get(20)
# Printing 3 collection
collections[0:3]

In [None]:
# Printing the concept-id for the first 10 collections
concept_ids = [collection.concept_id() for collection in collections[0:10]]

In [None]:
print(concept_ids)

### Query CMR for Data Granules

The DataGranules class provides similar functionality as the collection class. Concept IDs are unique identifiers for data sets (collections). To query for granules from the exact data set and version in which you are interested, query granules using concept-id.
You can search data granules using a short name but that could (more likely will) return different versions of the same data granules. Even when specifying both short name and version number, a query won't distinguish between on prem or cloud hosted granules.

In this example we're querying for 20 data granules from ICESat-2  [ATL03](https://nsidc.org/data/ATL03/versions/) version `005` dataset. 

In [None]:
# Generally speaking we won't need the auth instance for queries to collections and granules
# Query = DataGranules().short_name('ATL03').version("003").bounding_box(-134.7,58.9,-133.9,59.2)

Query = DataGranules(auth).short_name('ATL03').version("005").bounding_box(-134.7,58.9,-133.9,59.2)
granules = Query.get(20)
print(granules[0:10])


In [None]:
In this query we use the concept-id for cloud-hosted ATL03 v005 granules.

In [None]:
Query = DataGranules(auth).concept_id('C2153572325-NSIDC_CPRD')
granules = Query.get(10)
print(Query.hits())
print(len(granules))

## Pretty printing data granules

Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function `display`
This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal

In [None]:
# printing 2 granules using display
[display(granule) for granule in granules[0:2]]

### Spatiotemporal queries

As mentioned above, granules and collection classes accept the same spatial and temporal arguments as CMR so we can search for granules that match spatiotemporal criteria.  Below are some additional examples of applying these methods.



In [None]:
Query = DataGranules().short_name("ATL03").temporal("2020-03-01", "2020-03-30").bounding_box(-134.7,58.9,-133.9,59.2).version("005")
# Always inspects the hits before retrieving the granule metadata, just because it's very verbose.
print(f"Granules found: {Query.hits()}")

In [None]:
#bounding box over Alaska/Canada border for ATL06?
Query = DataGranules(auth).concept_id("C2153572325-NSIDC_CPRD").temporal("2020-03-01", "2020-03-30").bounding_box(-134.7,58.9,-133.9,59.2)
#bounding box over north Greenland/Arctic Ocean for ATL10 v005 
#Query = DataGranules(auth).concept_id("C2162009387-NSIDC_ECS").temporal("2020-03-01", "2020-03-30").bounding_box(-113.9,81.6,-49.2,86.7)
# Always inspects the hits before retrieving the granule metadata, just because it's very verbose.
print(f"Granules found: {Query.hits()}")
granules = Query.get(10)
print(len(granules))

In [None]:
# Now we can print some info about these granules using the built-in methods
granules = Query.get(4)
data_links = [{'links': g.data_links(), 'size (MB):': g.size()} for g in granules]
data_links

In [None]:
# More datasets to try

# C1908348134-LPDAAC_ECS: GEDI L2A Elevation and Height Metrics Data Global Footprint Level V002
# C1968980609-POCLOUD: Sentinel-6A MF Jason-CS L2 P4 Altimeter Low Resolution (LR) STC Ocean Surface Topography
# C1575731655-LPDAAC_ECS: ASTER Global Digital Elevation Model NetCDF V003
# Query = DataGranules(auth).short_name('ATL03').version("003")
Query = DataGranules().short_name('ATL03').version("004").bounding_box(-134.7,58.9,-133.9,59.2)
# Query = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
print(f"Granules found: {Query.hits()}")

In [None]:
# Not all granules have data previews, if they do, the granule class will show up to 2 preview images while using Jupyter's display() function
granules = Query.get(10)
[display(g) for g in granules[0:5]]

In [None]:
# Granules are python dictionaries, with fancy nested key/value notation and some extra built-in methods.
granules[0]["umm.TemporalExtent.RangeDateTime"]

In [None]:
# Size in MB
data_links = [{'links': g.data_links(), 'size (MB):': g.size()} for g in granules]
data_links

## **Step 3. Accessing the data**

The cloud is not something magical, but having infrastructure on-demand is quite handy to have for many scientific workflows, especially if the data already lives in "the cloud".
As for NASA, data migration started in 2020 and will continue into the foreseeable future. Not all, but most of NASA data will be available in AWS object simple storage service or S3.

To work with this data the first thing we need to do is to get the proper credentials for accessing data in their S3 buckets. These credentials are on a per-DAAC base and last a mere 1 hour. In the near future the Auth class will keep track of this to regenerate the credentials as needed.

With `earthdata` a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, `us-west-2`.

## On-prem access  📡

DAAC hosted data

In [None]:
# If we want to start the notebook from here we need to execute this cell and uncomment the lines below
# Accessing not necessarily means downloading, specially in the cloud.
from earthdata import Auth, DataGranules, DataCollections, Accessor
auth = Auth()
auth.login()
access = Accessor(auth)

In [None]:
#bounding box is over big chunk of western Canada
Query = DataGranules().concept_id("C1997321091-NSIDC_ECS").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only 100
granules = Query.get(100)

In [None]:
# don't need to use 'display' here for granules, because it will automatically use d

granules[0]

In [None]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
files = access.get(granules_by_size[4:5], "./data/demo-atl03")

In [None]:
import xarray as xr

ds = xr.open_mfdataset('./data/demo-atl03/ATL03_20181021222823_03560103_004_01.h5')

ds

## Cloud access ☁️

Same API, just a different place

In [None]:
Query = DataGranules().concept_id("C1968980609-POCLOUD").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")
cloud_granules = Query.get(100)
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted

### **Handy methods that work on DataGranules**

In [None]:
# Does this granule belong to a cloud-based collection?
granules[0].cloud_hosted

In [None]:
# since the response is an array of dictionaries we can do pythonic things like ordering the granules by size
import operator
granules_by_size = sorted(granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 10
granules_by_size[0:5]