# NASA Earthdata API Client 🌍

## Overview

> TL;DR: **earthdata** is uses NASA APIs to search, preview and access NASA datasets on-prem and in the cloud with 4 lines of Python.

There are many ways to access NASA datasets, we can use the Earthdata search portal. We can use DAAC specific portals or tools.
We could even use data.gov! These web portals are great but... they are not designed for programmatic access and reproducible workflows. 
This is extremly important in the age of the cloud and reproducible open science.

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. 
There are already some very useful client libraries for these APIs:

* python-cmr
* eo-metadata-tools
* harmony-py
* Hyrax
* others

Each of these libraries has amazing features and some similarities but they miss the glue to take a researcher all the way from searching to getting the data. *Harmony-py* is probably the most complete client of the list and the future for cloud-based workflows but as of today it only serves a small number of datasets.


### Data Formats and Cloud Access



## Querying for data collections
The DataCollection client can query CMR for any collection using all of CMR's Query parameters and has built-in accessors for the common ones.
This makes it ideal for one liners and easier notation.

In [None]:
# We import the classes from earthdata
from earthdata import Auth, DataCollections, DataGranules, Accessor
auth = Auth()

In [None]:
# We can now search for collections using a pythonic API client for CMR.
# Query = DataCollections(auth).keyword('fire').temporal("2016-01-01", "2020-12-12")
# Query = DataCollections(auth).keyword('GEDI').bounding_box(-134.7,58.9,-133.9,59.2)

Query = DataCollections(auth).keyword('elevation').bounding_box(-134.7,58.9,-133.9,59.2)

print(f'Collections found: {Query.hits()}')

# filtering what UMM fields to print
collections = Query.fields(['ShortName','Abstract']).get(10)
# Inspect 5 results printing just the ShortName and Abstract
collections[0:3]

In [None]:
# the results from DataCollections and DataGranules are enhanced python dict objects, we still can get all the fields from CMR
collections[0]["umm.ShortName"]

The DataCollections class returns python dictionaries with some handy methods.

```python 
collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.
```

The same results can be obtained using the `dict` syntax:

```python
collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
```


In [None]:
# We can now search for collections using a pythonic API client for CMR.
# Query = DataCollections(auth).provider('POCLOUD')
Query = DataCollections(auth).provider("POCLOUD")

print(f'Collections found: {Query.hits()}')
collections = Query.fields(['ShortName']).get(20)
# Printing 3 collections
collections[0]

In [None]:
# Printing the concept-id for the first 10 collections
[collection.concept_id() for collection in collections[0:10]]

## Querying for data granules

The DataGranules class provides similar functionality as the collection class. To query for granules in a more reliable way concept-id would be the main key.
You can search data granules using a short name but that could (more likely will) return different versions of the same data granules. 

In this example we're querying for 20 data grnaules from ICESat-2  [ATL03](https://nsidc.org/data/ATL03/versions/) version `003` dataset. 

In [None]:
# Generally speaking we won't need the auth instance for queries to collections and granules
# Query = DataGranules().short_name('ATL03').version("003").bounding_box(-134.7,58.9,-133.9,59.2)

Query = DataGranules(auth).short_name('ATL03').version("003").bounding_box(-134.7,58.9,-133.9,59.2)
granules = Query.get(20)
print(granules[0:2])

## Pretty printting data granules

Since we are in a notebook we can take advantage of it to see a moew user friendly version of the granules with the built-in function `display`
This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal

In [None]:
# printing 2 granules using display
[display(granule) for granule in granules[0:2]]

### Spatiotemporal queries

Our granules and collection classes accept the same spatial and temporal argumenst as CMR so we can search for granules that match spatiotemporal criteria.



In [None]:
Query = DataGranules().short_name("ATL03").temporal("2020-03-01", "2020-03-30").bounding_box(-134.7,58.9,-133.9,59.2).version("004")
print(f"Granules found: {Query.hits()}")

In [None]:
# Now we can print some info about these granules using the built-in methods
granules = Query.get(4)
data_links = [{'links': g.data_links(), 'size (MB):': g.size()} for g in granules]
data_links

In [None]:
# More datasets to try

# C1908348134-LPDAAC_ECS: GEDI L2A Elevation and Height Metrics Data Global Footprint Level V002
# C1968980609-POCLOUD: Sentinel-6A MF Jason-CS L2 P4 Altimeter Low Resolution (LR) STC Ocean Surface Topography
# C1575731655-LPDAAC_ECS: ASTER Global Digital Elevation Model NetCDF V003
# Query = DataGranules(auth).short_name('ATL03').version("003")
Query = DataGranules(auth).short_name('ATL03').version("004").bounding_box(-134.7,58.9,-133.9,59.2)
# Query = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
print(f"Granules found: {Query.hits()}")

In [None]:
# Not all granules have data previews, if they have the granule class will show up to 2 preview images while using Jupyter's display() function
granules = Query.get(10)
[display(g) for g in granules[0:5]]

In [None]:
# Granules are python dictionaries, with fancy nested key/value notation and some extra built-in methods.
granules[0]["umm.TemporalExtent.RangeDateTime"]

In [None]:
# Size in MB
data_links = [{'links': g.data_links(), 'size (MB):': g.size()} for g in granules]
data_links

## **Accessing the data**

The cloud is not something magical, but having infrastructure on-demand is quite handy to have on many scientific workflows, especially if the data already lives in "the cloud".
As for NASA, a data migration started in 2020 and will continue on the forseeable future. Not all but most of NASA data will be available on AWS object storage system or S3.

To work with this data the first thing we need to do is to get the proper credentials for accessing data on their S3 buckets. These credentials are on a per-DAAC base and last a mere 1 hour. In the near future the Auth class will keep track of this to regenerate the credentials.

With `earthdata` a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow egress if the requester is in the same zone, `us-west-2`.

## On-prem access

In [1]:
# If we want to start the notebook from here we need to execute this cell and uncomment the lines below
# Accessing not necessarily means downloading, specially in the cloud.
from earthdata import Auth, DataGranules, DataCollections, Accessor
auth = Auth()
access = Accessor(auth)

Enter your Earthdata Login username:  betolink
Enter your Earthdata password:  ········


You're now authenticated with NASA Earthdata Login


In [2]:
Query = DataGranules(auth).concept_id("C1997321091-NSIDC_ECS").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")

Granule hits: 6230


In [6]:
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only 100
granules = Query.get(100)
# since the response is an array of dictionaries we can do pythonic things like ordering the granules by size
import operator
granules_by_size = sorted(granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 10
granules_by_size[0:5]

[Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -104.98588858627349, 'StartLatitude': 59.5, 'StartDirection': 'A', 'EndLatitude': 80.0, 'EndDirection': 'A'}}}
 Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-21T22:33:47.886Z', 'EndingDateTime': '2018-10-21T22:33:49.005Z'}}
 Size(MB): 69.4352331161
 Data: ['https://n5eil01u.ecs.nsidc.org/DP9/ATLAS/ATL03.004/2018.10.21/ATL03_20181021222823_03560103_004_01.h5'],
 Collection: {'EntryTitle': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V004'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Orbit': {'AscendingCrossing': -110.69603039866222, 'StartLatitude': 27.0, 'StartDirection': 'A', 'EndLatitude': 59.5, 'EndDirection': 'A'}}}
 Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2018-10-20T22:54:00.883Z', 'EndingDateTime': '2018-10-20T22:54:02.716Z'}}
 Size(MB): 84.6457881927
 Data: ['https:/

In [5]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
files = access.get(granules_by_size[0:3], "./data/demo-atl03")

SUBMITTING | :   0%|          | 0/3 [00:00<?, ?it/s]

PROCESSING | :   0%|          | 0/3 [00:00<?, ?it/s]

COLLECTING | :   0%|          | 0/3 [00:00<?, ?it/s]

## Cloud access

Same API, just a different place

In [12]:
Query = DataGranules(auth).concept_id("C1968980609-POCLOUD").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")

Granule hits: 605


In [14]:
import operator

granules = Query.get(100)
cloud_granules_by_size = sorted(granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 10
cloud_granules_by_size[0:3]

[Collection: {'Version': 'F', 'ShortName': 'JASON_CS_S6A_L2_ALT_LR_STD_OST_STC_F'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'Lines': [{'Points': [{'Latitude': -65.651299, 'Longitude': 62.291185}, {'Latitude': -62.668781, 'Longitude': 90.38672}, {'Latitude': -55.249185, 'Longitude': 109.742412}, {'Latitude': -45.587235, 'Longitude': 121.805687}, {'Latitude': -34.84169, 'Longitude': 129.80194}, {'Latitude': -19.907954, 'Longitude': 137.2323}, {'Latitude': -8.237538, 'Longitude': 141.77837}, {'Latitude': 3.531965, 'Longitude': 146.004336}, {'Latitude': 15.288458, 'Longitude': 150.365636}, {'Latitude': 26.915319, 'Longitude': 155.364525}, {'Latitude': 38.24408, 'Longitude': 161.7267}, {'Latitude': 48.952765, 'Longitude': 170.698818}, {'Latitude': 58.344339, 'Longitude': -175.462384}, {'Latitude': 64.949148, 'Longitude': -153.453304}, {'Latitude': 66.647046, 'Longitude': -131.873032}, {'Latitude': 65.650127, 'Longitude': -131.794596}, {'Latitude': 63.952229, 'Longitude':

In [15]:
%%time

# If we get
files = access.get(cloud_granules_by_size[0:3], "./data/demo-POCLOUD")

PermissionError: Access Denied

In [None]:
import xarray as xr


ds = xr.open_dataset('./data/ASTGTMV003_N59W112_dem.nc')
ds

In [None]:
ds.ASTER_GDEM_DEM.plot()

In [None]:
ds.ASTER_GDEM_DEM.plot()

## Recap

```python
from earthdata import Auth, DataGranules, DataCollections, Accessor
auth = Auth()
access = Accessor(auth)

Query = DataGranules(auth).concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
granules = Query.get(10)
# preview the data granules
granules 
# get the files
files = access.get(granules)


```

**Wait, we said 4 lines of python, we meant 3!**

```python

from earthdata import Auth, DataGranules, DataCollections, Accessor
auth = Auth()
files = Accessor(auth).get(DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2).get(10))

# Profit!
```

In [None]:
from earthdata import Auth, DataGranules, DataCollections, Accessor
auth = Auth()
files = Accessor(auth).get(DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2).get(10))


### Related links

**CMR** API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

**EDL** API documentation: https://urs.earthdata.nasa.gov/

NASA OpenScapes: https://nasa-openscapes.github.io/earthdata-cloud-cookbook/

NSIDC: https://nsidc.org