# <img src="https://logos-world.net/wp-content/uploads/2020/05/NASA-Logo-1959-present.png" width="100px" align="middle" /> NASA Earthdata API Client 🌍

## Overview

#### TL;DR: [**earthdata**](https://github.com/nsidc/earthdata) is a Python package to search, preview and access NASA datasets (on-prem or in the cloud) with a few lines of code.

```python

from earthdata import Auth, DataGranules, Store

# first we authenticate with NASA EDL
auth = Auth().login(strategy="netrc")

# Then we build a Query with spatiotemporal parameters
GranuleQuery = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)

# We get the metadata records from CMR
granules = GranuleQuery.get()

# Now it{s time to download (or open) our data granules list with get()
files = Store(auth).get(granules, local_path='./data')

# Now to the important science!
```


### Why use the `earthdata` Python package?

There are many ways to access NASA datasets, we can use the [Earthdata search portal](https://search.earthdata.nasa.gov/). We can use DAAC specific portals or tools.
We could even use [data.gov](https://data.gov)!  Web portals are great but they are not designed for programmatic access and reproducible workflows. This is extremely important in the age of the cloud and reproducible open science. 

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. Many of these libraries contain amazing features and some similarities. In this context, **earthdata** aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-prem or in the cloud.


Library | Language Agnostic | On-Prem Access | Cloud Access | Programmatic | Subsetting | GIS Operations | Authentication | Full Archive Coverage|
-------- | ----- | -------- | ----- | -------- | ----- | -------- | ----- | -----
**earthdata** | Python| ✅ | ✅ | ✅ | No | No | ✅ | ✅
**HarmonyPy** | Python* | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | No
**OpenDAP** | ✅ | ✅ | No | ✅ | ✅ | No | No |  ✅ 
**cmr-stac** | Python | ✅ | ✅ |✅ | No | No | No |  ✅ 
**Earthdata Portal** | ✅ | ✅ | ✅| No | No | No | ✅ |  ✅ 
**GDAL** | ✅* | ✅ | ✅ |✅ | No | ✅* | ✅* |  ✅ 
**rsat** | R | ✅ | No |✅ | No | ✅* | ✅* |  No
**getSpatialData** | R | ✅ | No |✅ | No | ✅* | ✅* |  No

: Earthdata Access Comparison {tbl-colwidths="[50,50]"}


### Installing earthdata with conda/mamba

```bash
conda -c conda-forge install earthdata
```

### NASA EDL and the Auth class

What is Earthdata Login (EDL)?

Earthdata Login provides free and immediate access to thousands of EOSDIS data products covering all Earth science disciplines and topic areas for researchers, applied science users, application developers, and the general public. For more information about Earthdata Login benefits, features, and terms of service, go to What do I need to know about Earthdata Login.To learn more about EOSDIS and its mission to meet the needs of diverse users, please visit our [Earthdata Website](https://urs.earthdata.nasa.gov/documentation/for_users/welcome).


The `Auth` class will handle authentication with NASA Earthdata for both on-prem or cloud-hosted datasets

In [None]:
# We import the classes from earthdata
from earthdata import Auth, DataCollections, DataGranules, Store

auth = Auth()

# First we try to use a .netrc, if it's not present we use the interactive login
if not auth.login(strategy="netrc"):
    auth.login(strategy="interactive")

## 1. **Querying for data collections** (datasets)

The DataCollection client can query CMR for any collection using all of CMR's Query parameters and has built-in accessors for the common ones. This makes it ideal for one liners and easier search notation.

> Note: use bbox finder to get bounding box coordinates of an area of interest, [bboxfinder](http://bboxfinder.com/#0.000000,0.000000,0.000000,0.000000)

**[CMR API Documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html)**

In [None]:
from pprint import pprint
# We can now search for collections using a pythonic API client for CMR.

# CollectionQuery = DataCollections().keyword('elevation change').bounding_box(-134.7,58.9,-133.9,59.2).temporal("2016-01-01", "2020-12-12")

CollectionQuery = DataCollections().parameters(
    keyword = 'earth wind fire',
    bounding_box = (-134.7,58.9,-133.9,59.2),
    temporal = ("2016-01-01", "2020-12-12")
)

print(f'Collections found: {CollectionQuery.hits()}')

# filtering what UMM fields to print using display(), meta is always included
collections = CollectionQuery.fields(['ShortName', 'Abstract', 'Version']).get(5)
# Inspect 5 results printing just the ShortName and Abstract
for collection in collections:
    # print(collection["meta"]["concept-id"])
    print(collection.concept_id(), collection.version())
    # pprint(collection)
    # display(collection)

The DataCollections class returns python dictionaries with some handy methods.

```python 
collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.
```

The same results can be obtained using the `dict` syntax:

```python
collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
```


In [None]:
# What if we want cloud collections
CollectionQuery = DataCollections().daac("PODAAC").cloud_hosted(True)

print(f'Collections found: {CollectionQuery.hits()}')
collections = CollectionQuery.fields(['ShortName']).get(10)
# Printing 3 collections
collections[0]

In [None]:
# Printing the concept-id for the first 10 collections
[collection.concept_id() for collection in collections]

### Cloud or On-prem with a simple parameter

* `cloud_hosted(True)` will return cloud collections
* `cloud_hosted(False)` will return on-prem collections

In [None]:
ShortName = "SMAP_JPL_L3_SSS_CAP_8DAY-RUNNINGMEAN_V5"

collections = DataCollections().short_name(ShortName).cloud_hosted(True).get()

for collection in collections:
    concept_id = collection.concept_id()
    print(concept_id)

## 2. **Querying for data granules**

The DataGranules class provides similar functionality as the collection class. To query for granules in a more reliable way concept-id would be the main key.
You can search data granules using a short name but that could (more likely will) return different versions of the same data granules. 

In this example we're querying for 20 data grnaules from ICESat-2  [ATL05](https://nsidc.org/data/ATL03/versions/) version `005` dataset. 

In [None]:
# Generally speaking we won't need the auth instance for queries to collections and granules
# Query = DataGranules().short_name('ATL03').version("005").bounding_box(-134.7,58.9,-133.9,59.2)

GranuleQuery = DataGranules().parameters(
    short_name = "ATL03",
    version = "005",
    bounding_box = (-134.7,58.9,-133.9,59.2),
    # day_night_flag = "day",
    # cloud_cover = (0,25),
    # instrument = "MODIS",
    # platform = "TERRA"
)

granules = GranuleQuery.get(3)

for granule in granules:
    # print(granule)
    # pprint(granule)
    display(granule)

## 3. **Accessing the data**

With `earthdata` a researcher can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, `us-west-2`.

### **On-prem access**  📡

DAAC hosted data

In [None]:
# C2208422957-POCLOUD, cloud hosted

cloud_collection = "C2208422957-POCLOUD"
onprem_collection = "C1972955240-PODAAC"

# The store class will get us the granules from their location
store = Store(auth) 

GranuleQuery = DataGranules().concept_id(onprem_collection).bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {GranuleQuery.hits()}")
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only 100
onprem_granules = GranuleQuery.get(10)
# Does this granule belong to a cloud-based collection?
onprem_granules[0].cloud_hosted

In [None]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
files = store.get(onprem_granules[0:3], "./data/")

### **Cloud access** ☁️

Same API, just a different place

In [None]:
GranuleQuery = DataGranules().concept_id(cloud_collection).bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {GranuleQuery.hits()}")
cloud_granules = GranuleQuery.get(10)
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted

In [None]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
files = store.get(cloud_granules[0:3], "./data/")

### ☁️ **Cloud Access Part II: Opening files with S3FS**

Being in the cloud allows us to stream data as if we were using it locally. Pairing gridded datasets on S3 and xarray isa very useful patter when we deal with a lot of data. 

In [None]:
# data_links = cloud_granules[0].data_links(s3_only=True)
https_links = []
s3_links = []

# TODO: earthdata should be able to infer the provider based on URL, 
# it does it when we use the DataGranules but not with plain URLs
fs = Store(auth).get_s3fs_session('POCLOUD')

for granule in cloud_granules[0:10]:
    https_links.append(granule.data_links()[0])
    s3_links.append(granule.data_links(s3_only=True)[0])

fileset = [fs.open(s3_granule)for s3_granule in s3_links[0:5]]

In [None]:
%%time

import xarray as xr

lonrange = [-160, -130]
latrange = [68, 80]

ds_smap_L3 = xr.open_mfdataset(
    fileset,
    combine='nested',
    concat_dim='time',
    decode_cf=True,
    coords='minimal',
    chunks={'time': 1}
    ).sel(longitude=slice(lonrange[0],lonrange[1]), latitude=slice(latrange[1],latrange[0]))
ds_smap_L3

## Recap

**Wait, we said 4 lines of Python**

```python

from earthdata import Auth, DataGranules, Store
auth = Auth().login(strategy="netrc")
granules = DataGranules().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2).get()
files = Store(auth).get(granules, local_path='./data')

# Now to the important science!
```

### Related links

**CMR** API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

**EDL** API documentation: https://urs.earthdata.nasa.gov/

NASA OpenScapes: https://nasa-openscapes.github.io/earthdata-cloud-cookbook/

Github repository: https://github.com/nsidc/earthdata


Contact: luis.lopez@nsidc.org
