# Accessing CMR Cloud Collections 

References:
* https://github.com/nsidc/earthdata/blob/main/notebooks/Demo.ipynb
* https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/NSIDC/ICESat2-CMR-AWS-S3.html
* https://nasa-openscapes.github.io/2021-Cloud-Workshop-AGU/how-tos/Earthdata_Cloud__Single_File__Direct_S3_Access_NetCDF4_Example.html

Programmatic access of NSIDC data can happen in 2 ways:

```text
Search -> Download -> Process -> Research
```

<img src="https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/download-model.png" width="35%"/>

```text
Search -> Process in the cloud -> Research
```

<img src="https://raw.githubusercontent.com/NASA-Openscapes/earthdata-cloud-cookbook/main/examples/NSIDC/img/cloud-model.png" width="35%"/>

> **Credit**: Open Architecture for scalable cloud-based data analytics. From Abernathey, Ryan (2020): Data Access Modes in Science.


## The big picture: 

There is nothing wrong with downloading data to our local machine but that can get complicated or even impossible if a dataset is too large.
For this reason NSIDC along with other NASA data centers started to collocate or migrate their dataset holdings to the cloud. 

### Steps

1. Authenticate with the NASA Earthdata Login API (EDL).
2. Search granules/collections using a CMR client that supports authentication
3. Parse CMR responses and get AWS S3 URLs
4. Access the data granules using temporary AWS credentials given by the NSIDC cloud credentials endpoint

### Data used:

*  ICESat-2 [ATL03](https://nsidc.org/data/ATL03/versions/4): This data set contains height above the WGS 84 ellipsoid (ITRF2014 reference frame), latitude, longitude, and time for all photons.

### Requirements

* [NASA Eartdata Login (EDL) credentials](https://urs.earthdata.nasa.gov/)
* python libraries:
  - h5py
  - matplotlib
  - xarray
  - s3fs
  - https://github.com/nsidc/earthdata/tree/main
  
  
Another python tutorial for finding data with the python requests
* https://nasa-openscapes.github.io/2021-Cloud-Hackathon/tutorials/01_Data_Discovery_CMR.html
* https://github.com/nasa/eo-metadata-tools/tree/master/CMR/python

In [2]:
#!pip install earthdata
#!pip install 's3fs<2022.0.0,>=2021.8.1'

## Querying CMR for NSIDC data in the cloud

Most collections at NSIDC have not being migrated to the cloud and can be found using CMR with no authentication at all. Here is a simple example for 
altimeter data (ATL03) coming from the ICESat-2 mission. First we'll search the regular collection and then we'll do the same using the cloud collection.

**Note**: This notebook uses CMR to search and locate the data granules, this is not the only workflow for data access and discovery. 

* **HarmonyPy**: Uses **Harmony** the NASA API to search, subset and transform the data in the cloud.
* **cmr-stac**: A "static" metadata catalog than can be read by **Intake** oand other client libraries to optimize the access of files in the cloud.



## Cloud Collections

Some NSIDC cloud collections are not yet, which means that temporarily you'll have to request access emailing nsidc@nsidc.org so your Eartdata login is in the authorized list for early users.

In [3]:
from earthdata import Auth, DataGranules, DataCollections, Accessor

auth = Auth() # if we want to access NASA DATA in the cloud
auth.login()

Enter your Earthdata Login username:  aimeeb
Enter your Earthdata password:  ··················


You're now authenticated with NASA Earthdata Login


True

In [9]:
collections = DataCollections(auth).keyword('ICESat-2').get(10)
collections[0]['meta']

{'revision-id': 2, 'deleted': False, 'format': 'application/iso19115+xml', 'provider-id': 'NSIDC_ECS', 'user-id': 'jbehnke', 'has-formats': True, 'associations': {'services': ['S1977894169-NSIDC_ECS', 'S1568899363-NSIDC_ECS', 'S1613689509-NSIDC_ECS', 'S2013502342-NSIDC_ECS', 'S2008499525-NSIDC_ECS'], 'tools': ['TL1950215144-NSIDC_ECS', 'TL1994100033-NSIDC_ECS', 'TL2012682515-NSIDC_ECS', 'TL1993837687-NSIDC_ECS', 'TL1977912846-NSIDC_ECS', 'TL1977971361-NSIDC_ECS', 'TL2011654705-NSIDC_ECS', 'TL2000645101-NSIDC_ECS', 'TL1956087574-NSIDC_ECS', 'TL1993837300-NSIDC_ECS', 'TL1995279987-NSIDC_ECS', 'TL1993841373-NSIDC_ECS', 'TL1952642907-NSIDC_ECS', 'TL1956550964-NSIDC_ECS']}, 'has-spatial-subsetting': True, 'native-id': 'ATLAS/ICESat-2 L2A Global Geolocated Photon Data V005', 'has-transforms': False, 'has-variables': True, 'concept-id': 'C2120512202-NSIDC_ECS', 'revision-date': '2021-11-29T18:34:40.321Z', 'granule-count': 242341, 'has-temporal-subsetting': True, 'concept-type': 'collection'}

In [10]:
granules = DataGranules(auth).short_name('ATL08').bounding_box(-10,20,10,50).get(5)
[display(g) for g in granules[0:5]]

[None, None, None, None, None]

## Data Access using AWS S3

* **IMPORTANT**: This section will only work if this notebook is running on the AWS **us-west-2** zone

There is more than one way of accessing data on AWS S3, either downloading it to your local machine using the official client library or using a python library. 

**Performance tip**: using the HTTPS URLs will decrease the access performance since these links have to internally be processed by AWS's content delivery system (CloudFront). To get a better performance we should access the `S3://` URLs with BOTO3 or a high level S3 enabled library (i.e. S3FS)


Related links:
* [HDF in the Cloud challenges and solutions for scientific data](http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud)
* [Cloud Storage (Amazon S3) HDF5 Connector](https://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/)


In [13]:
import s3fs
import h5py

In [12]:
access = Accessor(auth)

In [14]:
Query = DataGranules().concept_id("C2144424132-NSIDC_ECS")

In [15]:
Query.hits()

178839

In [16]:
granules = Query.get(100)

In [17]:
granules[0].cloud_hosted

False

In [31]:
import os
import glob

atl08_dir = 'data/demo-atl08'
files = glob.glob(f'{atl08_dir}/*.h5')

for f in files:
    try:
        os.remove(f)
    except OSError as e:
        print("Error: %s : %s" % (f, e.strerror))

In [32]:
%%time
files = access.get(granules[0:3], atl08_dir)

SUBMITTING | :   0%|          | 0/3 [00:00<?, ?it/s]

PROCESSING | :   0%|          | 0/3 [00:00<?, ?it/s]

COLLECTING | :   0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 337 ms, sys: 79.4 ms, total: 417 ms
Wall time: 3.39 s


In [None]:
import h5py

file = 'data/demo-atl08/ATL08_20181014001920_02350103_005_01.h5'
ds = h5py.File(file, 'r')
ds

# Cloud-Hosted Granules

In [34]:
Query = DataGranules().concept_id("C1968980609-POCLOUD").bounding_box(-134.7,54.9,-100.9,69.2)
print(f"Granule hits: {Query.hits()}")
cloud_granules = Query.get(100)
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted

Granule hits: 1576


True

## Using xarray to open files on S3

ATL data is complex so xarray doesn't know how to extract the important bits out of it.

In [36]:
# Let's order them by size again.
import operator
cloud_granules_by_size = sorted(cloud_granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 10
cloud_granules_by_size[0:2]

[Collection: {'Version': 'F', 'ShortName': 'JASON_CS_S6A_L2_ALT_LR_STD_OST_STC_F'}
Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'Lines': [{'Points': [{'Latitude': -65.651299, 'Longitude': 62.291185}, {'Latitude': -62.668781, 'Longitude': 90.38672}, {'Latitude': -55.249185, 'Longitude': 109.742412}, {'Latitude': -45.587235, 'Longitude': 121.805687}, {'Latitude': -34.84169, 'Longitude': 129.80194}, {'Latitude': -19.907954, 'Longitude': 137.2323}, {'Latitude': -8.237538, 'Longitude': 141.77837}, {'Latitude': 3.531965, 'Longitude': 146.004336}, {'Latitude': 15.288458, 'Longitude': 150.365636}, {'Latitude': 26.915319, 'Longitude': 155.364525}, {'Latitude': 38.24408, 'Longitude': 161.7267}, {'Latitude': 48.952765, 'Longitude': 170.698818}, {'Latitude': 58.344339, 'Longitude': -175.462384}, {'Latitude': 64.949148, 'Longitude': -153.453304}, {'Latitude': 66.647046, 'Longitude': -131.873032}, {'Latitude': 65.650127, 'Longitude': -131.794596}, {'Latitude': 63.952229, 'Longitude': 

In [79]:
%%time

jason_dir = 'data/jason'
#files = access.get(cloud_granules_by_size[0:2], jason_dir)

CPU times: user 0 ns, sys: 3 µs, total: 3 µs
Wall time: 5.48 µs


In [51]:
s3_cred_endpoint = {
    'podaac':'https://archive.podaac.earthdata.nasa.gov/s3credentials',
    'gesdisc': 'https://data.gesdisc.earthdata.nasa.gov/s3credentials',
    'lpdaac':'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
    'ornldaac': 'https://data.ornldaac.earthdata.nasa.gov/s3credentials',
    'ghrcdaac': 'https://data.ghrc.earthdata.nasa.gov/s3credentials'
}

In [50]:
def get_temp_creds(daac):
    temp_creds_url = s3_cred_endpoint
    return requests.get(temp_creds_url[daac]).json()

In [52]:
import requests
temp_creds_req = get_temp_creds('podaac')

In [62]:
#!pip install boto3
import os
import requests 
import boto3
from osgeo import gdal
import rasterio as rio
from rasterio.session import AWSSession
import rioxarray
import hvplot.xarray
import holoviews as hv

In [81]:
fs_s3 = s3fs.S3FileSystem(anon=False, 
                          key=temp_creds_req['accessKeyId'], 
                          secret=temp_creds_req['secretAccessKey'], 
                          token=temp_creds_req['sessionToken'])

In [87]:
#s3_url = cloud_granules[0].data_links()[0]
s3_url = 's3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_2015-01_ECCO_V4r4_latlon_0p50deg.nc'

In [84]:
s3_file_obj = fs_s3.open(s3_url, mode='rb')

In [85]:
s3_file_obj

<File-like object S3FileSystem, podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_2015-01_ECCO_V4r4_latlon_0p50deg.nc>

In [88]:
ssh_ds = xr.open_dataset(s3_file_obj, engine='h5netcdf')
ssh_ds

In [89]:
ssh_da = ssh_ds.SSH
ssh_da

In [94]:
ssh_da.hvplot.image(x='longitude', y='latitude', cmap='Spectral_r', aspect='equal').opts(clim=(ssh_da.attrs['valid_min'],ssh_da.attrs['valid_max']))

:DynamicMap   [time]