# [Data discovery with NASA's CMR](https://openscapes.2i2c.cloud/hub/user-redirect/lab/tree/2022-Fall-ECOSTRESS-Cloud-Workshop/how-tos/data-discovery/Data_Discovery_CMR_API.ipynb)

## Summary

In this notebook, we will walk through how to search for Earthdata data collections and granules. Along the way we will explore the available search parameters, information return, and specific contrains when using the CMR API. Our object is to identify assets to access that we would downloaded, or perform S3 direct access, within an analysis workflow 

We will be querying CMR for ECOSTRESS version 2 collections/granules to identify assets we would downloaded, or perform S3 direct access, within an analysis workflow.

## Learning Objectives

- Understand what CMR/CMR API is and what CMR/CMR API can be used for 
- How to use the `requests` package to search data collections and granules
- How to parse the results of these searches.

## What is CMR
CMR is the Common Metadata Repository.  It catalogs all data for NASA's Earth Observing System Data and Information System (EOSDIS).  It is the backend of [Earthdata Search](https://search.earthdata.nasa.gov/search), the GUI search interface.  More information about CMR can be found [here](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr).

Unfortunately, the GUI for Earthdata Search is not accessible from a cloud instance - at least not without some work.  Earthdata Search is also not immediately reproducible.  What I mean by that is if you create a search using the GUI you would have to note the search criteria (date range, search area, collection name, etc), take a screenshot, copy the search url, or save the list of data granules returned by the search, in order to recreate the search.  This information would have to be re-entered each time you or someone else wanted to do the search.  You could make typos or other mistakes.  A cleaner, reproducible solution is to search CMR programmatically using the CMR API.

## What is the CMR API
API stands for Application Programming Interface.  It allows applications (software, services, etc) to send information to each other.  A helpful analogy is a waiter in a restaurant.  The waiter takes your drink or food order that you select from the menu, often translated into short-hand, to the bar or kitchen, and then returns (hopefully) with what you ordered when it is ready.

The CMR API accepts search terms such as collection name, keywords, datetime range, and location, queries the CMR database and returns the results.

---

## Getting Started: How to search CMR from Python
The first step is to import python packages.  We will use:  
- `requests` This package does most of the work for us accessing the CMR API using HTTP methods. 
- `pprint` to _pretty print_ the results of the search.  

A more in-depth tutorial on `requests` is [here](https://realpython.com/python-requests/)

In [1]:
import requests
import json
from pprint import pprint

To conduct a search using the CMR API, `requests` needs the url for the root CMR search endpoint. We'll assign this url to a python variable as a _string_.

In [2]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'

CMR allows search by __collections__, which are datasets, and __granules__, which are files that contain data. Many of the same search parameters can be used for collections and granules but the type of results returned differ. Search parameters can be found in the [API Documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).  

Whether we search __collections__ or __granules__ is distinguished by adding `"collections"` or `"granules"` to the end of the CMR endpoint URL.  

We are going to search collections first, so we add `"collections"` to the URL. We are using a `python` format string in the examples below.

In [3]:
url = f'{CMR_OPS}/{"collections"}'
url

'https://cmr.earthdata.nasa.gov/search/collections'

In this first example, we want to retrieve a list of __ECOSTRESS__ collections in the Earthdata Cloud. This includes ECOSTRESS collections for built 7.1 data which recently became publicly available. This means you will not need to generate a token to access data.
Before the public release you should have been part of the access list to access the data. Because of that, an extra `token` parameter, generated using your Earthdata Login credentials needed to be passed in each CMR request that indicated you are a valid user.


we want to retrieve the collections that are hosted in the cloud (`'cloud_hosted': 'True'`) that has granules availble (`'has_granules': 'True'`). We also want to get the content in `json` (pronounced "jason") format, so I pass a dictionary to the header keyword argument to say that I want results returned as `json` (`'Accept': 'application/json'`).

The `.get()` method is used to send this information to the CMR API. `get()` calls the HTTP method __GET__. 

In [4]:
response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                        },
                        headers={
                            'Accept': 'application/json',
                        }
                       )

The request returns a `Response` object.    

To check that our request was successful we can print the `response` variable we saved the request to.

In [5]:
response

<Response [200]>

A __200__ response is what we want. This means that the requests was successful. For more information on HTTP status codes see <https://en.wikipedia.org/wiki/List_of_HTTP_status_codes>

A more explict way to check the status code is to use the `status_code` attribute. Both methods return a HTTP status code.

In [6]:
response.status_code

200

The response from `requests.get` returns the results of the search and metadata about those results in the `headers`.  

More information about the `response` object can be found by typing `help(response)`.

`headers` contains useful information in a case-insensitive dictionary. We requested (above) that the information be return in json which means the object return is a dictionary in our Python environment. We'll iterate through the returned dictionary, looping throught each field (`k`) and its associated value (`v`). For more on interating through dictionary object click [here](https://realpython.com/iterate-through-dictionary-python/).

In [7]:
for k, v in response.headers.items():
    print(f'{k}: {v}')

Content-Type: application/json;charset=utf-8
Content-Length: 4204
Connection: keep-alive
Date: Tue, 15 Nov 2022 18:08:14 GMT
X-Frame-Options: SAMEORIGIN
Access-Control-Allow-Origin: *
X-XSS-Protection: 1; mode=block
CMR-Request-Id: cf80d8ad-a428-4ca4-a85e-998ec5b0c02f
Strict-Transport-Security: max-age=31536000
CMR-Search-After: [0.0,23600.0,"SENTINEL-1A_DP_META_GRD_HIGH","1",1214470576,826]
CMR-Hits: 1674
Access-Control-Expose-Headers: CMR-Hits, CMR-Request-Id, X-Request-Id, CMR-Scroll-Id, CMR-Search-After, CMR-Timed-Out, CMR-Shapefile-Original-Point-Count, CMR-Shapefile-Simplified-Point-Count
X-Content-Type-Options: nosniff
CMR-Took: 237
X-Request-Id: RjtJpW50AJUR058fFO1oB1ULN3tF72Vo-ffvb9N7FDYx7GjZPHVE1w==
Vary: Accept-Encoding, User-Agent
Content-Encoding: gzip
Server: ServerTokens ProductOnly
X-Cache: Miss from cloudfront
Via: 1.1 aa0280f933863b8ffd5ff636330f4170.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: HIO50-C2
X-Amz-Cf-Id: RjtJpW50AJUR058fFO1oB1ULN3tF72Vo-ffvb9N7FDYx7GjZPHVE1w=

Each item in the dictionary can be accessed in the normal way you access a `python` dictionary but the keys uniquely case-insensitive. Let's take a look at the commonly used `CMR-Hits` key.



In [8]:
response.headers['CMR-Hits']

'1674'

Note that "cmr-hits" works as well!

In [9]:
response.headers['cmr-hits']

'1674'

In some situations the response to your query can return a very large number of result, some of which may not be relevant. We can add additional [query parameters](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) to restrict the information returned. We're going to restrict the search by the `provider` parameter.

You can modify the code below to explore all Earthdata data products hosted by the various providers. When searching by provider, use _Cloud Provider_ to search for cloud-hosted datasets and _On-Premises Provider_ to search for datasets archived at the DAACs. A partial list of providers is given below.

DAAC      | Short Name                              | Cloud Provider | On-Premises Provider  
----------|-----------------------------------------|----------------|----------------------  
NSIDC     | National Snow and Ice Data Center       | NSIDC_CPRD     | NSIDC_ECS  
GHRC DAAC | Global Hydrometeorology Resource Center | GHRC_DAAC      | GHRC_DAAC  
PO DAAC   | Physical Oceanography Distributed Active Archive Center | POCLOUD | PODAAC  
ASF       | Alaska Satellite Facility | ASF | ASF  
ORNL DAAC | Oak Ridge National Laboratory | ORNL_CLOUD | ORNL_DAAC  
LP DAAC   | Land Processes Distributed Active Archive Center | LPCLOUD | LPDAAC_ECS
GES DISC  | NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) | GES_DISC | GES_DISC
OB DAAC   | NASA's Ocean Biology Distributed Active Archive Center |   | OB_DAAC
SEDAC     | NASA's Socioeconomic Data and Applications Center |   | SEDAC

We'll assign the provider to a variable as a _string_ and insert the variable into the parameter argument in the request. We'll also assign the term 'ECOSTRESS' to a varible so we don't need to repeatedly add it to the requests parameters. 

In [10]:
provider = 'LPCLOUD'
project = 'ECOSTRESS'

In [11]:
headers = {
    #'Authorization': f'Bearer {token}',
    'Accept': 'application/json',
}

In [12]:
response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider,
                            'project': project,
                        },
                        headers=headers
                       )
response

<Response [200]>

In [13]:
response.headers['cmr-hits']

'5'

Search results are contained in the __content__ part of the Response object. However, `response.content` returns information in bytes.

In [14]:
response.content

b'{"feed":{"updated":"2022-11-15T18:08:14.505Z","id":"https://cmr.earthdata.nasa.gov:443/search/collections.json?cloud_hosted=True&has_granules=True&provider=LPCLOUD&project=ECOSTRESS","title":"ECHO dataset metadata","entry":[{"processing_level_id":"2","cloud_hosted":true,"boxes":["-90 -180 90 180"],"time_start":"2018-07-09T00:00:00.000Z","version_id":"002","updated":"2021-06-23T16:50:51.108Z","dataset_id":"ECOSTRESS Tiled Land Surface Temperature and Emissivity Instantaneous L2 Global 70 m V002","has_spatial_subsetting":false,"has_transforms":false,"has_variables":false,"data_center":"LPCLOUD","short_name":"ECO_L2T_LSTE","organizations":["LP DAAC","NASA/JPL/ECOSTRESS"],"title":"ECOSTRESS Tiled Land Surface Temperature and Emissivity Instantaneous L2 Global 70 m V002","coordinate_system":"CARTESIAN","summary":"The ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS) mission measures the temperature of plants to better understand how much water plants need and

A more convenient way to work with this information is to use `json` formatted data. I'm using pretty print `pprint` to print the data in an easy to read way.    

**Note**
- `response.json()` will format our response in `json` 
- `['feed']['entry']` returns all entries that CMR returned in the request (not the same as __CMR-Hits__)
- `[0]` returns the first entry. Reminder that python starts indexing at 0, not 1!

In [15]:
pprint(response.json()['feed']['entry'][0])

{'archive_center': 'LP DAAC',
 'boxes': ['-90 -180 90 180'],
 'browse_flag': True,
 'cloud_hosted': True,
 'collection_data_type': 'SCIENCE_QUALITY',
 'consortiums': ['GEOSS', 'EOSDIS'],
 'coordinate_system': 'CARTESIAN',
 'data_center': 'LPCLOUD',
 'dataset_id': 'ECOSTRESS Tiled Land Surface Temperature and Emissivity '
               'Instantaneous L2 Global 70 m V002',
 'has_formats': False,
 'has_spatial_subsetting': False,
 'has_temporal_subsetting': False,
 'has_transforms': False,
 'has_variables': False,
 'id': 'C2076090826-LPCLOUD',
 'links': [{'href': 'https://search.earthdata.nasa.gov/search?q=C2076090826-LPCLOUD',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#'},
           {'href': 'https://doi.org/10.5067/ECOSTRESS/ECO_L2T_LSTE.002',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/metadata#'},
           {'href': 'https://lpdaac.usgs.gov/',
            'hreflang': 'en-US',
           

The first response contains a lot more information than we need. We'll narrow in on a few fields to get a feel for what we have. We'll print the name of the dataset (`dataset_id`) and the concept id (`id`). We can build this variable and print statement like we did above with the `url` variable. 

In [16]:
collections = response.json()['feed']['entry']

In [17]:
for collection in collections:
    print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["short_name"]} |{collection["id"]}')

LP DAAC | ECOSTRESS Tiled Land Surface Temperature and Emissivity Instantaneous L2 Global 70 m V002 | ECO_L2T_LSTE |C2076090826-LPCLOUD
LP DAAC | ECOSTRESS Swath Geolocation Instantaneous L1B Global 70 m V002 | ECO_L1B_GEO |C2076087338-LPCLOUD
LP DAAC | ECOSTRESS Swath Top of Atmosphere Calibrated Radiance Instantaneous L1B Global 70 m | ECO_L1B_RAD |C2076116385-LPCLOUD
LP DAAC | ECOSTRESS Swath Cloud Mask Instantaneous L2 Global 70 m V002 | ECO_L2_CLOUD |C2076115306-LPCLOUD
LP DAAC | ECOSTRESS Swath Land Surface Temperature and Emissivity Instantaneous L2 Global 70 m V002 | ECO_L2_LSTE |C2076114664-LPCLOUD


We know from `CMR-Hits` that there are 5 datasets but in some situations CMR restricts the number of results returned by each query. The default is 10 but it can be set to a maximum of 2000. 
If I only search for datasets that are distributed by `LPCLOUD` provider, we will have more number of results. We can set the `page_size` parameter to 50 (higher than the number of results returned) so we return all results in a single query.

In [18]:
response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider,
                            #'project': project,
                            'page_size': 50
                        },
                        headers=headers
                       )
response

<Response [200]>

In [19]:
response.headers['cmr-hits']

'41'

Now, when we can re-run our for loop for the collections we now have all of the available collections listed.

In [20]:
collections = response.json()['feed']['entry']
for collection in collections:
    print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["short_name"]} |{collection["id"]}')

LP DAAC | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | HLSS30 |C2021957295-LPCLOUD
LP DAAC | HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0 | HLSL30 |C2021957657-LPCLOUD
LP DAAC | ASTER Global Digital Elevation Model V003 | ASTGTM |C1711961296-LPCLOUD
LP DAAC | MODIS/Aqua Land Surface Temperature/Emissivity 5-Min L2 Swath 1km V061 | MYD11_L2 |C2343114808-LPCLOUD
LP DAAC | MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061 | MOD13Q1 |C1748066515-LPCLOUD
LP DAAC | MODIS/Terra Land Surface Temperature/Emissivity 8-Day L3 Global 1km SIN Grid V061 | MOD11A2 |C2269056084-LPCLOUD
LP DAAC | MODIS/Terra Vegetation Indices Monthly L3 Global 1km SIN Grid V061 | MOD13A3 |C2327962326-LPCLOUD
LP DAAC | MODIS/Terra Surface Reflectance Daily L2G Global 1km and 500m SIN Grid V061 | MOD09GA |C2202497474-LPCLOUD
LP DAAC | MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1km SIN Grid V0

## Searching for Granules
In NASA speak, Granules are files or groups of files. In this example, we will search for ECO_L2T_LSTE version 2 for a specified region of interest and datetime range.  

We need to change the resource url to look for __granules__ instead of collections

In [21]:
url = f'{CMR_OPS}/{"granules"}'
url

'https://cmr.earthdata.nasa.gov/search/granules'

We will search by `concept_id`, `temporal`, and `bounding_box`.  Details about these search parameters can be found in the CMR API Documentation.

The formatting of the values for each parameter is quite specific.  
__Temporal parameters__ are in ISO 8061 format `yyyy-MM-ddTHH:mm:ssZ`.  
__Bounding box coordinates__ are lower left longitude, lower left latitude, upper right longitude, upper right latitude. 

In [22]:
collection_id = 'C2076090826-LPCLOUD'
date_range = '2022-10-20T00:00:00Z,2022-11-14T23:59:59Z'
bbox = '-120.295181,34.210026,-119.526215,35.225021'


In [23]:
response = requests.get(url, 
                        params={
                            'concept_id': collection_id,
                            'temporal': date_range,
                            'bounding_box': bbox,
                            #'token': token,
                            'page_size': 200
                            },
                        headers=headers
                       )
print(response.status_code)

200


In [24]:
print(response.headers['CMR-Hits'])

47


In [25]:
granules = response.json()['feed']['entry']
for granule in granules:
    print(f'{granule["data_center"]} | {granule["title"]} | {granule["id"]}')

LPCLOUD | ECOv002_L2T_LSTE_24418_001_11SKT_20221026T105945_0710_01 | G2530780237-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_001_10SGC_20221026T105945_0710_01 | G2530780962-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_001_10SGD_20221026T105945_0710_01 | G2530781111-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_10SGD_20221026T110036_0710_01 | G2530775818-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_10SGE_20221026T110036_0710_01 | G2530778344-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_11SKV_20221026T110036_0710_01 | G2530780217-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_11SKU_20221026T110036_0710_01 | G2530780251-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_11SKT_20221026T110036_0710_01 | G2530780282-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24418_002_10SGC_20221026T110036_0710_01 | G2530780296-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24479_001_10SGE_20221030T092522_0710_01 | G2535607120-LPCLOUD
LPCLOUD | ECOv002_L2T_LSTE_24479_001_10SGD_20221030T092522_0710_01 | G2535607552-LPCLOUD
LPCLOUD | ECOv002_L2T

In [26]:
pprint(granules[0])

{'boxes': ['33.309906 -120.259598 34.3242 -119.044289'],
 'browse_flag': True,
 'collection_concept_id': 'C2076090826-LPCLOUD',
 'coordinate_system': 'GEODETIC',
 'data_center': 'LPCLOUD',
 'dataset_id': 'ECOSTRESS Tiled Land Surface Temperature and Emissivity '
               'Instantaneous L2 Global 70 m V002',
 'day_night_flag': 'NIGHT',
 'granule_size': '3.36234',
 'id': 'G2530780237-LPCLOUD',
 'links': [{'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24418_001_11SKT_20221026T105945_0710_01/ECOv002_L2T_LSTE_24418_001_11SKT_20221026T105945_0710_01_water.tif',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#',
            'title': 'Download '
                     'ECOv002_L2T_LSTE_24418_001_11SKT_20221026T105945_0710_01_water.tif'},
           {'href': 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24418_001_11SKT_20221026T105945_0710_01/ECOv002_L2T_LSTE_24418_001_11SKT_

## Get URLs to cloud data assets

In [27]:
https_urls = [l['href'] for l in granules[13]['links'] if 'https' in l['href'] and '.tif' in l['href']]
https_urls

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_water.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_cloud.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_height.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_QC.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_202

In [28]:
s3_urls = [l['href'] for l in granules[13]['links'] if 's3' in l['href'] and '.tif' in l['href']]
s3_urls

['s3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_water.tif',
 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_cloud.tif',
 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_height.tif',
 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_QC.tif',
 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_LST.tif',
 's3://lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01/ECOv002_L2T_LSTE_24479_001_11SKU_20221030T092522_0710_01_LST_err.