# 01. Data discovery with CMR

In this tutorial you will learn:
- what CMR is;  
- how to use the `requests` package to search data collections and granules;  
- how to parse the results of these searches.


We will focus on datasets in the cloud.  Currently, DAACs with data in the cloud are 'ASF', 'GES_DISC', 'GHRC_DAAC', 'LPCLOUD', 'ORNL_CLOUD', 'POCLOUD'

## What is CMR
CMR is the Common Metadata Repository.  It catalogs all data for NASA's Earth Observing System Data and Information System (EOSDIS).  It is the backend of [Earthdata Search](https://search.earthdata.nasa.gov/search), the GUI search interface you are probably familiar with.  More information about CMR can be found [here](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr).

Unfortunately, the GUI for Earthdata Search is not accessible from a cloud instance - at least not without some work.  Earthdata Search is also not immediately reproducible.  What I mean by that is if you create a search using the GUI you would have to note the search criteria (date range, search area, collection name, etc), take a screenshot, copy the search url, or save the list of data granules returned by the search, in order to recreate the search.  This information would have to be re-entered each time you or someone else wanted to do the search.  You could make typos or other mistakes.  A cleaner, reproducible solution is to search CMR programmatically using the CMR API.

## What is the CMR API
API stands for Application Programming Interface.  It allows applications (software, services, etc) to send information to each other.  A helpful analogy is a waiter in a restaurant.  The waiter takes your drink or food order that you select from the menu, often translated into short-hand, to the bar or kitchen, and then returns (hopefully) with what you ordered when it is ready.

The CMR API accepts search terms such as collection name, keywords, datetime range, and location, queries the CMR database and returns the results.


## How to search CMR from Python
The first step is to import python packages.  We will use:  
- `requests` This package does most of the work for us accessing the CMR API using HTTP methods. 
- `pprint` to _pretty print_ the results of the search.  

A more in depth tutorial on `requests` is [here](https://realpython.com/python-requests/)

In [None]:
#

Then we need to authenticate with EarthData Login. Since we've already set this up in the previous lesson, here you need to enter your username before executing the cell.

To conduct a search using the CMR API, `requests` needs the url for the root CMR search endpoint. 
We'll build this url as a python variable.

In [None]:
#

CMR allows search by __collections__, which are datasets, and __granules__, which are files that contain data.  Many of the same search parameters can be used for colections and granules but the type of results returned differ.  Search parameters can be found in the [API Documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).

Whether we search __collections__ or __granules__ is distinguished by adding `"collections"` or `"granules"` to the url for the root CMR endpoint.

We are going to search collections first, so we add collections to the url.  I'm using a `python` format string here.

In [None]:
#

In this first example, I want to retrieve a list of collections that are hosted in the cloud.  Each collection has a `cloud_hosted` parameter that is either True if that collection is in the cloud and False if it is not.  The migration of NASA data to the cloud is a work in progress.  Not all collections tagged as `cloud_hosted` have granules.  To search for only `cloud_hosted` datasets with granules, I also set `has_granules` to `True`.

I also want to get the content in `json` (pronounced "jason") format, so I pass a dictionary to the header keyword argument to say that I want results returned as `json`.

The `.get()` method is used to send this information to the CMR API.  `get()` calls the HTTP method __GET__. 

In [None]:
#

`requests` returns a `Response` object.  

Often, we want to check that our request was successful.  In a notebook or someother interactive environment, we can just type the name of the variable we have saved our `requests` Response to, in this case the `response` variable.

In [None]:
#

  A cleaner and more understandable method is to check the `status_code` attribute.  Both methods return a HTTP status code.  You've probably seen a 404 error when you have tried to access a website that doesn't exist.

In [None]:
#

Try changing `CMR_OPS` to `https://cmr.earthdata.nasa.gov/searches` and run `requests.get` again.  __Don't forget to rerun the cell that assigns the `url` variable__


The response from `requests.get` returns the results of the search and metadata about those results in the `headers`.  

More information about the `response` object can be found by typing `help(response)`.

`headers` contains useful information in a case-insensitive dictionary.  This information is printed below.
*TODO: maybe some context for where the 2 elements k, v, come from?*

In [None]:
#

We can see that the content returned is in `json` format in the UTF-8 character set.  We can also see from `CMR-Hits` that 919 collections were found.

Each item in the dictionary can be accessed in the normal way you access a `python` dictionary but because it is case-insensitive, both

In [None]:
#

and

In [None]:
#

work.

This is a large number of data sets.  I'm going to restrict the search to cloud-hosted datasets from ASF (Alaska SAR Facility) because I'm interested in SAR images of sea ice.  To do this, I set the `provider` parameter to `ASF`.

You can modify the code below to explore all of the cloud-hosted datasets or cloud-hosted datasets from other providers.  A partial list of providers is given below.

DAAC      | Short Name                              | Cloud Provider | On-Premises Provider  
----------|-----------------------------------------|----------------|----------------------  
NSIDC     | National Snow and Ice Data Center       | NSIDC_CPRD     | NSIDC_ECS  
GHRC DAAC | Global Hydrometeorology Resource Center | GHRC_DAAC      | GHRC_DAAC  
PO DAAC   | Physical Oceanography Distributed Active Archive Center | POCLOUD | PODAAC  
ASF       | Alaska Satellite Facility | ASF | ASF  
ORNL DAAC | Oak Ridge National Laboratory | ORNL_CLOUD | ORNL_DAAC  
LP DAAC   | Land Processes Distributed Active Archive Center | LPCLOUD | LPDAAC_ECS
GES DISC  | NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) | GES_DISC | GES_DISC
OB DAAC   | NASA's Ocean Biology Distributed Active Archive Center |   | OB_DAAC
SEDAC     | NASA's Socioeconomic Data and Applications Center |   | SEDAC

When search by provider, use _Cloud Provider_ to search for cloud-hosted datasets and _On-Premises Provider_ to search for datasets archived at the DAACs.

In [None]:
#

In [None]:
#

Search results are contained in the content part of the Response object.  However, `response.content` returns information in bytes.

In [None]:
#

It is more convenient to work with `json` formatted data.  I'm using pretty print `pprint` to print the data in an easy to read way.  

__Step through `response.json()`, then to `response.json()['feed']['entry'][0]`__. A reminder that python starts indexing at 0, not 1!

In [None]:
#

The first response is not the result I am looking for *TODO: because xyz...but it does show a few variables that we can use to further refine the search*.  So I want to print the name of the dataset (`dataset_id`) and the concept id (`id`). We can build this variable and print statement like we did above with the `url` variable. 
*TODO: is it worth saying something about what "feed" and "entry" are?*

In [None]:
#

In [None]:
#

But there is a problem.  We know from `CMR-Hits` that there are 49 datasets but only 10 are printed.  This is because CMR restricts the number of results returned by a query.  The default is 10 but it can be set to a maximum of 2000.  Knowing that there were 49 'hits', I'll set `page_size` to 49. Then, we can re-run our for loop for the collections.

In [None]:
#

In [None]:
#

## Granule Search
In NASA speak, Granules are files.  In this example, we will search for recent Sentinel-1 Ground Range Detected (GRD) Medium Resolution Synthetic Aperture Radar images over the east coast of Greenland.  The data in these files are most useful for sea ice mapping.

I'll use the data range 2021-10-17 00:00 to 2021-10-18 23:59:59.

I'll use a simple bounding box to search.
- SW: 76.08166,-67.1746
- NW: 88.19689,21.04862

From the collections search, I know the concept ids for Sentinel-1A and Sentinel-1B GRD medium resolution are
- C1214472336-ASF
- C1327985578-ASF

We need to change the resource url to look for granules instead of collections

In [None]:
#

We will search by `concept_id`, `temporal`, and `bounding_box`.  Details about these search parameters can be found in the CMR API Documentation.

The formatting of the values for each parameter is quite specific.  
Temporal parameters are in ISO 8061 format `yyyy-MM-ddTHH:mm:ssZ`.  
Bounding box coordinates are lower left longitude, lower left latitude, upper right longitude, upper right latitude. 

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#