# How to: Find and Access EMIT Data

**Summary**  

There are currently 4 ways to find EMIT data:

1. [EarthData Search](https://search.earthdata.nasa.gov/search)
2. [NASA's CMR API](https://www.earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr) (`earthaccess` uses this)
3. [NASA's CMR-STAC API](https://cmr.earthdata.nasa.gov/search/site/docs/search/stac)
3. [Visions Open Access Data Portal](https://earth.jpl.nasa.gov/emit/data/data-portal/coverage-and-forecasts/)

This notebook will explain how to access Earth Surface Mineral Dust Source Investigation (EMIT) data programmaticly using the [earthaccess python library](https://github.com/nsidc/earthaccess). `earthaccess` is an easy to use library that reduces finding and downloading or streaming data over https or s3 to only a few lines of code. `earthaccess` searches NASA's Common Metadata Repository (CMR), a metadata system that catalogs Earth Science data and associated metadata records, then can be used to download granules or generate lists granule search result URLs.

**Requirements:**
- A NASA [Earthdata Login](https://urs.earthdata.nasa.gov/) account is required to download EMIT data   
- *No Python setup requirements if connected to the workshop cloud instance!*
- **Local Only** Set up Python Environment - See **setup_instructions.md** in the `/setup/` folder to set up a local compatible Python environment

**Learning Objectives**  
- How to get information about data collections using `earthaccess`
- How to search and access EMIT data using `earthaccess`

## Setup
Import the required packages

In [1]:
import os
import earthaccess
import numpy as np
import pandas as pd
import geopandas as gp
from shapely.geometry.polygon import orient
import xarray as xr
import sys
sys.path.append('../modules/')
from emit_tools import emit_xarray


## Authentication

`earthaccess` creates and leverages Earthdata Login tokens to authenticate with NASA systems. Earthdata Login tokens expire after a month. To retrieve a token from Earthdata Login, you can either enter your username and password each time you use `earthaccess`, or use a `.netrc` file. A `.netrc` file is a configuration file that is commonly used to store login credentials for remote systems. If you don't have a `.netrc` or don't know if you have one or not, you can use the `persist` argument with the `login` function below to create or update an existing one, then use it for authentication.

If you do not have an Earthdata Account, you can create one [here](https://urs.earthdata.nasa.gov/home). 

In [2]:
auth = earthaccess.login(persist=True)
print(auth.authenticated)

True


If you receive a message that your token has expired, use `refresh_tokens()` like below to generate a new one.

In [3]:
# auth.refresh_tokens

## Searching for Collections

The EMIT mission produces several collections or datasets available via the LP DAAC cloud archive.

To view what's available, we can use the `search_datasets` function and with the `keyword` and and `provider` arguments. The `provider` is the data location, in this case `LPCLOUD`. Specifying the provider isn't necessary, but the "emit" keyword can be found in metadata for some other datasets, and additional collections may be returned.


In [4]:
# Retrieve Collections
collections = earthaccess.search_datasets(provider='LPCLOUD', keyword='emit')
# Print Quantity of Results
print(f'Collections found: {len(collections)}')

Collections found: 12


If you print the `collections` object you can explore all of the json metadata.

In [5]:
# # Print collections
collections

[{
   "meta": {
     "revision-id": 61,
     "deleted": false,
     "format": "application/vnd.nasa.cmr.umm+json",
     "provider-id": "LPCLOUD",
     "has-combine": false,
     "user-id": "arketch",
     "has-formats": false,
     "associations": {
       "variables": [
         "V3204053518-LPCLOUD",
         "V3204053593-LPCLOUD",
         "V3204053601-LPCLOUD",
         "V3204053636-LPCLOUD",
         "V3204053653-LPCLOUD",
         "V3204053647-LPCLOUD",
         "V3204053597-LPCLOUD",
         "V3204053667-LPCLOUD",
         "V3204053710-LPCLOUD",
         "V3204053623-LPCLOUD",
         "V3204053606-LPCLOUD",
         "V3204053685-LPCLOUD",
         "V3204053614-LPCLOUD"
       ],
       "tools": [
         "TL1860232272-LPDAAC_ECS"
       ]
     },
     "s3-links": [
       "s3://lp-prod-protected/EMITL2ARFL.001",
       "s3://lp-prod-public/EMITL2ARFL.001"
     ],
     "has-spatial-subsetting": false,
     "native-id": "mmt_collection_24331",
     "has-transforms": false,
    

We can also create a list of the `short-name`, `concept-id`, and `version` of each result collection using list comprehension. These fields are important for specifying and searching for data within collections. 

In [6]:
collections_info = [
    {
        'short_name': c.summary()['short-name'],
        'collection_concept_id': c.summary()['concept-id'],
        'version': c.summary()['version'],
        'entry_title': c['umm']['EntryTitle']
    }
    for c in collections
]
pd.set_option('display.max_colwidth', 150)
collections_info = pd.DataFrame(collections_info)
collections_info

Unnamed: 0,short_name,collection_concept_id,version,entry_title
0,EMITL2ARFL,C2408750690-LPCLOUD,1,EMIT L2A Estimated Surface Reflectance and Uncertainty and Masks 60 m V001
1,EMITL1BRAD,C2408009906-LPCLOUD,1,EMIT L1B At-Sensor Calibrated Radiance and Geolocation Data 60 m V001
2,EMITL2BCH4PLM,C2748088093-LPCLOUD,1,EMIT L2B Estimated Methane Plume Complexes 60 m V001
3,EMITL2BMIN,C2408034484-LPCLOUD,1,EMIT L2B Estimated Mineral Identification and Band Depth and Uncertainty 60 m V001
4,EMITL4ESM,C2408755900-LPCLOUD,1,EMIT L4 Earth System Model Products V001
5,EMITL2BCH4ENH,C3242680113-LPCLOUD,2,EMIT L2B Methane Enhancement Data 60 m V002
6,EMITL2BCH4ENH,C2748097305-LPCLOUD,1,EMIT L2B Methane Enhancement Data 60 m V001
7,EMITL2BCO2ENH,C3243477145-LPCLOUD,2,EMIT L2B Carbon Dioxide Enhancement Data 60 m V002
8,EMITL2BCO2ENH,C2872578364-LPCLOUD,1,EMIT L2B Carbon Dioxide Enhancement Data 60 m V001
9,EMITL2BCO2PLM,C2867824144-LPCLOUD,1,EMIT L2B Estimated Carbon Dioxide Plume Complexes 60 m V001


The collection `concept-id` is the best way to search for data within a collection, as this is unique to each collection. The `short-name` can be used as well, however the `version` should be passed as well as there can be multiple versions available with the same short name. After finding the collection you want to search, you can use the `concept-id` to search for granules within that collection.

## Searching for Granules

A `granule` can be thought of as a unique spatiotemporal grouping within a collection. To search for `granules`, we can use the `search_data` function from `earthaccess` and provide the arguments for our search. Its possible to specify search products using several criteria shown in the table below:

|dataset origin and location|spatio temporal parameters|dataset metadata parameters|
|:---|:---|:---|
|archive_center|bounding_box|concept_id
|data_center|temporal|entry_title
|daac|point|keyword
|provider|polygon|version
|cloud_hosted|line|short_name

### Point Search

In this case, we specify the `shortname`, `point` coordinates, `temporal` range, and min and max `cloud_cover` percentages, as well as `count`, which limits the maximum number of results returned. 

In [7]:
# Search example using a Point
results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    point=(-62.1123,-39.89402),
    temporal=('2022-09-03','2022-09-04'),
    cloud_cover=(0,90),
    count=100
)

In [8]:
results

[Collection: {'ShortName': 'EMITL2ARFL', 'Version': '001'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'GPolygons': [{'Boundary': {'Points': [{'Longitude': -62.08873748779297, 'Latitude': -39.2420539855957}, {'Longitude': -62.51200866699219, 'Latitude': -39.943302154541016}, {'Longitude': -61.660133361816406, 'Latitude': -40.45749282836914}, {'Longitude': -61.23686218261719, 'Latitude': -39.75624465942383}, {'Longitude': -62.08873748779297, 'Latitude': -39.2420539855957}]}}]}}}
 Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '2022-09-03T16:31:29Z', 'EndingDateTime': '2022-09-03T16:31:41Z'}}
 Size(MB): 3578.7448024749756
 Data: ['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFLUNCERT_001_20220903T163129_

### Bounding Box Search

You can also use a bounding box to search. To do this we will first open a geojson file containing our region of interest (ROI) then simplify it to a bounding box by getting the bounds and putting them into a Python object called a tuple. We will use the `total_bounds` property to get the bounding box of our ROI, and add that to a Python tuple, which is the expected data type for the bounding_box parameter `earthaccess` `search_data`.

In [9]:
geojson = gp.read_file('../../data/isla_gaviota.geojson')
geojson.geometry

0    POLYGON ((-62.14758 -39.88951, -62.16900 -39.87694, -62.19419 -39.90642, -62.20427 -39.94071, -62.13184 -39.95230, -62.11609 -39.92091, -62.12554 ...
Name: geometry, dtype: geometry

In [10]:
bbox = tuple(list(geojson.total_bounds))
bbox

(-62.20427143422259,
 -39.95230375907932,
 -62.11609022486114,
 -39.87693893067732)

Now we can search for granules using the a bounding box.

In [11]:
# Search example using bounding box
results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    bounding_box=bbox,
    temporal=('2022-09-03','2022-09-04'),
    cloud_cover=(0,90),
    count=100
)


### Polygon Search

A polygon can also be used to search. For a simple polygon without holes we can take the geojson we opened and grab the coordinates of the exterior ring vertices and place them in a list. Note that this list of vertices must be in **counter-clockwise order** to be accepted by the `search_data` function. If necessary, the external ring vertices of your polygon can be reordered using the `orient` function from the shapely library.

In [12]:
# Orient External Ring Vertices
oriented = orient(geojson.geometry[0], sign=1.0)
# Create List of External Ring vertices coordinates
polygon = list(oriented.exterior.coords)
polygon

[(-62.147583513919045, -39.88950549416461),
 (-62.16899895047814, -39.87693893067732),
 (-62.19419358172446, -39.90641838472922),
 (-62.20427143422259, -39.94071456822524),
 (-62.1318368693898, -39.95230375907932),
 (-62.11609022486114, -39.92091182572591),
 (-62.125538211578245, -39.895787912197314),
 (-62.147583513919045, -39.88950549416461)]

With this list of coordinate pairs we can use the `polygon` parameter for our search. 
> Note that we overwrote the `results` object, because for all 3 types spatial search, the `results` are the same for this example.

In [13]:
# Search Example using a Polygon
results = earthaccess.search_data(
    short_name='EMITL2ARFL',
    polygon=polygon,
    temporal=('2022-09-03','2022-09-04'),
    cloud_cover=(0,90),
    count=100
)

## Working with Search Results

All three of these examples will have the same result, since the spatiotemporal parameters fall within the same single granule. Results is a `list`, so we can use an index to view a single result.

In [14]:
result = results[0]
result

We can also retrieve specific metadata for a result using `.keys()` since this object also acts as a dictionary.

In [15]:
result.keys()

dict_keys(['meta', 'umm', 'size'])

Look at each of the keys to see what is available.

In [16]:
result['meta']

{'concept-type': 'granule',
 'concept-id': 'G2544480896-LPCLOUD',
 'revision-id': 4,
 'native-id': 'EMIT_L2A_RFL_001_20220903T163129_2224611_012',
 'collection-concept-id': 'C2408750690-LPCLOUD',
 'provider-id': 'LPCLOUD',
 'format': 'application/vnd.nasa.cmr.umm+json',
 'revision-date': '2024-10-24T23:31:09.035Z'}

In [17]:
result['size']

3578.7448024749756

The `umm` metadata contains a lot of fields, so instead of printing the entire object, we can just look at the keys. 

In [18]:
result['umm'].keys()

dict_keys(['TemporalExtent', 'GranuleUR', 'AdditionalAttributes', 'SpatialExtent', 'ProviderDates', 'CollectionReference', 'PGEVersionClass', 'RelatedUrls', 'CloudCover', 'DataGranule', 'Platforms', 'MetadataSpecification'])

One important piece of info here is the Look at the cloud cover percentage.

In [19]:
result['umm']['CloudCover']

82

Another of note is the `AdditionalAttributes` key, which contains other useful information about the EMIT granule, like solar zenith and azimuth.

In [20]:
result['umm']['AdditionalAttributes']

[{'Name': 'SOFTWARE_BUILD_VERSION', 'Values': ['010603']},
 {'Name': 'SOFTWARE_DELIVERY_VERSION', 'Values': ['010610']},
 {'Name': 'Identifier_product_doi_authority', 'Values': ['https://doi.org']},
 {'Name': 'Identifier_product_doi', 'Values': ['10.5067/EMIT/EMITL2ARFL.001']},
 {'Name': 'ORBIT', 'Values': ['2224611']},
 {'Name': 'ORBIT_SEGMENT', 'Values': ['0']},
 {'Name': 'SCENE', 'Values': ['12']},
 {'Name': 'SOLAR_ZENITH', 'Values': ['9.27']},
 {'Name': 'SOLAR_AZIMUTH', 'Values': ['128.65']}]

From here, we can do other things, such as convert the results to a `pandas` dataframe, or filter down your results further using string matching and list comprehension.

In [21]:
pd.json_normalize(results)

Unnamed: 0,size,meta.concept-type,meta.concept-id,meta.revision-id,meta.native-id,meta.collection-concept-id,meta.provider-id,meta.format,meta.revision-date,umm.TemporalExtent.RangeDateTime.BeginningDateTime,...,umm.PGEVersionClass.PGEVersion,umm.RelatedUrls,umm.CloudCover,umm.DataGranule.DayNightFlag,umm.DataGranule.ArchiveAndDistributionInformation,umm.DataGranule.ProductionDateTime,umm.Platforms,umm.MetadataSpecification.URL,umm.MetadataSpecification.Name,umm.MetadataSpecification.Version
0,3578.744802,granule,G2544480896-LPCLOUD,4,EMIT_L2A_RFL_001_20220903T163129_2224611_012,C2408750690-LPCLOUD,LPCLOUD,application/vnd.nasa.cmr.umm+json,2024-10-24T23:31:09.035Z,2022-09-03T16:31:29Z,...,v1.3.3,"[{'URL': 'https://opendap.earthdata.nasa.gov/collections/C2408750690-LPCLOUD/granules/EMIT_L2A_RFL_001_20220903T163129_2224611_012', 'Type': 'USE ...",82,Day,"[{'Name': 'EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc', 'SizeInBytes': 1851082967, 'Format': 'NETCDF-4', 'Checksum': {'Value': 'bb789c229358a...",2023-03-20T21:50:48Z,"[{'ShortName': 'ISS', 'Instruments': [{'ShortName': 'EMIT Imaging Spectrometer'}]}]",https://cdn.earthdata.nasa.gov/umm/granule/v1.6.6,UMM-G,1.6.6


## Downloading or Streaming Data

After we have our results, there are 2 ways we an work with the data:

1. Download All Assets
2. Selectively Download Assets
3. Access in place / Stream the data. 

To download the data we can simply use the download function. This will retrieve all assets associated with a granule, and is nice if you plan to work with the data in this way and need all of the assets included with the product. For the EMIT L2A Reflectance, this includes the Uncertainty and Masks files.

In [22]:
# earthaccess.download(results, '../../data/')

If we want to stream the data or further filter the assets for download we want to first create a list of URLs nested by granule using list comprehesion.

In [23]:
emit_results_urls = [granule.data_links() for granule in results]
emit_results_urls

[['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc',
  'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFLUNCERT_001_20220903T163129_2224611_012.nc',
  'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_MASK_001_20220903T163129_2224611_012.nc']]

Now we can also split these into results for specific assets or filter out an asset using the following. In this example, we only want to access or download reflectance.

In [24]:
filtered_asset_links = []
# Pick Desired Assets - Use underscores to aid in stringmatching of the filenames (_RFL_, _RFLUNCERT_, _MASK_)
desired_assets = ['_RFL_']
# Step through each sublist (granule) and filter based on desired assets.
for n, granule in enumerate(emit_results_urls):
    for url in granule: 
        asset_name = url.split('/')[-1]
        if any(asset in asset_name for asset in desired_assets):
            filtered_asset_links.append(url)
filtered_asset_links

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc']

After we have our filtered list, we can stream the reflectance asset or download it. Start an https session then open it to stream the data, or download to save the file.

#### Stream Data

This may take a while to load the dataset.

In [25]:
# Get Https Session using Earthdata Login Info
fs = earthaccess.get_fsspec_https_session()
# Retrieve granule asset ID from URL (to maintain existing naming convention)
url = filtered_asset_links[0]
granule_asset_id = url.split('/')[-1]
# Define Local Filepath
fp = fs.open(url)
# Open with `emit_xarray` function
ds = emit_xarray(fp)
ds

NameError: name 'emit_xarray' is not defined

#### Download Filtered 

In [None]:
# Get requests https Session using Earthdata Login Info
fs = earthaccess.get_requests_https_session()
# Retrieve granule asset ID from URL (to maintain existing naming convention)
for url in filtered_asset_links:
    granule_asset_id = url.split('/')[-1]
    # Define Local Filepath
    fp = f'../../data/{granule_asset_id}'
    # Download the Granule Asset if it doesn't exist
    if not os.path.isfile(fp):
        with fs.get(url,stream=True) as src:
            with open(fp,'wb') as dst:
                for chunk in src.iter_content(chunk_size=64*1024*1024):
                    dst.write(chunk)

## Contact Info:  

Email: LPDAAC@usgs.gov  
Voice: +1-866-573-3222  
Organization: Land Processes Distributed Active Archive Center (LP DAAC)¹  
Website: <https://lpdaac.usgs.gov/>  
Date last modified: 11-06-2024  

¹Work performed under USGS contract G15PD00467 for NASA contract NNG14HH33I. 