# How to: Find and Access EMIT Data

**Summary**  

This notebook will explain how to access Earth Surface Minteral Dust Source Investigation (EMIT) data programmaticly using NASA's CMR API. The Common Metadata Repository (CMR) is a metadata system that catalogs Earth Science data and associated metadata records. The CMR Application Programming Interface (API) provides programatic search capabilities through CMR's vast metadata holdings using various parameters and keywords. When querying NASA's CMR, there is a limit of 1 million granules matched and only 2000 granules returned per page. 

**Requirements:**
+ A NASA [Earthdata Login](https://urs.earthdata.nasa.gov/) account is required to download EMIT data   
+ Selected the `emit_tutorials` environment as the kernel for this notebook.
  + For instructions on setting up the environment, follow the the `setup_instructions.md` included in the `/setup/` folder of the repository.  

**Learning Objectives**  
- How to find EMIT data using NASA's CMR API
- How to download programmatically 

Import the required packages

In [2]:
import os
import requests
import earthaccess
import pandas as pd
import datetime as dt
import geopandas
from shapely.geometry import MultiPolygon, Polygon, box

  from .autonotebook import tqdm as notebook_tqdm


In [19]:
pip install geopandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Obtaining the Concept ID

NASA EarthData's unique ID for this dataset (called Concept ID) is needed for searching the dataset. The dataset Digital Object Identifier or DOI can be used to obtain the Concept ID. DOIs can be found by clicking the `Citation` link on the LP DAAC's [EMIT Product Pages](https://lpdaac.usgs.gov/product_search/?query=emit&view=cards&sort=title).

In [3]:
# 10.5067/ASTER/AST_08.003
doi = '10.5067/EMIT/EMITL2ARFL.001'# EMIT L2A Reflectance

# CMR API base url
cmrurl='https://cmr.earthdata.nasa.gov/search/' 

doisearch = cmrurl + 'collections.json?doi=' + doi
concept_id = requests.get(doisearch).json()['feed']['entry'][0]['id']
print(concept_id)

C2408750690-LPCLOUD


This is the unique NASA-given concept ID for the EMIT L2A Reflectance dataset, which can be used to retrieve relevant files (or granules).

## Searching using CMR API

When searching the CMR API, users can provide spatial bounds and date-time ranges to narrow their search. These spatial bounds can be either, points, a bounding box, or a polygon. 

Specify start time and dates and reformat them to the structure necessary for searching CMR.

In [4]:
# Temporal Bound - Year, month, day. Hour, minutes, and seconds (ZULU) can also be included 
start_date = dt.datetime(2022, 9, 3)
end_date = dt.datetime(2022, 9, 3, 23, 23, 59)  

# CMR formatted start and end times
dt_format = '%Y-%m-%dT%H:%M:%SZ'
temporal_str = start_date.strftime(dt_format) + ',' + end_date.strftime(dt_format)
print(temporal_str)

2022-09-03T00:00:00Z,2022-09-03T23:23:59Z


The CMR API only allows 2000 results to be shown at a time. Using `page_num` allows a user to loop through the search result pages. The sections below walk through using Points, Bounding Boxes, and Polygons to spatially constrain a search made using the CMR API. 

### Search using Points

To search using a point we specify a latitude and longitude.

In [5]:
lon = -62.1123
lat = -39.89402
point_str = str(lon) +','+ str(lat)

page_num = 0
page_size = 2000 # CMR page size limit

granule_arr = []

while True:
    page_num += 1
     # defining parameters
    cmr_param = {
        "collection_concept_id": concept_id, 
        "page_size": page_size,
        "page_num": page_num,
        "temporal": temporal_str,
        "point":point_str
    }

    granulesearch = cmrurl + 'granules.json'
    response = requests.post(granulesearch, data=cmr_param)
    granules = response.json()['feed']['entry']
    print(page_num*page_size)  
    if granules:
        for g in granules:
            granule_urls = ''
            granule_poly = ''
                       
            # read cloud cover
            cloud_cover = g['cloud_cover']
            # reading bounding geometries
            if 'polygons' in g:
                polygons= g['polygons']
                print(polygons)
                multipolygons = []
                for poly in polygons:
                    i=iter(poly[0].split (" "))
                    ltln = list(map(" ".join,zip(i,i)))
                    multipolygons.append(Polygon([[float(p.split(" ")[1]), float(p.split(" ")[0])] for p in ltln]))
                granule_poly = MultiPolygon(multipolygons)
            # Get https URLs to .nc files and exclude .dmrpp files
            granule_urls = [x['href'] for x in g['links'] if 'https' in x['href'] and '.nc' in x['href'] and '.dmrpp' not in x['href']]
            granule_arr.append([granule_urls, cloud_cover, granule_poly])
                           
    else: 
        break
print(granule_arr)

2000
[['-39.242054 -62.0887375 -39.9433022 -62.5120087 -40.4574928 -61.6601334 -39.7562447 -61.2368622 -39.242054 -62.0887375']]
['-39.242054 -62.0887375 -39.9433022 -62.5120087 -40.4574928 -61.6601334 -39.7562447 -61.2368622 -39.242054 -62.0887375']
['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFLUNCERT_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_MASK_001_20220903T163129_2224611_012.nc']
4000
[[['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.

### Search using a bounding box
For this we'll use a bounding box along the coast of Argentina with a bottom left corner of -62.1123 Longitude, -39.89402 Latitude, and a top right corner of -61.70801 Longitude and -39.57769 Latitude.

In [43]:
# Search Using a Bounding Box
bound = (-62.1123, -39.89402, -61.70801, -39.57769) 
bound_str = ','.join(map(str,bound))

page_num = 1
page_size = 2000 # CMR page size limit

granule_arr = []

while True:
    
     # defining parameters
    cmr_param = {
        "collection_concept_id": concept_id, 
        "page_size": page_size,
        "page_num": page_num,
        "temporal": temporal_str,
        "bounding_box[]":bound_str
    }

    granulesearch = cmrurl + 'granules.json'
    response = requests.post(granulesearch, data=cmr_param)
    granules = response.json()['feed']['entry']
       
    if granules:
        for g in granules:
            granule_urls = ''
            granule_poly = ''
                       
            # read cloud cover
            cloud_cover = g['cloud_cover']
    
            # reading results bounding geometries
            if 'polygons' in g:
                polygons= g['polygons']
                multipolygons = []
                for poly in polygons:
                    i=iter(poly[0].split (" "))
                    ltln = list(map(" ".join,zip(i,i)))
                    multipolygons.append(Polygon([[float(p.split(" ")[1]), float(p.split(" ")[0])] for p in ltln]))
                granule_poly = MultiPolygon(multipolygons)
            
            # Get https URLs to .nc files and exclude .dmrpp files
            granule_urls = [x['href'] for x in g['links'] if 'https' in x['href'] and '.nc' in x['href'] and '.dmrpp' not in x['href']]
            # Add to list
            granule_arr.append([granule_urls, cloud_cover, granule_poly])
                           
    page_num += 1

print(granule_arr)

KeyError: 'feed'

### Search a Polygon

A polygon can also be used to spatially search using the CMR API. A shapefile, geojson, or other format can be opened as a geopandas dataframe, then reformatted to a geojson format to be sent as a parameter in the CMR search. Note that very complex shapefiles must be simplified, there is a 5000 coordinate limit.

In [12]:
# Search using a Polygon
polygon = geopandas.read_file('../../data/isla_gaviota.geojson')
geojson = {"shapefile": ("isla_gaviota.geojson", polygon.geometry.to_json(), "application/geo+json")}

page_num = 1
page_size = 2000 # CMR page size limit

granule_arr = []

while True:
    
     # defining parameters
    cmr_param = {
        "collection_concept_id": concept_id, 
        "page_size": page_size,
        "page_num": page_num,
        "temporal": temporal_str,
        "simplify-shapefile": 'true' # this is needed to bypass 5000 coordinates limit of CMR
    }

    granulesearch = cmrurl + 'granules.json'
    response = requests.post(granulesearch, data=cmr_param, files=geojson)
    granules = response.json()['feed']['entry']
       
    if granules:
        for g in granules:
            granule_urls = ''
            granule_poly = ''
                       
            # read granule title and cloud cover
            granule_name = g['title']
            cloud_cover = g['cloud_cover']
    
            # reading bounding geometries
            if 'polygons' in g:
                polygons= g['polygons']
                multipolygons = []
                for poly in polygons:
                    i=iter(poly[0].split (" "))
                    ltln = list(map(" ".join,zip(i,i)))
                    multipolygons.append(Polygon([[float(p.split(" ")[1]), float(p.split(" ")[0])] for p in ltln]))
                granule_poly = MultiPolygon(multipolygons)
            
            # Get https URLs to .nc files and exclude .dmrpp files
            granule_urls = [x['href'] for x in g['links'] if 'https' in x['href'] and '.nc' in x['href'] and '.dmrpp' not in x['href']]
            # Add to list
            granule_arr.append([granule_urls, cloud_cover, granule_poly])
                           
        page_num += 1
    else: 
        break
 
print(granule_arr)

DriverError: ../../data/isla_gaviota.geojson: No such file or directory

> Note: At the time this tutorial was made, all 3 searches, point, bounding box, and polygon should result in the same assets being returned.

### Creating a Dataframe with the resulting Links

A `pandas.dataframe` can be used to store the download URLs and geometries of each file. The EMIT L2A Reflectance and Uncertainty and Mask collection contains 3 assets per granule (reflectance, reflectance uncertainty, and masks). We can see when printing this list, that there are three assets that correspond to a single polygon. For the next step we will place these into a dataframe and 'explode' the dataframe to place each of these in a separate row. If we only want a subset of these assets, we can filter them out. 

In [7]:
cmr_results_df = pd.DataFrame(granule_arr, columns=["asset_url", "cloud_cover", "granule_poly"])
cmr_results_df = cmr_results_df[cmr_results_df['granule_poly'] != '']
cmr_results_df = cmr_results_df.explode('asset_url')
cmr_results_df.insert(0,'asset_name', cmr_results_df.asset_url.str.split('/',n=-1).str.get(-1))
cmr_results_df

Unnamed: 0,asset_name,asset_url,cloud_cover,granule_poly
0,EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,82,"MULTIPOLYGON (((-62.0887375 -39.242054, -62.51..."
0,EMIT_L2A_RFLUNCERT_001_20220903T163129_2224611...,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,82,"MULTIPOLYGON (((-62.0887375 -39.242054, -62.51..."
0,EMIT_L2A_MASK_001_20220903T163129_2224611_012.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,82,"MULTIPOLYGON (((-62.0887375 -39.242054, -62.51..."


In [16]:
print(granule_arr)

[[['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_RFLUNCERT_001_20220903T163129_2224611_012.nc', 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/EMITL2ARFL.001/EMIT_L2A_RFL_001_20220903T163129_2224611_012/EMIT_L2A_MASK_001_20220903T163129_2224611_012.nc'], '82', <MULTIPOLYGON (((-62.089 -39.242, -62.512 -39.943, -61.66 -40.457, -61.237 -...>]]


In [15]:
cmr_results_df.to_csv("ouput.csv", index = False)

At this stage we can filter based on the assets that we want or the cloud cover. For this example lets say we are only interested in the Reflectance and the Mask. To filter by asset, we can match strings included in the asset name. 

In [25]:
cmr_results_df = cmr_results_df[cmr_results_df.asset_name.str.contains('_RFL_') | cmr_results_df.asset_name.str.contains('MASK')]
cmr_results_df

Unnamed: 0,asset_name,asset_url,cloud_cover,granule_poly
0,EMIT_L2A_RFL_001_20220903T163129_2224611_012.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,82,"MULTIPOLYGON (((-62.0887375 -39.242054, -62.51..."
0,EMIT_L2A_MASK_001_20220903T163129_2224611_012.nc,https://data.lpdaac.earthdatacloud.nasa.gov/lp...,82,"MULTIPOLYGON (((-62.0887375 -39.242054, -62.51..."


After filtering down to the assets you want, you can output a text file with the asset urls or save the entire dataframe, then use a utility such as wget or the DAAC Data Download Tool to download the files. To download you will need to set up NASA Earthdata Login authentication using  a .netrc file. 

Save the asset urls to a textfile in the `/data/` folder.

In [30]:
# Save text file of asset urls
cmr_results_dfs = cmr_results_df[:-1].drop_duplicates(subset=['asset_url']) # Remove any duplicates
cmr_results_df.to_csv('../../data/emit_asset_urls.txt', columns = ['asset_url'], index=False, header = False)

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \UXXXXXXXX escape (3420063674.py, line 3)

## Downloading Files using the list of URLS/Text File

To download the files using Python, you can run the cell below. 

In [27]:
# Define input filepath
url_list_filepath = "../../data/emit_asset_urls.txt"
# Define output directory
output_directory = "../../data/"

# Open Text file
with open(url_list_filepath, "r") as file:
        file_list = file.read().splitlines()
        file.close()

# EDL Authentication/Create .netrc if necessary
earthaccess.login(persist=True)
# Get requests https Session using Earthdata Login Info
fs = earthaccess.get_requests_https_session()
# Retrieve granule asset ID from URL (to maintain existing naming convention)
for url in file_list:
    granule_asset_id = url.split("/")[-1]
    # Define Local Filepath
    fp = f"{output_directory}{granule_asset_id}"
    # Download the Granule Asset if it doesn't exist
    print(f"Downloading {granule_asset_id}...")
    if not os.path.isfile(fp):
        with fs.get(url, stream=True) as src:
            with open(fp, "wb") as dst:
                for chunk in src.iter_content(chunk_size=64 * 1024 * 1024):
                    dst.write(chunk)

FileNotFoundError: [Errno 2] No such file or directory: '../../data/emit_asset_urls.txt'

To download using wget, use the following in the command line.

In [None]:
!wget -P ../../data/ -i ../../data/emit_asset_urls.txt

## Contact Info:  

Email: LPDAAC@usgs.gov  
Voice: +1-866-573-3222  
Organization: Land Processes Distributed Active Archive Center (LP DAAC)¹  
Website: <https://lpdaac.usgs.gov/>  
Date last modified: 03-22-2024  

¹Work performed under USGS contract G15PD00467 for NASA contract NNG14HH33I. 