# Using the CMR API and asyncio for fast CMR Queries  

---

## Summary  

This tutorial demonstrates how to effectively perform queries and extract data download Uniform Resource Locators (URLs) for every Common Metadata Repository (CMR) metadata record within a NASA Earthdata collection. Two examples are shown. The first highlight making sequential requests for data URLs associated with specified collections. The second example demonstrates the how to leverages Python's `asyncio` package to perform bulk parallel requests for the same information and highlights the increase in speed when doing so. The NASA Earthdata collections highlighted here are Harmonized Landsat Sentinel-2 Operational Land Imager Surface Refleactance and TOA Brightness Daily Global 30m ([HLSL30.002](https://doi.org/10.5067/HLS/HLSL30.002)) and
Harmonized Landsat Sentinel-2 Multi-spectral Instrument Surface Reflactance Daily Global 30m ([HLSS30.002](https://doi.org/10.5067/HLS/HLSS30.002)).  

### What is CMR?  

The CMR is a metadata system that catalogs NASA's Earth Observing System Data and Information System (EOSDIS) data and associated metadata. The CMR Application Programming Interface (API) provides programatic search capabilities through CMR's vast metadata holdings using various parameters and keywords. When querying NASA's CMR, there is a limit of 1 million granule matched with only 2000 granules returned per page. This guide shows how to search for CMR records using the CMR API and create a list of download URLs. This guide also shows how to leverage asynchronous, or parallel requests, to increase the speed of this process. The example below leverages the Harmonized Landsat Sentinel-2 collection archived by NASA's LP DAAC to demonstrate how to use Python's `asyncio` to perform large queries again NASA's CMR.  

## Objectives  

+ Use the CMR API and Python to perform large queries (requests that return more than 2000 granules) against NASA's CMR.  
+ Prepare a list of URLs to access or download assets associated with those granules.  
+ Utilize asynchronous/parallel requests to increase speed of query and list construction.  

---

## Getting Started  

Import the required packages.


In [3]:
import requests
import math
import aiohttp
import asyncio
import time
import earthaccess

auth = earthaccess.login(persist=True)

  from .autonotebook import tqdm as notebook_tqdm


## Searching the CMR

Set the CMR API Endpoint. This is the URL that we'll use to search through the CMR.

In [4]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search' # CMR API Endpoint
url = f'{CMR_OPS}/{"granules"}'

To search the CMR we need to set our parameters. In this example we'll narrow our search using Collection IDs, a range of dates and times, and the number of results we want to show per page. Spatial areas can also be used to narrow searches (example shown in [HLS_Tutorial](https://git.earthdata.nasa.gov/projects/LPDUR/repos/hls-tutorial/browse/HLS_Tutorial.ipynb)). 

Here, we are interested in both HLS Landsat-8 and Sentinel-2 collections collected from October 17-19, 2021. Specify the `collections` to search, set a `datetime_range` and set the quantity of results to return per page using the `page_size` parameter like below.  

In [94]:
collections = ['C1748058432-LPCLOUD'] # Collection or concept_id specific to LPDAAC Products (HLS Landsat OLI and HLS Sentinel-2 respectively) 
datetime_range = '2000-02-24T00:00:00Z,2025-07-08T00:00:00Z'
page_size = 2000
bbox = '3.82715,2.29382,14.36961,15.00814'

A CMR search can find up to 1 million items or granules, but the number returned per page is limited to 2000, meaning large searches may have several pages of results. By default, `page_size` is set to 10.

## Submitting Requests

Using the above search criteria we can make a request using the `requests.get()` function. Submit a request and print the `response.status_code`.


In [141]:
response = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'bounding_box': bbox,
                            'page_size': page_size
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

200


A status code of 200 indicates the request has succeeded. 

To see the number of results, print the `CMR-Hits` found in the returned header.

In [142]:
print(response.headers['CMR-Hits']) # Resulting quantity of granules/items.

36754


## Building a List of File URLs

We can build a list of URLs to data assets using our search results. Notice this only uses the first page of results.

In [143]:
granules = response.json()['feed']['entry']
len(granules) # Resulting quantity of granules on page one.

2000

[{'producer_granule_id': 'MOD11A1.A2000055.h18v07.061.2020043120835',
  'time_start': '2000-02-24T00:00:00.000Z',
  'cloud_cover': '64.0',
  'updated': '2020-03-25T02:29:27.229Z',
  'dataset_id': 'MODIS/Terra Land Surface Temperature/Emissivity Daily L3 Global 1km SIN Grid V061',
  'data_center': 'LPCLOUD',
  'title': 'MOD11A1.A2000055.h18v07.061.2020043120835',
  'coordinate_system': 'GEODETIC',
  'day_night_flag': 'BOTH',
  'time_end': '2000-02-24T23:59:59.000Z',
  'id': 'G2182561954-LPCLOUD',
  'original_format': 'ECHO10',
  'granule_size': '3.03519',
  'browse_flag': True,
  'polygons': [['10.0041667 -0.0042208 10.0041667 10.1502892 19.9958333 10.6373196 19.9958333 -0.0044399 10.0041667 -0.0042208']],
  'collection_concept_id': 'C1748058432-LPCLOUD',
  'online_access_flag': True,
  'links': [{'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#',
    'title': 'Download MOD11A1.A2000055.h18v07.061.2020043120835.hdf',
    'hreflang': 'en-US',
    'href': 'https://data.lpdaac.earthdatacl

In [89]:
file_list = []
for g in granules:
    file_list.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.hdf' in x['href']])
len(file_list) # Total number of assets from page one of granules.

200

Print part of the URLs list. 

In [91]:
file_list[:25]

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h18v07.061.2020043120835/MOD11A1.A2000055.h18v07.061.2020043120835.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h19v07.061.2020043120933/MOD11A1.A2000055.h19v07.061.2020043120933.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h18v08.061.2020043121044/MOD11A1.A2000055.h18v08.061.2020043121044.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h19v08.061.2020043120847/MOD11A1.A2000055.h19v08.061.2020043120847.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000056.h18v07.061.2020043120835/MOD11A1.A2000056.h18v07.061.2020043120835.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000056.h19v07.061.2020043120933/MOD11A1.A2000056.h19v07.061.2020043120933.hdf',
 'ht

This process can be extended to all pages of search results to build a complete list of asset URLs. 

## Creating a List from Multiple Results Pages

To create a list from multiple results pages, we first define a function to build a list of pages based upon the number of results.

In [118]:
def get_page_total(collections, datetime_range, page_size):
    hits = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'bounding_box': bbox,
                            'page_size': page_size,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       ).headers['CMR-Hits']
    return math.ceil(int(hits)/page_size)

Then we build a list of pages called `page_numbers`.

In [119]:
page_numbers = list(range(1, get_page_total(collections, datetime_range, page_size)+1))
len(page_numbers) # Total number of pages to iterate through.

19

After we have a list of pages we can iterate through page by page to make a complete list of assets matching our search.

In [120]:
data_urls = [] # empty list
start = time.time() # Begin timer
for n in page_numbers: # Iterate through requests page by page sequentially
    print(f'Page: {n}') # Print Page Number
    response = requests.get(url, # Same request function as used previously
                            params={
                                'concept_id': collections,
                                'temporal': datetime_range,
                                'page_size': page_size,
                                'bounding_box': bbox,
                                'page_num': n
                            },
                            headers={
                                'Accept': 'application/json'
                            }
                           )
    print(f'Page {n} Resonse Code: {response.status_code}') # Show the response code for each page
    
    granules = response.json()['feed']['entry']
    print(f'Number of Granules: {len(granules)}') # Show the number of granules on each page
    
    for g in granules:
        data_urls.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.hdf' in x['href']])
end = time.time()
print(f'Total time: {end-start}') # Record the total time taken

Page: 1
Page 1 Resonse Code: 200
Number of Granules: 2000
Page: 2
Page 2 Resonse Code: 200
Number of Granules: 2000
Page: 3
Page 3 Resonse Code: 200
Number of Granules: 2000
Page: 4
Page 4 Resonse Code: 200
Number of Granules: 2000
Page: 5
Page 5 Resonse Code: 200
Number of Granules: 2000
Page: 6
Page 6 Resonse Code: 200
Number of Granules: 2000
Page: 7
Page 7 Resonse Code: 200
Number of Granules: 2000
Page: 8
Page 8 Resonse Code: 200
Number of Granules: 2000
Page: 9
Page 9 Resonse Code: 200
Number of Granules: 2000
Page: 10
Page 10 Resonse Code: 200
Number of Granules: 2000
Page: 11
Page 11 Resonse Code: 200
Number of Granules: 2000
Page: 12
Page 12 Resonse Code: 200
Number of Granules: 2000
Page: 13
Page 13 Resonse Code: 200
Number of Granules: 2000
Page: 14
Page 14 Resonse Code: 200
Number of Granules: 2000
Page: 15
Page 15 Resonse Code: 200
Number of Granules: 2000
Page: 16
Page 16 Resonse Code: 200
Number of Granules: 2000
Page: 17
Page 17 Resonse Code: 200
Number of Granules: 200

Show the total quantity of assets in our list matching search parameters.

In [121]:
len(data_urls)

36754

In [123]:
data_urls[-4:]

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2025189.h19v08.061.2025190103844/MOD11A1.A2025189.h19v08.061.2025190103844.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2025189.h19v07.061.2025190103926/MOD11A1.A2025189.h19v07.061.2025190103926.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2025189.h18v08.061.2025190103851/MOD11A1.A2025189.h18v08.061.2025190103851.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2025189.h18v07.061.2025190103901/MOD11A1.A2025189.h18v07.061.2025190103901.hdf']

We can also see that the first 25 assets match up from our first page only search results.

In [124]:
file_list[:25]==data_urls[:25]

True



## Improve speed using Asynchronous Requests

You may have noticed the total time the function above took to run. For searches with a large quantity of results, we can query and build a list of asset URLs more quickly by utilizing asynchronous requests. Asynchronous requests can be run concurrently or in parallel, which typically decreases the total time of operations because a response is not needed for the prior request before a subsequent request is made. This time we'll use a similar approach as before, except we will build a list of page URLs that can be used in asynchronous requests to populate our list of asset URLs more quickly.

First we define a new function `get_cmr_pages_urls()` to create a list of results pages URLs, not just the page numbers like we did before, then build that list.


In [125]:
def get_cmr_pages_urls(collections, datetime_range, page_size): 
    response = requests.get(url,
                       params={
                           'concept_id': collections,
                           'temporal': datetime_range,
                           'bounding_box': bbox,
                           'page_size': page_size,
                       },
                       headers={
                           'Accept': 'application/json'
                       }
                      )
    hits = int(response.headers['CMR-Hits'])
    n_pages = math.ceil(hits/page_size)
    cmr_pages_urls = [f'{response.url}&page_num={x}'.replace('granules?', 'granules.json?') for x in list(range(1,n_pages+1))]
    return cmr_pages_urls

In [127]:
urls = get_cmr_pages_urls(collections, datetime_range, page_size)
urls

['https://cmr.earthdata.nasa.gov/search/granules.json?concept_id=C1748058432-LPCLOUD&temporal=2000-02-24T00%3A00%3A00Z%2C2025-07-08T00%3A00%3A00Z&bounding_box=3.82715%2C2.29382%2C14.36961%2C15.00814&page_size=2000&page_num=1',
 'https://cmr.earthdata.nasa.gov/search/granules.json?concept_id=C1748058432-LPCLOUD&temporal=2000-02-24T00%3A00%3A00Z%2C2025-07-08T00%3A00%3A00Z&bounding_box=3.82715%2C2.29382%2C14.36961%2C15.00814&page_size=2000&page_num=2',
 'https://cmr.earthdata.nasa.gov/search/granules.json?concept_id=C1748058432-LPCLOUD&temporal=2000-02-24T00%3A00%3A00Z%2C2025-07-08T00%3A00%3A00Z&bounding_box=3.82715%2C2.29382%2C14.36961%2C15.00814&page_size=2000&page_num=3',
 'https://cmr.earthdata.nasa.gov/search/granules.json?concept_id=C1748058432-LPCLOUD&temporal=2000-02-24T00%3A00%3A00Z%2C2025-07-08T00%3A00%3A00Z&bounding_box=3.82715%2C2.29382%2C14.36961%2C15.00814&page_size=2000&page_num=4',
 'https://cmr.earthdata.nasa.gov/search/granules.json?concept_id=C1748058432-LPCLOUD&tempora

Then we define a function `get_tasks()` to build a list of tasks for each page number URL and a function `get_url()` to make the requests for each page in parallel with one another.

In [128]:
def get_tasks(session):
    tasks = []
    for l in urls:
        tasks.append(session.get(l))
    return tasks

Run the functions to submit asynchronous/parallel requests for each page of results.

Much faster than before! We can see the same quantity of results and that a subsample of the resulting asset URLs matches what we retrieved before.

In [None]:
import aiohttp
import asyncio
import time

# --- Define Async Functions ---
async def fetch_granules(session, url, params):
    try:
        async with session.get(url, params=params, headers={'Accept': 'application/json'}) as response:
            print(f"Page {params['page_num']} Response Code: {response.status}")
            if response.status == 200:
                json_data = await response.json()
                granules = json_data['feed']['entry']
                print(f"Number of Granules: {len(granules)}")


                def get(keys, values):
                    return dict(zip(keys, values))

                links = [
                    link['href'] for g in granules for link in g['links']
                    if 'https' in link['href'] and '.hdf' in link['href']
                ]

                date = [g['time_start'] for g in granules if 'time_start' in g]

                res = get(['links', 'date'], [links, date])
                return res
            
            else:
                return []
    except Exception as e:
        print(f"Error on page {params['page_num']}: {e}")
        return []

async def fetch_all_granules(url, collections, datetime_range, page_size, page_numbers):
    timeout = aiohttp.ClientTimeout(total=600)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        tasks = [
            fetch_granules(session, url, {
                'concept_id': collections,
                'temporal': datetime_range,
                'bounding_box': bbox,
                'page_size': page_size,
                'page_num': n
            }) for n in page_numbers
        ]
        results = await asyncio.gather(*tasks)
        return results

# --- Use in Notebook Cell ---
# Example usage in a notebook cell:
start = time.time()

# Make sure url, collections, datetime_range, page_size, and page_numbers are already defined
data_urls = await fetch_all_granules(url, collections, datetime_range, page_size, page_numbers)

end = time.time()
print(f"Total time: {end - start:.2f} seconds")
print(f"Total .hdf URLs: {len(data_urls)}")


Page 19 Response Code: 200
Page 2 Response Code: 200
Number of Granules: 754
Page 1 Response Code: 200
Page 17 Response Code: 200
Number of Granules: 2000
Page 4 Response Code: 200
Page 3 Response Code: 200
Number of Granules: 2000
Page 6 Response Code: 200
Number of Granules: 2000
Number of Granules: 2000
Number of Granules: 2000
Number of Granules: 2000
Page 7 Response Code: 200
Page 8 Response Code: 200
Page 13 Response Code: 200
Page 10 Response Code: 200
Page 5 Response Code: 200
Number of Granules: 2000
Page 9 Response Code: 200
Page 14 Response Code: 200
Number of Granules: 2000
Number of Granules: 2000
Number of Granules: 2000
Page 11 Response Code: 200
Page 12 Response Code: 200
Number of Granules: 2000
Number of Granules: 2000
Number of Granules: 2000
Page 15 Response Code: 200
Number of Granules: 2000
Number of Granules: 2000
Number of Granules: 2000
Page 16 Response Code: 200
Page 18 Response Code: 200
Number of Granules: 2000
Number of Granules: 2000
Total time: 5.26 secon

In [162]:
len(data_urls)

19

In [172]:
# show the last 4 results
print(data_urls[0]['links'][0])
print(data_urls[0]['date'][0])

https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h18v07.061.2020043120835/MOD11A1.A2000055.h18v07.061.2020043120835.hdf
2000-02-24T00:00:00.000Z


In [None]:
earthaccess.download(all_data_urls[0], local_path='.',)

In [1]:
# analyse .hdf file to check if correct files are downloaded
import rioxarray
import matplotlib.pyplot as plt
import numpy as np

# define the path to the downloaded file
hdf_file = "/Users/lshms102/Documents/MPCP_lassa_sentinel/MOD11A1.A2000055.h12v10.061.2020043121102.hdf"

try:
    data_layer_str = f'HDF4_EOS:EOS_GRID:"{hdf_file}":MODIS_Grid_Daily_1km_LST:LST_Day_1km'

    # Open the specific subdataset using rioxarray
    rds = rioxarray.open_rasterio(data_layer_str)

except Exception as e:
    print(f"Error opening file: {e}")
    print("\nPlease ensure the HDF file is in the same directory as the script or provide the full path.")
    print("Also, ensure you have the necessary GDAL drivers, which are installed with rioxarray.")
    exit()

# The data is loaded as an xarray.DataArray. Let's inspect it.
print("--- DataArray Structure ---")
print(rds)
print("\n--- Coordinate Reference System (CRS) ---")
print(rds.rio.crs)



--- DataArray Structure ---
<xarray.DataArray (band: 1, y: 1200, x: 1200)> Size: 3MB
[1440000 values with dtype=uint16]
Coordinates:
  * band         (band) int64 8B 1
  * x            (x) float64 10kB -6.671e+06 -6.67e+06 ... -5.561e+06 -5.56e+06
  * y            (y) float64 10kB -1.112e+06 -1.113e+06 ... -2.223e+06
    spatial_ref  int64 8B 0
Attributes: (12/85)
    add_offset_err:                     0
    ALGORITHMPACKAGEACCEPTANCEDATE:     102004
    ALGORITHMPACKAGEMATURITYCODE:       Normal
    ALGORITHMPACKAGENAME:               MOD_PR11A
    ALGORITHMPACKAGEVERSION:            6
    ASSOCIATEDINSTRUMENTSHORTNAME.1:    MODIS
    ...                                 ...
    valid_range:                        7500, 65535
    VERSIONID:                          61
    VERTICALTILENUMBER:                 10
    WESTBOUNDINGCOORDINATE:             -63.855103293672
    _FillValue:                         0
    add_offset:                         0.0

--- Coordinate Reference System (