# Using the CMR API and asyncio for fast CMR Queries  

---

## Summary  

This tutorial demonstrates how to effectively perform queries and extract data download Uniform Resource Locators (URLs) for every Common Metadata Repository (CMR) metadata record within a NASA Earthdata collection. Two examples are shown. The first highlight making sequential requests for data URLs associated with specified collections. The second example demonstrates the how to leverages Python's `asyncio` package to perform bulk parallel requests for the same information and highlights the increase in speed when doing so. The NASA Earthdata collections highlighted here are Harmonized Landsat Sentinel-2 Operational Land Imager Surface Refleactance and TOA Brightness Daily Global 30m ([HLSL30.002](https://doi.org/10.5067/HLS/HLSL30.002)) and
Harmonized Landsat Sentinel-2 Multi-spectral Instrument Surface Reflactance Daily Global 30m ([HLSS30.002](https://doi.org/10.5067/HLS/HLSS30.002)).  

### What is CMR?  

The CMR is a metadata system that catalogs NASA's Earth Observing System Data and Information System (EOSDIS) data and associated metadata. The CMR Application Programming Interface (API) provides programatic search capabilities through CMR's vast metadata holdings using various parameters and keywords. When querying NASA's CMR, there is a limit of 1 million granule matched with only 2000 granules returned per page. This guide shows how to search for CMR records using the CMR API and create a list of download URLs. This guide also shows how to leverage asynchronous, or parallel requests, to increase the speed of this process. The example below leverages the Harmonized Landsat Sentinel-2 collection archived by NASA's LP DAAC to demonstrate how to use Python's `asyncio` to perform large queries again NASA's CMR.  

## Objectives  

+ Use the CMR API and Python to perform large queries (requests that return more than 2000 granules) against NASA's CMR.  
+ Prepare a list of URLs to access or download assets associated with those granules.  
+ Utilize asynchronous/parallel requests to increase speed of query and list construction.  

---

## Getting Started  

Import the required packages.


In [49]:
import requests
import math
import aiohttp
import asyncio
import time
import earthaccess

auth = earthaccess.login(persist=True)

## Searching the CMR

Set the CMR API Endpoint. This is the URL that we'll use to search through the CMR.

In [2]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search' # CMR API Endpoint
url = f'{CMR_OPS}/{"granules"}'

To search the CMR we need to set our parameters. In this example we'll narrow our search using Collection IDs, a range of dates and times, and the number of results we want to show per page. Spatial areas can also be used to narrow searches (example shown in [HLS_Tutorial](https://git.earthdata.nasa.gov/projects/LPDUR/repos/hls-tutorial/browse/HLS_Tutorial.ipynb)). 

Here, we are interested in both HLS Landsat-8 and Sentinel-2 collections collected from October 17-19, 2021. Specify the `collections` to search, set a `datetime_range` and set the quantity of results to return per page using the `page_size` parameter like below.  

In [3]:
collections = ['C1748058432-LPCLOUD'] # Collection or concept_id specific to LPDAAC Products (HLS Landsat OLI and HLS Sentinel-2 respectively) 
datetime_range = '2000-02-24T00:00:00Z,2025-07-08T00:00:00Z'
page_size = 2000

A CMR search can find up to 1 million items or granules, but the number returned per page is limited to 2000, meaning large searches may have several pages of results. By default, `page_size` is set to 10.

## Submitting Requests

Using the above search criteria we can make a request using the `requests.get()` function. Submit a request and print the `response.status_code`.


In [4]:
response = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'page_size': page_size,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

200


A status code of 200 indicates the request has succeeded. 

To see the number of results, print the `CMR-Hits` found in the returned header.

In [5]:
print(response.headers['CMR-Hits']) # Resulting quantity of granules/items.

2906771


## Building a List of File URLs

We can build a list of URLs to data assets using our search results. Notice this only uses the first page of results.

In [6]:
granules = response.json()['feed']['entry']
len(granules) # Resulting quantity of granules on page one.

2000

In [7]:
file_list = []
for g in granules:
    file_list.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.hdf' in x['href']])
len(file_list) # Total number of assets from page one of granules.

2000

Print part of the URLs list. 

In [8]:
file_list[:25]

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h12v10.061.2020043121102/MOD11A1.A2000055.h12v10.061.2020043121102.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h08v05.061.2020043120932/MOD11A1.A2000055.h08v05.061.2020043120932.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h33v08.061.2020043121151/MOD11A1.A2000055.h33v08.061.2020043121151.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h20v13.061.2020043121116/MOD11A1.A2000055.h20v13.061.2020043121116.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h23v16.061.2020043121123/MOD11A1.A2000055.h23v16.061.2020043121123.hdf',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/MOD11A1.061/MOD11A1.A2000055.h08v03.061.2020043121115/MOD11A1.A2000055.h08v03.061.2020043121115.hdf',
 'ht

This process can be extended to all pages of search results to build a complete list of asset URLs. 

## Creating a List from Multiple Results Pages

To create a list from multiple results pages, we first define a function to build a list of pages based upon the number of results.

In [9]:
def get_page_total(collections, datetime_range, page_size):
    hits = requests.get(url, 
                        params={
                            'concept_id': collections,
                            'temporal': datetime_range,
                            'page_size': page_size,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       ).headers['CMR-Hits']
    return math.ceil(int(hits)/page_size)

Then we build a list of pages called `page_numbers`.

In [10]:
page_numbers = list(range(1, get_page_total(collections, datetime_range, page_size)+1))
page_numbers

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99,
 100,
 101,
 102,
 103,
 104,
 105,
 106,
 107,
 108,
 109,
 110,
 111,
 112,
 113,
 114,
 115,
 116,
 117,
 118,
 119,
 120,
 121,
 122,
 123,
 124,
 125,
 126,
 127,
 128,
 129,
 130,
 131,
 132,
 133,
 134,
 135,
 136,
 137,
 138,
 139,
 140,
 141,
 142,
 143,
 144,
 145,
 146,
 147,
 148,
 149,
 150,
 151,
 152,
 153,
 154,
 155,
 156,
 157,
 158,
 159,
 160,
 161,
 162,
 163,
 164,
 165,
 166,
 167,
 168,
 169,
 170,
 171,
 172,
 173,
 174,
 175,
 176,
 177,
 178,
 179,
 180,
 181,
 182,
 183,
 184,
 185

After we have a list of pages we can iterate through page by page to make a complete list of assets matching our search.

In [27]:
data_urls = [] # empty list
start = time.time() # Begin timer
for n in page_numbers: # Iterate through requests page by page sequentially
    print(f'Page: {n}') # Print Page Number
    response = requests.get(url, # Same request function as used previously
                            params={
                                'concept_id': collections,
                                'temporal': datetime_range,
                                'page_size': page_size,
                                'page_num': n
                            },
                            headers={
                                'Accept': 'application/json'
                            }
                           )
    print(f'Page {n} Resonse Code: {response.status_code}') # Show the response code for each page
    
    granules = response.json()['feed']['entry']
    print(f'Number of Granules: {len(granules)}') # Show the number of granules on each page
    
    for g in granules:
        data_urls.extend([x['href'] for x in g['links'] if 'https' in x['href'] and '.hdf' in x['href']])
end = time.time()
print(f'Total time: {end-start}') # Record the total time taken

Page: 1
Page 1 Resonse Code: 200
Number of Granules: 2000
Page: 2
Page 2 Resonse Code: 200
Number of Granules: 2000
Page: 3
Page 3 Resonse Code: 200
Number of Granules: 2000
Page: 4
Page 4 Resonse Code: 200
Number of Granules: 2000
Page: 5
Page 5 Resonse Code: 200
Number of Granules: 2000
Page: 6
Page 6 Resonse Code: 200
Number of Granules: 2000
Page: 7
Page 7 Resonse Code: 200
Number of Granules: 2000
Page: 8
Page 8 Resonse Code: 200
Number of Granules: 2000
Page: 9
Page 9 Resonse Code: 200
Number of Granules: 2000
Page: 10
Page 10 Resonse Code: 200
Number of Granules: 2000
Page: 11
Page 11 Resonse Code: 200
Number of Granules: 2000
Page: 12
Page 12 Resonse Code: 200
Number of Granules: 2000
Page: 13
Page 13 Resonse Code: 200
Number of Granules: 2000
Page: 14
Page 14 Resonse Code: 200
Number of Granules: 2000
Page: 15
Page 15 Resonse Code: 200
Number of Granules: 2000
Page: 16
Page 16 Resonse Code: 200
Number of Granules: 2000
Page: 17
Page 17 Resonse Code: 200
Number of Granules: 200

Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.nasa.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None)>
Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.nasa.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None)>
Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.nasa.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None)>
Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.nasa.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None)>
Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.nasa.gov', port=443, is_ssl=True, ssl=True, proxy=None, proxy_auth=None, proxy_headers_hash=None)>
Unclosed connection
client_connection: Connection<ConnectionKey(host='cmr.earthdata.n

Page 24 Resonse Code: 200
Number of Granules: 2000
Page: 25
Page 25 Resonse Code: 200
Number of Granules: 2000
Page: 26
Page 26 Resonse Code: 200
Number of Granules: 2000
Page: 27
Page 27 Resonse Code: 200
Number of Granules: 2000
Page: 28
Page 28 Resonse Code: 200
Number of Granules: 2000
Page: 29
Page 29 Resonse Code: 200
Number of Granules: 2000
Page: 30
Page 30 Resonse Code: 200
Number of Granules: 2000
Page: 31
Page 31 Resonse Code: 200
Number of Granules: 2000
Page: 32
Page 32 Resonse Code: 200
Number of Granules: 2000
Page: 33
Page 33 Resonse Code: 200
Number of Granules: 2000
Page: 34
Page 34 Resonse Code: 200
Number of Granules: 2000
Page: 35
Page 35 Resonse Code: 200
Number of Granules: 2000
Page: 36
Page 36 Resonse Code: 200
Number of Granules: 2000
Page: 37
Page 37 Resonse Code: 200
Number of Granules: 2000
Page: 38
Page 38 Resonse Code: 200
Number of Granules: 2000
Page: 39
Page 39 Resonse Code: 200
Number of Granules: 2000
Page: 40
Page 40 Resonse Code: 200
Number of Gran

KeyboardInterrupt: 

Show the total quantity of assets in our list matching search parameters.

In [28]:
len(data_urls)

158000

We can also see that the first 25 assets match up from our first page only search results.

In [29]:
file_list[:25]==data_urls[:25]

True



## Improve speed using Asynchronous Requests

You may have noticed the total time the function above took to run. For searches with a large quantity of results, we can query and build a list of asset URLs more quickly by utilizing asynchronous requests. Asynchronous requests can be run concurrently or in parallel, which typically decreases the total time of operations because a response is not needed for the prior request before a subsequent request is made. This time we'll use a similar approach as before, except we will build a list of page URLs that can be used in asynchronous requests to populate our list of asset URLs more quickly.

First we define a new function `get_cmr_pages_urls()` to create a list of results pages URLs, not just the page numbers like we did before, then build that list.


In [14]:
def get_cmr_pages_urls(collections, datetime_range, page_size): 
    response = requests.get(url,
                       params={
                           'concept_id': collections,
                           'temporal': datetime_range,
                           'page_size': page_size,
                       },
                       headers={
                           'Accept': 'application/json'
                       }
                      )
    hits = int(response.headers['CMR-Hits'])
    n_pages = math.ceil(hits/page_size)
    cmr_pages_urls = [f'{response.url}&page_num={x}'.replace('granules?', 'granules.json?') for x in list(range(1,n_pages+1))]
    return cmr_pages_urls

In [None]:
urls = get_cmr_pages_urls(collections, datetime_range, page_size)
urls

1454

Next, we create an empty list to populate with our asset URLs.

In [16]:
results = []

Then we define a function `get_tasks()` to build a list of tasks for each page number URL and a function `get_url()` to make the requests for each page in parallel with one another.

In [24]:
def get_tasks(session):
    tasks = []
    for l in urls:
        tasks.append(session.get(l))
    return tasks

In [25]:
async def get_url():
    # Set a longer timeout (e.g., 10 minutes)
    timeout = aiohttp.ClientTimeout(total=600)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        tasks = get_tasks(session)
        responses = await asyncio.gather(*tasks)
        for response in responses:
            if response.status == 200:
                res = await response.json()
                results.extend([l['href'] for g in res['feed']['entry'] for l in g['links'] if 'https' in l['href'] and '.hdf' in l['href']])
            else:
                print(f"Error: Received status {response.status} for {response.url}")


Run the functions to submit asynchronous/parallel requests for each page of results.

In [26]:
start = time.time() 

await get_url()

end = time.time()

total_time = end - start
total_time

TimeoutError: 

Much faster than before! We can see the same quantity of results and that a subsample of the resulting asset URLs matches what we retrieved before.

In [None]:
len(results)

In [None]:
data_urls[2025:2125] == results[2025:2125]

In [33]:
async def fetch_page(session, url, collections, datetime_range, page_size, page_num):
    """
    Asynchronously fetches a single page of data and extracts granule URLs.
    
    Args:
        session (aiohttp.ClientSession): The aiohttp session to use for the request.
        url (str): The base URL for the API endpoint.
        collections (str): The concept ID for the collection.
        datetime_range (str): The temporal range for the data.
        page_size (int): The number of results per page.
        page_num (int): The page number to fetch.
        
    Returns:
        list: A list of data URLs (.hdf) found on the page, or an empty list if an error occurs.
    """
    params = {
        'concept_id': collections,
        'temporal': datetime_range,
        'page_size': page_size,
        'page_num': page_num
    }
    headers = {
        'Accept': 'application/json'
    }
    
    print(f"Requesting Page: {page_num}")
    try:
        # The 'async with' statement makes sure the request is non-blocking
        async with session.get(url, params=params, headers=headers) as response:
            # Raise an exception for bad status codes (4xx or 5xx)
            response.raise_for_status()
            print(f"Page {page_num} Response Code: {response.status}")
            
            # The 'await' keyword pauses the function until the JSON data is received
            data = await response.json()
            
            granules = data.get('feed', {}).get('entry', [])
            print(f"Page {page_num}: Found {len(granules)} Granules")
            
            page_data_urls = []
            for g in granules:
                # Extract links that contain 'https' and end with '.hdf'
                page_data_urls.extend([
                    x['href'] for x in g.get('links', []) 
                    if 'https' in x.get('href', '') and x.get('href', '').endswith('.hdf')
                ])
            return page_data_urls
            
    except aiohttp.ClientError as e:
        print(f"Error fetching page {page_num}: {e}")
        return [] # Return an empty list on error

async def main():
    """
    Main asynchronous function to orchestrate the fetching of all pages.
    """
    # --- Configuration (replace with your actual values) ---
    CMR_OPS = 'https://cmr.earthdata.nasa.gov/search' # CMR API Endpoint
    url = f'{CMR_OPS}/{"granules"}'
    
    collections = ['C1748058432-LPCLOUD'] # Collection or concept_id specific to LPDAAC Products (HLS Landsat OLI and HLS Sentinel-2 respectively) 
    datetime_range = '2000-02-24T00:00:00Z,2025-07-08T00:00:00Z'
    page_size = 2000
    
    # --- Execution ---
    start_time = time.time()
    all_data_urls = []
    
    # Use a single session for all requests for efficiency
    async with aiohttp.ClientSession() as session:
        
        # 1. Make an initial request to get the total number of hits
        print("--- Determining total number of pages ---")
        initial_params = {
            'concept_id': collections,
            'temporal': datetime_range,
            'page_size': 1 # We only need the headers, so request a small page size
        }
        async with session.get(url, params=initial_params) as response:
            response.raise_for_status()
            hits = int(response.headers.get('CMR-Hits', 0))
            if hits == 0:
                print("No granules found for the given criteria.")
                return

            n_pages = math.ceil(hits / page_size)
            print(f"Found {hits} total granules across {n_pages} pages.")

        # 2. Create and run tasks for all pages concurrently
        print("\n--- Fetching all pages concurrently ---")
        page_numbers = range(1, n_pages + 1)
        tasks = [
            fetch_page(session, url, collections, datetime_range, page_size, n)
            for n in page_numbers
        ]
        
        # asyncio.gather runs all tasks concurrently and collects their results
        results_from_pages = await asyncio.gather(*tasks)

        # 3. Flatten the list of lists into a single list of URLs
        all_data_urls = [url for sublist in results_from_pages for url in sublist]

        return all_data_urls
    
    end_time = time.time()
    
    print("\n--- Summary ---")
    print(f"Total Granule URLs found: {len(all_data_urls)}")
    print(f"Total time taken: {end_time - start_time:.2f} seconds")

In [34]:
all_data_urls = await main()

--- Determining total number of pages ---
Found 2906771 total granules across 1454 pages.

--- Fetching all pages concurrently ---
Requesting Page: 1
Requesting Page: 2
Requesting Page: 3
Requesting Page: 4
Requesting Page: 5
Requesting Page: 6
Requesting Page: 7
Requesting Page: 8
Requesting Page: 9
Requesting Page: 10
Requesting Page: 11
Requesting Page: 12
Requesting Page: 13
Requesting Page: 14
Requesting Page: 15
Requesting Page: 16
Requesting Page: 17
Requesting Page: 18
Requesting Page: 19
Requesting Page: 20
Requesting Page: 21
Requesting Page: 22
Requesting Page: 23
Requesting Page: 24
Requesting Page: 25
Requesting Page: 26
Requesting Page: 27
Requesting Page: 28
Requesting Page: 29
Requesting Page: 30
Requesting Page: 31
Requesting Page: 32
Requesting Page: 33
Requesting Page: 34
Requesting Page: 35
Requesting Page: 36
Requesting Page: 37
Requesting Page: 38
Requesting Page: 39
Requesting Page: 40
Requesting Page: 41
Requesting Page: 42
Requesting Page: 43
Requesting Page: 4

In [1]:
print(all_data_urls[0]))

SyntaxError: unmatched ')' (178862041.py, line 1)

In [52]:
earthaccess.download(all_data_urls[-1], local_path='.',)

QUEUEING TASKS | : 100%|██████████| 1/1 [00:00<00:00, 1574.44it/s]
PROCESSING TASKS | : 100%|██████████| 1/1 [00:02<00:00,  2.91s/it]
COLLECTING RESULTS | : 100%|██████████| 1/1 [00:00<00:00, 9986.44it/s]


['MOD11A1.A2008352.h33v07.061.2021123004920.hdf']

In [None]:
# analyse .hdf file to check if correct files are downloaded
import rioxarray
import matplotlib.pyplot as plt
import numpy as np

# define the path to the downloaded file
hdf_file = "/Users/lshms102/Documents/MPCP_lassa_sentinel/MOD11A1.A2000055.h12v10.061.2020043121102.hdf"

try:
    data_layer_str = f'HDF4_EOS:EOS_GRID:"{hdf_file}":MODIS_Grid_Daily_1km_LST:LST_Day_1km'

    # Open the specific subdataset using rioxarray
    rds = rioxarray.open_rasterio(hdf_file)

except Exception as e:
    print(f"Error opening file: {e}")
    print("\nPlease ensure the HDF file is in the same directory as the script or provide the full path.")
    print("Also, ensure you have the necessary GDAL drivers, which are installed with rioxarray.")
    exit()

# The data is loaded as an xarray.DataArray. Let's inspect it.
print("--- DataArray Structure ---")
print(rds)
print("\n--- Coordinate Reference System (CRS) ---")
print(rds.rio.crs)

# Remove the singleton 'band' dimension to make it a 2D array
data = rds.squeeze()

# The raw data is stored as integers and must be converted to scientific units.
# First, handle the NoData values, which rioxarray often reads automatically.
# We replace them with NaN (Not a Number) for accurate calculations.
data = data.where(data != rds.rio.nodata)

# According to the MOD11A1 product user guide, we must apply a scale factor.
# The scale factor for LST is 0.02, which converts the digital numbers to Kelvin.
scale_factor = 0.02
data_kelvin = data * scale_factor

# For easier interpretation, convert the temperature from Kelvin to Celsius
data_celsius = data_kelvin - 273.15

print("\n--- Basic Statistics (Temperature in Celsius) ---")
print(f"Min: {np.nanmin(data_celsius.values):.2f}°C")
print(f"Max: {np.nanmax(data_celsius.values):.2f}°C")
print(f"Mean: {np.nanmean(data_celsius.values):.2f}°C")

# --- Step 3: Visualize the Data ---
print("\nGenerating plot...")
plt.figure(figsize=(10, 8))

# Use imshow to create a plot of the 2D data array.
# The 'cmap' argument sets the color scheme. 'viridis' is a good choice for sequential data.
im = plt.imshow(data_celsius, cmap='viridis')

plt.title('MODIS Land Surface Temperature (LST_Day_1km)', fontsize=16)
plt.xlabel('Pixel Column')
plt.ylabel('Pixel Row')

# Add a colorbar to show the mapping of colors to temperature values.
cbar = plt.colorbar(im, fraction=0.046, pad=0.04)
cbar.set_label('Temperature (°C)', fontsize=12)

plt.show()
print("Plot displayed.")


Error opening file: 'MOD11A1.A2000055.h12v10.061.2020043121102.hdf' not recognized as being in a supported file format.

Please ensure the HDF file is in the same directory as the script or provide the full path.
Also, ensure you have the necessary GDAL drivers, which are installed with rioxarray.
--- DataArray Structure ---


NameError: name 'rds' is not defined

: 

In [1]:
import os 
os.getcwd()

'/Users/lshms102/Documents/MPCP_lassa_sentinel'