# Accessing Data from NASA's CMR in R

Authors: Harshini Girish (UAH), Sheyenne Kirkland (UAH), Alex Mandel (DevSeed), Henry Rodman (DevSeed), Zac Deziel (DevSeed)

Date: March 24, 2025

Description: This notebook serves as a follow-up to ["Searching for Data in NASA's CMR in R"](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r/cmr_search_in_r.html). In this guide, users will learn how to:
- Access data from a NASA Distributed Active Archive Center (DAAC) directly.
- Use `paws` to download data from a NASA DAAC locally.

## Additional Resources
- [Working with R in MAAP](https://docs.maap-project.org/en/develop/technical_tutorials/working_with_r.html)  
  - Current R Documentation within the MAAP Docs.
- [NASA's Operational CMR (MAAP Docs)](https://docs.maap-project.org/en/latest/technical_tutorials/search/catalog.html#nasa-s-operational-cmr)  
  - A section in the MAAP Docs offering an overview of resources to search and access NASA's CMR.
- [`ncdf4` Reference Manual](https://cran.r-project.org/web/packages/ncdf4/ncdf4.pdf)
  - Documentation for reading and writing netCDF files using the `ncdf4` package.
- [GDAL Raster Drivers](https://gdal.org/en/latest/drivers/raster/index.html)
  - A list of drivers for raster data.
- [`paws` Reference Manual](https://cran.r-project.org/web/packages/paws/paws.pdf)
  - Documentation for using the `paws` package.

## Run This Notebook
To access and run this tutorial within MAAP’s Algorithm Development Environment (ADE), please refer to the [“Getting started with the MAAP”](https://docs.maap-project.org/en/latest/getting_started/getting_started.html) section of our documentation.

Disclaimer: it is highly recommended to run a tutorial within MAAP’s ADE, which already includes packages specific to MAAP. Running the tutorial outside of the MAAP ADE may lead to errors. Users should work within an "R/Python" workspace.

## Install and Load Required Libraries
Let's load the packages needed for this notebook.

In [89]:
library("reticulate")       
library("paws")
library("ncdf4")

Additionally, we'll invoke the `MAAP` constructor. This will allow us to use the python-based `maapy-py` library from R.

In [90]:
maap_py <- import("maap.maap")
maap <- maap_py$MAAP()

 ## Searching for Data

In the example below, we'll demonstrate searching and accessing data from ORNL DAAC. We'll search for a GEDI L4B dataset, extract the associated links to access the data, and then open a file.

In [91]:
# Search for a dataset in NASA's CMR
gedi_collection <- maap$searchCollection(
  short_name = "GEDI_L4B_Gridded_Biomass_V2_1_2299",  
  cmr_host = "cmr.earthdata.nasa.gov",
  cloud_hosted = "true"
)

# Extract the collection’s concept ID
collection_id <- gedi_collection[[1]]["concept-id"]
print(paste("Collection ID:", collection_id))

# Retrieve granules (up to 5 granules)
gedi_granules <- maap$searchGranule(
  concept_id = collection_id,
  limit = as.integer(5),
  cmr_host = "cmr.earthdata.nasa.gov"
)

[1] "Collection ID: C2792577683-ORNL_CLOUD"


Now that we have our granules, let's extract the URLs associated with the first granule. There are two links: an S3 link, and an https link.

In [92]:
http_link <- gedi_granules[[1]]["Granule"]["OnlineAccessURLs"][[1]][0]["URL"]
print(paste("https Link:", http_link))
s3_link <- gedi_granules[[1]]["Granule"]["OnlineAccessURLs"][[1]][2]["URL"]
print(paste("S3 Link:", s3_link))

[1] "https Link: https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif"
[1] "S3 Link: s3://ornl-cumulus-prod-protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif"


## Data Access

### Direct Access

Let's use the `sf` package to open the TIFF file above. To read an item from S3 directly, `/vsis3/` needs to precede the S3 path. To do this, we'll use the `sub` function to replace `s3://` with `/vsis3/`.

In [93]:
tiff_path <- sub("s3://", "/vsis3/", s3_link)

tiff_read <- sf::gdal_utils("info", tiff_path)
tiff_read

Driver: GTiff/GeoTIFF
Files: /vsis3/ornl-cumulus-prod-protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif
Size is 34704, 14616
Coordinate System is:
PROJCRS["WGS 84 / NSIDC EASE-Grid 2.0 Global",
    BASEGEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["US NSIDC EASE-Grid

### Download a File Locally

When data cannot or should not be directly accessed, the file can also be downloaded locally. For this example, let's search for a MODIS dataset provided by LP DAAC. Similar to above, we'll search for the collection and retrieve the associated granules, then extract the S3 link from the first granule.

In [94]:
# Search for a dataset in NASA's CMR
modis_collection <- maap$searchCollection(
  short_name = "MOD13A1",  
  cmr_host = "cmr.earthdata.nasa.gov",
  cloud_hosted = "true"
)

# Extract the collection’s concept ID
collection_id <- modis_collection[[1]]["concept-id"]

# Retrieve granules (up to 5 granules)
modis_granules <- maap$searchGranule(
  concept_id = collection_id,
  limit = as.integer(5),
  cmr_host = "cmr.earthdata.nasa.gov"
)

# Retrieve S3 link
s3_link <- modis_granules[[1]]["Granule"]["OnlineAccessURLs"][[1]][1]["URL"]
print(paste("S3 Link:", http_link))

[1] "S3 Link: https://data.ornldaac.earthdata.nasa.gov/protected/gedi/GEDI_L4B_Gridded_Biomass_V2_1/data/GEDI04_B_MW019MW223_02_002_02_R01000M_SE.tif"


To download the data locally, temporary credentials for LP DAAC are needed. 

In [95]:
# Get AWS S3 credentials for LP DAAC
credentials <- maap$aws$earthdata_s3_credentials(
  "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials"
)

# Configure AWS S3 client using paws
s3 <- paws::s3(
  credentials = list(
    creds = list(
      access_key_id = credentials$accessKeyId,
      secret_access_key = credentials$secretAccessKey,
      session_token = credentials$sessionToken
    )),
  region = "us-west-2")

Before downloading, let's do some final prepping. First, we'll create a directory to download our file to. Then, from our S3 link, we can get the bucket, key, and a filename.

In [None]:
# Create directory
dir.create("./data")

In [97]:
# Create file name for download
filename <- strsplit(s3_link, "/", fixed = TRUE)[[1]] |> tail(n = 1)
filename

# Get bucket from file path
bucket <- strsplit(s3_link, "/", fixed = TRUE)[[1]] |> head(n = 3)
bucket <- bucket[3]
bucket

# Get key from file path
key <- strsplit(s3_link, "/", fixed = TRUE)[[1]] |> tail(n = 3)
key <- paste(key[1], key[2], key[3], sep = "/")
key

Now we can download our file.

In [98]:
modis_file <- s3$download_file(Bucket = bucket, Key = key, Filename = paste("./data/", filename))

### Access the Downloaded File

The data has been downloaded and we can open the file. Since this is an HDF4 file, we can use the `ncdf4` package to open and work with it.

In [99]:
modis_file <- nc_open(paste("./data/", filename))

The desired information can now be obtained from the opened file. For example, let's print the variable names.

In [100]:
names(modis_file$var)