---
title: Create STAC metadata for a CMR collection, using GPM IMERG
description: Tutorial for creating STAC metadata for a collection in CMR
author: Aimee Barciauskas
date: June 9, 2025
execute:
  freeze: true
  cache: true
---

## Run this notebook

This notebook was written on a VEDA JupyterHub instance.

See [VEDA Interactive Compute and Processing Environment docs](https://docs.openveda.cloud/user-guide/scientific-computing/getting-access.html) for information about how to gain access.

## Approach

This notebook creates STAC collection metadata for the [GPM IMERG Final Precipitation L3 1 day 0.1 degree x 0.1 degree V07 (GPM_3IMERGDF) at GES DISC](https://search.earthdata.nasa.gov/search/granules?p=C2723754864-GES_DISC) dataset. 

## Step 1: Install and import necessary libraries

In [1]:
%%capture
!pip install xstac

In [2]:
import earthaccess
import json
from datetime import datetime
import pandas as pd
import pystac
import requests
import s3fs
import xstac
import xarray as xr

## Step 2: Get Collection metadata from CMR

In [3]:
earthaccess.login()

Enter your Earthdata Login username:  aimeeb
Enter your Earthdata password:  ········


<earthaccess.auth.Auth at 0x7fe42566dbb0>

In [4]:
collection_identifier = "GPM_3IMERGDF.07"
collection_config = {
    "short_name": "GPM_3IMERGDF",
    "version": "07",
    "temporal_step": "P1D",
    "variables": {
        "precipitation": {"rescale": [[0, 48]], "colormap": "cfastie"},
        "MWprecipitation": {"rescale": [[0, 50]], "colormap": "cfastie"},
        "randomError": {"rescale": [[0, 17469]], "colormap": "reds"},
    },
    "reference_system": "4326",
}
short_name, version, temporal_step, variables, reference_system = (
    collection_config.values()
)
collection_query = earthaccess.collection_query()
r = collection_query.short_name(short_name).version(version)
cmr_collection = r.get(1)[0]

Pick out one granule to open for data cube dimensions and variables

In [5]:
first_result = earthaccess.search_data(
    short_name=short_name, version=version, cloud_hosted=True, count=1
)

In [6]:
first_result

[Collection: {'ShortName': 'GPM_3IMERGDF', 'Version': '07'}
 Spatial coverage: {'HorizontalSpatialDomain': {'Geometry': {'BoundingRectangles': [{'WestBoundingCoordinate': -180.0, 'EastBoundingCoordinate': 180.0, 'NorthBoundingCoordinate': 90.0, 'SouthBoundingCoordinate': -90.0}]}}}
 Temporal coverage: {'RangeDateTime': {'BeginningDateTime': '1998-01-01T00:00:00.000Z', 'EndingDateTime': '1998-01-01T23:59:59.999Z'}}
 Size(MB): 26.3265686035156
 Data: ['https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGDF.07/1998/01/3B-DAY.MS.MRG.3IMERG.19980101-S000000-E235959.V07B.nc4']]

In [7]:
s3_link = first_result[0].data_links(access="direct")[0]
s3_link

's3://gesdisc-cumulus-prod-protected/GPM_L3/GPM_3IMERGDF.07/1998/01/3B-DAY.MS.MRG.3IMERG.19980101-S000000-E235959.V07B.nc4'

In [8]:
%%time
fs = s3fs.S3FileSystem(anon=False)
ds_s3 = xr.open_dataset(fs.open(s3_link), engine="h5netcdf", chunks={})

CPU times: user 2.47 s, sys: 312 ms, total: 2.78 s
Wall time: 3.61 s


## Step 3: Generate STAC metadata

The spatial and temporal extents are extracted from the CMR collection metadata.

In [9]:
spatial_extent = cmr_collection["umm"]["SpatialExtent"]
bounding_rectangle = spatial_extent["HorizontalSpatialDomain"]["Geometry"][
    "BoundingRectangles"
][0]
extent_list = [
    bounding_rectangle["WestBoundingCoordinate"],
    bounding_rectangle["SouthBoundingCoordinate"],
    bounding_rectangle["EastBoundingCoordinate"],
    bounding_rectangle["NorthBoundingCoordinate"],
]
spatial_extent = list(map(int, extent_list))

temporal_extent = cmr_collection["umm"]["TemporalExtents"][0]["RangeDateTimes"][0]
start = temporal_extent["BeginningDateTime"]
end = temporal_extent.get("EndingDateTime", None)

extent = pystac.Extent(
    spatial=pystac.SpatialExtent(bboxes=[spatial_extent]),
    temporal=pystac.TemporalExtent([[pd.to_datetime(start), pd.to_datetime(end)]]),
)

Add the provider information from CMR.

In [10]:
cmr_roles_to_pystac_roles = {
    "PROCESSOR": pystac.ProviderRole.PROCESSOR,
    "DISTRIBUTOR": pystac.ProviderRole.HOST,
}


def create_providers_from_data_centers(data_centers):
    providers = []

    for center in data_centers:
        # Extracting necessary information from each data center
        short_name = center.get("ShortName", "")
        long_name = center.get("LongName", "")
        roles = []
        for role in center.get("Roles", []):
            if role in cmr_roles_to_pystac_roles:
                roles.append(cmr_roles_to_pystac_roles[role])
        url = next(
            (
                url_info["URL"]
                for url_info in center.get("ContactInformation", {}).get(
                    "RelatedUrls", []
                )
                if url_info.get("URLContentType") == "DataCenterURL"
            ),
            None,
        )

        # Creating a PySTAC Provider object
        provider = pystac.Provider(
            name=short_name, description=long_name, roles=roles, url=url
        )
        providers.append(provider)

    return providers


data_centers = cmr_collection["umm"]["DataCenters"]
providers = create_providers_from_data_centers(data_centers)

Put it all together to intialize a `pystac.Collection` instance.

In [11]:
_id = short_name.replace(".", "_")
description = cmr_collection["umm"]["Abstract"]
concept_id = cmr_collection["meta"]["concept-id"]
pystac_collection = pystac.Collection(
    id=_id,
    extent=extent,
    description=cmr_collection["umm"]["Abstract"],
    providers=providers,
    stac_extensions=["https://stac-extensions.github.io/datacube/v2.0.0/schema.json"],
    license="CC0-1.0",
    extra_fields={"collection_concept_id": concept_id},
)

That collection instance is used by `xstac` to generate additional metadata, specifically for the [`datacube extension`](https://github.com/stac-extensions/datacube) information.

In [12]:
collection_template = pystac_collection.to_dict()

# see https://github.com/stac-utils/xstac/issues/30
for k, v in ds_s3.variables.items():
    attrs = {
        name: xr.backends.zarr.encode_zarr_attr_value(value)
        for name, value in v.attrs.items()
    }
    ds_s3[k].attrs = attrs

collection = xstac.xarray_to_stac(
    ds_s3,
    collection_template,
    temporal_dimension="time",
    temporal_step=temporal_step,
    x_dimension="lon",
    y_dimension="lat",
    reference_system=reference_system,
    validate=False,
)

collection.validate()

['https://schemas.stacspec.org/v1.0.0/collection-spec/json-schema/collection.json',
 'https://stac-extensions.github.io/datacube/v2.2.0/schema.json']

Set the second value for the time extent to `None` since the dataset is ongoing. Otherwise the extent is just the extent of the first file in the collection.

In [13]:
collection.to_dict()["cube:dimensions"]["time"]["extent"][1] = None

In [14]:
cube_variables = collection.to_dict()["cube:variables"]
for variable in cube_variables.keys():
    cube_variables[variable]["shape"][0] = None

Add [renders](https://github.com/stac-extensions/render) extension.

In [15]:
collection.extra_fields["renders"] = {}
for vname, vvalue in variables.items():
    collection.extra_fields["renders"][vname] = {
        "title": f"Renders configuration for {vname}",
        "resampling": "average",
        "colormap_name": vvalue["colormap"],
        "rescale": vvalue["rescale"],
        "backend": "xarray",
    }
    collection.to_dict()["cube:variables"][vname]["renders"] = vname

Add dashboard fields.

In [16]:
collection.extra_fields["dashboard:is_periodic"] = True
collection.extra_fields["dashboard:time_density"] = "day"

## Step 4: Write to json

In [18]:
with open(f"../ingestion-data/staging/collections/{collection.id}.json", "w+") as f:
    f.write(json.dumps(collection.to_dict(), indent=2))