In [1]:
import calendar
import datetime
import os

import pyproj
import pystac
import shapely.geometry
import stac2dcache

from pystac.extensions.projection import ProjectionExtension
from pystac.extensions.scientific import ScientificExtension

from stac2dcache.utils import copy_asset

# Download the Daymet dataset as a STAC catalog

## 1. Introduction

### 1.1 Overview

In this notebook we retrieve (part of) the Daymet dataset, which is made available from the [NASA's Distributed Active Archive Center (DAAC) at Oak Ridge National Laboratory (ORNL)](https://daac.ornl.gov) and download it to the [SURF dCache storage](http://doc.grid.surfsara.nl/en/stable/Pages/Service/system_specifications/dcache_specs.html). The [SpatioTemporal Asset Catalog (STAC)](https://stacspec.org) specification is used to store the dataset metadata and to organize the files within the storage system.  

### 1.2 The dataset

The Daymet dataset includes daily surface weather data for North America, starting from from January 1, 1980 (1950 for Puerto Rico). The dataset consists of a set of netCDF files that include gridded estimates of 7 parameters on a 1-km grid. More information on the dataset can be found [here](https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1840) (dataset version 4.2, https://doi.org/10.3334/ORNLDAAC/1840). 

### 1.3 Before running this notebook

The dataset and its metadata are stored on the SURF dCache  system, which we access via bearer-token authentication with a macaroon. The macaroon, generated using [this script](https://github.com/sara-nl/GridScripts/blob/master/get-macaroon), is stored together with other configuration parameters within a JSON fsspec configuration file (also see the [STAC2dCache tutorial](https://github.com/NLeSC-GO-common-infrastructure/stac2dcache/blob/main/notebooks/tutorial.ipynb) and the [fsspec documentation](https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration) for more info):
```json
{
    "dcache": {
        "token": "<MACAROON_STRING_HERE>",
        "api_url": "https://dcacheview.grid.surfsara.nl:22880/api/v1",
        "webdav_url": "https://webdav.grid.surfsara.nl:2880",
        "block_size": 0, 
	"request_kwargs": {
            "timeout": 3600
        }
    }
}
```

## 2. Daymet as a STAC catalog

### 2.1 Overview

Let's start by creating a STAC catalog for the Daymet dataset and by saving it to the dCache storage. Note that the catalog itself only contains metadata and links to the relevant remote dataset files. These links will be used to retrieve the actual data in the following section.

### 2.2 Metadata 

The catalog metadata are populated using the following information, which we have extracted from the dataset [user guide](https://daac.ornl.gov/DAYMET/guides/Daymet_Daily_V4.html).

In [2]:
title = (
    "Daymet: Daily Surface Weather Data on a 1-km Grid "
    "for North America, Version 4"
)

In [3]:
# Summary
description = (
    "This dataset provides Daymet Version 4 data as gridded "
    "estimates of daily weather parameters for North America, "
    "Hawaii, and Puerto Rico. Daymet variables include the "
    "following parameters: minimum temperature, maximum "
    "temperature, precipitation, shortwave radiation, vapor "
    "pressure, snow water equivalent, and day length. The dataset "
    "covers the period from January 1, 1980, to December 31 (or "
    "December 30 in leap years) of the most recent full calendar "
    "year for the Continental North America and Hawaii spatial "
    "regions. Data for Puerto Rico is available starting in 1950. "
    "Each subsequent year is processed individually at the close of "
    "a calendar year. Daymet variables are provided as individual "
    "files, by variable and year, at a 1 km x 1 km spatial "
    "resolution and a daily temporal resolution. Areas of Hawaii "
    "and Puerto Rico are available as files separate from the "
    "continental North America. Data are in a North America Lambert "
    "Conformal Conic projection and are distributed in a "
    "standardized Climate and Forecast (CF)-compliant netCDF file "
    "format."
)

In [4]:
# Citation
doi = "10.3334/ORNLDAAC/1840"

In [5]:
# Temporal Coverage and Study Areas
# (all latitude and longitude given in decimal degrees)
regions = {
    "na": {
        "full_name": "Continental North America", 
        "bbox": (-178.1333, 14.0749, -53.0567, 82.9143),
        "start_year": 1980,
        "end_year": 2021,
    },
    "pr": {
        "full_name": "Puerto Rico", 
        "bbox": (-67.9927, 16.8444, -64.1196, 19.9382),
        "start_year": 1950,
        "end_year": 2021,
    },
    "hi": {
        "full_name": "Hawaii", 
        "bbox": (-160.3056, 17.9539, -154.772, 23.5186),
        "start_year": 1980,
        "end_year": 2021,
    },
}

In [6]:
# Parameters, abbreviations, units, and descriptions.
parameters = {
    "dayl": {
        "title": "Day length",
        "description": (
            "Duration of the daylight period in seconds per day. " 
            "This calculation is based on the period of the day "
            "during which the sun is above a hypothetical flat "
            "horizon"
        ),
        "units": "s/day",
    }, 
    "prcp": {
        "title": "Precipitation",
        "description": (
            "Daily total precipitation in millimeters. Sum of all "
            "forms of precipitation converted to a water-equivalent "
            "depth."
        ),
        "units": "mm",
    }, 
    "srad": {
        "title": "Shortwave radiation",
        "description": (
            "Incident shortwave radiation flux density in watts per "
            "square meter, taken as an average over the daylight "
            "period of the day. Note: Daily total radiation "
            "(MJ/m2/day) can be calculated as follows: "
            "((srad (W/m2) * dayl (s/day)) / l,000,000)"
        ),
        "units": "W/m2",
    }, 
    "swe": {
        "title": "Snow water equivalent",
        "description": (
            "Snow water equivalent in kilograms per square meter. "
            "The amount of water contained within the snowpack."
        ),
        "units": "kg/m2",
    }, 
    "tmax": {
        "title": "Maximum air temperature",
        "description": (
            "Daily maximum 2 m air temperature in degrees Celsius."
        ),
        "units": "degrees C",
    }, 
    "tmin": {
        "title": "Minimum air temperature",
        "description": (
            "Daily minimum 2 m air temperature in degrees Celsius."
        ),
        "units": "degrees C",
    }, 
    "vp": {
        "title": "Water vapor pressure",
        "description": (
            "Water vapor pressure in pascals. Daily average partial "
            "pressure of water vapor."
        ),
        "units": "Pa",
    },
}

In [7]:
# Coordinate Reference System
proj4_string = (
    "+proj=lcc +lat_1=25 +lat_2=60 +lat_0=42.5 +lon_0=-100 "
    "+x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs"
)

In [8]:
filename_format = "daymet_v4_daily_{region}_{parameter}_{year}.nc"

### 2.3 Create STAC Catalog

Here we define the catalog structure. A STAC item is created per each year and region, with the 7 dataset parameter files being linked as assets therewithin. All items in a region are organized in a sub-catalog, defined as a STAC collection. 

Together with the metadata listed above we add information about the Daymet license and about the data provider (ORNL-DAAC): 

In [9]:
# link to the dataset license
license = pystac.Link(
    rel=pystac.RelType.LICENSE,
    target=(
        "https://science.nasa.gov/earth-science/"
        "earth-science-data/data-information-policy"
    ),
    title="NASA's Earth Science program Data and Information Policy",
)

In [10]:
# data provider, with citation
provider = pystac.Provider(
    name="ORNL DAAC",
    roles=[pystac.ProviderRole.PRODUCER],
    url="https://doi.org/10.3334/ORNLDAAC/1840",
)

The following cell contains the parameters that set the catalog name and the dCache path where the catalog root directory is created: 

In [11]:
catalog_id = "daymet-daily-v4"
dcache_root = "dcache://pnfs/grid.sara.nl/data/remotesensing/disk"
ornl_daac_root = (
    "https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/1840/"
)

In [12]:
# convert proj4 string to projjson to store CRS within items
crs = pyproj.CRS.from_proj4(proj4_string)
crs_projjson = crs.to_json_dict()

In [13]:
# Create Daymet catalog
catalog = pystac.Catalog(
    id=catalog_id,
    description=description,
    title=title,
)

for region_key, region_val in regions.items():

    region_full_name = region_val["full_name"]
    bbox = region_val["bbox"]
    start_year = region_val["start_year"]
    end_year = region_val["end_year"]

    geometry = shapely.geometry.mapping(
        shapely.geometry.Polygon.from_bounds(*bbox)
    )
    
    items = []
    for year in range(start_year, end_year+1):
    
        # For each region and year, create a STAC item
        item = pystac.Item(
            id=f"{region_key}-{year}",
            geometry=geometry,
            bbox=bbox,
            datetime=datetime.datetime(year, 1, 1),
            properties={
                "gsd": 1000
            }
        )
        
        # Add projection information to the items
        ext = ProjectionExtension.ext(item, add_if_missing=True)
        ext.epsg = None
        ext.projjson = crs_projjson
        
        for parameter_key, parameter_val in parameters.items():
            
            filename = filename_format.format(
                parameter=parameter_key,
                region=region_key,
                year=year,
            )
         
            # For each parameter, create a STAC asset
            asset = pystac.Asset(
                href=f"{ornl_daac_root}/{filename}",
                title=parameter_val["title"],
                description=parameter_val["description"],
                media_type=pystac.MediaType.HDF5,
                roles=["data"],
                extra_fields={"units": parameter_val["units"]},
            )
            item.add_asset(parameter_key, asset)
            
        items.append(item)
            
    # Group items corresponding to a region in a STAC collection
    collection = pystac.Collection(
        id=f"region-{region_key}",
        description=f"Daymet dataset for {region_full_name}",
        extent=pystac.Extent.from_items(items),
        license="proprietary",
        providers=[provider]
    )

    # Add DOI to collections
    ext = ScientificExtension.ext(collection, add_if_missing=True)
    ext.doi = doi
    
    collection.add_items(items)
    catalog.add_child(collection)
    
# Add link to license
catalog.add_link(license)

In [14]:
# get overview of the catalog
catalog.describe()

* <Catalog id=daymet-daily-v4>
    * <Collection id=region-na>
      * <Item id=na-1980>
      * <Item id=na-1981>
      * <Item id=na-1982>
      * <Item id=na-1983>
      * <Item id=na-1984>
      * <Item id=na-1985>
      * <Item id=na-1986>
      * <Item id=na-1987>
      * <Item id=na-1988>
      * <Item id=na-1989>
      * <Item id=na-1990>
      * <Item id=na-1991>
      * <Item id=na-1992>
      * <Item id=na-1993>
      * <Item id=na-1994>
      * <Item id=na-1995>
      * <Item id=na-1996>
      * <Item id=na-1997>
      * <Item id=na-1998>
      * <Item id=na-1999>
      * <Item id=na-2000>
      * <Item id=na-2001>
      * <Item id=na-2002>
      * <Item id=na-2003>
      * <Item id=na-2004>
      * <Item id=na-2005>
      * <Item id=na-2006>
      * <Item id=na-2007>
      * <Item id=na-2008>
      * <Item id=na-2009>
      * <Item id=na-2010>
      * <Item id=na-2011>
      * <Item id=na-2012>
      * <Item id=na-2013>
      * <Item id=na-2014>
      * <Item id=na-2015>
 

In [15]:
# save catalog to storage
catalog.normalize_and_save(
    f"{dcache_root}/{catalog_id}",
    catalog_type=pystac.CatalogType.SELF_CONTAINED,
)

## 3. Retrieve data for spring index calculation

### 3.1 Overview 

After having created a STAC catalog with the metadata for the  Daymet dataset, we now procede to retrieve the files that are relevant for the extended Spring Index model calculations over our area of interest (i.e. the continental North America). Only for this region, we thus retrieve the assets corresponding to the following three parameters: maximum temperature, minimum temperature, and duration of the daylight period. 

The data files are stored beside the metadata files on the dCache storage. Links within the catalog are updated to point to the files that have been retrieved to SURF infrastructures.

### 3.2 Dowload

We first select the only sub-catalog with the items related to the continental North America (collection ID: `region-na`): 

In [16]:
collection_na = catalog.get_child("region-na")

Finally, we download the assets to the storage and save the catalog with the updated links:

In [None]:
for asset_key in ("tmin", "tmax", "dayl"):
    copy_asset(
        catalog=collection_na,
        asset_key=asset_key,
        update_catalog=True,
        max_workers=4
    )

In [None]:
catalog.normalize_and_save(f"{dcache_root}/{catalog_id}")

The data downloaded amounts to ~940 GiB. Using 4 workers to retrieve the items' assets in parallel, the cells above complete in ~7 hours. 