Skip to content

Data Access Patterns

MikeJohnson-NOAA edited this page Nov 30, 2023 · 22 revisions

Working with the CONUS hydrofabric:

A hydrofabric suitable for the NextGen Water Resources Framework is made avaialbe to the community as a cloud resource1.

In addition to the core hydrofabric products, a wide range of ancillary data can be associated with the fabrics to provide more opportunity for science, research, ML, and regionalization.

This wiki demonstrates how to find a subset of the CONUS dataset, and how to add data from (1) NextGen core models (2) hydroATLAS (3) EPA streamcats and (4) USGS basin characteristics (5) how to utilize

Each is slightly unique in what is needed to be done, but they all share simular data access patterns.

To start, we need to load the associated library:

library(hydrofabric)

#> ── Attaching packages ────────────────────────────────────── hydrofabric0.0.6 ──
#> ✔ dplyr         1.1.3        ✔ nhdplusTools  1.0.1   
#> ✔ terra         1.7.55       ✔ hydrofab      0.5.0   
#> ✔ ngen.hydrofab 0.0.3        ✔ zonal         0.0.2   
#> ✔ climateR      0.3.1.4      ✔ glue          1.6.2   
#> ✔ sf            1.0.14       ✔ arrow         13.0.0.1
#> ── Conflicts ──────────────────────────────────────── hydrofabric_conflicts() ──
#> ✖ arrow::buffer()    masks terra::buffer()
#> ✖ terra::intersect() masks dplyr::intersect()
#> ✖ glue::trim()       masks terra::trim()
#> ✖ terra::union()     masks dplyr::union()

And define our

  1. desired outlet (in this case by hydrolocation),
  2. the location to write our subset network, and the
  3. s3 endpoint we will operate from:
# A hydrolocation URI
hl = 'Gages-04185000'

# The output directory
o = "data/gray_test.gpkg"

# s3 endpoint
s3_base = "s3://lynker-spatial/

NOTE: If you dont have, or lose access, to an s3 endpoint - or want faster repeated access - the base_dir option will allow all utlities here to work off of a local copy of the data.

Building a subset

Our first task is to build a subset from the desired outlet. While this example defines an outlet using a hydrolocation URI you can also provide a:

  1. a comid,
  2. hl_uri,
  3. XY,
  4. nldi_feature,
  5. nextgen ID
  6. geoconnex URL
# A hydrolocation URI
hl = 'Gages-04185000'

# The output directory
o = "data/gray_test.gpkg"

# Build subset
## caching the downloaded VPU files to "data" and writing all layers to "o"
subset_network(hl_uri = hl, cache_dir = "data", outfile = o)

Created on 2023-10-17 by the reprex package (v2.0.1)

Adding CFE and NOAH-OWP data

Now that we have a geopackage with the subset layers, it is possible to append it with data that has been summarized to the nextgen data, to the reference fabric, or, to an associated data product.

In all cases the ancillary data is hosted in the cloud either by Lynker, or, the USGS. Access to this data is always most direct using the network layer of a hydrofabric. In the following example we show how to append the CFE/NOAH-OWP data to the subset divides:

  1. Access the hydrofabric network layer to identify the desired VPU
  2. Define an s3 file based on the VPU and attribute of choice
  3. Establish a connection to the remote file
  4. Filter to only include those divides in the subset network
  5. Pull down the data,
  6. Join it to the divides and plot
library(hydrofabric)

par = open_dataset(glue('{s3_base}/v20/conus_model_attributes.parquet')) %>% 
  filter(divide_id %in% div$divide_id) %>% 
  collect()

dim(par)
#> [1] 140  43

div = left_join(read_sf(o, "divides"), par, by = "divide_id")

plot(div['smcwlt_soil_layers_stag=2'])

Created on 2023-11-13 by the reprex package (v2.0.1)

Adding HydroAtlas data

HydroAtlas offers a global compendium of hydro-environmental characteristics for all sub-basins of HydroBASINS. We have indexed the reference features to the HydroAtlas/HydroBasin divides and provide the attribute data and this crosswalk amongst the cloud resources.

Since HydroBasins and the reference fabric stem for different source DEMs, there is not a scalable relation between the two. Further HydroAtlas is much courser then the NextGen fabric making a reference fabric centroid <--> hydroatlas basin about as good as it gets.

In the following example we append hydroAtlas data to the subset network:

  1. Define the variables of interest
  2. Access the hydrofabric network layer to identify a representative hf_id for each divide
  3. Define an s3 file based on the VPU and attribute of choice
  4. Establish a connection to the remote file
  5. Filter to only include those divides in the subset network, and select only the desired columns
  6. Pull down the data and join it to the network object (establishing a divide --> variable relation)
  7. Join it to the divides and plot
# Define HydroATLAS variables you want (can skip this and remove `select("hf_id", any_of(vars))` to get all)
vars = 'pet_mm_s01'

## READ NETWORK
net = read_sf(o, "network") %>% 
  select(divide_id, hf_id) %>% 
  filter(complete.cases(.)) %>% 
  group_by(divide_id) %>% 
  slice(1)

## EXRTACT HydroATLAS vars

ha = open_dataset(glue('{s3_base}/hydroATLAS/hydroatlas_vars.parquet')) %>% 
  filter(hf_id %in% net$hf_id) %>% 
  select("hf_id", any_of(vars)) %>% 
  collect() %>% 
  right_join(net, by = "hf_id")

## JOIN features and variables

div = left_join(read_sf(o, "divides"), ha, by = "divide_id")

## PLOT
plot(div[c(vars)])

Created on 2023-10-17 by the reprex package (v2.0.1)

Adding StreamCAT

SteamCat contains over 600 metrics that include local catchment (Cat), watershed (Ws), and special metrics. The special metrics were derived through modeling or by combining other StreamCat metrics. These variables include predicted water temperature, predicted biological condition, and the indexes of catchment and watershed integrity.

The catchment level is associated with the reference fabric used to build our products therefore there is an area weighted scaling that can be performed between the small (streamcat) and larger (nextgen) basins using zonal.

In the following example we append streamcat statsgo data to the subset network:

  1. List the streamcat parquet files in the s3 account (138 in total) and determine those containing statsgo data (n = 2)
  2. Access the hydrofabric network layer to identify the representative hf_ids and areas in each divide
  3. Establish a connection to the remote parquet file
  4. Filter to only include those reference featrues in the subset network, and select only the desired columns
  5. Pull down the data and join it to the network object (establishing a divide/area --> ref/area/value relation)
  6. Rescale the values using zonal::aggregate_zones
  7. Join it to the divides and plot
library(aws.s3)

sc_files = get_bucket_df(bucket = s3_base, prefix = 'streamcats', region = "us-west-2") %>% 
  filter(!grepl("old|data", Key)) %>% 
  select(Key)

dim(sc_files)
#> [1] 138   1

(statsgo = filter(sc_files, grepl('STATSGO', Key)))
#>                                             Key
#> 1 streamcats/streamcat_STATSGO_Set1_cat.parquet
#> 2 streamcats/streamcat_STATSGO_Set2_cat.parquet

ref =  select(read_sf(o, "network"), hf_id, divide_id, areasqkm)

sc = open_dataset(glue('{s3_base}{statsgo$Key[1]}')) %>% 
  select(hf_id = COMID, ClayCat, SandCat, s_areasqkm = CatAreaSqKm) %>% 
  filter(hf_id %in% ref$hf_id) %>% 
  collect() %>% 
  left_join(ref, by = "hf_id")

sc = aggregate_zones(data = sc, 
                     geom = NULL,  
                     crosswalk = select(sc, hf_id, divide_id, areasqkm, s_areasqkm),
                     ID = "divide_id") 

  
div = left_join(read_sf(o, "divides"), sc,  by = "divide_id")

plot(div[c('ClayCat', 'SandCat')])

Created on 2023-10-17 by the reprex package (v2.0.1)

Adding USGS Summaries

Mike Wieczorek and his team at the USGS regularly produce high quality summarization of critical environmental data to the NHDPlus. A cached version of the 2021 version2 this has been made avaliable on ScienceBase.

We use the nhdplusTools index of these to find the proper URLs for a desired attribute (e.g. TWI). From that a very similar rescaling process can take place. The exception here, is that, unlike SteamCat, the base parquet files do not store the reference features areas. To get around this, we need to add the reference feature areas to the network table using nhdplusTools::get_vaa()

TODO: perhaps we should add hf_area to the network tables?

  1. Identify the USGS TWI parquet file
  2. Access the hydrofabric network layer to identify the representative hf_id's and area of each divide.
  3. Join this to a table of reference feature areas.
  4. Establish a connection to the remote parquet file
  5. Filter to only include those reference features in the subset network, and select only the desired columns
  6. Pull down the data and join it to the network object (establishing a divide/area --> ref/area/value relation)
  7. Rescale the values using zonal::aggregate_zones
  8. Join it to the divides and plot
usgs_c = get_characteristics_metadata() %>% 
  filter(ID == 'TOT_TWI')

ref =  select(read_sf(o, "network"), hf_id, divide_id, areasqkm) %>% 
  left_join(rename(get_vaa('areasqkm'), s_areasqkm = areasqkm), by = c('hf_id' = 'comid')) %>% 
  filter(complete.cases(.))


usgs_c = open_dataset(usgs_c$s3_url) %>% 
  select(hf_id = COMID, TOT_TWI) %>% 
  filter(hf_id %in% ref$hf_id) %>% 
  collect() %>% 
  left_join(ref, by = "hf_id")

usgs_c = aggregate_zones(data = usgs_c, 
                     geom = NULL,  
                     crosswalk = select(usgs_c, hf_id, divide_id, areasqkm, s_areasqkm),
                     ID = "divide_id") 


div = left_join(read_sf(o, "divides"), usgs_c,  by = "divide_id")

plot(div["TOT_TWI"])

Created on 2023-10-17 by the reprex package (v2.0.1)

Adding NWM Gridded Data

Lastly, there are times when gridded data needs to be accessed. In a simple example, we have reformatted the data from the NWM 3.0 needed to run key NextGen modules. This has been save in the gridded-resources directory of our s3 account.

This data can be downloaded, and accessed by the divides in the subset network:

r = rast(glue('{base_s3}/nwm/conus/{vars}.tif'))
d = crop(r, project(vect(div), crs(r)), snap = "out", mask = TRUE)

plot(d)

r2 = correct_nwm_spatial(glue('{base_s3}/ngen_gridded_data.nc'), lyrs = vars)
crop(r2, project(vect(div), crs(r)), snap = "out")
#> class       : SpatRaster 
#> dimensions  : 60, 31, 3  (nrow, ncol, nlyr)
#> resolution  : 1000, 1000  (x, y)
#> extent      : 998000.4, 1029000, 238999.3, 298999.3  (xmin, xmax, ymin, ymax)
#> coord. ref. : Sphere_Lambert_Conformal_Conic 
#> source(s)   : memory
#> names       :       slope, ISLTYP, IVGTYP 
#> min values  : 0.007807857,      3,      1 
#> max values  : 0.297914475,     14,     16

Created on 2023-11-13 by the reprex package (v2.0.1)

Adding climateR-catalog summaries

climateR-catalogs are a monthly updating archive of web accessable data that currently catalogs over 100,000 datasets from over 2,000 providers.

This catalog can be accessed in JSON or parquet format and is used not only for NextGen activites, but by the USGS for a range of activities (e.g. gdptools)

The catalog is shipped with climateR which provides generalized data access patterns for geospatial data (on the web and local)

For this example, lets say we are interest in the mean sand percentage in our divides across the 6 depth layers provided by POLARIS.

First we would need to reduce the catalog:

cat = filter(catalog, id == "polaris", grepl('mean sand', varname))

#> # A tibble: 6 × 29
#>   id      asset URL      type  varname variable description units model ensemble
#>   <fct>   <chr> <chr>    <fct> <chr>   <fct>    <chr>       <chr> <chr> <chr>   
#> 1 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> 2 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> 3 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> 4 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> 5 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> 6 polaris <NA>  /vsicur… vrt   mean s… mean sa… sand perce… %     <NA>  <NA>    
#> # ℹ 19 more variables: scenario <chr>, T_name <chr>, duration <chr>,
#> #   interval <chr>, nT <int>, X_name <chr>, Y_name <chr>, X1 <dbl>, Xn <dbl>,
#> #   Y1 <dbl>, Yn <dbl>, resX <dbl>, resY <dbl>, ncols <int>, nrows <int>,
#> #   crs <chr>, toptobottom <lgl>, tiled <fct>, dim_order <fct>

Then, access the gridded data for these resources and out AOI:

div = read_sf(o, "divides")
sand_cat = dap(catalog = cat, AOI = div)
names(sand_cat) = paste0('layer', 1:6)

#> class       : SpatRaster 
#> dimensions  : 1978, 1321, 6  (nrow, ncol, nlyr)
#> resolution  : 0.0002777778, 0.0002777778  (x, y)
#> extent      : -84.51528, -84.14833, 41.48556, 42.035  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#> source(s)   : memory
#> names       :    layer1,    layer2,    layer3,    layer4,    layer5,    layer6 
#> min values  :  5.791708,  5.620117,  4.984344,  3.536133,  2.602639,  3.738049 
#> max values  : 88.154297, 88.329582, 90.301567, 92.887917, 94.666504, 95.716583

And finally, determine the per unit mean using zonal

sand = execute_zonal(data = sand_cat, geom = div, ID = "id")

Created on 2023-10-24 by the reprex package (v2.0.1)

Adding with climateR shortcuts

In addition to the general dap utilities provided by climateR the package provides a suite of "shortcuts" for common dataset. For example we can use climateR to (1) extract, and (2) process a 58 year, monthly data products in a matter of seconds:

div = read_sf(o, "divides")

## 65 years of monthly 4km data
system.time({
  tmax = getTerraClim(AOI = div,  varname = "tmax")  
})
#>    user  system elapsed 
#>   0.524   0.051  20.557

#> $tmax
#> class       : SpatRaster 
#> dimensions  : 14, 10, 780  (nrow, ncol, nlyr)
#> resolution  : 0.04166667, 0.04166667  (x, y)
#> extent      : -84.54167, -84.125, 41.45833, 42.04167  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=longlat +ellps=WGS84 +no_defs 
#> source(s)   : memory
#> names       : tmax_~total, tmax_~total, tmax_~total, tmax_~total, tmax_~total, tmax_~total, ... 
#> min values  :       -0.76,       -2.77,        5.05,       14.88,       21.57,       22.97, ... 
#> max values  :        0.04,       -1.90,        5.75,       15.91,       22.58,       23.94, ... 
#> unit        :        degC,        degC,        degC,        degC,        degC,        degC, ... 
#> time        : 1958-01-01 to 2022-12-01 UTC

Once accessed, the data can be summarized using zonal for the divides of interest:

system.time({
  z = execute_zonal(tmax, geom = div, ID = "id", fun = "mean")
})
#>    user  system elapsed 
#>   0.462   0.024   0.491

Adding alternative summaries

Lets say instead of a monthly mean value over each month in the 58 year record, you wanted the mean monthly normal (e.g. mean January temperture). Since the data is in memory, this modififcation is quite simple:

system.time({
  # Define an index
  ind = rep(1:12, times  = nlyr(tmax[[1]])/12)
  # Summarize over index
  x <- rast(lapply(1:12, function(i) mean(tmax[[1]][[ind[i]]])))
  names(x) = month.abb
  # summarize
  z.mean = execute_zonal(x, geom = div, ID = "id", fun = "mean")
  z.max = execute_zonal(x, geom = div, ID = "id", fun = "max")
})
#>    user  system elapsed 
#>   0.285   0.007   0.296

Created on 2023-10-24 by the reprex package (v2.0.1)

Adding alternative shortcuts

In many cases, DEM data is needed. We store many variations of DEM in the climateR-catalog including 3DEP, NASADEM, GEBOC, and some OpenTopography sets

Here, we can quickly exemplify accessing a 3DEP DEM for our divides:

dem = climateR::get3DEP(div)

Created on 2023-10-24 by the reprex package (v2.0.1)

Adding external data

Now while we cover a huge amount of resources in the catalog, there are always datasets we dont. Fortunately, the dap utilities can read from remote, non-cataloged resources as well. For example. we can point to an OpenDap endpoint:

system.time({
  file <- dap(
    URL = "https://cida.usgs.gov/thredds/dodsC/bcsd_obs",
    AOI = div,
    startDate = "1995-01-01",
    verbose = FALSE,
    varname  = "pr"
  ) |>
    execute_zonal(geom = div, fun = "mean", ID = "id", join = TRUE)
})
#>    user  system elapsed 
#>   0.444   0.049   2.733

Created on 2023-10-24 by the reprex package (v2.0.1)

References

1 Johnson, J. M. (2022). National Hydrologic Geospatial Fabric (hydrofabric) for the Next Generation (NextGen) Hydrologic Modeling Framework, HydroShare, http://www.hydroshare.org/resource/129787b468aa4d55ace7b124ed27dbde

2 Wieczorek, M.E., Jackson, S.E., and Schwarz, G.E., 2018, Select Attributes for NHDPlus Version 2.1 Reach Catchments and Modified Network Routed Upstream Watersheds for the Conterminous United States (ver. 3.0, January 2021): U.S. Geological Survey data release, doi:10.5066/F7765D7V