Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to split / aggregate characteristics based on hydrofab refactor workflow. #303

Closed
dblodgett-usgs opened this issue Oct 14, 2022 · 31 comments

Comments

@dblodgett-usgs
Copy link
Collaborator

Will document the idea later on, but we need to create a function to take attributes from one scale of catchments and split and / or aggregate them to another set of catchments.

@mikejohnson51
Copy link
Contributor

haha we are working towards this too - we are testing out the stream cats data (see here: s3://nextgen-hydrofabric/streamcats/) with a zonal-esque approach that used data.table/collapse:

https://github.com/mikejohnson51/zonal/blob/18ef9f684bfdf32f732efa69747023526e504803/R/weights.R#L68

@mjcashman
Copy link

Definitely interested in this idea - was already thinking about how to do this for estimating Mike W's upstream accumulated basin characteristics, but at a split bottom catchment. Have a potential need/use-case here

@lekoenig
Copy link

lekoenig commented Nov 16, 2022

For PUMP-ML workflows we've aggregated characteristics to the NHM reference fabric. We haven't dealt with split catchments, but I'll start by describing our workflow for aggregation using just the tabular data.

library(tidyverse)
library(nhdplusTools)

# First, define a function to grab characteristics of interest. So far we've been downloading the data ourselves, 
# but I'll use nhdplusTools helpers in the future now that I know about these capabilities to interface with NLDI.

#' @param varname character vector of desired variables retrieved from /link{discover_nldi_characteristics}
#' @param ids numeric vector of identifiers from the specified reference fabric
#' @param reference_fabric (not used) will be used to allow future specification of alternate reference fabrics
#' @importFrom dplyr bind_rows filter select
#' @importFrom tidyselect everything
get_catchment_characteristics <- function(varname, ids, reference_fabric = "nhdplusv2"){
  nldi_vars <- bind_rows(
    lapply(ids, function(x){
      nldi_ft <- list(featureSource = "comid", featureID = x)
      nldi_vars_ft <- bind_rows(get_nldi_characteristics(nldi_ft, type = "all"))
      nldi_vars_ft_subset <- filter(nldi_vars_ft, characteristic_id %in% varname)
      nldi_vars_ft_out <- select(mutate(nldi_vars_ft_subset,
                                        comid = as.numeric(x),
                                        characteristic_value = as.numeric(characteristic_value)),
                                 comid, everything())
      return(nldi_vars_ft_out)
    })
  )
  return(nldi_vars)
}

# Next, define a function to aggregate characteristics assuming we have a crosswalk of ids between the two fabrics
# and preferences from the user for which summary statistics to use when aggregating.

#' @param vars data frame containing two columns: "characteristic_id", which is a character vector of desired 
#' variables retrieved from /link{discover_nldi_characteristics}; and "aggregation_statistic", which is a character 
#' vector indicating the summary statistic that should be applied to each variable. Acceptable values for 
#' "aggregation_statistic" include "sum", "length_weighted_mean", "area_weighted_mean", "min", and "max".
#' @param ids_xwalk data frame containing two columns: "id", which is a numeric vector of identifiers at the 
#' desired scale and "comid" which is a numeric vector of NHDPlusv2 identifiers.
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr select left_join group_by summarize ungroup across
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect any_of
aggregate_catchment_characteristics <- function(vars, ids_xwalk){

  # download nldi characteristics for the requested comids
  nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
                                            ids = ids_xwalk$comid)

  # download nhdplus subset so that we have areasqkm and lengthkm
  # cols (there's probably a slicker way to do this)
  nhd_lines <- sf::st_drop_geometry(get_nhdplus(comid = c(ids_xwalk$comid), realization = "flowline"))
  nhd_subset <- select(nhd_lines, comid, areasqkm, lengthkm)

  # format characteristics data frame
  nhd_subset_w_vars <- left_join(x = left_join(x = nhd_subset, y = nhd_vars, by = "comid"),
                                 y = ids_xwalk, by = "comid")
  nhd_subset_w_vars_wide <- pivot_wider(nhd_subset_w_vars, !percent_nodata,
                                        names_from = characteristic_id,
                                        values_from = characteristic_value)

  # assign columns based on the desired summary operation
  cols_sum <- vars$characteristic_id[vars$aggregation_statistic == "sum"]
  cols_area_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "area_weighted_mean"]
  cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
  cols_min <- vars$characteristic_id[vars$aggregation_statistic == "min"]
  cols_max <- vars$characteristic_id[vars$aggregation_statistic == "max"]

  # aggregate NHDPlusv2 attributes to desired catchments
  nhd_vars_aggregated <- nhd_subset_w_vars_wide |>
    group_by(id) |>
    summarize(
      areasqkm_sum = sum(areasqkm),
      lengthkm_sum = sum(lengthkm),
      across(any_of(cols_area_wtd_mean), weighted.mean, w = areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
      across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
      across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
      across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
      across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
    ) |>
    ungroup()

  return(nhd_vars_aggregated)
}

# Apply the functions above to an example. 
# define the desired characteristics and their corresponding summary statistics
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011"),
                   aggregation_statistic = c("area_weighted_mean", "min"))

# we use a crosswalk that tells us how the ids from the desired fabric (e.g. NHM fabric)
# correspond to NHDPlusv2 ids. Include crosswalk for one segment as an example:
catchment_xwalk <- data.frame(id = rep(1637, 8),
                              comid = c(4146594, 4146596, 4147370, 4147376,
                                        4147378, 4147380, 4147382, 4147396))
> aggregate_catchment_characteristics(vars, catchment_xwalk)
# A tibble: 1 x 5
     id areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_WILDFIRE_2011_min
  <dbl>        <dbl>        <dbl>            <dbl>                 <dbl>
1  1637         47.3         23.1            -23.3                     0
> 

Is this roughly what you had in mind? The use case to split catchments is less obvious to me right now given the tabular data, I'll check out the zonal stats approach Mike linked to above.

@dblodgett-usgs
Copy link
Collaborator Author

This is exactly what I had in mind. The catchment splitting is going to require a little bit more in the main pipeline and we'll have to bring in a lookup table, but what you've got here is most of it and gives me a good sense for how this should work.

@ajsekell
Copy link

I've got a script (below) running for crosswalking nhdv2 catchment level variables to nhgf refactor. It aggregates and area-weights split catchments when it needs to, sum or mean. I tested with basin area and the numbers add up, even with nhgf divides that include split catchments.

Dave B graciously provided me with a refactor nhgf that includes a split_divides layer so we can get those split catchment areas. That was the missing piece.

So that's the good news, the bad news is that it is ridiculously slow. I'm doing everything in an admittedly tedious way just to get the concepts working.

Since I'm not a whiz at "doing things faster"...perhaps someone can have a look and see a way to optimize this? Like...a lot? It's so slow. And maybe test another variable or two?

library(sf)
library(beepr)
library(dplyr)
library(nhdplusTools)
library(sbtools)

#make a crosswalk from data compiled for nhdv2 catchments to nhgf
#works great! just unbearably slow.
#so i guess that makes it not great.
#
#11/1/2022 andrew sekellick
#updated 11/17/2022
#
#for nhdv2 to nhgf we have aggregated catchments and split catchments (or both, or just 1)
#
#this code works on the refactor nhgf and requires the version that includes the "split_divide" layer
#
#using nhdplusTools::get_nldi_characteristics() to get local basin area from the web is 
#not much difference in speed as having a table on file. possibly even faster this way?
#i realize with the test variable being basin area i could have just used that but whatever, i'm trying stuff. 
#this way you don't have to keep basin area on hand for other variables
#
#using catchment area as test variable and the final numbers look good.

workdir <- "C://nhgf"

#i'm using basin area as my test variable for crosswalking
#get it from here: https://www.sciencebase.gov/catalog/item/57976a0ce4b021cadec97890
#or use the snippet below to download and unzip
item_file_download('57976a0ce4b021cadec97890', destinations = file.path(workdir, 'basin_char.zip'), names = 'BASIN_CHAR_CAT_CONUS.zip')
unzip(file.path(workdir, 'basin_char.zip'), exdir = workdir)
file.remove(file.path(workdir, 'basin_char.zip'))

#test variable
bareavar <- read.csv(file.path(workdir, 'BASIN_CHAR_CAT_CONUS.txt')) %>% 
  subset(select = c(COMID, CAT_BASIN_AREA))

#the nhgf refactor geopackage, stored locally as a dataframe.
#get it from here: https://www.sciencebase.gov/catalog/item/61fbfdced34e622189cb1b0a
#i was having a look at geometry, but I don't really need it.
divideswithgeom <- as.data.frame(st_read("C:\\nhgf\\refactor_02.gpkg", layer='refactored_divides'))
divides <- subset(divideswithgeom, select = -c(geom))

#this is the split divides layer
splitdivideswithgeom <- as.data.frame(st_read("C:\\nhgf\\refactor_02.gpkg", layer='split_divides'))
splitdivides <- subset(splitdivideswithgeom, select = -c(geom))

#choose sum or avg method (don't want to use sum for temperature).
#i haven't really tested avg
method <- "sum"
#method <- "avg"

starttime <- Sys.time()

#making an empty data frame to fill in later.
alldata <- data.frame()

#loop through all rows in the nhgf divides layer. 
#you can do a test run with x number of rows.

for (row in 1:100)
  #for (row in 1:nrow(divides))
{
  print(row)
  #select the nhgf row
  tmpdivide <- divides[row,]
  #split the member_COMID column by commas to get list of COMIDs in the selected NHGF catchment
  tmpmembers <- unlist(strsplit(tmpdivide$member_COMID, ","))
  #making an empty dataframe
  tmptotal <- data.frame() 
  #this loop checks each item in the list of COMIDs per nhgf divide
  for (membs in tmpmembers)
  {
    if (grepl(".", membs, fixed = TRUE) == TRUE)
    {
      #the if looks for a ".", which means it's a split catchment. 
      #so we need to split to get that COMID value.
      testsplit <- unlist(strsplit(membs, ".", fixed = TRUE))
      testcomid <- testsplit[1]
      #this next chunk gets the total comid catchment area from nhdplustools online
      unsplit_area <-
        as.data.frame(get_nldi_characteristics(list(featureSource = "comid", featureID = testcomid), type = "local")) %>%
        filter(local.characteristic_id == "CAT_BASIN_AREA") %>%
        select(local.characteristic_value) %>%
        unlist()
      #this gets the area of the just the split catchment, thanks to Dave B for providing this.
      split_area <- filter(splitdivides, comid_part == membs)
      if (method == "avg")
      {
        newvalue <- (filter(bareavar, COMID == testcomid))
      } else if (method == "sum")
      {
        #this is the area weighting part! 
        #it uses area of split_catchment divided by area of full nhd catchment, multiplied by the variable value
        newvalue <- (filter(bareavar, COMID == testcomid)) * (as.numeric(split_area) / as.numeric(unsplit_area))
      }
      #tmptotal is the total from all the member comids in that nhgf divide.
      #so it can be a couple of rows per divide
      tmptotal <- bind_rows(tmptotal, newvalue)
    }
    else
      #the else here is for unsplit catchments (no ".")
    {
      newvalue <- filter(bareavar, COMID == membs)
      tmptotal <- bind_rows(tmptotal, newvalue)
    }
  }
  #now we're summarizing all the pieces to get 1 value per nhgf catchment
  if (method == "avg") 
  {
    tmptotalsum <- summarize_all(tmptotal, mean)
  } else if ( method == "sum") 
  {
    tmptotalsum <- summarize_all(tmptotal, sum)
  }
  tmptotalsum <- subset(tmptotalsum, select = -c(COMID))
  tmpdivide <- as.data.frame(c(tmpdivide, tmptotalsum))
  #add the row that's the result of this loop to dataframe.
  alldata <- bind_rows(alldata, tmpdivide)
}

endtime <- Sys.time()

endtime - starttime


#beep
beep(2)

@ajsekell
Copy link

Here's a link to the refactor gpkg that includes the split_divide layer:
https://doimspp-my.sharepoint.com/:u:/g/personal/ajsekell_usgs_gov/Effo4XP4-fpOnktiz-Ay_ywBeR-B903AR3ag07N_HK2fXA?e=1lPLJX

(Dave I assume it's okay to share that)

@lekoenig
Copy link

lekoenig commented Jan 4, 2023

OK, I took a look at the refactored hydrofabric and @ajsekell's code and came up with the following solution to rescale (i.e., split or aggregate) catchment characteristics:

# The code below calls a function that grabs the catchment characteristics. I haven't defined
# the function here, but it is included in my previous comment within this GitHub issue.

# Define a function to rescale the catchment attributes using the tabular data.
# This function is a combination of Andrew Sekellick's workflow to accommodate
# split catchments and a function I previously included in this GitHub issue
# to aggregate NHDPlusv2 attributes to new fabrics.

#' @param vars data frame containing two columns: "characteristic_id" is a character
#' vector of desired variables to retrieve from /link{discover_nldi_characteristics};
#' "summary_statistic", is a character vector indicating which summary statistic
#' should be applied to rescale each characteristic. Accepted values include
#' "sum," "length_weighted_mean," "area_weighted_mean," "min," and "max."
#' @param lookup_table data frame containing at least three columns: "ID" is a
#' numeric vector of identifiers at the desired scale; "comid" is a numeric vector
#' of NHDPlusv2 identifiers; "member_comid" contains the formatted NHDPlusv2 COMIDs,
#' for example, in the case of split catchments. If catchments have not been split,
#' the columns "comid" and "member_comid" should be identical. 
#' @param refactored_areas data frame containing columns "FEATUREID" and "areasqkm."
#' Used to retrieve adjusted catchment areas in the case of split catchments.
#'
#' @importFrom dplyr left_join rename_with mutate across group_by summarize ungroup distinct
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect starts_with any_of
#'
rescale_nldi_characteristics <- function(vars, lookup_table, refactored_areas){

  # check that the inputs match what we are expecting
  if(!all(c("characteristic_id", "summary_statistic") %in% names(vars))){
    stop("Check that vars contains columns 'characteristic_id' and 'summary_statistic'")
  }
  summary_stat_acceptable <- vars$summary_statistic %in% c("min", "max", "sum",
                                                           "length_weighted_mean",
                                                           "area_weighted_mean")
  if(!all(summary_stat_acceptable)){
    stop("Check that all entries in vars$summary_statistic match accepted values")
  }
  if(!all(c("comid", "member_comid", "ID") %in% names(lookup_table))){
    stop("Check that lookup_table contains columns 'comid,' 'member_comid,' and 'ID'")
  }

  # omit any duplicated rows in the lookup table
  lookup_table <- distinct(lookup_table)

  # download nldi characteristics for the requested comids and pivot to wide format
  nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
                                            ids = unique(lookup_table$comid))
  nhd_vars_wide <- pivot_wider(nhd_vars, !percent_nodata,
                               names_from = characteristic_id,
                               values_from = characteristic_value)

  # get comid catchment areas, adjusting area for catchments that have been "split"
  nhd_subset <- get_catchment_areas(unique(lookup_table$member_comid), refactored_areas)

  # combine the nldi characteristics with the catchment identifier and basin area
  cat_vars_unscaled <- lookup_table |>
    left_join(y = nhd_subset, by = c("member_comid","comid")) |>
    left_join(y = nhd_vars_wide, by = "comid")

  # rescale the nldi characteristics if needed (i.e., for split catchments)
  if(!all(cat_vars_unscaled$comid == cat_vars_unscaled$member_comid)){
    var_names <- unique(nhd_vars$characteristic_id)
    cat_vars_scaled <- mutate(cat_vars_unscaled, across(any_of(var_names), ~.x*split_area_prop))
  } else {
    cat_vars_scaled <- cat_vars_unscaled
  }

  # assign columns based on the desired summary operation
  cols_sum <- vars$characteristic_id[vars$summary_statistic == "sum"]
  cols_area_wtd_mean <- vars$characteristic_id[vars$summary_statistic == "area_weighted_mean"]
  cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
  cols_min <- vars$characteristic_id[vars$summary_statistic == "min"]
  cols_max <- vars$characteristic_id[vars$summary_statistic == "max"]

  # rescale NHDPlusv2 attributes to desired catchments
  # !note that there are currently no adjustments made for the length of split flowlines
  nhd_vars_rescaled <- cat_vars_scaled |>
    group_by(ID) |>
    summarize(
      areasqkm_sum = sum(split_basin_areasqkm),
      lengthkm_sum = sum(lengthkm),
      across(any_of(cols_area_wtd_mean), weighted.mean, w = split_basin_areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
      across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
      across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
      across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
      across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
    ) |>
    ungroup()

  return(nhd_vars_rescaled)
}

# Define the function `get_catchment_areas` that's called in `rescale_nldi_characteristics`

#' @title Get basin area and flowline length for NHDPlusV2 COMIDs
#'
#' @description
#' Get catchment area and flowline length for NHDPlusv2 COMID(s) of interest.
#' If any COMID represents a split catchment, the split catchment areas are
#' returned, along with a column that represents the proportion of the original
#' NHDPlusv2 catchment area that is covered by the split catchment.
#'
#' @param comids integer vector or character vector containing NHDPlusV2 identifiers
#' @param refactored_areas data frame containing columns "FEATUREID" and
#' "areasqkm." Used to retrieve adjusted catchment areas in the case of split
#' catchments.
#'
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr mutate select right_join left_join filter rename bind_rows
#'
get_catchment_areas <- function(comids, refactored_areas){

  # format comids (omit suffix "." in the case of split catchments)
  comids_fmt <- data.frame(member_comid = comids) |>
    mutate(comid = gsub("\\..*", "", member_comid))

  # fetch basin area for all comids
  nhd_lines <- sf::st_drop_geometry(get_nhdplus(comid = c(unique(comids_fmt$comid))))
  nhd_subset <- nhd_lines |>
    mutate(comid = as.character(comid)) |>
    select(comid, areasqkm, lengthkm) |>
    right_join(x = comids_fmt, by = "comid")

  # handle "split" catchments (if applicable)
  if(all(comids_fmt$member_comid == comids_fmt$comid)){
    nhd_subset_out <- nhd_subset |>
      mutate(comid = as.integer(comid))
  } else {
    # create data frame containing the "split" catchments
    comids_to_split <- filter(nhd_subset, grepl(".", member_comid, fixed = TRUE))

    # get the basin area for the split catchments
    split_areas <- refactored_areas |>
      filter(FEATUREID %in% comids_to_split$member_comid) |>
      rename(split_basin_areasqkm = areasqkm)
    split_areas_fmt <- left_join(x = comids_to_split, y = split_areas, by = c("member_comid" = "FEATUREID"))

    # create data frame containing the unsplit catchments
    comids_no_split <- filter(nhd_subset, !grepl(".", member_comid, fixed = TRUE))
    unsplit_areas_fmt <- mutate(comids_no_split, split_basin_areasqkm = areasqkm)

    # bind split and unsplit catchments back together
    nhd_subset_out <- bind_rows(unsplit_areas_fmt, split_areas_fmt) |>
      mutate(comid = as.integer(comid),
             split_area_prop = split_basin_areasqkm/areasqkm)
  }

  return(nhd_subset_out)
}


# Apply the functions above to an example.

# First, define the attributes of interest and how they should be
# summarized when aggregated or split.
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011", "CAT_BASIN_AREA"),
                   summary_statistic = c("area_weighted_mean", "min","sum"))

# Next, get the lookup table from the refactored hydrofabric that
# contains information about how NHDPlusv2 catchments are split.
file_name <- "refactor_02.gpkg"
file_path <- file.path(tempdir(), file_name)
gpkg_file <- sbtools::item_file_download("61fbfdced34e622189cb1b0a",
                                         destinations = file_path,
                                         names = file_name,
                                         overwrite_file = TRUE)
lookup_table <- sf::st_read(gpkg_file, layer= "lookup_table") |>
  rename(comid = NHDPlusV2_COMID, member_comid = member_COMID, ID = reconciled_ID) |>
  # subset lookup table for this example
  filter(ID %in% c(10012268, 10012979, 10024047, 10024048, 10024049, 10024050))
lookup_table
> lookup_table
    comid member_comid       ID LevelPathID
1 4146596      4146596 10012268     2132667
2 4147382      4147382 10012268     2132667
3 4147378      4147378 10012979     2130154
4 4147370      4147370 10012979     2130154
5 4147396    4147396.1 10024047     2123206
6 4147396    4147396.2 10024048     2123206
7 4147380    4147380.1 10024049     2123206
8 4147380    4147380.2 10024050     2123206

# We will also need the dataset "split_divides" that gives the basin
# area for the split catchments.
split_divides <- sf::st_read(gpkg_file, layer = "split_divides")

# rescale the characteristics to the refactored hydrofabric
rescale_nldi_characteristics(vars, lookup_table, refactored_areas = split_divides)
># A tibble: 6 x 6
        ID areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_BASIN_AREA_sum CAT_WILDFIRE_2011_min
     <dbl>        <dbl>        <dbl>            <dbl>              <dbl>                 <dbl>
1 10012268        12.9          6.30           -25.5               12.9                      0
2 10012979         4.78         3.31           -11.9                4.77                     0
3 10024047         6.51         4.31           -24.6                6.52                     0
4 10024048         1.44         4.31            -5.44               1.44                     0
5 10024049         5.84         5.10           -10.8                5.84                     0
6 10024050         7.56         5.10           -14.0                7.56                     0
>

I downloaded the refactored hydrofabric from ScienceBase and tested this function on the same subset of 100 ID's that @ajsekell used, and got the ~same values for the summed CAT_BASIN_AREA (within rounding error; r2 = 1, slope = 1, intercept = 0). I'd appreciate any help with further testing though, especially for split catchments - @ajsekell or @mjcashman-usgs, maybe you have some examples in mind?

Re. run time, this still isn't super quick and rescaling 100 ID's takes ~3 minutes. We may be able to speed that up, but some initial profiling suggests most of the time is in downloading the NLDI characteristics.

@dblodgett-usgs
Copy link
Collaborator Author

dblodgett-usgs commented Jan 5, 2023

This is great! I'm almost wrapped up with the hydroloom refactor and will turn attention to this as soon as I can.

@lekoenig
Copy link

lekoenig commented Jan 5, 2023

Sounds good, thanks @dblodgett-usgs!

@ajsekell
Copy link

ajsekell commented Jan 6, 2023

Awesome! That's almost a minute and a half faster than my script for 100 IDs.

I'll try to run yours with something like temperature to test a weighted average instead of sum but won't get to it until Monday at the earliest.

Sort of on topic, for aggregating up to a POI hydrofabric, we'd need to aggregate to the refactor fabric first (with this code) and then aggregate again to the POI fabric right? Don't think we could easily go directly from NHD to a POI fabric...

@dblodgett-usgs
Copy link
Collaborator Author

Cool -- progress.

I refactored your example a bit and think I have a good base implementation now @lekoenig

Note that I pulled out a stand alone function that runs the actual statistics and have all the data munging separate. I'll try and find some time to work this into nhdplusTools in the next few days, but this function layout should do the trick if you all want to try scaling this up.

library(tidyverse)
library(nhdplusTools)

get_charachteristics_metadata <- function() {

  u <- "https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/5669a79ee4b08895842a1d47/metadata_table.tsv"

  f <- file.path(nhdplusTools_data_dir(), "metadata_table.tsv")

  if(!file.exists(f)) resp <- httr::RETRY("GET", u, httr::write_disk(f))

  read.delim(f, sep = "\t")
}

#' @param varname character vector of desired variables
#' @param ids numeric vector of identifiers from the specified reference fabric
#' @param reference_fabric (not used) will be used to allow future specification of alternate reference fabrics
#' @importFrom dplyr bind_rows filter select
#' @importFrom tidyselect everything
get_catchment_characteristics <- function(varname, ids, reference_fabric = "nhdplusv2"){

  metadata <- get_charachteristics_metadata()

  bind_rows(
    lapply(varname, function(x) {
      i <- metadata[metadata$ID == x,]

      id <- gsub("https://www.sciencebase.gov/catalog/item/", "", i$datasetURL)

      bucket <- arrow::s3_bucket("s3://prod-is-usgs-sb-prod-publish", anonymous = TRUE, region = "us-west-2")

      end <- ifelse(grepl("CAT", x, ), "_cat.parquet",
                    ifelse(grep("TOT", x), "_tot_parquet",
                           "_acc.parquet"))

      ds <- arrow::open_dataset(paste0("s3://anonymous@prod-is-usgs-sb-prod-publish/",
                                       id, "/", id, end, "?region=us-west-2"))

      sub <- filter(ds, COMID %in% ids)

      att <- collect(sub)

      att$characteristic_id <- x
      att$percent_nodata <- 0

      select(att, all_of(c(characteristic_id = "characteristic_id", comid = "COMID",
                           characteristic_value = x, percent_nodata = "percent_nodata")))
    })
  )
}


rescale_characteristics <- function(vars, cat_vars_scaled) {
  # assign columns based on the desired summary operation
  cols_sum <- vars$characteristic_id[vars$summary_statistic == "sum"]
  cols_area_wtd_mean <- vars$characteristic_id[vars$summary_statistic == "area_weighted_mean"]
  cols_length_wtd_mean <- vars$characteristic_id[vars$aggregation_statistic == "length_weighted_mean"]
  cols_min <- vars$characteristic_id[vars$summary_statistic == "min"]
  cols_max <- vars$characteristic_id[vars$summary_statistic == "max"]
  
  # rescale NHDPlusv2 attributes to desired catchments
  # !note that there are currently no adjustments made for the length of split flowlines
  nhd_vars_rescaled <- cat_vars_scaled |>
    group_by(ID) |>
    summarize(
      areasqkm_sum = sum(split_basin_areasqkm),
      lengthkm_sum = sum(lengthkm),
      across(any_of(cols_area_wtd_mean), weighted.mean, w = split_basin_areasqkm, na.rm = TRUE, .names = "{col}_area_wtd"),
      across(any_of(cols_length_wtd_mean), weighted.mean, w = lengthkm, na.rm = TRUE, .names = "{col}_length_wtd"),
      across(any_of(cols_sum), sum, na.rm = TRUE, .names = "{col}_sum"),
      across(any_of(cols_min), min, na.rm = TRUE, .names = "{col}_min"),
      across(any_of(cols_max), max, na.rm = TRUE, .names = "{col}_max")
    ) |>
    ungroup()
  
  return(nhd_vars_rescaled)
}

# The code below calls a function that grabs the catchment characteristics. I haven't defined
# the function here, but it is included in my previous comment within this GitHub issue.

# Define a function to rescale the catchment attributes using the tabular data.
# This function is a combination of Andrew Sekellick's workflow to accommodate
# split catchments and a function I previously included in this GitHub issue
# to aggregate NHDPlusv2 attributes to new fabrics.

#' @param vars data frame containing two columns: "characteristic_id" is a character
#' vector of desired variables to retrieve from /link{discover_nldi_characteristics};
#' "summary_statistic", is a character vector indicating which summary statistic
#' should be applied to rescale each characteristic. Accepted values include
#' "sum," "length_weighted_mean," "area_weighted_mean," "min," and "max."
#' @param lookup_table data frame containing at least three columns: "ID" is a
#' numeric vector of identifiers at the desired scale; "comid" is a numeric vector
#' of NHDPlusv2 identifiers; "member_comid" contains the formatted NHDPlusv2 COMIDs,
#' for example, in the case of split catchments. If catchments have not been split,
#' the columns "comid" and "member_comid" should be identical.
#' @param refactored_areas data frame containing columns "FEATUREID" and "areasqkm."
#' Used to retrieve adjusted catchment areas in the case of split catchments.
#'
#' @importFrom dplyr left_join rename_with mutate across group_by summarize ungroup distinct
#' @importFrom tidyr pivot_wider
#' @importFrom tidyselect starts_with any_of
#'
rescale_nldi_characteristics <- function(vars, lookup_table, refactored_areas){

  # check that the inputs match what we are expecting
  if(!all(c("characteristic_id", "summary_statistic") %in% names(vars))){
    stop("Check that vars contains columns 'characteristic_id' and 'summary_statistic'")
  }
  summary_stat_acceptable <- vars$summary_statistic %in% c("min", "max", "sum",
                                                           "length_weighted_mean",
                                                           "area_weighted_mean")
  if(!all(summary_stat_acceptable)){
    stop("Check that all entries in vars$summary_statistic match accepted values")
  }
  if(!all(c("comid", "member_comid", "ID") %in% names(lookup_table))){
    stop("Check that lookup_table contains columns 'comid,' 'member_comid,' and 'ID'")
  }

  # omit any duplicated rows in the lookup table
  lookup_table <- distinct(lookup_table)

  # download nldi characteristics for the requested comids and pivot to wide format
  nhd_vars <- get_catchment_characteristics(varname = vars$characteristic_id,
                                            ids = unique(lookup_table$comid))

  nhd_vars_wide <- pivot_wider(nhd_vars, !percent_nodata,
                               names_from = characteristic_id,
                               values_from = characteristic_value)

  # get comid catchment areas, adjusting area for catchments that have been "split"
  nhd_subset <- get_catchment_areas(unique(lookup_table$member_comid), refactored_areas)

  # combine the nldi characteristics with the catchment identifier and basin area
  cat_vars_unscaled <- lookup_table |>
    left_join(y = nhd_subset, by = c("member_comid","comid")) |>
    left_join(y = nhd_vars_wide, by = "comid")

  # rescale the nldi characteristics if needed (i.e., for split catchments)
  if(!all(cat_vars_unscaled$comid == cat_vars_unscaled$member_comid)){
    var_names <- unique(nhd_vars$characteristic_id)
    cat_vars_scaled <- mutate(cat_vars_unscaled, across(any_of(var_names), ~.x*split_area_prop))
  } else {
    cat_vars_scaled <- cat_vars_unscaled
  }

  return(rescale_characteristics(vars, cat_vars_scaled))
}

# Define the function `get_catchment_areas` that's called in `rescale_nldi_characteristics`

#' @title Get basin area and flowline length for NHDPlusV2 COMIDs
#'
#' @description
#' Get catchment area and flowline length for NHDPlusv2 COMID(s) of interest.
#' If any COMID represents a split catchment, the split catchment areas are
#' returned, along with a column that represents the proportion of the original
#' NHDPlusv2 catchment area that is covered by the split catchment.
#'
#' @param comids integer vector or character vector containing NHDPlusV2 identifiers
#' @param refactored_areas data frame containing columns "FEATUREID" and
#' "areasqkm." Used to retrieve adjusted catchment areas in the case of split
#' catchments.
#'
#' @importFrom sf st_drop_geometry
#' @importFrom dplyr mutate select right_join left_join filter rename bind_rows
#'
get_catchment_areas <- function(comids, refactored_areas){

  # format comids (omit suffix "." in the case of split catchments)
  comids_fmt <- data.frame(member_comid = comids) |>
    mutate(comid = as.integer(member_comid))

  # fetch basin area for all comids
  nhd_subset <- nhdplusTools::get_vaa(atts = c("comid", "areasqkm", "lengthkm")) |>
    select(comid, areasqkm, lengthkm) |>
    right_join(x = comids_fmt, by = "comid")

  # handle "split" catchments (if applicable)
  if(all(comids_fmt$member_comid == as.character(comids_fmt$comid))) {
    nhd_subset_out <- nhd_subset |>
      mutate(comid = as.integer(comid))
  } else {
    # create data frame containing the "split" catchments
    comids_to_split <- filter(nhd_subset, grepl(".", member_comid, fixed = TRUE))

    # get the basin area for the split catchments
    split_areas <- refactored_areas |>
      filter(FEATUREID %in% comids_to_split$member_comid) |>
      rename(split_basin_areasqkm = areasqkm)

    split_areas_fmt <- left_join(x = comids_to_split,
                                 y = split_areas,
                                 by = c("member_comid" = "FEATUREID"))

    # create data frame containing the unsplit catchments
    comids_no_split <- filter(nhd_subset, !grepl(".", member_comid, fixed = TRUE))
    unsplit_areas_fmt <- mutate(comids_no_split, split_basin_areasqkm = areasqkm)

    # bind split and unsplit catchments back together
    nhd_subset_out <- bind_rows(unsplit_areas_fmt, split_areas_fmt) |>
      mutate(comid = as.integer(comid),
             split_area_prop = split_basin_areasqkm/areasqkm)
  }

  return(nhd_subset_out)
}


# Apply the functions above to an example.

# First, define the attributes of interest and how they should be
# summarized when aggregated or split.
vars <- data.frame(characteristic_id = c("CAT_EWT","CAT_WILDFIRE_2011", "CAT_BASIN_AREA"),
                   summary_statistic = c("area_weighted_mean", "min","sum"))

# Next, get the lookup table from the refactored hydrofabric that
# contains information about how NHDPlusv2 catchments are split.
file_name <- "refactor_02.gpkg"
file_path <- file.path(tempdir(), file_name)
gpkg_file <- sbtools::item_file_download("61fbfdced34e622189cb1b0a",
                                         destinations = file_path,
                                         names = file_name,
                                         overwrite_file = TRUE)
lookup_table <- sf::st_read(gpkg_file, layer= "lookup_table") |>
  dplyr::rename(comid = NHDPlusV2_COMID, member_comid = member_COMID, ID = reconciled_ID) |>
  # subset lookup table for this example
  dplyr::filter(ID %in% c(10012268, 10012979, 10024047, 10024048, 10024049, 10024050))
lookup_table
# > lookup_table
# comid member_comid       ID LevelPathID
# 1 4146596      4146596 10012268     2132667
# 2 4147382      4147382 10012268     2132667
# 3 4147378      4147378 10012979     2130154
# 4 4147370      4147370 10012979     2130154
# 5 4147396    4147396.1 10024047     2123206
# 6 4147396    4147396.2 10024048     2123206
# 7 4147380    4147380.1 10024049     2123206
# 8 4147380    4147380.2 10024050     2123206

# We will also need the dataset "split_divides" that gives the basin
# area for the split catchments.
split_divides <- sf::st_read(gpkg_file, layer = "split_divides")

# rescale the characteristics to the refactored hydrofabric
rescale_nldi_characteristics(vars, lookup_table, refactored_areas = split_divides)
# A tibble: 6 x 6
#   ID areasqkm_sum lengthkm_sum CAT_EWT_area_wtd CAT_BASIN_AREA_sum CAT_WILDFIRE_2011_min
# <dbl>        <dbl>        <dbl>            <dbl>              <dbl>                 <dbl>
#   1 10012268        12.9          6.30           -25.5               12.9                      0
# 2 10012979         4.78         3.31           -11.9                4.77                     0
# 3 10024047         6.51         4.31           -24.6                6.52                     0
# 4 10024048         1.44         4.31            -5.44               1.44                     0
# 5 10024049         5.84         5.10           -10.8                5.84                     0
# 6 10024050         7.56         5.10           -14.0                7.56                     0
# >

@lekoenig
Copy link

Sort of on topic, for aggregating up to a POI hydrofabric, we'd need to aggregate to the refactor fabric first (with this code) and then aggregate again to the POI fabric right? Don't think we could easily go directly from NHD to a POI fabric...

I like these edits, @dblodgett-usgs, thanks! I haven't used the refactored hydrofabrics much yet - do you have thoughts on @ajsekell's question about the POI fabric? I think that if you had a similar lookup table and a table with the refactored areas you could go from NHD to a POI fabric with this code. Does that sound right?

@dblodgett-usgs
Copy link
Collaborator Author

I think so, yes. My next step is to build some tests around this based on real data from a refactored fabric to make sure I'm really following what's going on.

@lekoenig
Copy link

lekoenig commented Feb 1, 2023

Sounds good, @dblodgett-usgs. We may start trying to scale this up using the functions in your comment above, and then replace with the nhdplusTools implementation down the line. Also, a random thought - should we note somewhere in the documentation that this process of splitting catchments may not work well for attributes that are not spatially uniform?

@ajsekell
Copy link

ajsekell commented Feb 1, 2023

I would suggest including a comment like that in the documentation. I'd like to recalculate something like NLCD directly from the rasters to a POI fabric and see if there's a significant difference between that and a NHD level compilation to NHGF POI script.

@dblodgett-usgs
Copy link
Collaborator Author

If you install from the hydroloom branch,

remotes::install_github("doi-usgs/nhdplusTools@hydroloom"),

you'll see I got two new catchment access functions incorporated. I am working on incorporating the rest of the functions above over the next couple days.

See #304 (comment) for how those functions currently work.

@dblodgett-usgs
Copy link
Collaborator Author

dblodgett-usgs commented Feb 3, 2023

This is now complete in #324 -- @lekoenig can you kick the tires?

remotes::install_github("doi-usgs/nhdplusTools@hydroloom")
??rescale_catchment_characteristics

You can see what I added here: 04b195f

Todo on this is to

  • deal with some outstanding warnings about deprecated function usage,
  • check over documentation and write some examples
  • flesh out tests to include some proper unit tests that verify the calculations do what we expect on a onesy twosy basis or just spot check what's in there and verify the numbers in some test expectations.

I'll deal with the warnings but would welcome help on the other two.

@lekoenig
Copy link

To test the rescaling functions I've tried to request characteristics that differ from the ones I was using to develop the code above. I've run into an edge case with CAT_IMPV11 where rescale_catchment_characteristics throws an error:

# rescale NHDPlusv2 attributes by aggregating comid values to new "id" values
vars <- data.frame(characteristic_id = c('CAT_IMPV11',"CAT_SANDAVE", "CAT_BASIN_AREA"),
                   summary_statistic = c("area_weighted_mean", "min","sum"))
lookup_table <- data.frame(id = rep(10012268, 2),
                           comid = c(4146596, 4147382),
                           member_comid = c(4146596, 4147382))
rescale_catchment_characteristics(vars, lookup_table)    
                   
Error in `summarize()`:
i In argument: `across(...)`.
i In group 1: `id = 10012268`.
Caused by error in `across()`:
! Can't compute column `CAT_IMPV11_area_wtd`.
Caused by error in `x * w`:
! non-numeric argument to binary operator
Run `rlang::last_error()` to see where the error occurred.
Warning message:
Values from `characteristic_value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(comid, characteristic_id) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L) 
>

The dplyr warnings are pretty helpful, actually. There are multiple rows with the same characteristic_id and characteristic_value that cause problems when trying to pivot the data to wide format. This is seen in rows 1 and 3 in the returned data frame below. My guess is that CAT_IMPV11 is included in more than one of the attribute tables and so is getting returned twice. Maybe a simple fix like adding a distinct() call within rescale_catchment_characteristics before pivoting the data to wide format would do the trick?

get_catchment_characteristics(varname = vars$characteristic_id,
                              ids = unique(lookup_table$comid))

   characteristic_id   comid characteristic_value percent_nodata
1:        CAT_IMPV11 4146596                 0.00              0
2:        CAT_IMPV11 4147382                 0.00              0
3:        CAT_IMPV11 4146596                 0.00              0
4:        CAT_IMPV11 4147382                 0.00              0
5:       CAT_SANDAVE 4146596                24.65              0
6:       CAT_SANDAVE 4147382                24.64              0
7:    CAT_BASIN_AREA 4146596                 0.96              0
8:    CAT_BASIN_AREA 4147382                11.98              0
>

Actually, I'm not sure about that. I'm testing river segments in the DRB and it looks like sometimes two different values get returned for CAT_IMPV11. Any idea what's going on there?

vars <- data.frame(characteristic_id = "CAT_IMPV11", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 1435, comid = 1748535, member_comid = 1748535)
get_catchment_characteristics(varname = vars$characteristic_id, ids = unique(lookup_table$comid))

   characteristic_id   comid characteristic_value percent_nodata
1:        CAT_IMPV11 1748535                 0.00              0
2:        CAT_IMPV11 1748535                 0.25              0

@dblodgett-usgs
Copy link
Collaborator Author

dblodgett-usgs commented Feb 15, 2023

I tracked this down the a bad assumption. The CAT_IMPV11 data has duplicates for these comids. I looks like you found the same thing (now that I read your issue more closely).

@mewieczo can you look at what's up with https://www.sciencebase.gov/catalog/item/570577fee4b0d4e2b7571d7b I'm seeing duplicate rows in the cat parquet file. Is that also in the csv or is something wrong with the parquet conversion? Why are we seeing different values for the same COMID?

In the mean time, if you update from the hydroloom branch, this should work.

remotes::install_github("doi-usgs/nhdplusTools@hydroloom")

library(nhdplusTools)
vars <- data.frame(characteristic_id = c('CAT_IMPV11',"CAT_SANDAVE", "CAT_BASIN_AREA"),
                   summary_statistic = c("area_weighted_mean", "min","sum"))
lookup_table <- data.frame(id = rep(10012268, 2),
                           comid = c(4146596, 4147382),
                           member_comid = c(4146596, 4147382))
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 × 6
#>         id areasqkm_sum lengthkm_sum CAT_IMPV11_area_wtd CAT_BASIN_ARE…¹ CAT_S…²
#>      <dbl>        <dbl>        <dbl>               <dbl>           <dbl>   <dbl>
#> 1 10012268         12.9         6.30                   0            12.9    24.6
#> # … with abbreviated variable names ¹​CAT_BASIN_AREA_sum, ²​CAT_SANDAVE_min

Created on 2023-02-14 with reprex v2.0.2

@lekoenig
Copy link

lekoenig commented Feb 15, 2023

Ah, OK - thanks for looking into that IMPV11 issue! I've tested these functions out on a larger population of stream reaches (~15K NHDPlusV2 COMIDs in the DRB) and an expanded set of ~30 attributes to compare against our previous PUMP workflow. Overall, the new nhdplusTools rescale workflow works well! I've added some suggestions for documentation (including examples) and fixes for the dplyr function usage in a branch on my fork. I can open a PR if that fits with your workflow, @dblodgett-usgs.

I've also found a few other potential edge cases that we could consider:

  1. The CAT_sinuosity variable gives me trouble. Based on the error message, is it possible this variable has a different name in the metadata table and the parquet files?
library(nhdplusTools)
lookup_table <- data.frame(id = rep(10012268, 2), comid = c(4146596, 4147382), member_comid = c(4146596, 4147382))
vars <- data.frame(characteristic_id = "CAT_sinuosity", summary_statistic = "length_weighted_mean")
rescale_catchment_characteristics(vars, lookup_table)
#> Error in UseMethod("pivot_wider") : 
#>   no applicable method for 'pivot_wider' applied to an object of class "NULL"
#> In addition: Warning message:
#> In get_catchment_characteristics(varname = vars$characteristic_id,  :
#>   Issue getting characteristics. Error was: 
#> Problem while evaluating `all_of(...)`.Issue getting characteristics. Error was: 
...
  1. For rescaled catchments that only contain zero-area COMIDs, the area-weighted values return NaN. It makes sense, and I think I'm OK with that but thought I'd mention it.
library(nhdplusTools)
vars <- data.frame(characteristic_id = "CAT_STRM_DENS", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 1721, comid = 4188275, member_comid = 4188275)
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#>      id areasqkm_sum lengthkm_sum CAT_STRM_DENS_area_wtd
#>   <dbl>        <dbl>        <dbl>                  <dbl>
#> 1  1721            0        0.048                    NaN

Created on 2023-02-15 by the reprex package (v2.0.1)

  1. -9999 values in the characteristics tables cause weird rescaled values, and we should probably replace them with NA before calling rescale_characteristics().
library(nhdplusTools)
vars <- data.frame(characteristic_id = "CAT_EWT", summary_statistic = "area_weighted_mean")
lookup_table <- data.frame(id = 4203, 
                           comid = c(8074226,8075830,932040118,932040186,932040187), 
                           member_comid = c(8074226,8075830,932040118,932040186,932040187))
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = lookup_table$comid)
#>    characteristic_id     comid characteristic_value percent_nodata
#> 1:           CAT_EWT   8074226                    0              0
#> 2:           CAT_EWT   8075830                    0              0
#> 3:           CAT_EWT 932040118                    0              0
#> 4:           CAT_EWT 932040186                -9999              0
#> 5:           CAT_EWT 932040187                -9999              0
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#>      id areasqkm_sum lengthkm_sum CAT_EWT_area_wtd
#>   <dbl>        <dbl>        <dbl>            <dbl>
#> 1  4203         22.8         10.3           -5057.

Created on 2023-02-15 by the reprex package (v2.0.1)

  1. I was working with the split catchments to develop additional unit tests based on expected output. I'm wondering whether we actually want to apply the proportion area correction to all vars at the end of rescale_catchment_characteristics(), or whether it'd be better to make that adjustment within rescale_characteristics() and only for the area-weighted cols. Here's a reprex to show what I mean:
library(nhdplusTools)
library(dplyr)
vars <- data.frame(characteristic_id = c("CAT_ELEV_MIN"),
                   summary_statistic = c("min"))
lookup_table <- data.frame(id = rep(10012268, 2), comid = c(4146596, 4147382), member_comid = c(4146596, 4147382))

# works ok for aggregation only
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = unique(lookup_table$comid))
#>    characteristic_id   comid characteristic_value percent_nodata
#> 1:      CAT_ELEV_MIN 4146596               796.09              0
#> 2:      CAT_ELEV_MIN 4147382               530.76              0
rescale_catchment_characteristics(vars, lookup_table)
#> # A tibble: 1 x 4
#>         id areasqkm_sum lengthkm_sum CAT_ELEV_MIN_min
#>      <dbl>        <dbl>        <dbl>            <dbl>
#> 1 10012268         12.9         6.30             531.

# but for split catchments?
d <- readRDS(list.files(pattern = "rescale_data.rds", recursive = TRUE, full.names = TRUE))

# smallest value for cat_elev_min between comids belonging to id 10024047 or 10024048 is 487.49
get_catchment_characteristics(varname = unique(vars$characteristic_id), ids = unique(d$lookup_table$comid)) |>
  left_join(d$lookup_table, by = "comid", multiple = "all")
#>   characteristic_id   comid characteristic_value percent_nodata member_comid       id LevelPathID
#> 1:      CAT_ELEV_MIN 4146596               796.09              0      4146596 10012268     2132667
#> 2:      CAT_ELEV_MIN 4147370               616.66              0      4147370 10012979     2130154
#> 3:      CAT_ELEV_MIN 4147378               574.88              0      4147378 10012979     2130154
#> 4:      CAT_ELEV_MIN 4147380               530.24              0    4147380.1 10024049     2123206
#> 5:      CAT_ELEV_MIN 4147380               530.24              0    4147380.2 10024050     2123206
#> 6:      CAT_ELEV_MIN 4147382               530.76              0      4147382 10012268     2132667
#> 7:      CAT_ELEV_MIN 4147396               487.49              0    4147396.1 10024047     2123206
#> 8:      CAT_ELEV_MIN 4147396               487.49              0    4147396.2 10024048     2123206

# yet rescaled value of cat_elev_min for id 10024048 is 88.5
rescale_catchment_characteristics(vars, d$lookup_table, refactored_areas = d$split_divides)
#> # A tibble: 1 x 5
#>         id areasqkm_sum lengthkm_sum CAT_ELEV_MIN_min 
#>      <dbl>        <dbl>        <dbl>            <dbl>            
#> 1 10012268        12.9          6.30            531.             
#> 2 10012979         4.78         3.31            575.             
#> 3 10024047         6.51         4.31            400.              
#> 4 10024048         1.44         4.31             88.5             
#> 5 10024049         5.84         5.10            231.              
#> 6 10024050         7.56         5.10            299.              

Finally, it occurred to me that I've ignored the percent_nodata column from get_catchment_characteristics() up until now. I think it could be useful to retain and possible rescale it. I have an idea to do that if it seems worthwhile.

@dblodgett-usgs
Copy link
Collaborator Author

Thanks for the doco @lekoenig this is getting solid -- I'll keep this issue in my todo queue and look at your comments.

If you think you know how to work in the percent_nodata, that would be awesome. I've kind of been overlooking it as well.

@dblodgett-usgs
Copy link
Collaborator Author

CAT_sinuosity is in the "tot" file for some reason? The metadata does have the information needed for that and that case is now covered.

I found a case where there is no "percent_nodata" (for ACC_STRM_DENS). I will just fill in 0 in that case.

For a divide by 0, I think NaN is the best choice.

I now do att <- dplyr::mutate_all(att, ~ifelse(. == -9999, NA, .)) @mewieczo -- do you know if there are other "nodata" values that I will need to parse out?

The issue you highlight in (4) just looks like a bug? I'm not totally following what's going on though. Do you want to dig further and get that fixed up or should I @lekoenig ?

dblodgett-usgs added a commit that referenced this issue Feb 20, 2023
@dblodgett-usgs
Copy link
Collaborator Author

Reinstall from hydroloom to get the little fixes above.

remotes::install_github("doi-usgs/nhdplusTools@hydroloom")

@lekoenig
Copy link

I now do att <- dplyr::mutate_all(att, ~ifelse(. == -9999, NA, .))

Great, thanks for doing that. If mutate_all is ever deprecated, we could revert to the across usage, e.g. something like: dplyr::mutate(att, dplyr::across(is.numeric, ~dplyr::na_if(., -9999))).

The issue in (4) above is really more of a conceptual question than a bug. For split catchments, should we be rescaling all characteristics? In that example, if a user requests the "min" statistic for the attribute CAT_ELEV_MIN, the value that gets returned is lower than any of the contributing catchments (because we rescaled CAT_ELEV_MIN by the proportion of contributing area). I'm not sure that's what we want, but would appreciate a second set of eyes in thinking this through. To avoid rescaling characteristics by proportional area when the characteristics will be used to calculate "min" or "max" summary statistics, a potential fix could be to move this line:

  # rescale the nldi characteristics if needed (i.e., for split catchments)
  if(!all(lookup_table$comid == lookup_table$member_comid)){
    lookup_table <- mutate(lookup_table, across(any_of(var_names), ~.x*.data$split_area_prop))
  }

from rescale_catchment_characteristics to rescale_characteristics and then only apply that correction to certain columns (instead of var_names):

  # rescale the nldi characteristics if needed (i.e., for split catchments)
  if(!all(lookup_table$comid == lookup_table$member_comid)){
    lookup_table <- mutate(lookup_table, across(any_of(c(cols_area_wtd_mean, cols_sum)), ~.x*.data$split_area_prop))
  }

Does that make sense?

@lekoenig
Copy link

If you think you know how to work in the percent_nodata, that would be awesome. I've kind of been overlooking it as well.

I've suggested some changes in #331 to retain and rescale the percent_nodata values.

@dblodgett-usgs
Copy link
Collaborator Author

@lekoenig -- based on what we have now, do you think this could be closed?

@lekoenig
Copy link

lekoenig commented Mar 9, 2023

I was just looking at this last night to see if we could close this loop. I think the only thing remaining is to resolve (4) described in #303 (comment). I'd appreciate if you or @ajsekell could take a look at the example above as a check on my logic, but either way I'll open a PR today with my proposed change.

@ajsekell
Copy link

ajsekell commented Mar 9, 2023

I think I'm following 4 in #303...yeah that's a problem. You're right the rescaling by proportion of area isn't necessarily needed for all variables (like min elevation, or maybe some of the climate variables too). Your suggestion I think makes sense and I also think some of these details will be important in the help/documentation.

@lekoenig
Copy link

@lekoenig -- based on what we have now, do you think this could be closed?

Following up on this, after the changes merged in #337 I think this issue can be closed. In the next couple months we will be working with this code and scaling up these workflows to generate rescaled attributes for CONUS. I'll open a new issue if we find any bugs or have other questions.

@dblodgett-usgs
Copy link
Collaborator Author

Great. As I have time, I am working on getting hydroloom documented and ready for CRAN. Once it's out, nhdplusTools will start to migrate to that back end on the main branch. Until then, we'll just have to install from the hydroloom branch.

@dblodgett-usgs
Copy link
Collaborator Author

See https://doi-usgs.github.io/nhdplusTools/news/index.html

This is now on the main branch and on its way to CRAN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants