Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download, process, and covariate calculation package- next steps #5

Closed
mitchellmanware opened this issue Jan 16, 2024 · 22 comments
Closed
Assignees

Comments

@mitchellmanware
Copy link
Collaborator

mitchellmanware commented Jan 16, 2024

Updates, additions, and next steps for suite of functions designed to download, process, and calculation covariates from geospatial environmental data sources.
@sigmafelix @Spatiotemporal-Exposures-and-Toxicology

@mitchellmanware
Copy link
Collaborator Author

mitchellmanware commented Jan 16, 2024

Download Functions

Processing/Covariate Calculation Functions

  • Currently, some of the data sources have separate functions for processing and calculating covariates (ie. process_hms > calc_hms vs calc_geos). Should we use consistent terms and separate functions based on their position within the progression from "data exists online > data is downloaded to machine > data is subsetted to variables/locations of interest > data imported into R" (example only).
  • Change calc_data to covariates_data for the generation of covariates.
    • calc_data can be created for calculating temporal mean/median/range etc, but covariate_data is more specific for the extraction of data at point locations.

Discussion

  1. Should new data sources be included in first version of package, or should we focus on developing a full suite of processing/calculation functions around the data download functions we already have?
  2. Processing and calculate covariates function structure and names.

@kyle-messier
Copy link
Collaborator

Thanks Mitchell

  1. Yes to code naming consistency
  2. It looks like you are considering adding other covariates - that is a good list. If it is straight forward to do so, then go ahead. But that can be a future release too
  3. We still need package names!

@mitchellmanware
Copy link
Collaborator Author

Whether the additional data sources are included in the first or a later version, I think our immediate priority is cleaning up what we have. It will be easier to add additional download, process, and calculation functions once we have standard function names and more robust unit/integration tests.

@mitchellmanware
Copy link
Collaborator Author

The code naming consistency can be based on the output of each function.

The download... functions return raw, unprocessed downloaded data from various source.

The process... functions clean these raw data (ie. variable selection from netCDF, spatial subset) and return a single SpatRaster/SpatVector object. The user would have to utilize only two functions to have cleaned and interpretable geospatial data in the R environment.

The calc... functions can be renamed to covariate... because the extraction of data at points or polygons is limited scope for a title of "calc". calc.. functions can be created to calculate spatio-temporal summary statistics (ie. average temperature within counties/census tracts or user defined set of polygons)

@mitchellmanware
Copy link
Collaborator Author

mitchellmanware commented Jan 17, 2024

Mitchell To Do (January 17, 2023) -

  • Restructure calc_narr and calc_geos to have separate process... and covariate... functions
    • As all covariates are already calculated, temporarily breaking these functions should not cause major productivity issues
  • Identify standard parameters for process... functions
    • Examples may be crop = , variables = , temporal = (for GEOS-CF as daily data may be more useful than hourly)
  • Unit tests for new process... and covariate... functions
  • Integration tests for download... > process... > covariates... pipeline
  • Create spatiotemporal summary statistics functions (new calc...)
  • Package name 🌋🌋🌋

@mitchellmanware
Copy link
Collaborator Author

mitchellmanware commented Jan 18, 2024

process...

Parameters

  • Temporal range (date/year start and end)
  • Variables
  • Crop (user provides SpatVector/sf object)
  • Directory with data

Unit Tests

  • Return object is SpatRaster/SpatVector
  • Return object has:
    • coordinate reference system
    • extent
    • time
  • variable names (especially for NARR pressure levels variables)
  • correct number of layers based on temporal range and resolution (hourly, daily, month, annual)
  • extent > 1 x 1

@sigmafelix
Copy link
Collaborator

calc_, process_ or covariate_ functions may need to include extent/tile range for flexibility. The functions in this package, at least in calc_modis function and its dependencies, are very generic to accept file paths and select one-day file paths to process.

We mainly use circular buffer polygons, which are generated inside each calc_ function, to extract values from raster datasets. Generic polygon inputs need to be supported for general cases of summary (jurisdictions, isochrones, etc.). One option is to consider circular buffers as default.

@mitchellmanware
Copy link
Collaborator Author

Discussion 2024/01/2018 @sigmafelix @Spatiotemporal-Exposures-and-Toxicology

  • Refactor existing calc_* functions into separate process_ and covariate_ functions
  • Test performance of process_ function w/o spatial subset
  • Implement unit and integration tests for download_ > process_ pipeline
  • Identify and create additional "helper" functions based on repeated code in download_ and process_ functions
  • Use same/similar parameters in process_ functions so users can do.call both download_ and process_ function sequentially with same parameters. Also helpful for integration testing.
  • Redesign calculate_ function based on uniform output from process_
  • Include buffer = as standard parameter for calculate_ function(s)
  • Uniform function notes (for us)
  • Uniform function cat(...) statements (for users)
  • Clean Roxygen descriptions
  • Expand to other data sources (those without R packages)

@sigmafelix
Copy link
Collaborator

For Point 1, MODIS-related auxiliary functions are separated from the main functions then placed in R/calculate_covariates_support.R. The file and function names will be changed in the next PR. I guess covariate_ is supposed to be calculate_?

For Point 4 (helper function), I think a spatial data class conversion function will be necessary. Functions written earlier (mostly by me) usually accept sf or SpatVector in sites argument, whereas the later versions accept stdt.

For Point 5, I think unifying common argument names across process_ function needs to be prioritized. Before then, we need to think about Point 6 and the format of covariate source datasets. What would be a proper form, a sf/stars/Spat* object or file path(s)? The intermediate product from process_ could be vector or raster, so we might classify processing functions in two groups and design uniform arguments in calculate_ functions accordingly.

@mitchellmanware
Copy link
Collaborator Author

@sigmafelix
i agree with all your points.
For point 5/6, my thought was that the process_ functions would return a cleaned Spat* object into the user's R environment. By "cleaned" I mean

  • Variable names set (ie changing DOY numbering to date in NARR air_1 -> air_20180101)
  • Assign coordinate reference system if empty
  • Assign time variable if empty
  • Subset to individual variable within the data set

The "cleaning" does not have to include manipulation/calculation/alteration of the data values in any way (ignore contradiction with daily = in process_geos 😅), which would make the functions 1. simpler and 2. rely on fewer arguments. The essential arguments that come to mind are temporal range (start and end), variable, and directory with data. Other source specific items (ie. collection) can be verified by checking for the presence of the variable within the data's varnames rather than being a separate argument.

I think generating cleaned data in a two step workflow from download_data(...) > process_(...) is useful enough to the users where it saves lots of time, but also general enough so they/we can run our own calc_/covar_ functions or proceed with other analyses in R.

Lots of rambling thoughts so I will refine and post again with an example.

@mitchellmanware
Copy link
Collaborator Author

import_geos and import_narr have been simplified to utilize only date_start, date_end, variable, and directory_with_data parameters. The functions import and clean data, providing the user with a single SpatRaster object for their desired temporal range and variable. The data is not changed in any way, only the CRS, time, and variable names are cleaned for information retention (ie pressure levels in variable names).

https://github.com/Spatiotemporal-Exposures-and-Toxicology/NRTAPmodel/blob/mm-process-functions/R/import.R

> d <- "2018-01-01"
> dwd <- "../testdata/"
> g <- import_geos(
+   date_start = d,
+   date_end = d,
+   variable = "O3",
+   paste0(dwd, "geos/aqc_tavg_1hr_g1440x721_v1"))
Cleaning O3 data for 2018-01-01 00:30:00...
Cleaning O3 data for 2018-01-01 01:30:00...
Cleaning O3 data for 2018-01-01 02:30:00...
Cleaning O3 data for 2018-01-01 03:30:00...
Cleaning O3 data for 2018-01-01 04:30:00...
Cleaning O3 data for 2018-01-01 05:30:00...
Cleaning O3 data for 2018-01-01 06:30:00...
Cleaning O3 data for 2018-01-01 07:30:00...
Cleaning O3 data for 2018-01-01 08:30:00...
Cleaning O3 data for 2018-01-01 09:30:00...
Cleaning O3 data for 2018-01-01 10:30:00...
Cleaning O3 data for 2018-01-01 11:30:00...
Cleaning O3 data for 2018-01-01 12:30:00...
Cleaning O3 data for 2018-01-01 13:30:00...
Cleaning O3 data for 2018-01-01 14:30:00...
Cleaning O3 data for 2018-01-01 15:30:00...
Cleaning O3 data for 2018-01-01 16:30:00...
Cleaning O3 data for 2018-01-01 17:30:00...
Cleaning O3 data for 2018-01-01 18:30:00...
Cleaning O3 data for 2018-01-01 19:30:00...
Cleaning O3 data for 2018-01-01 20:30:00...
Cleaning O3 data for 2018-01-01 21:30:00...
Cleaning O3 data for 2018-01-01 22:30:00...
Cleaning O3 data for 2018-01-01 23:30:00...
Returning hourly O3 data from 2018-01-01 to 2018-01-01.
> g
class       : SpatRaster 
dimensions  : 721, 1440, 24  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : -180.125, 179.875, -90.125, 90.125  (xmin, xmax, ymin, ymax)
coord. ref. :  
sources     : GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0030z.nc4:O3  
              GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0130z.nc4:O3  
              GEOS-CF.v01.rpl.aqc_tavg_1hr_g1440x721_v1.20180101_0230z.nc4:O3  
              ... and 21 more source(s)
varnames    : O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              O3 (Ozone (O3, MW = 48.00 g mol-1) volume mixing ratio dry air) 
              ...
names       : O3_le~03000, O3_le~13000, O3_le~23000, O3_le~33000, O3_le~43000, O3_le~53000, ... 
unit        :   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1,   mol mol-1, ... 
time        : 2018-01-01 00:30:00 to 2018-01-01 23:30:00 UTC 
> names(g)[1:10]
 [1] "O3_lev=72_20180101_003000" "O3_lev=72_20180101_013000"
 [3] "O3_lev=72_20180101_023000" "O3_lev=72_20180101_033000"
 [5] "O3_lev=72_20180101_043000" "O3_lev=72_20180101_053000"
 [7] "O3_lev=72_20180101_063000" "O3_lev=72_20180101_073000"
 [9] "O3_lev=72_20180101_083000" "O3_lev=72_20180101_093000"
> time(g)[1:10]
 [1] "2018-01-01 00:30:00 UTC" "2018-01-01 01:30:00 UTC"
 [3] "2018-01-01 02:30:00 UTC" "2018-01-01 03:30:00 UTC"
 [5] "2018-01-01 04:30:00 UTC" "2018-01-01 05:30:00 UTC"
 [7] "2018-01-01 06:30:00 UTC" "2018-01-01 07:30:00 UTC"
 [9] "2018-01-01 08:30:00 UTC" "2018-01-01 09:30:00 UTC"
> n <- import_narr(
+   date_start = d,
+   date_end = d,
+   variable = "omega",
+   paste0(dwd, "narr/omega/"))
Cleaning data for year 2018...
Returning daily omega data from 20180101 to 20180101.
> n
class       : SpatRaster 
dimensions  : 277, 349, 29  (nrow, ncol, nlyr)
resolution  : 32462.99, 32463  (x, y)
extent      : -16231.49, 11313351, -16231.5, 8976020  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=lcc +lat_0=50 +lon_0=-107 +lat_1=50 +lat_2=50 +x_0=5632642.22547 +y_0=4612545.65137 +datum=WGS84 +units=m +no_defs 
source      : omega.201801.nc:omega 
varname     : omega (Daily Omega on Pressure Levels) 
names       : omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, omega~80101, ... 
unit        :    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s,    Pascal/s, ... 
time        : 2018-01-01 UTC 
> names(n)[1:10]
 [1] "omega_level=1000_20180101" "omega_level=975_20180101" 
 [3] "omega_level=950_20180101"  "omega_level=925_20180101" 
 [5] "omega_level=900_20180101"  "omega_level=875_20180101" 
 [7] "omega_level=850_20180101"  "omega_level=825_20180101" 
 [9] "omega_level=800_20180101"  "omega_level=775_20180101" 
> time(n)[1:10]
 [1] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
 [5] "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC" "2018-01-01 UTC"
 [9] "2018-01-01 UTC" "2018-01-01 UTC"
> 

@sigmafelix
Is this more along the lines of "common argument names"?

@mitchellmanware
Copy link
Collaborator Author

mitchellmanware commented Jan 26, 2024

Performance checks for spatial subsetting with import_* functions

import_geos is function as currently exists on mm-process-functions branch. Spatial subsetting occurs external to the function after data has been cleaned. import_geos_crop_final crops the final single SpatRaster after all data has been cleaned. import_geos_crop_layer crops each layer of the data individually after cleaning but before merging with final SpatRaster.

Test with collection aqc_tavg_1hr_g1440x721_v1 (single level):

> microbenchmark(
  crop_external = import_geos(
    variable = "O3",
    directory_with_data = d
  ) %>>% terra::crop(
    project(
      ct,
      "EPSG:4326"
    )
  ),
  crop_final = import_geos_crop_final(
    variable = "O3",
    directory_with_data = d,
    crop = ct
  ),
  crop_layer = import_geos_crop_layer(
    variable = "O3",
    directory_with_data = d,
    crop = ct
  ),
  times = 5
)
Unit: seconds
          expr      min       lq     mean   median       uq      max neval cld
 crop_external 2.469672 2.484679 2.492802 2.485250 2.497332 2.527078     5   a
    crop_final 2.472164 2.472217 2.488969 2.473530 2.498845 2.528091     5   a
    crop_layer 2.636986 2.655052 2.912806 2.661639 2.692731 3.917623     5   a

Test with collection chm_inst_1hr_g1440x721_p23 (pressure levels):

> microbenchmark(
+   crop_external = import_geos(
+     variable = "O3",
+     directory_with_data = d
+   ) %>>% terra::crop(
+     project(
+       ct,
+       "EPSG:4326"
+     )
+   ),
+   crop_final = import_geos_crop_final(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   crop_layer = import_geos_crop_layer(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   times = 5
+ )
Unit: seconds
          expr       min       lq     mean   median       uq      max neval cld
 crop_external  9.940528 10.03314 10.17560 10.19378 10.19413 10.51641     5   a
    crop_final 10.039182 10.10433 10.82902 10.12391 10.52642 13.35124     5   a
    crop_layer 10.357362 10.50860 10.56581 10.55911 10.62703 10.77697     5   a

At least for GEOS-CF data, performance between these functions was similar, so we can discussion the inclusion of crop = or subset = as a standard parameter in the import_* functions.

Local
When run locally, spatial subsetting after layer-specific cleaning is optimal by 10 - 15 seconds.

> microbenchmark(
+   crop_external = import_geos(
+     variable = "O3",
+     directory_with_data = d
+   ) %>>% terra::crop(
+     project(
+       ct,
+       "EPSG:4326"
+     )
+   ),
+   crop_final = import_geos_crop_final(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   crop_layer = import_geos_crop_layer(
+     variable = "O3",
+     directory_with_data = d,
+     crop = ct
+   ),
+   times = 2
+ )
Unit: seconds
          expr      min       lq     mean   median       uq      max neval
 crop_external 464.0877 464.0877 464.4559 464.4559 464.8242 464.8242     2
    crop_final 458.4853 458.4853 460.2615 460.2615 462.0377 462.0377     2
    crop_layer 446.7328 446.7328 449.1123 449.1123 451.4919 451.4919     2
>

@sigmafelix sigmafelix transferred this issue from NIEHS/beethoven Feb 10, 2024
@sigmafelix
Copy link
Collaborator

sigmafelix commented Feb 10, 2024

process_* functions will do pretty similar functions as modis_*. I will work on separate processing parts from calc_* first then unifying and minimizing arguments of process_* functions.

TODO

  • process_ecoregion from calc_ecoregion
  • process_koppen_geiger from calc_koppen_geiger
  • process_nei from calc_nei
  • process_nlcd_ratio from calc_nlcd_ratio
  • process_tri from calc_tri
  • Simplify calc_modis
  • Rename modis_* to process_modis_*
  • Finalize a pull request

cf. Argument lists by functions in R/calculate_covariates.R and R/calculate_covariates_support.R

# calc_* function arguments 
                    fun path sites id_col product name_covariates radius subdataset fun_summary nthreads package_list_add
1        calc_ecoregion TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
2    calc_koppen_geiger TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
3            calc_modis TRUE  TRUE   TRUE    TRUE            TRUE   TRUE       TRUE        TRUE     TRUE             TRUE
4              calc_nei TRUE  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
5       calc_nlcd_ratio TRUE  TRUE     NA      NA              NA   TRUE         NA          NA       NA               NA
6 calc_temporal_dummies   NA  TRUE   TRUE      NA              NA     NA         NA          NA       NA               NA
7              calc_tri TRUE  TRUE   TRUE      NA              NA   TRUE         NA          NA       NA               NA
  export_list_add year county_shp sites_epsg domain_year
1              NA   NA         NA         NA          NA
2              NA   NA         NA         NA          NA
3            TRUE   NA         NA         NA          NA
4              NA TRUE       TRUE       TRUE          NA
5              NA TRUE         NA         NA          NA
6              NA   NA         NA         NA        TRUE
7              NA   NA         NA       TRUE        TRUE

# modis_* function arguments 
                     fun path product nsds fun_agg paths date_in regex_sds  foo get_var resolution custom_sel subdataset crs_ref
1    modis_aggregate_sds TRUE    TRUE TRUE    TRUE    NA      NA        NA   NA      NA         NA         NA         NA      NA
2          modis_get_vrt   NA    TRUE   NA      NA  TRUE    TRUE      TRUE TRUE      NA         NA         NA         NA      NA
3     modis_mosaic_mod06   NA      NA   NA      NA  TRUE    TRUE        NA   NA    TRUE       TRUE         NA         NA      NA
4    modis_prefilter_sds   NA    TRUE   NA      NA    NA      NA        NA   NA      NA         NA       TRUE         NA      NA
5 modis_preprocess_vnp46   NA      NA   NA      NA  TRUE    TRUE        NA   NA      NA         NA         NA       TRUE    TRUE
6       modis_warp_stars TRUE      NA   NA      NA    NA      NA        NA   NA      NA         NA         NA         NA      NA
7           modis_worker   NA    TRUE   NA      NA    NA      NA        NA   NA      NA         NA         NA         NA      NA
  cellsize threshold crs_out raster date sites_in name_extracted fun_summary_raster id_col radius
1       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
2       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
3       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
4       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
5       NA        NA      NA     NA   NA       NA             NA                 NA     NA     NA
6     TRUE      TRUE    TRUE     NA   NA       NA             NA                 NA     NA     NA
7       NA        NA      NA   TRUE TRUE     TRUE           TRUE               TRUE   TRUE   TRUE

@sigmafelix
Copy link
Collaborator

@mitchellmanware
As per our plan to split calc_* into process_* and calculate_* and to generalize functions, we may need to rename common arguments in these functions. At least in functions I've been working on, path and sites are usually used. I think generic names such as x and y or loc/locs and from, which nuance "calculate covariates at x/loc/locs from from. sites sounds a little specific to AQS cases. I'd be happy to discuss more generic argument names.

@kyle-messier
Copy link
Collaborator

@sigmafelix It's always nice to have as much internal consistency - could potentially even be made into a special use-case advanced linter (e.g. arugment_name_linter)

@mitchellmanware
Copy link
Collaborator Author

I agree these general names are more broadly applicable. I think there could be more variability in the process_ function arguments as the raw data can take on many forms, but we can implement vectorized arguments for collection/variable or statistic/resolution to retain consistency.

ie for the GMTED processing function:

gmted <-
        import_gmted(
          variable = c("Breakline Emphasis", "7.5 arc seconds),
          directory_with_data =
          "../testdata/gmted/gmted"
        )

The variable argument takes both statistic and resolution but conforms to the simplified argument names.

@mitchellmanware
Copy link
Collaborator Author

We can also apply the path argument to the download functions. I do not recall why I created separate directory_to_download and directory_to_save folders.

@sigmafelix
Copy link
Collaborator

sigmafelix commented Feb 13, 2024

@mitchellmanware The separation was for distinguishing a directory of compressed raw data from a directory of the unzipped data (what we actually use to process and calculate). directory_to_download was for the former, whereas directory_to_save was for the latter. One way to simplify the directory setting could be to force making two directories under directory_to_save, say, zips and raw inside download functions. Exception could be handled with the extension in download links.

@Spatiotemporal-Exposures-and-Toxicology Yes, a custom linter will help a lot! I will work on adding one.

@mitchellmanware
Copy link
Collaborator Author

Right, thanks for the reminder. That would allow for succinct folder arguments across functions which is good, but seems low priority for now. I will revisit next week.

@sigmafelix
Copy link
Collaborator

#13 is merged into main. I reflected what we've agreed in our discussion. Reflecting my comment in #7, sites is replaced with locs, path remains in process_* functions, and is replaced with from in calc_* functions.

@mitchellmanware
Copy link
Collaborator Author

Great, thank you Insang. I am very close to being done with my function migration and will create pull request once finished.

@kyle-messier
Copy link
Collaborator

@mitchellmanware closing - reopen if needed. 🤙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants