Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first take at expanding predictSolute #219

Merged
merged 16 commits into from
Aug 25, 2017
Merged

Conversation

wdwatkins
Copy link
Contributor

#199

One warning is for the vignette, there is also an existing S3 consistency warning from predictSolute.loadComp.

R/loadComp.R Outdated
#use aggregate solute to aggregate to agg.by, but warn and return NA for uncertainty
if(agg.by != "unit") {
preds <- aggregateSolute(preds, metadata = getMetadata(load.model), agg.by = agg.by,
format = flux.or.conc)
Copy link
Contributor Author

@wdwatkins wdwatkins Aug 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aappling-usgs aggregateSolute has both flux rate and flux total options, which of those is appropriate for these? Should one of them be eliminated?

It is going to flux rate as is via partial matching I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q. I like your suggestion to take out flux total entirely and just accept conc or flux [interpreted as flux rate] instead. Let's eliminate flux total everywhere in this function, and add a warning just before

format <- match.arg.loadflex(format, c("conc", "flux rate", "flux total"))

(line ~126; specifics of the above line will change to only accept 'conc' & 'flux'). The warning should check for 'flux total' and, if that was the format passed in, explain that the option is deprecated, and you can multiply flux rate by the duration of each period if a total flux is desired. This will catch very few people now that you've made aggregateSolute internal, but i think it'll help keep things clear for the next year or so.

@aappling-usgs
Copy link
Contributor

I see this S3 consistency warning in the Appveyor build:

* checking S3 generic/method consistency ... WARNING
predictSolute:
  function(load.model, flux.or.conc, newdata, interval, level,
           lin.or.log, se.fit, se.pred, date, attach.units, agg.by,
           ...)
predictSolute.loadComp:
  function(load.model, flux.or.conc, newdata, interval, level,
           lin.or.log, se.fit, se.pred, date, attach.units, fit.reg,
           fit.resid, fit.resid.raw, agg.by, ...)
See section 'Generic functions and methods' in the 'Writing R
Extensions' manual.

Is this warning related to the one you're seeing in the vignette? Could you copy that warning here so I can see?

@wdwatkins
Copy link
Contributor Author

wdwatkins commented Aug 17, 2017

The vignette warning is just because I made aggregateSolute internal. We'll just need to adjust the vignette to use the new predictSolute argument instead.

@aappling-usgs
Copy link
Contributor

OK, could you create an issue for updating the vignette (unless it's easy enough to do in this PR)?

@aappling-usgs
Copy link
Contributor

Also, when I Check locally, I'm seeing an error rather than a warning in the vignette. Is this different from what you're seeing?

Quitting from lines 120-124 (intro_to_loadflex.Rmd) 
Error: processing vignette 'intro_to_loadflex.Rmd' failed with diagnostics:
could not find function "aggregateSolute"

@wdwatkins
Copy link
Contributor Author

wdwatkins commented Aug 17, 2017

Yeah that is the one. Apparently errors in vignettes generate warnings in R CMD CHECK

checking re-building of vignette outputs ... WARNING
Error in re-building vignettes:

Copy link
Contributor

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on a tricky refactor. I've spun off a few additional issues i think we can address in separate PRs. The comments I left in here are things we should probably address in this PR. The big ones will be fixing the S3 consistency warning, deprecating the flux total option, and ensuring that there's a date column for any set of arguments to predLoad/predConc. And then some more minor clarifications and tidyings.

"In the meantime, please consider reporting instantaneous uncertainties only, ",
"or using predLoad(getFittedModel(load.model), by=[format]) ",
"In the meantime, please consider reporting instantaneous uncertainties only",
"(by setting agg.by = \"unit\"), or using predLoad(getFittedModel(load.model), by=[format]) ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

agg_preds$CI_lower <- agg_preds$Value - CI_quantile*agg_preds$SE
agg_preds$CI_upper <- agg_preds$Value + CI_quantile*agg_preds$SE
#returning NAs since this is not accurate
agg_preds$CI_lower <- NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 284-285 have similar needs, and all these calculations are now extraneous all the time, right? i think it's time to trust git to keep a copy of this code for us, and just replace the whole if(ci.agg) block with

if(ci.agg) {
  agg_preds$CI_lower <- NA
  agg_preds$CI_upper <- NA
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(spinning off some additional cleanup tasks into #220)

R/loadComp.R Outdated
#use aggregate solute to aggregate to agg.by, but warn and return NA for uncertainty
if(agg.by != "unit") {
preds <- aggregateSolute(preds, metadata = getMetadata(load.model), agg.by = agg.by,
format = flux.or.conc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Q. I like your suggestion to take out flux total entirely and just accept conc or flux [interpreted as flux rate] instead. Let's eliminate flux total everywhere in this function, and add a warning just before

format <- match.arg.loadflex(format, c("conc", "flux rate", "flux total"))

(line ~126; specifics of the above line will change to only accept 'conc' & 'flux'). The warning should check for 'flux total' and, if that was the format passed in, explain that the option is deprecated, and you can multiply flux rate by the duration of each period if a total flux is desired. This will catch very few people now that you've made aggregateSolute internal, but i think it'll help keep things clear for the next year or so.

@@ -227,7 +228,7 @@ aggregateSolute <- function(
agg_preds <- as.data.frame(summarise(
preds_filt,
Value = mean(preds),
SE = if(se.agg | ci.agg) SEofSum(dates, se.preds, cormat.function) else NA, # why isn't SEofSum divided by n()??
SE = NA, # why isn't SEofSum divided by n()?? #returning NAs since this is not accurate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can take out the # why isn't SEofSum divided by n()?? comment

@@ -101,8 +102,7 @@
#' time.step=as.difftime(1, units="days"), max.tao=as.difftime(10, units="days"))
#' aggregateSolute(preds_example, metadata=metadata_example, format="conc", agg.by="month",
#' cormat.function=new_correlation_assumption)
#'
#' @export
#' }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spinning off some thoughts about deprecation messaging into issue #221 for later cleanup

# Get around data size constraints of predLoad and predConc by splitting the
# prediction into smaller chunks as necessary. chunk.size is the maximum
# number of values per chunk; rloadest can handle at most 176000 values by
# default. nchunks is the number of chunks required.
chunk.size <- 176000
chunk.size <- 175872 #multiple of 96, so breaks iv data on day boundaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea (and inspired #224)

R/loadReg2.R Outdated
nchunks <- ceiling(nrow(newdata) / chunk.size)
datachunks <- lapply(1:nchunks, function(i) {
newdata[((i-1)*chunk.size + 1):min(i*chunk.size, nrow(newdata)),]
})

browser()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely don't want this in here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if R CMD CHECK caught those

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so true!

R/loadReg2.R Outdated
agg.by = agg.by)
agg.by = "total" #going to rloadest
if(nrow(newdata) > 176000) {
stop(paste(strwrap("Sorry, rloadest can't handle more than 176,000 data points,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to 175,872 to reflect new chunk.size below?

R/loadReg2.R Outdated
agg.by = "total" #going to rloadest
if(nrow(newdata) > 176000) {
stop(paste(strwrap("Sorry, rloadest can't handle more than 176,000 data points,
and loadflex does not currently support aggregating uncertainties.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line will probably confuse users who don't understand how rloadest and loadflex interact. I think you could change this whole message to:

Sorry, rloadest can't aggregate more than 176,000 data points at a time.
Please change the agg.by argument to a shorter period, or else 
supply a shorter period of data for newdata"

R/loadReg2.R Outdated
# Add dates if requested
if(date) {
if(!is.data.frame(preds)) {
preds <- data.frame(fit=preds)
}
# prepend the date column
preds <- data.frame(date=getCol(metadata, newdata, "date"), preds)
#preds <- data.frame(date=getCol(metadata, newdata, "date"), preds)
#predLoad returns the dates, don't need to get them from metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure? if we just created the data.frame in line 468 above, then preds will be a data.frame with only a fit column, no dates...

(however, if that's the only case when a date column needs to be added, it does look like these lines could be moved into the if(!is.data.frame(preds)) block just above, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see that, although predLoad and predConc both always return data frames containing the dates, so I wasn't sure why they would ever need to be reattached.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought i remembered that with certain arguments predLoad just returned a vector...let me see if i can recreate that. maybe i'm misremembering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i've looked and still think the date column needs to be added - always, in fact. preds is not the direct output from predLoad or predConc. it's either a vector (https://github.com/wdwatkins/loadflex/blob/9e4e520e0b88a04ce594aa1e7b5048e26dae0e2a/R/loadReg2.R#L394-L397) or a data.frame with columns fit, lwr, and upr (https://github.com/wdwatkins/loadflex/blob/9e4e520e0b88a04ce594aa1e7b5048e26dae0e2a/R/loadReg2.R#L406-L420)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will leave it then. I think this could be streamlined in the future though, I still don't see a reason for dropping the dates and then appending them again later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a reasonable suggestion.

@wdwatkins
Copy link
Contributor Author

should probably add/change some tests so the new agg.by arguments get used.

@aappling-usgs
Copy link
Contributor

when you said, "should probably add/change some tests", did you mean that you planned to do so for this PR? if not, could you create an issue for it?

could you also add updates or an issue for the vignette? would rather not have a broken vignette on the master branch for long.

@wdwatkins
Copy link
Contributor Author

Might as well do it now.

@wdwatkins
Copy link
Contributor Author

@aappling-usgs actually it might be better if you you do the vignette, you would probably have a better idea how to realistically incorporate this change into the workflow. I will make an issue.

@wdwatkins
Copy link
Contributor Author

@aappling-usgs this should be good to go now, pending any additional review

@aappling-usgs
Copy link
Contributor

cool! good work.

@aappling-usgs aappling-usgs merged commit 286a7c2 into DOI-USGS:master Aug 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants