-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optparse to workflow.R #2626
Conversation
<!--- Provide a general summary of your changes in the Title above --> <!--- Please select appropriate Priority, Status,and Type labels--> <!--- If you do not have permission to select labels please state which labels you would like --> ## Description <!--- Describe your changes in detail --> The current version of workflow.R incorrectly parses the command-line arguments for the given XML configuration file. Looking at the following section: ``` args <- commandArgs(trailingOnly = TRUE) if (is.na(args[1])) { settings <- PEcAn.settings::read.settings("pecan.xml") } else { settings_file <- args[1] settings <- PEcAn.settings::read.settings(settings_file) } ``` Per the documentation on `commandArgs`, `args` will be: A character vector containing the name of the executable and the user-supplied command line arguments. The first element is the name of the executable by which R was invoked. The exact form of this element is platform dependent: it may be the fully qualified name, or simply the last component (or basename) of the application, or for an embedded R it can be anything the programmer supplied. This program is intended to be executed like so: ./workflow.R --settings input.xml This code is using `args[1]` which will actually be the path to the R executable, not the name of the "settings" XML document. The only reason the program succeeds in using the given XML is because the "read.settings" function in "base/settings/R/read.settings.R" separately parses the `commandArgs` for the presence of a "--settings" option and grabs the "input.xml" value. FWIW, I would strongly argue AGAINST this as this hidden behavior has allowed this bug to persist. The function should only use and validate the given "inputfile" and should definitely NOT parse the command-line arguments OR inspect Sys.getenv("PECAN_SETTINGS"), but that is another battle. ## Motivation and Context <!--- Why is this change required? What problem does it solve? --> <!--- If it fixes an open issue, please link to the issue here. --> The "optparse" establishes the parameters for a program in a way that both validates the inputs and generates documentation. This program also, apparently, accepts a "--continue" flag, but the only way to know this is to read the source code itself. As written, this program could be run with any other arbitrary arguments without error. By describing the parameters to "optparse," the program will now reject unknown arguments and also generate a "usage" statement when executed with "-h" or "--help": ``` $ ./workflow.R -h Usage: ./workflow.R [options] Options: -s FILE, --settings=FILE Settings XML file -c, --continue Continue processing -h, --help Show this help message and exit ``` Note that unknown arguments are rejected: ``` $ ./workflow.R --foo bar Error in getopt_options(object, args) : Error in getopt(spec = spec, opt = args) : long flag "foo" is invalid Calls: main ... get_args -> parse_args -> parse_options -> getopt_options Execution halted ``` And invalid file arguments for "--settings" are rejected: ``` $ ./workflow.R --settings foo.xml Usage: ./workflow.R [options] Options: -s FILE, --settings=FILE Settings XML file -c, --continue Continue processing -h, --help Show this help message and exit Error in get_args() : --settings "foo.xml" not a valid file Calls: main -> get_args Execution halted ``` ## Review Time Estimate <!---When do you want your code reviewed by?--> - [ ] Immediately - [ ] Within one week - [x] When possible ## Types of changes <!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: --> - [ ] Bug fix (non-breaking change which fixes an issue) <!-- please add issue number --> - [x] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Checklist: <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> <!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! --> - [ ] My change requires a change to the documentation. - [ ] I have updated the CHANGELOG.md. - [ ] I have updated the documentation accordingly. - [ ] I have read the **CONTRIBUTING** document. - [ ] I have added tests to cover my changes. - [ ] All new and existing tests passed. <!--this template is from https://www.talater.com/open-source-templates/#/page/99-->
web/workflow.R
Outdated
suppressMessages(library("PEcAn.all")) | ||
suppressMessages(library("PEcAn.utils")) | ||
suppressMessages(library("RCurl")) | ||
suppressMessages(library("optparse")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a package that needs to to be installed first. I'm tempted to suggest adding it to base/utils since that is also where the status functions are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robkooper did we ever document when a package should be added as 'Imports' vs 'Suggests'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the parsing is neatly encapsulated already, I suggest moving the definition of get_args
into a package (utils or workflow seem like the obvious candidates) and calling it from there — then folks reading the workflow file just see eg args <- PEcAn.utils::get_args(commandArgs())
and don’t have to grapple with the full function definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I entirely follow this, but I think I disagree. The code in get_args()
could easily live in the main body of the program (or the main()
that I wrote), but it needs to be a part of this program. It's not a generic function. It describes and validates the parameters to this program only.
FWIW, I'm cargo-culting my ideas from Python. I've written extensively on these ideas (see Tiny Python Projects), and Kristina saw me present on the first chapter where I spell out these ideas of encapsulating ideas into functions which hopefully can be tested, documenting command-line arguments, running integration tests for a program, etc. Many Perl/Python/R programs write all the code in the "main" namespace and don't bother with functions. This has the effect, e.g., of making all the variables global to the program which is not a best practice. I try to write small (<50 lines), pure (no global/environmental variables, hidden state, etc.) functions (and unit tests!) to try to break down my programs into understandable functions that can be composed.
Every command-line program I write in Python starts in a main()
function and almost always has a get_args()
function for handling the command-line arguments. I have written a Python program called new.py
to create new programs with these functions:
https://github.com/kyclark/tiny_python_projects/blob/master/bin/new.py
The idea is to bring a standard interface to my programs and to adhere to the Python community's best practices. The "argparse" module is standard and widely used. Most programmers (esp coming from C languages, Java, etc.) would immediately understand starting in a main()
function. Small functions tend to limit scoping of variables and make programs easier to understand AND TEST.
I feel strongly that every command-line program ought to use the same interface for documenting and validating the arguments. R's offerings in this space seem limited to "optparse" which I have demonstrated here. It's lacking in many of the features from Python's "argparse" such as automatically validating that an argument is a readable text file so you note that I had to check this manually. There appears to be an R package called "argparse" which in some way seems to rely on the existence of Python to handle the argument processing, so I'm not sure if this is a better or worse option.
In addition to validating the arguments, every command-line program ought to generate a "usage" statement when run with the standard "-h" and "--help" flags. The only reason I found out about the "--continue" flag was because I read the source code, which is an unreasonable expectation for the user/programmer. Additionally, it turns out that the read.settings()
function ALSO parses the commandArgs()
which seems highly unusual and undesirable as it could create other hidden, undocumented options for the program.
Only by manually tracing and reading all the source code could a person ever know all the options that the code accepts. This is not a best practice. ALL of the options should be explicitly stated in documentation/usage and should be VALIDATED IMMEDIATELY upon program execution. Every program should strive to fail as early as possible, e.g., if some file is required then verify both that the argument is present and is actually a readable file.
I would be happy to demonstrate my ideas using Python and trying to find reasonable parallels in R. I could try to explain unit and integration tests, and point out some of the flaws in dynamically typed languages like R/Python that can easily lead to unmaintainable code. I could explain some of the errors that I've seen in the current codebase with things like scoping of variables, impure and extremely long functions with far too many parameters, fatal errors that are ignored, lack of exceptions and traceback information that seems to make debugging R a particular bear.
In short, I would recommend adding "optparse" as a standard module for the project and encouraging every command-line program to adopt this. It's not necessary to have main()
and get_args()
functions, but just defining the program parameters can easily take up 100+ lines for a complicated program, so it could make the main body of the code easier to read if it's a separate function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to concur with @infotroph here. Definitely like the get_args
function. Agree that it sees like it should go in a package, such as settings
, which likewise only makes sense in the context of PEcAn's workflow. Agree that I'd prefer to not put the workflow in a main()
. Agree that I don't want to suppress warnings. Would also add a request that command line args be documented in the model documentation as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a generic function. It describes and validates the parameters to this program only.
This is a very good point that I wasn't thinking about! I completely understand the philosophical objection, but pragmatically there's plenty of precedent in PEcAn for choosing encapsulation over inline usage even when it's not fully generic -- c.f. PEcAn.DB::betyConnect
, which hard-codes a very specific (and arguably Rube-Goldberg, but still useful!) set of database parameters.
Let's discuss your other points below and keep the line comments focused on how to handle this specific line of code.
can you update the changelog as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robkooper to clarify are you requesting that Ken
- Add optparse to Imports in base/utils/DESCRIPTION (or to base/workflow/DESCRIPTION if we move functions there?)
- add
get_args()
andworkflow.R
to a new file like base/workflow/R/workflow.R? - Update this to call
PEcAn.workflow::workflow()
?
If so, @kyclark let me know. I'm happy to do this rather than ask you to learn the mechanics of r package development!
web/workflow.R
Outdated
suppressMessages(library("PEcAn.all")) | ||
suppressMessages(library("PEcAn.utils")) | ||
suppressMessages(library("RCurl")) | ||
suppressMessages(library("optparse")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robkooper did we ever document when a package should be added as 'Imports' vs 'Suggests'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support fixing the argument parsing and like this implementation. But I’d pretty strongly prefer that the body of the workflow not be wrapped in a function, for two reasons: 1) I often find it handy to step through the workflow interactively and this makes that harder; 2)
adding another call level adds a lot of complexity to the error tracebacks (this is an indictment of R’s tracebacks IMO), and I’m not completely sure the error handlers will work right with it either.
I also mildly prefer that package load messages not be suppressed, but this is mostly an aesthetic opinion.
@robkooper I'm definitely unsure how these ideas ought to be adopted. I barely understand this codebase and how it is used, e.g., @infotroph thinks that adding a However the code is broken up, there will still be command-line programs that need to respond to and use command-line arguments, produce documentation, etc. These programs, e.g., @infotroph notes the desire to run R code interactively, which I never do. I'm looking at this as a developer who wants to run the code from the command-line with all the arguments and see that it works. These two goals may not be compatible. I'm not in the position to say. Whatever makes the code more transparent and easily tested is what should be done. As for suppressing the import messages, I find the output of most R programs completely maddening. I cannot understand how one could possibly pick out the relevant problems when programs emit hundreds and thousands of lines of output. Are error messages at least segregated to STDERR? |
Let's focus the discussion here: @kyclark, you've identified a real bug in the argument parsing and provided a great-looking implementation of a fix. We appreciate this and want to merge your patch! The question is how to handle the added You're also proposing a fundamental change in how the workflow works and what uses it supports. This is a big topic that we've discussed a fair amount before (e.g. #1866 and links from it) and we'll need to very carefully consider everyone's use cases before any changes happen, so I suggest keeping this PR focused on fixing the argument parsing rather than try to resolve both at once. |
I can implement the changes but just want to make sure it is clear:
Is that correct? |
1: yes (I'm also not sure which steps of the workflow need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything additional that needs to be discussed on this PR, or are we just waiting for the requested changes to be implemented?
@kyclark checking in again to see when you'll be able to implement the requested changes on this PR |
Sorry, I did not realize my participation in this was still required. I feel very out of my depth with trying to suggest how "optparse" should be integrated. I believe the way that I structure and run code as a command-line, Python software developer is very different from how statisticians/researchers structure and use R code. For instance, I'm ultimately concerned with breaking code into the smallest possible reusable and testable functions and hiding data/variables in the smallest possible scope. Most of the R code I've seen in my life takes almost the opposite approach with no/very few functions and all data/vars in a global, mutable space. Whereas I tend to run programs in their entirety with all parameters provided via arguments, R programmers seem to use code interactively most esp from w/in RStudio where various sections can be run at-will, in varying orders, while mutating and inspecting values during run-time. Due to these differences, I think my suggestions to add "optparse" dependency and the "get_args()" function directly to the "workflow.R" seem to be at odds with how most people want the code organized. I feel strongly that each command-line program that accepts arguments should be structured in the same way; that is, each one should import the "optparse" module and use either a "get_args()" function or have that code incorporated into the body of the program itself. The "get_args()" function I showed is not a generic function that can be used by other modules/programs (unless, of course, all these programs have the same arguments). I would be happy to perhaps participate in a discussion about these ideas, but I think I can do no better than my original submission. I feel strongly that every command-line program ought to start with a "main()" function so as to move away from global variables. Programs should be broken into far smaller functions preferable that include tests. The reason to have a "get_args()" function is to encapsulate those ideas, hiding as much of the data and code as possible, and shorten the "main()" function. If these principles conflict with how the code is currently structured and developed, then I'm not really in a position to dictate how you incorporate the use of the "optparse" module. |
@kyclark I think the final disposition of the requested changes were summarized in the last two comments from @dlebauer and @infotroph.
None of us question the usefulness of optparse or get_args. If you feel strongly enough about the addition of |
Sorry, I don't quite understand "1. put get_args() in the settings package and add optparse under Imports". Does this mean to move "get_args()" outside the "workflow.R" file? I can submit a version with 2 and 3 implemented. |
Yes, we envisioned placing get_args in with the package that also reads the xml settings file used by workflow.R. If that's a sticking point for you we can focus on 2 and 3 and revisit this later. |
I've assigned to myself and will implement changes summarized by @mdietze #2626 (comment) |
I've sent a review of "read.settings()" to @dlebauer to discuss ideas in more detail. |
I've taken a first pass at fixing the changes from #2626 (comment). However I don't have PEcAn installed on my computer so I haven't run tests locally. |
base/settings/R/get_args.R
Outdated
default = ifelse(Sys.getenv("PECAN_SETTINGS") != "", | ||
Sys.getenv("PECAN_SETTINGS"), "pecan.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sys.getenv can itself take a default value, so this simplifies to
default = ifelse(Sys.getenv("PECAN_SETTINGS") != "", | |
Sys.getenv("PECAN_SETTINGS"), "pecan.xml"), | |
default = Sys.getenv("PECAN_SETTINGS", "pecan.xml"), |
But this does miss one edge case: Sys.getenv only uses its default when the variable is unset; in the (perverse?) case where the var is set but its value is the empty string, it will return the empty string. So if it's important that PECAN_SETTINGS="" should result in using pecan.xml, reject this suggestion and keep the existing ifelse.
@dlebauer you need to update Roxygen
|
/document |
/document |
1 similar comment
/document |
This is now failing at the step where it's installing new dependencies:
I suspect this is related to @infotroph recent PR that added a dependency install |
d'oh! Yep, fixed in #2724 |
This is now failing with the following Warning
and I'm not sure how to change it to fix the build |
} | ||
|
||
if ((length(which(commandArgs() == "--advanced")) != 0) | ||
&& (PEcAn.utils::status.check("ADVANCED") == 0)) { | ||
&& (PEcAn.utils::status.check("ADVANCED") == 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a strange indentation to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and wish styler wouldn't do it, but am carefully not volunteering to figure out how 😉
Description
The current version of workflow.R incorrectly parses the command-line arguments for the given XML configuration file. Looking at the following section:
Per the documentation on
commandArgs
,args
will be:This program is intended to be executed like so:
This code is using
args[1]
which will actually be the path to the R executable, not the name of the "settings" XML document.The only reason the program succeeds in using the given XML is because the "read.settings" function in "base/settings/R/read.settings.R" separately parses the
commandArgs
for the presence of a "--settings" option and grabs the "input.xml" value.FWIW, I would strongly argue AGAINST this as this hidden behavior has allowed this bug to persist. The function should only use and validate the given "inputfile" and should definitely NOT parse the command-line arguments OR inspect Sys.getenv("PECAN_SETTINGS"), but that is another battle.
Motivation and Context
The "optparse" establishes the parameters for a program in a way that both validates the inputs and generates documentation. This program also, apparently, accepts a "--continue" flag, but the only way to know this is to read the source code itself. As written, this program could be run with any other arbitrary arguments without error.
By describing the parameters to "optparse," the program will now reject unknown arguments and also generate a "usage" statement when executed with "-h" or "--help":
Note that unknown arguments are rejected:
And invalid file arguments for "--settings" are rejected:
Review Time Estimate
Types of changes
Checklist:
Description
Motivation and Context
Review Time Estimate
Types of changes
Checklist: