Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optparse to workflow.R #2626

Merged
merged 26 commits into from
Oct 29, 2020
Merged

Add optparse to workflow.R #2626

merged 26 commits into from
Oct 29, 2020

Conversation

kyclark
Copy link
Contributor

@kyclark kyclark commented Jun 3, 2020

Description

The current version of workflow.R incorrectly parses the command-line arguments for the given XML configuration file. Looking at the following section:

args <- commandArgs(trailingOnly = TRUE)
if (is.na(args[1])) {
  settings <- PEcAn.settings::read.settings("pecan.xml")
} else {
  settings_file <- args[1]
  settings <- PEcAn.settings::read.settings(settings_file)
}

Per the documentation on commandArgs, args will be:

 A character vector containing the name of the executable and the
 user-supplied command line arguments.  The first element is the
 name of the executable by which R was invoked.  The exact form of
 this element is platform dependent: it may be the fully qualified
 name, or simply the last component (or basename) of the
 application, or for an embedded R it can be anything the
 programmer supplied.

This program is intended to be executed like so:

./workflow.R --settings input.xml

This code is using args[1] which will actually be the path to the R executable, not the name of the "settings" XML document.

The only reason the program succeeds in using the given XML is because the "read.settings" function in "base/settings/R/read.settings.R" separately parses the commandArgs for the presence of a "--settings" option and grabs the "input.xml" value.

FWIW, I would strongly argue AGAINST this as this hidden behavior has allowed this bug to persist. The function should only use and validate the given "inputfile" and should definitely NOT parse the command-line arguments OR inspect Sys.getenv("PECAN_SETTINGS"), but that is another battle.

Motivation and Context

The "optparse" establishes the parameters for a program in a way that both validates the inputs and generates documentation. This program also, apparently, accepts a "--continue" flag, but the only way to know this is to read the source code itself. As written, this program could be run with any other arbitrary arguments without error.

By describing the parameters to "optparse," the program will now reject unknown arguments and also generate a "usage" statement when executed with "-h" or "--help":

$ ./workflow.R -h
Usage: ./workflow.R [options]


Options:
	-s FILE, --settings=FILE
		Settings XML file

	-c, --continue
		Continue processing

	-h, --help
		Show this help message and exit

Note that unknown arguments are rejected:

$ ./workflow.R --foo bar
Error in getopt_options(object, args) :
  Error in getopt(spec = spec, opt = args) : long flag "foo" is invalid
Calls: main ... get_args -> parse_args -> parse_options -> getopt_options
Execution halted

And invalid file arguments for "--settings" are rejected:

$ ./workflow.R --settings foo.xml
Usage: ./workflow.R [options]


Options:
	-s FILE, --settings=FILE
		Settings XML file

	-c, --continue
		Continue processing

	-h, --help
		Show this help message and exit


Error in get_args() : --settings "foo.xml" not a valid file
Calls: main -> get_args
Execution halted

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Description

Motivation and Context

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

<!--- Provide a general summary of your changes in the Title above -->
<!--- Please select appropriate Priority, Status,and Type labels-->
<!--- If you do not have permission to select labels please state which labels you would like -->

## Description
<!--- Describe your changes in detail -->
The current version of workflow.R incorrectly parses the command-line arguments for the given XML configuration file. Looking at the following section:

```
args <- commandArgs(trailingOnly = TRUE)
if (is.na(args[1])) {
  settings <- PEcAn.settings::read.settings("pecan.xml")
} else {
  settings_file <- args[1]
  settings <- PEcAn.settings::read.settings(settings_file)
}
```
Per the documentation on `commandArgs`, `args` will be:

     A character vector containing the name of the executable and the
     user-supplied command line arguments.  The first element is the
     name of the executable by which R was invoked.  The exact form of
     this element is platform dependent: it may be the fully qualified
     name, or simply the last component (or basename) of the
     application, or for an embedded R it can be anything the
     programmer supplied.

This program is intended to be executed like so:

    ./workflow.R --settings input.xml

This code is using `args[1]` which will actually be the path to the R executable, not the name of the "settings" XML document.

The only reason the program succeeds in using the given XML is because the "read.settings" function in "base/settings/R/read.settings.R"  separately parses the `commandArgs` for the presence of a "--settings" option and grabs the "input.xml" value. 

FWIW, I would strongly argue AGAINST this as this hidden behavior has allowed this bug to persist. The function should only use and validate the given "inputfile" and should definitely NOT parse the command-line arguments OR inspect Sys.getenv("PECAN_SETTINGS"), but that is another battle.

## Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here. -->

The "optparse" establishes the parameters for a program in a way that both validates the inputs and generates documentation. This program also, apparently, accepts a "--continue" flag, but the only way to know this is to read the source code itself. As written, this program could be run with any other arbitrary arguments without error. 

By describing the parameters to "optparse," the program will now reject unknown arguments and also generate a "usage" statement when executed with "-h" or "--help":

```
$ ./workflow.R -h
Usage: ./workflow.R [options]


Options:
	-s FILE, --settings=FILE
		Settings XML file

	-c, --continue
		Continue processing

	-h, --help
		Show this help message and exit

```

Note that unknown arguments are rejected:

```
$ ./workflow.R --foo bar
Error in getopt_options(object, args) :
  Error in getopt(spec = spec, opt = args) : long flag "foo" is invalid
Calls: main ... get_args -> parse_args -> parse_options -> getopt_options
Execution halted
```

And invalid file arguments for "--settings" are rejected:

```
$ ./workflow.R --settings foo.xml
Usage: ./workflow.R [options]


Options:
	-s FILE, --settings=FILE
		Settings XML file

	-c, --continue
		Continue processing

	-h, --help
		Show this help message and exit


Error in get_args() : --settings "foo.xml" not a valid file
Calls: main -> get_args
Execution halted
```

## Review Time Estimate
<!---When do you want your code reviewed by?-->
- [ ] Immediately
- [ ] Within one week
- [x] When possible
## Types of changes
<!--- What types of changes does your code introduce? Put an `x` in all the boxes that apply: -->
- [ ] Bug fix (non-breaking change which fixes an issue) <!-- please add issue number -->
- [x] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)

## Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
- [ ] My change requires a change to the documentation.
- [ ] I have updated the CHANGELOG.md.
- [ ] I have updated the documentation accordingly.
- [ ] I have read the **CONTRIBUTING** document.
- [ ] I have added tests to cover my changes.
- [ ] All new and existing tests passed.

<!--this template is from https://www.talater.com/open-source-templates/#/page/99-->
web/workflow.R Outdated
suppressMessages(library("PEcAn.all"))
suppressMessages(library("PEcAn.utils"))
suppressMessages(library("RCurl"))
suppressMessages(library("optparse"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a package that needs to to be installed first. I'm tempted to suggest adding it to base/utils since that is also where the status functions are.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robkooper did we ever document when a package should be added as 'Imports' vs 'Suggests'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the parsing is neatly encapsulated already, I suggest moving the definition of get_args into a package (utils or workflow seem like the obvious candidates) and calling it from there — then folks reading the workflow file just see eg args <- PEcAn.utils::get_args(commandArgs()) and don’t have to grapple with the full function definition

Copy link
Contributor Author

@kyclark kyclark Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely follow this, but I think I disagree. The code in get_args() could easily live in the main body of the program (or the main() that I wrote), but it needs to be a part of this program. It's not a generic function. It describes and validates the parameters to this program only.

FWIW, I'm cargo-culting my ideas from Python. I've written extensively on these ideas (see Tiny Python Projects), and Kristina saw me present on the first chapter where I spell out these ideas of encapsulating ideas into functions which hopefully can be tested, documenting command-line arguments, running integration tests for a program, etc. Many Perl/Python/R programs write all the code in the "main" namespace and don't bother with functions. This has the effect, e.g., of making all the variables global to the program which is not a best practice. I try to write small (<50 lines), pure (no global/environmental variables, hidden state, etc.) functions (and unit tests!) to try to break down my programs into understandable functions that can be composed.

Every command-line program I write in Python starts in a main() function and almost always has a get_args() function for handling the command-line arguments. I have written a Python program called new.py to create new programs with these functions:

https://github.com/kyclark/tiny_python_projects/blob/master/bin/new.py

The idea is to bring a standard interface to my programs and to adhere to the Python community's best practices. The "argparse" module is standard and widely used. Most programmers (esp coming from C languages, Java, etc.) would immediately understand starting in a main() function. Small functions tend to limit scoping of variables and make programs easier to understand AND TEST.

I feel strongly that every command-line program ought to use the same interface for documenting and validating the arguments. R's offerings in this space seem limited to "optparse" which I have demonstrated here. It's lacking in many of the features from Python's "argparse" such as automatically validating that an argument is a readable text file so you note that I had to check this manually. There appears to be an R package called "argparse" which in some way seems to rely on the existence of Python to handle the argument processing, so I'm not sure if this is a better or worse option.

In addition to validating the arguments, every command-line program ought to generate a "usage" statement when run with the standard "-h" and "--help" flags. The only reason I found out about the "--continue" flag was because I read the source code, which is an unreasonable expectation for the user/programmer. Additionally, it turns out that the read.settings() function ALSO parses the commandArgs() which seems highly unusual and undesirable as it could create other hidden, undocumented options for the program.

Only by manually tracing and reading all the source code could a person ever know all the options that the code accepts. This is not a best practice. ALL of the options should be explicitly stated in documentation/usage and should be VALIDATED IMMEDIATELY upon program execution. Every program should strive to fail as early as possible, e.g., if some file is required then verify both that the argument is present and is actually a readable file.

I would be happy to demonstrate my ideas using Python and trying to find reasonable parallels in R. I could try to explain unit and integration tests, and point out some of the flaws in dynamically typed languages like R/Python that can easily lead to unmaintainable code. I could explain some of the errors that I've seen in the current codebase with things like scoping of variables, impure and extremely long functions with far too many parameters, fatal errors that are ignored, lack of exceptions and traceback information that seems to make debugging R a particular bear.

In short, I would recommend adding "optparse" as a standard module for the project and encouraging every command-line program to adopt this. It's not necessary to have main() and get_args() functions, but just defining the program parameters can easily take up 100+ lines for a complicated program, so it could make the main body of the code easier to read if it's a separate function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to concur with @infotroph here. Definitely like the get_args function. Agree that it sees like it should go in a package, such as settings, which likewise only makes sense in the context of PEcAn's workflow. Agree that I'd prefer to not put the workflow in a main(). Agree that I don't want to suppress warnings. Would also add a request that command line args be documented in the model documentation as part of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a generic function. It describes and validates the parameters to this program only.

This is a very good point that I wasn't thinking about! I completely understand the philosophical objection, but pragmatically there's plenty of precedent in PEcAn for choosing encapsulation over inline usage even when it's not fully generic -- c.f. PEcAn.DB::betyConnect, which hard-codes a very specific (and arguably Rube-Goldberg, but still useful!) set of database parameters.

Let's discuss your other points below and keep the line comments focused on how to handle this specific line of code.

web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated Show resolved Hide resolved
@robkooper
Copy link
Member

can you update the changelog as well.

Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robkooper to clarify are you requesting that Ken

  1. Add optparse to Imports in base/utils/DESCRIPTION (or to base/workflow/DESCRIPTION if we move functions there?)
  2. add get_args() and workflow.R to a new file like base/workflow/R/workflow.R?
  3. Update this to call PEcAn.workflow::workflow()?

If so, @kyclark let me know. I'm happy to do this rather than ask you to learn the mechanics of r package development!

web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated
suppressMessages(library("PEcAn.all"))
suppressMessages(library("PEcAn.utils"))
suppressMessages(library("RCurl"))
suppressMessages(library("optparse"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robkooper did we ever document when a package should be added as 'Imports' vs 'Suggests'?

web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated Show resolved Hide resolved
web/workflow.R Outdated Show resolved Hide resolved
Copy link
Member

@infotroph infotroph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support fixing the argument parsing and like this implementation. But I’d pretty strongly prefer that the body of the workflow not be wrapped in a function, for two reasons: 1) I often find it handy to step through the workflow interactively and this makes that harder; 2)
adding another call level adds a lot of complexity to the error tracebacks (this is an indictment of R’s tracebacks IMO), and I’m not completely sure the error handlers will work right with it either.

I also mildly prefer that package load messages not be suppressed, but this is mostly an aesthetic opinion.

@kyclark
Copy link
Contributor Author

kyclark commented Jun 4, 2020

@robkooper I'm definitely unsure how these ideas ought to be adopted. I barely understand this codebase and how it is used, e.g., @infotroph thinks that adding a main() function would make this code harder to use interactively. Maybe all the code in this program belongs in some library for actually running the workflow, but it would still require creating some function with the parameters explicitly being passed rather than manually (AND INCORRECTLY!) parsing commandArgs() which is currently the case with workflow.R.

However the code is broken up, there will still be command-line programs that need to respond to and use command-line arguments, produce documentation, etc. These programs, e.g., workflow.R should have "optparse" (or something like it) to handle the validation and docs. They may pass those arguments to libraries for execution, but those libraries ought not to also parse commandArgs(). This creates hidden and unexpected behavior and will make any testing a nightmare. Functions should only act upon the parameters explicitly passed and should never rely on global or environmental variables.

@infotroph notes the desire to run R code interactively, which I never do. I'm looking at this as a developer who wants to run the code from the command-line with all the arguments and see that it works. These two goals may not be compatible. I'm not in the position to say. Whatever makes the code more transparent and easily tested is what should be done.

As for suppressing the import messages, I find the output of most R programs completely maddening. I cannot understand how one could possibly pick out the relevant problems when programs emit hundreds and thousands of lines of output. Are error messages at least segregated to STDERR?

@infotroph
Copy link
Member

Let's focus the discussion here:

@kyclark, you've identified a real bug in the argument parsing and provided a great-looking implementation of a fix. We appreciate this and want to merge your patch! The question is how to handle the added optparse dependency, and in PEcAn the answer is clear: We handle dependencies by declaring them in the appropriate PEcAn package.

You're also proposing a fundamental change in how the workflow works and what uses it supports. This is a big topic that we've discussed a fair amount before (e.g. #1866 and links from it) and we'll need to very carefully consider everyone's use cases before any changes happen, so I suggest keeping this PR focused on fixing the argument parsing rather than try to resolve both at once.

@dlebauer
Copy link
Member

dlebauer commented Jun 5, 2020

I can implement the changes but just want to make sure it is clear:

  1. put get_args() in the settings package and add optparse under Imports
  2. remove the main <- function(){} wrapper around the workflow()
  3. replace library(optparse) with optparse::function syntax

Is that correct?

@infotroph
Copy link
Member

1: yes
2: yes
3: yes, and it looks like get_args uses optparse:: everywhere already, so 👍
4. remove the suppressMessages calls, and while we're at it drop library(PEcAn.utils) -- PEcAn.all loads it anyway.

(I'm also not sure which steps of the workflow need library(RCurl), but that's probably a separate PR.)

Copy link
Member

@mdietze mdietze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything additional that needs to be discussed on this PR, or are we just waiting for the requested changes to be implemented?

@mdietze
Copy link
Member

mdietze commented Jun 29, 2020

@kyclark checking in again to see when you'll be able to implement the requested changes on this PR

@kyclark
Copy link
Contributor Author

kyclark commented Jun 30, 2020

Sorry, I did not realize my participation in this was still required. I feel very out of my depth with trying to suggest how "optparse" should be integrated. I believe the way that I structure and run code as a command-line, Python software developer is very different from how statisticians/researchers structure and use R code. For instance, I'm ultimately concerned with breaking code into the smallest possible reusable and testable functions and hiding data/variables in the smallest possible scope. Most of the R code I've seen in my life takes almost the opposite approach with no/very few functions and all data/vars in a global, mutable space. Whereas I tend to run programs in their entirety with all parameters provided via arguments, R programmers seem to use code interactively most esp from w/in RStudio where various sections can be run at-will, in varying orders, while mutating and inspecting values during run-time.

Due to these differences, I think my suggestions to add "optparse" dependency and the "get_args()" function directly to the "workflow.R" seem to be at odds with how most people want the code organized. I feel strongly that each command-line program that accepts arguments should be structured in the same way; that is, each one should import the "optparse" module and use either a "get_args()" function or have that code incorporated into the body of the program itself. The "get_args()" function I showed is not a generic function that can be used by other modules/programs (unless, of course, all these programs have the same arguments).

I would be happy to perhaps participate in a discussion about these ideas, but I think I can do no better than my original submission. I feel strongly that every command-line program ought to start with a "main()" function so as to move away from global variables. Programs should be broken into far smaller functions preferable that include tests. The reason to have a "get_args()" function is to encapsulate those ideas, hiding as much of the data and code as possible, and shorten the "main()" function. If these principles conflict with how the code is currently structured and developed, then I'm not really in a position to dictate how you incorporate the use of the "optparse" module.

@mdietze
Copy link
Member

mdietze commented Jun 30, 2020

@kyclark I think the final disposition of the requested changes were summarized in the last two comments from @dlebauer and @infotroph.

  1. put get_args() in the settings package and add optparse under Imports
  2. remove the main <- function(){} wrapper around the workflow
  3. remove the suppressMessages calls, and drop library(PEcAn.utils)

None of us question the usefulness of optparse or get_args. If you feel strongly enough about the addition of main() and suppressMessages that you are unwilling to separate that request from optparse then you should close this PR, but I personally don't see why those issues need to be bundled with an improvement to arg parsing.

@kyclark
Copy link
Contributor Author

kyclark commented Jun 30, 2020

Sorry, I don't quite understand "1. put get_args() in the settings package and add optparse under Imports". Does this mean to move "get_args()" outside the "workflow.R" file? I can submit a version with 2 and 3 implemented.

@mdietze
Copy link
Member

mdietze commented Jun 30, 2020

Yes, we envisioned placing get_args in with the package that also reads the xml settings file used by workflow.R. If that's a sticking point for you we can focus on 2 and 3 and revisit this later.

@dlebauer dlebauer self-assigned this Jun 30, 2020
@dlebauer
Copy link
Member

I've assigned to myself and will implement changes summarized by @mdietze #2626 (comment)

@kyclark
Copy link
Contributor Author

kyclark commented Jul 1, 2020

I've sent a review of "read.settings()" to @dlebauer to discuss ideas in more detail.

@dlebauer
Copy link
Member

dlebauer commented Aug 7, 2020

I've taken a first pass at fixing the changes from #2626 (comment). However I don't have PEcAn installed on my computer so I haven't run tests locally.

Comment on lines 15 to 16
default = ifelse(Sys.getenv("PECAN_SETTINGS") != "",
Sys.getenv("PECAN_SETTINGS"), "pecan.xml"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sys.getenv can itself take a default value, so this simplifies to

Suggested change
default = ifelse(Sys.getenv("PECAN_SETTINGS") != "",
Sys.getenv("PECAN_SETTINGS"), "pecan.xml"),
default = Sys.getenv("PECAN_SETTINGS", "pecan.xml"),

But this does miss one edge case: Sys.getenv only uses its default when the variable is unset; in the (perverse?) case where the var is set but its value is the empty string, it will return the empty string. So if it's important that PECAN_SETTINGS="" should result in using pecan.xml, reject this suggestion and keep the existing ifelse.

@mdietze
Copy link
Member

mdietze commented Aug 8, 2020

@dlebauer you need to update Roxygen

check for out-of-date Rd files
 M base/settings/NAMESPACE
 M docker/depends/pecan.depends
?? base/settings/man/get_args.Rd

@infotroph
Copy link
Member

/document

@robkooper
Copy link
Member

/document

1 similar comment
@robkooper
Copy link
Member

/document

@mdietze
Copy link
Member

mdietze commented Oct 27, 2020

This is now failing at the step where it's installing new dependencies:

Warning in install.packages(missing, lib = rlib) :
  'lib = "~/R/x86_64-pc-linux-gnu-library/4.0"' is not writable
Error in install.packages(missing, lib = rlib) : 
  unable to install packages
Execution halted

I suspect this is related to @infotroph recent PR that added a dependency install

@infotroph
Copy link
Member

I suspect this is related to @infotroph recent PR that added a dependency install

d'oh! Yep, fixed in #2724

@mdietze
Copy link
Member

mdietze commented Oct 28, 2020

This is now failing with the following Warning

prepare_Rd: get_args.Rd:9-11: Dropping empty section \value

and I'm not sure how to change it to fix the build

@infotroph
Copy link
Member

@mdietze Heeeeyyyy, the build is back to catching real code errors! This was because of the empty @return in the Roxygen comments.

I took at stab at fixing this (f3f9e19) and ran styler on the file while I was at it (51e4bbd); @dlebauer does the new description look OK?

@mdietze mdietze merged commit 0cb58b5 into develop Oct 29, 2020
}

if ((length(which(commandArgs() == "--advanced")) != 0)
&& (PEcAn.utils::status.check("ADVANCED") == 0)) {
&& (PEcAn.utils::status.check("ADVANCED") == 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a strange indentation to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and wish styler wouldn't do it, but am carefully not volunteering to figure out how 😉

@infotroph infotroph deleted the kyclark-workflow.R branch November 2, 2020 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants