Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed approach for testing CLI arg parsing #1566

Merged
merged 5 commits into from Mar 17, 2024

Conversation

veekaybee
Copy link
Contributor

See discussion here: #1518

Here's an approach to start testing CLI argument parsing:

  1. Separate out setting up the argument parser in parse_eval_args into a separate method, setup_parser that gets called in parse_eval_args
  2. Create unit tests that call the parser for each of the command line arguments
  3. Adding specific TypeError exceptions at each argument entrypoint in the cli_evaluate method

Let me know what you think about this approach. If it seems reasonable, I'll add the tests for the rest of the methods and exceptions where it's reasonable.

@LSinev @haileyschoelkopf

@LSinev
Copy link
Contributor

LSinev commented Mar 12, 2024

Combination of HFArgumentParser from transformers with args setup through dataclass like https://github.com/huggingface/transformers/blob/main/examples/research_projects/wav2vec2/run_asr.py#L343 and the __post_init__ value check like in video (link with timecode) https://youtu.be/zN4VCb0LbQI?t=592
But this still may not solve points that follow.

As for the current code, testing the parser the way presented seems like testing the argument parser, not the code of this repo module. We put 5 to something that should be a number and it works. In this case it might be useful to check that it always fails if the input is like --numshots five. What are the cases, which will fail at new written tests, which will not fail inside ArgumentParser?

The `try... except' example here seems to be overreacting to an already solved case — no prevention of new failures. Some future failures may be prevented (though this hypothesis should be tested by turning on failed code and rechecking) after mypy checks are turned back on (even for tests).

@veekaybee
Copy link
Contributor Author

Thanks for the feedback @LSinev ! You're right that these cases don't necessarily cover what we'd like. After thinking about this and checking the videos and the links, I decided to take a different approach and unit test whether each CLI argument, with the exception of booleans, has a type.

That way, if you input one without a type unit tests won't pass and if it's a boolean you'll have to delcare a default anyway. Let me know what you think about this approach.

@LSinev
Copy link
Contributor

LSinev commented Mar 14, 2024

This seems to be a much better approach.
By the way, some boolean cli arguments may be also set like

    parser.add_argument(
        "--some_boolean_arg",
        type=bool,
        default=True,
        help="do something good",
        action=argparse.BooleanOptionalAction,  # type: ignore[attr-defined]
    )

which also adds --no-some_boolean_arg. Mentioning this way in case you want check those too.

@veekaybee
Copy link
Contributor Author

Thanks!

parser.add_argument(
"--some_boolean_arg",
type=bool,
default=True,
help="do something good",
action=argparse.BooleanOptionalAction, # type: ignore[attr-defined]
)

I checked these and decided not to add a test for them since we use the store_true pattern generally in all our arguments and it makes sense to standardize on this, what do you think?

@veekaybee veekaybee marked this pull request as ready for review March 14, 2024 16:06
@LSinev
Copy link
Contributor

LSinev commented Mar 14, 2024

Standardization is good for future improvements and development. Even more, after reading the documentation I see that BooleanOptionalAction is only available since python 3.9, so it is of no use as this repo should support 3.8 as well. But I am not sure if this store_true pattern with default=True is OK:

   parser.add_argument(
       "--trust_remote_code",
       default=True,
       action="store_true",
       help="Sets trust_remote_code to True to execute code to create HF Datasets from the Hub",
   )

with or without this argument, the code is trusted by default. I don't know if this pattern adds an option of --no-trust_remote_code (and also if it depends on the Python version).

@veekaybee
Copy link
Contributor Author

The behavior of store_true seems somewhat confusing in general. We override to true in the case of the default and respect the user's settings, but if we don't set the default to True, then it defaults to False, at least in 3.9: https://gist.github.com/veekaybee/2c8769789a90f219dc83a9e681773000

Ironically this is the default behavior of the module 😅 . I figured from that perspective, it was better to explicitly set it (explicit is better than implicit, etc, zen of python) even though we handle it later downstream. I can also check 3.8 if that. helps

@LSinev
Copy link
Contributor

LSinev commented Mar 14, 2024

I think, your gist example/test may be more insightful with parsing of same set of args (and also setup when no args is provided) by all three defined parsers.

I am a bit confused here. As far as I understand now, after this PR (with default=True for some boolean store_true arguments) merged, calling lm_eval with some arguments from commandline, considering --trust_remote_code will have effect on datasets invocation like:

Command Trust to remote code state
(some arguments but no --trust_remote_code at all) True
--trust_remote_code True
--trust_remote_code false False
--trust_remote_code true True
--trust_remote_code 0 False
--trust_remote_code 1 True

Also I suppose (in this case) user or any system calling from commandline consider equivalent all typical ways of setting True (1, true, T, True, TRUE, on, On, ON, Y, y, yes, Yes, YES) and False (0, false, F, False, FALSE, off, Off, OFF, N, n, no, No, NO). Is this implied somehow, or may be tested also?

I checked one with ipython 3.8

In [1]: import argparse

In [2]: import os

In [3]: parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)

In [4]: parser.add_argument(
   ...:        "--trust_remote_code",
   ...:               default=True,
   ...:                      action="store_true",
   ...:                             help="Sets trust_remote_code to True to execute code to create HF Datasets from the
   ...:  Hub",
   ...:                                )
Out[4]: _StoreTrueAction(option_strings=['--trust_remote_code'], dest='trust_remote_code', nargs=0, const=True, default=True, type=None, choices=None, help='Sets trust_remote_code to True to execute code to create HF Datasets from the Hub', metavar=None)

and then

In [20]: parser.parse_args([])
Out[20]: Namespace(trust_remote_code=True)

In [21]: parser.parse_args(['--trust_remote_code'])
Out[21]: Namespace(trust_remote_code=True)

Seems, there is no way to turn off trust in remote code. I tried some ways to set false and didn't find any.

Without default=True I thought is like

Command Trust to remote code state
(some arguments but no --trust_remote_code at all) False
--trust_remote_code True

No confusion for user, just having key/argument set — turns something on, and no trust to remote code if not specified.

Found big discussion with many ways to implement (still no pre-commit check which I was actually looking for): https://stackoverflow.com/questions/15008758/parsing-boolean-values-with-argparse

@veekaybee
Copy link
Contributor Author

veekaybee commented Mar 14, 2024

It seems like you mention, we might want to test this flag specifically - I'll see what I can add from a testing perspective to cover store_true flags as they are currently implemented (with the intention of keeping code/behavior changes as minimal as possible)

Based on the thread you posted, this looks like the easiest and most accepted answer https://stackoverflow.com/a/59579733.

In looking at how HF implements this, they take a similar approach:https://github.com/huggingface/transformers/blob/11bbb505c77a1d29370cf16a964cfe73b7a76340/src/transformers/hf_argparser.py#L34C5-L34C19

so we could go this way too if we wanted.

@LSinev
Copy link
Contributor

LSinev commented Mar 14, 2024

the easiest and most accepted answer

If sorted by highest score, there are more interesting answers.

Leaving argument like it was before, for me seems the best for now

    parser.add_argument(
        "--trust_remote_code",
        action="store_true",
        help="Sets trust_remote_code to True to execute code to create HF Datasets from the Hub",
    )

no argument — no trust, and that's all.

@veekaybee
Copy link
Contributor Author

👍 Works for me, I just changed the two args that take it, but am keeping the ones added for args in cases where action is not store_true, such as model.

@haileyschoelkopf
Copy link
Contributor

Hi @veekaybee ! This approach looks good to me.

And agree we should leave the store_true args as is, as was decided here! The desired behavior is for passing --trust_remote_code to set it to True and if not provided to be False otherwise.

@veekaybee
Copy link
Contributor Author

Thanks so much both for your discussion and comments! This PR is now ready for review.

Copy link
Contributor

@haileyschoelkopf haileyschoelkopf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM!

@haileyschoelkopf haileyschoelkopf merged commit 92f30af into EleutherAI:main Mar 17, 2024
8 checks passed
artemorloff pushed a commit to artemorloff/lm-evaluation-harness that referenced this pull request Apr 4, 2024
* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true
nightingal3 pushed a commit to mycoalchen/lm-evaluation-harness that referenced this pull request May 2, 2024
* New tests for CLI args

* fix spacing

* change tests for parsing

* add tests, fix parser

* remove defaults for store_true
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants