-
Notifications
You must be signed in to change notification settings - Fork 20
Simplify test data groups #418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize if some of these points are already addressed in other parts of the spec.
| This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent. | ||
| The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case. | ||
| Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change "...will be used..." to "...will be used/displayed..."?`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. I feel that "display" is outside the scope, what I intended to convey with "used" is that test cases will be "run" in that order. But then @RagnarGrootKoerkamp and @eldering will complain (and they will be correct) because you are absolutely allowed to run test cases (somewhat speculatively) in parallel. So I think it technically should be something like it should be run and the results used "as if they were run in that order". I have not found a way to write that without it being much harder to understand.
...
Maybe we should just say, "will be run in lexicographical order".
and then add:
A system using a problme package is free to run things in any order, for example if running things in parallel, but the result must be the same as if test cases were run in the expected order.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "will be used in lexicographical order for determining the result"?
| #### Maximum Score Inference | ||
| For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score. | ||
| The default value of `score` for `secret` is 100. | ||
| The default value of `score` for test data groups is `unbounded`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR, I think we should consider not having a default value and forcing judges to explicitly assign scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something similar when writing this. I would leave unbounded as a default though, because then you can make score just an integer.
Also, it makes sense to me that "everything is just unbounded" is the result of not specifying any max score. Not have a max score kinda literally means that it's unbounded.
| For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score. | ||
| The default value of `score` for `secret` is 100. | ||
| The default value of `score` for test data groups is `unbounded`. | ||
| Test data groups may only have `unbounded` maximum score if `secret` is unbounded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably disallow "unreasonable" score assignments. If we accept score > max_score as a guarantee (there seems to be a strong consensus that score should either act as clamp or judge error), then we could catch trivial mistakes such as the root having score 100 and sum aggregation, but the sum of all testgroups being smaller or larger than 100.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed.
If we would choose "clamp", there there are no unreasonable assignments (kinda), we will just clamp it.
If we choose JE (which is the current state), then the following should be disallowed:
- aggragation
minand any child is greater - aggregaton
sumand sum of children is greater
Should we also disallow the non-breaking, but arguably somewhat unreasonable things like
- aggragation
minand all children are strictly smaller - aggregaton
sumand sum of children is strictly smaller
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely think we should disallow all 4 cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider: if you take "forcing judges to explicitly assign group scores" to its logical conclusion, is there even a point in specifying a score for secret? I would say there still is, as it allows us to do the 4 above sanity checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, for exact that reason.
| This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent. | ||
| The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case. | ||
| Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "will be used in lexicographical order for determining the result"?
| as well as the `args` sequence in the `.yaml` file, then the input of the two test cases is equivalent. | ||
| This means that for any two test cases, if their input, output validator arguments | ||
| and the contents of their `.ans` files are equivalent, then the test cases are equivalent. | ||
| This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to write "equivalent" instead of "equal" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think "equal" would be better.
Also, row 557 is kinda repeating what's in 555-556.
This PR does the following:
/secret(Restricting depth of test data tree #410)test_group.yamlis not inheritedhintanddescriptionfromtest_group.yamlscorescore, i.e. no max score inference (Consider dropping maximum score inference #414)Some oddities due to these changes:
test_group.yamlis not really a good name anymore./secret(because they become groups)/secreta group or only the groups under/secret, currently the usage is mixed.I think the oddities can (and should) be fixed in follow-ups, if we feel that the large strokes of this is the right way to go.
Closes: #410
Closes: #414