Simplify test data groups #418

niemela · 2025-04-03T19:33:03Z

This PR does the following:

groups can only contain groups or cases, not both (Restricting depth of test data tree #410)
there is only one level of groups below /secret (Restricting depth of test data tree #410)
test_group.yaml is not inherited
remove hint and description from test_group.yaml
require static cases to always specify score
groups must specify score, i.e. no max score inference (Consider dropping maximum score inference #414)

Some oddities due to these changes:

test_group.yaml is not really a good name anymore.
you can organize you test cases in directories freely in groups, but not in /secret (because they become groups)
the terminology use for groups is a little bit inconsistent right now. Specifically, is /secret a group or only the groups under /secret, currently the usage is mixed.

I think the oddities can (and should) be fixed in follow-ups, if we feel that the large strokes of this is the right way to go.

Closes: #410
Closes: #414

Matistjati

I apologize if some of these points are already addressed in other parts of the spec.

spec/2023-07-draft.md

Matistjati · 2025-04-04T00:53:28Z

spec/2023-07-draft.md

+This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
 The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.

+Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name.


Should we change "...will be used..." to "...will be used/displayed..."?`

I don't think so. I feel that "display" is outside the scope, what I intended to convey with "used" is that test cases will be "run" in that order. But then @RagnarGrootKoerkamp and @eldering will complain (and they will be correct) because you are absolutely allowed to run test cases (somewhat speculatively) in parallel. So I think it technically should be something like it should be run and the results used "as if they were run in that order". I have not found a way to write that without it being much harder to understand.

...

Maybe we should just say, "will be run in lexicographical order".

and then add:

A system using a problme package is free to run things in any order, for example if running things in parallel, but the result must be the same as if test cases were run in the expected order.

?

Maybe "will be used in lexicographical order for determining the result"?

spec/2023-07-draft.md

Matistjati · 2025-04-04T01:01:11Z

spec/2023-07-draft.md

-#### Maximum Score Inference
+For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score.
+The default value of `score` for `secret` is 100.
+The default value of `score` for test data groups is `unbounded`.


In a future PR, I think we should consider not having a default value and forcing judges to explicitly assign scores.

I was thinking something similar when writing this. I would leave unbounded as a default though, because then you can make score just an integer.

Also, it makes sense to me that "everything is just unbounded" is the result of not specifying any max score. Not have a max score kinda literally means that it's unbounded.

Matistjati · 2025-04-04T01:02:44Z

spec/2023-07-draft.md

+For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score.
+The default value of `score` for `secret` is 100.
+The default value of `score` for test data groups is `unbounded`.
+Test data groups may only have `unbounded` maximum score if `secret` is unbounded.


We should probably disallow "unreasonable" score assignments. If we accept score > max_score as a guarantee (there seems to be a strong consensus that score should either act as clamp or judge error), then we could catch trivial mistakes such as the root having score 100 and sum aggregation, but the sum of all testgroups being smaller or larger than 100.

Yes, agreed.

If we would choose "clamp", there there are no unreasonable assignments (kinda), we will just clamp it.

If we choose JE (which is the current state), then the following should be disallowed:

aggragation min and any child is greater

aggregaton sum and sum of children is greater

Should we also disallow the non-breaking, but arguably somewhat unreasonable things like

aggragation min and all children are strictly smaller

aggregaton sum and sum of children is strictly smaller
?

I definitely think we should disallow all 4 cases

Another thing to consider: if you take "forcing judges to explicitly assign group scores" to its logical conclusion, is there even a point in specifying a score for secret? I would say there still is, as it allows us to do the 4 above sanity checks.

I agree, for exact that reason.

spec/2023-07-draft.md

eldering · 2025-04-06T12:17:18Z

spec/2023-07-draft.md

+This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
 The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.

+Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name.


Maybe "will be used in lexicographical order for determining the result"?

eldering · 2025-04-06T12:17:46Z

spec/2023-07-draft.md

 as well as the `args` sequence in the `.yaml` file, then the input of the two test cases is equivalent.
-This means that for any two test cases, if their input, output validator arguments
-and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
+This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.


Is there a reason to write "equivalent" instead of "equal" here?

No, I think "equal" would be better.

Also, row 557 is kinda repeating what's in 555-556.

Matistjati reviewed Apr 4, 2025

View reviewed changes

niemela added 8 commits April 4, 2025 10:32

Reorder test data sections

f77c7e9

Split out test_group.yaml section

33eb667

Drop hint and description from test_group.yaml

6722011

Rewrite test data section

cf4df3e

Remove max score inference

a0b66c8

Rewrite case and group scoring

7130d10

Require score to be specified for static validators

8d8c2dd

Fix comments from Joshua

5ab340c

niemela force-pushed the simplify-groups branch from 077e1a2 to 5ab340c Compare April 4, 2025 08:32

niemela merged commit 0a7c170 into master Apr 5, 2025

niemela deleted the simplify-groups branch April 5, 2025 05:11

eldering reviewed Apr 6, 2025

View reviewed changes

niemela mentioned this pull request Apr 7, 2025

When should test group results be shown to the end user? #412

Closed

Simplify test data groups #418

Simplify test data groups #418

Uh oh!

Conversation

niemela commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Matistjati left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

niemela Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niemela commented Apr 3, 2025 •

edited

Loading

niemela Apr 4, 2025 •

edited

Loading