Skip to content

Conversation

@niemela
Copy link
Member

@niemela niemela commented Apr 3, 2025

This PR does the following:

Some oddities due to these changes:

  • test_group.yaml is not really a good name anymore.
  • you can organize you test cases in directories freely in groups, but not in /secret (because they become groups)
  • the terminology use for groups is a little bit inconsistent right now. Specifically, is /secret a group or only the groups under /secret, currently the usage is mixed.

I think the oddities can (and should) be fixed in follow-ups, if we feel that the large strokes of this is the right way to go.

Closes: #410
Closes: #414

Copy link
Collaborator

@Matistjati Matistjati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize if some of these points are already addressed in other parts of the spec.

This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.
Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change "...will be used..." to "...will be used/displayed..."?`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I feel that "display" is outside the scope, what I intended to convey with "used" is that test cases will be "run" in that order. But then @RagnarGrootKoerkamp and @eldering will complain (and they will be correct) because you are absolutely allowed to run test cases (somewhat speculatively) in parallel. So I think it technically should be something like it should be run and the results used "as if they were run in that order". I have not found a way to write that without it being much harder to understand.

...

Maybe we should just say, "will be run in lexicographical order".

and then add:

A system using a problme package is free to run things in any order, for example if running things in parallel, but the result must be the same as if test cases were run in the expected order.

?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "will be used in lexicographical order for determining the result"?

#### Maximum Score Inference
For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score.
The default value of `score` for `secret` is 100.
The default value of `score` for test data groups is `unbounded`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, I think we should consider not having a default value and forcing judges to explicitly assign scores.

Copy link
Member Author

@niemela niemela Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something similar when writing this. I would leave unbounded as a default though, because then you can make score just an integer.

Also, it makes sense to me that "everything is just unbounded" is the result of not specifying any max score. Not have a max score kinda literally means that it's unbounded.

For `secret`, all test data groups, and every test case in a group with `sum` or `min` aggregation, there is a maximum possible score.
The default value of `score` for `secret` is 100.
The default value of `score` for test data groups is `unbounded`.
Test data groups may only have `unbounded` maximum score if `secret` is unbounded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably disallow "unreasonable" score assignments. If we accept score > max_score as a guarantee (there seems to be a strong consensus that score should either act as clamp or judge error), then we could catch trivial mistakes such as the root having score 100 and sum aggregation, but the sum of all testgroups being smaller or larger than 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed.

If we would choose "clamp", there there are no unreasonable assignments (kinda), we will just clamp it.

If we choose JE (which is the current state), then the following should be disallowed:

  • aggragation min and any child is greater
  • aggregaton sum and sum of children is greater

Should we also disallow the non-breaking, but arguably somewhat unreasonable things like

  • aggragation min and all children are strictly smaller
  • aggregaton sum and sum of children is strictly smaller
    ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think we should disallow all 4 cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider: if you take "forcing judges to explicitly assign group scores" to its logical conclusion, is there even a point in specifying a score for secret? I would say there still is, as it allows us to do the 4 above sanity checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, for exact that reason.

@niemela niemela merged commit 0a7c170 into master Apr 5, 2025
@niemela niemela deleted the simplify-groups branch April 5, 2025 05:11
This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.
Test cases and [test data groups](#test-data-groups) will be used in lexicographical order on base name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "will be used in lexicographical order for determining the result"?

as well as the `args` sequence in the `.yaml` file, then the input of the two test cases is equivalent.
This means that for any two test cases, if their input, output validator arguments
and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
This means that for any two test cases, if their input, output validator arguments and the contents of their `.ans` files are equivalent, then the test cases are equivalent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to write "equivalent" instead of "equal" here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think "equal" would be better.

Also, row 557 is kinda repeating what's in 555-556.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider dropping maximum score inference Restricting depth of test data tree

3 participants