-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Aggregates Collection and Usage #323
Implement Aggregates Collection and Usage #323
Conversation
pkg/linter/linter.go
Outdated
// `count(input.aggregate) == 0` and `not input.aggregate` will not evaluate. | ||
// For simplicity and to avoid error-prone code, ensure it's always defined as an array, | ||
// so that authors can always check `count(input.aggregate)` | ||
// not worrying about nil/null/undefined nuances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this concern is common in Rego, as it's rather normal to lean in on undefined terminating evaluation. But having it defined is OK too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationales of ensuring it's always defined as part of the public contract is to avoid common mistakes (as I've seen) like this one:
- A Rego author accidentally breaks the collection of some aggregate data.
input.aggregate
becomes undefined.- Violations using
input.aggregate[_]....
will not evaluate, and the linter will pass.
By making #2 impossible, we simplify the work/validations Rego authors need to make.
t.Run("One violation expected", func(t *testing.T) { | ||
stdout := bytes.Buffer{} | ||
stderr := bytes.Buffer{} | ||
// By sending a single file to the command, we skip the aggregates computation, so we expect one violation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why this happens, but it's a bit of a mind bend to me having test fail by not running aggregate computation. I'd expect report
rules that need aggregates to just fail out as undefined, likely by referencing input.aggregate
which would be undefined if no aggregate rules had run?
The way it's done now, there's no way fo a report rule to know if aggregates have been collected or not, or is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way it's done now, there's no way for a report rule to know if aggregates have been collected or not, or is there?
It is. input.aggregate
is always defined, which allows devs to check if any aggregate is collected simply via count(input.aggregate[_]) > 0
without worrying if it's defined or not. This also allows a kind of typed check, e.g. "check that there are N aggregates of specific type", for example:
aggregate contains entry if {
entry := { "file" : input.regal.file.name }
}
report contains violation if {
# Report violation if the number of processed Rego files is not 2 (this rule makes not much sense,
# this is just to show how it's possible to check that we have N aggregates with a specific shape
not count([x | x = input.aggregate[_].file]) == 2
violation := result.fail(rego.metadata.chain(), {})
}
The test code linked to this comment sends a single file to the linter, and this is why the violation is reported. Again, a business rule saying "fail if we don't have exactly two Rego files" doesn't make a lot of sense, I've used this to reach complete coverage (i.e. check that violations are reported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that'll only tell you if any of the aggregate rules has something to report — it's not going to tell you that the aggregate rules were not evaluated at all, which is why I'd prefer it if input.aggregate
was undefined in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not going to tell you that the aggregate rules were not evaluated at all
I did not fully follow this. Maybe some code example? I don't know if by "aggregate rules" you mean rules collecting aggregates (i.e. aggregate contains entry if
) or the rules evaluating aggregates (i.e. report contains violation if { .... input.aggregate ... }
I'd prefer it if input.aggregate was undefined in that case.
The rationale of merging not input.aggregates
and count(input.aggregates[_]) == 0
into the latter was explained in the preliminary PR (in summary: simplify the how one can check aggregates), but that's just my personal preference. If this is a fundamental problem blocking a merge, I don't mind tweaking the PR as needed.
What do you have in mind? Would you like to default on having it undefined
(if the array is zero), or something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have elaborated more on the why, but I was short on time :)
Consider a rule like this: In package foo
(which could be distributed across multiple files) I need to ensure that there is exactly one rule named bar
. Some pseudo-code below, but bear with me:
aggregate contains entry if {
input["package"].name == "foo"
some rule in input.rules
rule.head.name == "bar"
entry := {
"file" : input.regal.file.name,
"message": "found rule named 'bar'"
}
}
report contains violation if {
count(input.aggregate) != 1
violation := result.fail(rego.metadata.chain(), {})
}
If input.aggregate
is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file, or pehaps has been disabled by other means. In this case, count(input.aggregate)
will give me 0
, which is of course not equal to 1. If input.aggregate
is undefined, evluation will stop at that line, and nothing will be reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file
😮 This discussion has surfaced a bug: your example above makes sense and (regardless of the original point of the discussion), it seems valid to me to write an aggregate that I want to collect always. The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego
files. In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise. I think we should:
- When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.
- Alternatively... be more consistent and remove the optimisation, and always collect aggregates.
I prefer #2. If you agree I'll make the change. If you prefer #1, no worries (however, I can't commit to making those extra changes soon in follow up PRs due to time constraints).
Now, coming back to the original point, i.e. should be undefined or not:
I think that if we encourage writing rules that attempt to find data more specifically, for example:
aggregate contains entry if {
input["package"].name == "foo"
some rule in input.rules
rule.head.name == "bar"
entry := {
"bar_rule_found" : input.regal.file.name
}
}
report contains violation if {
not count([x | x = input.aggregate[_].bar_rule_found]) == 1
violation := result.fail(rego.metadata.chain(), {})
}
... the logic will work OK for both cases (empty array or undefined).
While I'm still (slightly) inclined to keep it as an empty array, I'm not strong on this one (and I'm here to help the project contributing code and following your lead). I'd be happy to change things. Let me know what changes you'd like and I'll make then soon this week.
Thanks 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files.
The point of aggregate rules is to provide a way to lint Rego at the level of a project, which is something we can't currently do. If your project consists of a single file, there's really no point in using aggregate rules, as there's nothing to aggregate. That case is covered by regular rules. Indeed, this distinction will need to be well documented, and I'd be happy to flesh that out in follow-up PRs.
See the Regal integration that went out in the Rego Playground yesterday for an example of single file linting. This will be increasingly common as we start to look into editor integrations later.
In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise.
Yes, a refactor merging all files into one would render aggregate rules moot. That's not surprising to me, although I'd be very surprised if someone did that and called it refactoring :)
I think that if we encourage writing rules that attempt to find data more specifically
Agreed! Even better than encouraging, I think, is a helper function, similar to how we provide result.fail
. I suggested this previously, but much have been said since then, so that it got lost along the way is understandable :)
Rather than:
entry := {
"bar_rule_found" : input.regal.file.name
}
We'd provide a result.aggregate
helper:
entry := result.aggregate(rego.metadata.chain(), {"some_other_interesting": "value"})
Since we pass rego.metadata.chain()
, this helper would be able to find the name of the rule from the annotation, and group the aggregates accordingly, i.e. input.aggregate["my-rule"]
. It would also add input.regal.file.name
to each aggregate, and anything else we could think might be useful. The second argument would be for the rule author to populate with whatever is useful in the context of their rule.
So yeah 1. would be me preferred choice here:
- When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.
And I'd be happy to help fill in some of the blanks in follow-up fixes, like docs, the above mentioned helper function, (optionally) display warnings if aggregate rules are skipped, and so on. Just let me know how you want to proceed 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just let me know how you want to proceed 😃
IIUC, we have two changes:
- Make aggregates undefined if not collected, or collected for still of zero-length (i.e. normalise to undefined)
- Add the helper function
result.aggregate(...)
IMHO, we could leave 2 for later. Initially, Rego authors can put anything they want in the entry. Later, the project can release the helper function to make it easier to auto-insert the rule name, etc.
I'm happy to help with the minimum change you'd accept a merge, leaving any optionals/wish-to-have for a post-merge. If you think we cannot merge this without #2, no worries... I'll commit to add it (not soon because I have too much on my plate recently, but I still want to end what I started).
The reason I'm leaning towards speed over coding optionals is that the PR has been open for a a while and I have now non-trivial conflicts to merge. I'll work on them while waiting for your answer.
Looks great! I'm not sure about the logic around evaluation of |
Hi @anderseknert I've pushed an update. Changes:
The only missing think is deciding if the helper function |
Awesome! Let's get this merged, and I'll create some follow-up issues to address before the next release, but I don't expect it to be much. Great work on this @sesponda 👏 |
There were a couple of issues open with the aggregate label previously, and I've just created some more for things I think we'll need to consider for this type of rule. I may also add a few more later when I've had more time to think about this. I don't consider any of them critical, but perhaps we should call this feature a beta in the next release, so that we at least have some freedom to work on the design in the following releases. But first, let's see how much we are be able to accomplish until then :) If you have the bandwidth to work on any issue in the list, that's great! But regardless of that, I'd value your thoughts on the issues that involve some design decisions. Please leave a comment on those if you can, and let me know if you want any issue assigned to you. |
Feature described by: #282
Summary:
This change allows custom rules to collect aggregate data while processing a file AST, which can be used later by other linter rules that require a global context.
Highlights of the change:
Aggregate
that collects data while processing a file AST and can be used later by other rules that need a global context.lintWithRegoRules
function to include the aggregates in the evaluation process.evalAndCollect
to extract common code that processes each file in a goroutine.