Implement Aggregates Collection and Usage #323

sesponda · 2023-09-15T04:06:18Z

Feature described by: #282

Summary:

This change allows custom rules to collect aggregate data while processing a file AST, which can be used later by other linter rules that require a global context.

Highlights of the change:

Added a new feature in the linter to collect aggregates. This is done by creating a new type Aggregate that collects data while processing a file AST and can be used later by other rules that need a global context.
Modified the lintWithRegoRules function to include the aggregates in the evaluation process.
Created a helper function evalAndCollect to extract common code that processes each file in a goroutine.
Updated the end-to-end testing to include the aggregates in the testing process. This includes positive and negative tests.

anderseknert · 2023-09-15T13:10:08Z

pkg/linter/linter.go

+				// `count(input.aggregate) == 0` and `not input.aggregate` will not evaluate.
+				// For simplicity and to avoid error-prone code, ensure it's always defined as an array,
+				// so that authors can always check `count(input.aggregate)`
+				// not worrying about nil/null/undefined nuances.


I'm not sure this concern is common in Rego, as it's rather normal to lean in on undefined terminating evaluation. But having it defined is OK too.

The rationales of ensuring it's always defined as part of the public contract is to avoid common mistakes (as I've seen) like this one:

A Rego author accidentally breaks the collection of some aggregate data.

input.aggregate becomes undefined.

Violations using input.aggregate[_].... will not evaluate, and the linter will pass.

By making #2 impossible, we simplify the work/validations Rego authors need to make.

anderseknert · 2023-09-15T13:25:03Z

e2e/cli_test.go

+	t.Run("One violation expected", func(t *testing.T) {
+		stdout := bytes.Buffer{}
+		stderr := bytes.Buffer{}
+		// By sending a single file to the command, we skip the aggregates computation, so we expect one violation


I understand why this happens, but it's a bit of a mind bend to me having test fail by not running aggregate computation. I'd expect report rules that need aggregates to just fail out as undefined, likely by referencing input.aggregate which would be undefined if no aggregate rules had run?

The way it's done now, there's no way fo a report rule to know if aggregates have been collected or not, or is there?

The way it's done now, there's no way for a report rule to know if aggregates have been collected or not, or is there?

It is. input.aggregate is always defined, which allows devs to check if any aggregate is collected simply via count(input.aggregate[_]) > 0 without worrying if it's defined or not. This also allows a kind of typed check, e.g. "check that there are N aggregates of specific type", for example:

aggregate contains entry if { entry := { "file" : input.regal.file.name } } report contains violation if { # Report violation if the number of processed Rego files is not 2 (this rule makes not much sense, # this is just to show how it's possible to check that we have N aggregates with a specific shape not count([x | x = input.aggregate[_].file]) == 2 violation := result.fail(rego.metadata.chain(), {}) }

The test code linked to this comment sends a single file to the linter, and this is why the violation is reported. Again, a business rule saying "fail if we don't have exactly two Rego files" doesn't make a lot of sense, I've used this to reach complete coverage (i.e. check that violations are reported).

Right, but that'll only tell you if any of the aggregate rules has something to report — it's not going to tell you that the aggregate rules were not evaluated at all, which is why I'd prefer it if input.aggregate was undefined in that case.

it's not going to tell you that the aggregate rules were not evaluated at all

I did not fully follow this. Maybe some code example? I don't know if by "aggregate rules" you mean rules collecting aggregates (i.e. aggregate contains entry if) or the rules evaluating aggregates (i.e. report contains violation if { .... input.aggregate ... }

I'd prefer it if input.aggregate was undefined in that case.

The rationale of merging not input.aggregates and count(input.aggregates[_]) == 0 into the latter was explained in the preliminary PR (in summary: simplify the how one can check aggregates), but that's just my personal preference. If this is a fundamental problem blocking a merge, I don't mind tweaking the PR as needed.
What do you have in mind? Would you like to default on having it undefined (if the array is zero), or something different?

Sorry, I should have elaborated more on the why, but I was short on time :)

Consider a rule like this: In package foo (which could be distributed across multiple files) I need to ensure that there is exactly one rule named bar. Some pseudo-code below, but bear with me:

aggregate contains entry if { input["package"].name == "foo" some rule in input.rules rule.head.name == "bar" entry := { "file" : input.regal.file.name, "message": "found rule named 'bar'" } } report contains violation if { count(input.aggregate) != 1 violation := result.fail(rego.metadata.chain(), {}) }

If input.aggregate is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file, or pehaps has been disabled by other means. In this case, count(input.aggregate) will give me 0, which is of course not equal to 1. If input.aggregate is undefined, evluation will stop at that line, and nothing will be reported.

is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file

😮 This discussion has surfaced a bug: your example above makes sense and (regardless of the original point of the discussion), it seems valid to me to write an aggregate that I want to collect always. The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files. In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise. I think we should:

When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.

Alternatively... be more consistent and remove the optimisation, and always collect aggregates.

I prefer #2. If you agree I'll make the change. If you prefer #1, no worries (however, I can't commit to making those extra changes soon in follow up PRs due to time constraints).

Now, coming back to the original point, i.e. should be undefined or not:

I think that if we encourage writing rules that attempt to find data more specifically, for example:

aggregate contains entry if { input["package"].name == "foo" some rule in input.rules rule.head.name == "bar" entry := { "bar_rule_found" : input.regal.file.name } } report contains violation if { not count([x | x = input.aggregate[_].bar_rule_found]) == 1 violation := result.fail(rego.metadata.chain(), {}) }

... the logic will work OK for both cases (empty array or undefined).

While I'm still (slightly) inclined to keep it as an empty array, I'm not strong on this one (and I'm here to help the project contributing code and following your lead). I'd be happy to change things. Let me know what changes you'd like and I'll make then soon this week.

Thanks 🙏

The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files.

The point of aggregate rules is to provide a way to lint Rego at the level of a project, which is something we can't currently do. If your project consists of a single file, there's really no point in using aggregate rules, as there's nothing to aggregate. That case is covered by regular rules. Indeed, this distinction will need to be well documented, and I'd be happy to flesh that out in follow-up PRs.

See the Regal integration that went out in the Rego Playground yesterday for an example of single file linting. This will be increasingly common as we start to look into editor integrations later.

In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise.

Yes, a refactor merging all files into one would render aggregate rules moot. That's not surprising to me, although I'd be very surprised if someone did that and called it refactoring :)

I think that if we encourage writing rules that attempt to find data more specifically

Agreed! Even better than encouraging, I think, is a helper function, similar to how we provide result.fail. I suggested this previously, but much have been said since then, so that it got lost along the way is understandable :)

Rather than:

entry := { "bar_rule_found" : input.regal.file.name }

We'd provide a result.aggregate helper:

entry := result.aggregate(rego.metadata.chain(), {"some_other_interesting": "value"})

Since we pass rego.metadata.chain(), this helper would be able to find the name of the rule from the annotation, and group the aggregates accordingly, i.e. input.aggregate["my-rule"]. It would also add input.regal.file.name to each aggregate, and anything else we could think might be useful. The second argument would be for the rule author to populate with whatever is useful in the context of their rule.

So yeah 1. would be me preferred choice here:

When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.

And I'd be happy to help fill in some of the blanks in follow-up fixes, like docs, the above mentioned helper function, (optionally) display warnings if aggregate rules are skipped, and so on. Just let me know how you want to proceed 😃

Just let me know how you want to proceed 😃

IIUC, we have two changes:

Make aggregates undefined if not collected, or collected for still of zero-length (i.e. normalise to undefined)

Add the helper function result.aggregate(...)

IMHO, we could leave 2 for later. Initially, Rego authors can put anything they want in the entry. Later, the project can release the helper function to make it easier to auto-insert the rule name, etc.

I'm happy to help with the minimum change you'd accept a merge, leaving any optionals/wish-to-have for a post-merge. If you think we cannot merge this without #2, no worries... I'll commit to add it (not soon because I have too much on my plate recently, but I still want to end what I started).

The reason I'm leaning towards speed over coding optionals is that the PR has been open for a a while and I have now non-trivial conflicts to merge. I'll work on them while waiting for your answer.

anderseknert · 2023-09-15T13:26:35Z

Looks great! I'm not sure about the logic around evaluation of report without aggregate rules having been collected, but besides that... awesome! 👏

…er-scope-for-custom-rules

sesponda · 2023-09-22T05:43:47Z

Hi @anderseknert I've pushed an update. Changes:

Aggregates not collected now means the collection will be undefined.
Addressed merge conflicts. I've undo the refactoring extracting common code, because the other feature merged (time metrics) was too attached to the lintWithRegoRules method, and I did not want to risk an accidental change in the logic.

The only missing think is deciding if the helper function result.aggregate(...)is a must-have or can be done port-merge at another time as a future improvement, feel free to decide and I'll help. Cheers,

anderseknert · 2023-09-23T13:12:07Z

Awesome! Let's get this merged, and I'll create some follow-up issues to address before the next release, but I don't expect it to be much. Great work on this @sesponda 👏

anderseknert · 2023-09-23T21:08:41Z

@sesponda

There were a couple of issues open with the aggregate label previously, and I've just created some more for things I think we'll need to consider for this type of rule. I may also add a few more later when I've had more time to think about this. I don't consider any of them critical, but perhaps we should call this feature a beta in the next release, so that we at least have some freedom to work on the design in the following releases. But first, let's see how much we are be able to accomplish until then :)

If you have the bandwidth to work on any issue in the list, that's great! But regardless of that, I'd value your thoughts on the issues that involve some design decisions. Please leave a comment on those if you can, and let me know if you want any issue assigned to you.

Implement aggregates feature: #282

9c581df

sesponda mentioned this pull request Sep 15, 2023

Introduce broader scope for custom rules #306

Closed

Fix: field is used by evalAndCollect so it should not be ignored.

ac297c8

anderseknert reviewed Sep 15, 2023

View reviewed changes

sesponda and others added 3 commits September 18, 2023 11:46

Merge branch 'main' into issue-topic-282-broader-scope-for-custom-rules

611470b

Merge remote-tracking branch 'origin/main' into issue-topic-282-broad…

668075d

…er-scope-for-custom-rules

Fix code style.

ea0d793

anderseknert approved these changes Sep 23, 2023

View reviewed changes

anderseknert merged commit 0098ff7 into StyraInc:main Sep 23, 2023
1 check passed

anderseknert mentioned this pull request Sep 23, 2023

Allow a broader scope for custom rules #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Aggregates Collection and Usage #323

Implement Aggregates Collection and Usage #323

sesponda commented Sep 15, 2023

anderseknert Sep 15, 2023

sesponda Sep 18, 2023 •

edited

Loading

anderseknert Sep 15, 2023

sesponda Sep 18, 2023 •

edited

Loading

anderseknert Sep 18, 2023

sesponda Sep 19, 2023 •

edited

Loading

anderseknert Sep 19, 2023

sesponda Sep 20, 2023

anderseknert Sep 20, 2023

sesponda Sep 22, 2023

anderseknert commented Sep 15, 2023

sesponda commented Sep 22, 2023

anderseknert commented Sep 23, 2023

anderseknert commented Sep 23, 2023

Implement Aggregates Collection and Usage #323

Implement Aggregates Collection and Usage #323

Conversation

sesponda commented Sep 15, 2023

anderseknert Sep 15, 2023

Choose a reason for hiding this comment

sesponda Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

anderseknert Sep 15, 2023

Choose a reason for hiding this comment

sesponda Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

anderseknert Sep 18, 2023

Choose a reason for hiding this comment

sesponda Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

anderseknert Sep 19, 2023

Choose a reason for hiding this comment

sesponda Sep 20, 2023

Choose a reason for hiding this comment

anderseknert Sep 20, 2023

Choose a reason for hiding this comment

sesponda Sep 22, 2023

Choose a reason for hiding this comment

anderseknert commented Sep 15, 2023

sesponda commented Sep 22, 2023

anderseknert commented Sep 23, 2023

anderseknert commented Sep 23, 2023

sesponda Sep 18, 2023 •

edited

Loading

sesponda Sep 18, 2023 •

edited

Loading

sesponda Sep 19, 2023 •

edited

Loading