Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Aggregates Collection and Usage #323

Merged
merged 5 commits into from
Sep 23, 2023
Merged

Implement Aggregates Collection and Usage #323

merged 5 commits into from
Sep 23, 2023

Conversation

sesponda
Copy link
Contributor

Feature described by: #282

Summary:

This change allows custom rules to collect aggregate data while processing a file AST, which can be used later by other linter rules that require a global context.

Highlights of the change:

  1. Added a new feature in the linter to collect aggregates. This is done by creating a new type Aggregate that collects data while processing a file AST and can be used later by other rules that need a global context.
  2. Modified the lintWithRegoRules function to include the aggregates in the evaluation process.
  3. Created a helper function evalAndCollect to extract common code that processes each file in a goroutine.
  4. Updated the end-to-end testing to include the aggregates in the testing process. This includes positive and negative tests.

// `count(input.aggregate) == 0` and `not input.aggregate` will not evaluate.
// For simplicity and to avoid error-prone code, ensure it's always defined as an array,
// so that authors can always check `count(input.aggregate)`
// not worrying about nil/null/undefined nuances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this concern is common in Rego, as it's rather normal to lean in on undefined terminating evaluation. But having it defined is OK too.

Copy link
Contributor Author

@sesponda sesponda Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationales of ensuring it's always defined as part of the public contract is to avoid common mistakes (as I've seen) like this one:

  1. A Rego author accidentally breaks the collection of some aggregate data.
  2. input.aggregate becomes undefined.
  3. Violations using input.aggregate[_].... will not evaluate, and the linter will pass.

By making #2 impossible, we simplify the work/validations Rego authors need to make.

t.Run("One violation expected", func(t *testing.T) {
stdout := bytes.Buffer{}
stderr := bytes.Buffer{}
// By sending a single file to the command, we skip the aggregates computation, so we expect one violation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this happens, but it's a bit of a mind bend to me having test fail by not running aggregate computation. I'd expect report rules that need aggregates to just fail out as undefined, likely by referencing input.aggregate which would be undefined if no aggregate rules had run?

The way it's done now, there's no way fo a report rule to know if aggregates have been collected or not, or is there?

Copy link
Contributor Author

@sesponda sesponda Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's done now, there's no way for a report rule to know if aggregates have been collected or not, or is there?

It is. input.aggregate is always defined, which allows devs to check if any aggregate is collected simply via count(input.aggregate[_]) > 0 without worrying if it's defined or not. This also allows a kind of typed check, e.g. "check that there are N aggregates of specific type", for example:

aggregate contains entry if {
    entry := { "file" : input.regal.file.name }
}

report contains violation if {
    # Report violation if the number of processed Rego files is not 2 (this rule makes not much sense, 
    # this is just to show how it's possible to check that we have N aggregates with a specific shape
    not count([x | x = input.aggregate[_].file]) == 2
    violation := result.fail(rego.metadata.chain(), {})
}

The test code linked to this comment sends a single file to the linter, and this is why the violation is reported. Again, a business rule saying "fail if we don't have exactly two Rego files" doesn't make a lot of sense, I've used this to reach complete coverage (i.e. check that violations are reported).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that'll only tell you if any of the aggregate rules has something to report — it's not going to tell you that the aggregate rules were not evaluated at all, which is why I'd prefer it if input.aggregate was undefined in that case.

Copy link
Contributor Author

@sesponda sesponda Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not going to tell you that the aggregate rules were not evaluated at all

I did not fully follow this. Maybe some code example? I don't know if by "aggregate rules" you mean rules collecting aggregates (i.e. aggregate contains entry if) or the rules evaluating aggregates (i.e. report contains violation if { .... input.aggregate ... }

I'd prefer it if input.aggregate was undefined in that case.

The rationale of merging not input.aggregates and count(input.aggregates[_]) == 0 into the latter was explained in the preliminary PR (in summary: simplify the how one can check aggregates), but that's just my personal preference. If this is a fundamental problem blocking a merge, I don't mind tweaking the PR as needed.
What do you have in mind? Would you like to default on having it undefined (if the array is zero), or something different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have elaborated more on the why, but I was short on time :)

Consider a rule like this: In package foo (which could be distributed across multiple files) I need to ensure that there is exactly one rule named bar. Some pseudo-code below, but bear with me:

aggregate contains entry if {
    input["package"].name == "foo"
    some rule in input.rules

    rule.head.name == "bar"

    entry := {
        "file" : input.regal.file.name,
        "message": "found rule named 'bar'"
    }
}

report contains violation if {
    count(input.aggregate) != 1
    violation := result.fail(rego.metadata.chain(), {})
}

If input.aggregate is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file, or pehaps has been disabled by other means. In this case, count(input.aggregate) will give me 0, which is of course not equal to 1. If input.aggregate is undefined, evluation will stop at that line, and nothing will be reported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file

😮 This discussion has surfaced a bug: your example above makes sense and (regardless of the original point of the discussion), it seems valid to me to write an aggregate that I want to collect always. The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files. In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise. I think we should:

  1. When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.
  2. Alternatively... be more consistent and remove the optimisation, and always collect aggregates.

I prefer #2. If you agree I'll make the change. If you prefer #1, no worries (however, I can't commit to making those extra changes soon in follow up PRs due to time constraints).

Now, coming back to the original point, i.e. should be undefined or not:

I think that if we encourage writing rules that attempt to find data more specifically, for example:

aggregate contains entry if {
    input["package"].name == "foo"
    some rule in input.rules
    rule.head.name == "bar"
    entry := {
        "bar_rule_found" : input.regal.file.name
    }
}

report contains violation if {
    not count([x | x = input.aggregate[_].bar_rule_found]) == 1
    violation := result.fail(rego.metadata.chain(), {})
}

... the logic will work OK for both cases (empty array or undefined).

While I'm still (slightly) inclined to keep it as an empty array, I'm not strong on this one (and I'm here to help the project contributing code and following your lead). I'd be happy to change things. Let me know what changes you'd like and I'll make then soon this week.

Thanks 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files.

The point of aggregate rules is to provide a way to lint Rego at the level of a project, which is something we can't currently do. If your project consists of a single file, there's really no point in using aggregate rules, as there's nothing to aggregate. That case is covered by regular rules. Indeed, this distinction will need to be well documented, and I'd be happy to flesh that out in follow-up PRs.

See the Regal integration that went out in the Rego Playground yesterday for an example of single file linting. This will be increasingly common as we start to look into editor integrations later.

In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise.

Yes, a refactor merging all files into one would render aggregate rules moot. That's not surprising to me, although I'd be very surprised if someone did that and called it refactoring :)

I think that if we encourage writing rules that attempt to find data more specifically

Agreed! Even better than encouraging, I think, is a helper function, similar to how we provide result.fail. I suggested this previously, but much have been said since then, so that it got lost along the way is understandable :)

Rather than:

entry := {
    "bar_rule_found" : input.regal.file.name
}

We'd provide a result.aggregate helper:

entry := result.aggregate(rego.metadata.chain(), {"some_other_interesting": "value"})

Since we pass rego.metadata.chain(), this helper would be able to find the name of the rule from the annotation, and group the aggregates accordingly, i.e. input.aggregate["my-rule"]. It would also add input.regal.file.name to each aggregate, and anything else we could think might be useful. The second argument would be for the rule author to populate with whatever is useful in the context of their rule.

So yeah 1. would be me preferred choice here:

  1. When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.

And I'd be happy to help fill in some of the blanks in follow-up fixes, like docs, the above mentioned helper function, (optionally) display warnings if aggregate rules are skipped, and so on. Just let me know how you want to proceed 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let me know how you want to proceed 😃

IIUC, we have two changes:

  1. Make aggregates undefined if not collected, or collected for still of zero-length (i.e. normalise to undefined)
  2. Add the helper function result.aggregate(...)

IMHO, we could leave 2 for later. Initially, Rego authors can put anything they want in the entry. Later, the project can release the helper function to make it easier to auto-insert the rule name, etc.

I'm happy to help with the minimum change you'd accept a merge, leaving any optionals/wish-to-have for a post-merge. If you think we cannot merge this without #2, no worries... I'll commit to add it (not soon because I have too much on my plate recently, but I still want to end what I started).

The reason I'm leaning towards speed over coding optionals is that the PR has been open for a a while and I have now non-trivial conflicts to merge. I'll work on them while waiting for your answer.

@anderseknert
Copy link
Member

Looks great! I'm not sure about the logic around evaluation of report without aggregate rules having been collected, but besides that... awesome! 👏

@sesponda
Copy link
Contributor Author

Hi @anderseknert I've pushed an update. Changes:

  • Aggregates not collected now means the collection will be undefined.
  • Addressed merge conflicts. I've undo the refactoring extracting common code, because the other feature merged (time metrics) was too attached to the lintWithRegoRules method, and I did not want to risk an accidental change in the logic.

The only missing think is deciding if the helper function result.aggregate(...)is a must-have or can be done port-merge at another time as a future improvement, feel free to decide and I'll help. Cheers,

@anderseknert
Copy link
Member

Awesome! Let's get this merged, and I'll create some follow-up issues to address before the next release, but I don't expect it to be much. Great work on this @sesponda 👏

@anderseknert anderseknert merged commit 0098ff7 into StyraInc:main Sep 23, 2023
1 check passed
@anderseknert
Copy link
Member

@sesponda

There were a couple of issues open with the aggregate label previously, and I've just created some more for things I think we'll need to consider for this type of rule. I may also add a few more later when I've had more time to think about this. I don't consider any of them critical, but perhaps we should call this feature a beta in the next release, so that we at least have some freedom to work on the design in the following releases. But first, let's see how much we are be able to accomplish until then :)

If you have the bandwidth to work on any issue in the list, that's great! But regardless of that, I'd value your thoughts on the issues that involve some design decisions. Please leave a comment on those if you can, and let me know if you want any issue assigned to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants