Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Aggregates Collection and Usage #323

Merged
merged 5 commits into from
Sep 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions bundle/regal/main.rego
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,13 @@ report contains violation if {
not ignored(violation, ignore_directives)
}

aggregate contains aggregate if {
some category, title
config.for_rule(category, title).level != "ignore"
not config.excluded_file(category, title, input.regal.file.name)
aggregate := data.custom.regal.rules[category][title].aggregate[_]
}

ignored(violation, directives) if {
ignored_rules := directives[violation.location.row]
violation.title in ignored_rules
Expand Down
46 changes: 46 additions & 0 deletions e2e/cli_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -319,6 +319,52 @@ func TestLintRuleNamingConventionFromCustomCategory(t *testing.T) {
}
}

func TestAggregatesAreCollectedAndUsed(t *testing.T) {
t.Parallel()
cwd := must(os.Getwd)
basedir := cwd + "/testdata/aggregates"

t.Run("Zero violations expected", func(t *testing.T) {
stdout := bytes.Buffer{}
stderr := bytes.Buffer{}

err := regal(&stdout, &stderr)("lint", "--format", "json", basedir+"/rego", "--rules", basedir+"/rules/custom_rules_using_aggregates.rego")

if exp, act := 0, ExitStatus(err); exp != act {
t.Errorf("expected exit status %d, got %d", exp, act)
}

if exp, act := "", stderr.String(); exp != act {
t.Errorf("expected stderr %q, got %q", exp, act)
}
})

t.Run("One violation expected", func(t *testing.T) {
stdout := bytes.Buffer{}
stderr := bytes.Buffer{}
// By sending a single file to the command, we skip the aggregates computation, so we expect one violation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this happens, but it's a bit of a mind bend to me having test fail by not running aggregate computation. I'd expect report rules that need aggregates to just fail out as undefined, likely by referencing input.aggregate which would be undefined if no aggregate rules had run?

The way it's done now, there's no way fo a report rule to know if aggregates have been collected or not, or is there?

Copy link
Contributor Author

@sesponda sesponda Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it's done now, there's no way for a report rule to know if aggregates have been collected or not, or is there?

It is. input.aggregate is always defined, which allows devs to check if any aggregate is collected simply via count(input.aggregate[_]) > 0 without worrying if it's defined or not. This also allows a kind of typed check, e.g. "check that there are N aggregates of specific type", for example:

aggregate contains entry if {
    entry := { "file" : input.regal.file.name }
}

report contains violation if {
    # Report violation if the number of processed Rego files is not 2 (this rule makes not much sense, 
    # this is just to show how it's possible to check that we have N aggregates with a specific shape
    not count([x | x = input.aggregate[_].file]) == 2
    violation := result.fail(rego.metadata.chain(), {})
}

The test code linked to this comment sends a single file to the linter, and this is why the violation is reported. Again, a business rule saying "fail if we don't have exactly two Rego files" doesn't make a lot of sense, I've used this to reach complete coverage (i.e. check that violations are reported).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that'll only tell you if any of the aggregate rules has something to report — it's not going to tell you that the aggregate rules were not evaluated at all, which is why I'd prefer it if input.aggregate was undefined in that case.

Copy link
Contributor Author

@sesponda sesponda Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not going to tell you that the aggregate rules were not evaluated at all

I did not fully follow this. Maybe some code example? I don't know if by "aggregate rules" you mean rules collecting aggregates (i.e. aggregate contains entry if) or the rules evaluating aggregates (i.e. report contains violation if { .... input.aggregate ... }

I'd prefer it if input.aggregate was undefined in that case.

The rationale of merging not input.aggregates and count(input.aggregates[_]) == 0 into the latter was explained in the preliminary PR (in summary: simplify the how one can check aggregates), but that's just my personal preference. If this is a fundamental problem blocking a merge, I don't mind tweaking the PR as needed.
What do you have in mind? Would you like to default on having it undefined (if the array is zero), or something different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have elaborated more on the why, but I was short on time :)

Consider a rule like this: In package foo (which could be distributed across multiple files) I need to ensure that there is exactly one rule named bar. Some pseudo-code below, but bear with me:

aggregate contains entry if {
    input["package"].name == "foo"
    some rule in input.rules

    rule.head.name == "bar"

    entry := {
        "file" : input.regal.file.name,
        "message": "found rule named 'bar'"
    }
}

report contains violation if {
    count(input.aggregate) != 1
    violation := result.fail(rego.metadata.chain(), {})
}

If input.aggregate is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file, or pehaps has been disabled by other means. In this case, count(input.aggregate) will give me 0, which is of course not equal to 1. If input.aggregate is undefined, evluation will stop at that line, and nothing will be reported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is always an array, this rule will fail even when aggregate collection hasn't happened, like when evaluating only a single file

😮 This discussion has surfaced a bug: your example above makes sense and (regardless of the original point of the discussion), it seems valid to me to write an aggregate that I want to collect always. The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files. In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise. I think we should:

  1. When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.
  2. Alternatively... be more consistent and remove the optimisation, and always collect aggregates.

I prefer #2. If you agree I'll make the change. If you prefer #1, no worries (however, I can't commit to making those extra changes soon in follow up PRs due to time constraints).

Now, coming back to the original point, i.e. should be undefined or not:

I think that if we encourage writing rules that attempt to find data more specifically, for example:

aggregate contains entry if {
    input["package"].name == "foo"
    some rule in input.rules
    rule.head.name == "bar"
    entry := {
        "bar_rule_found" : input.regal.file.name
    }
}

report contains violation if {
    not count([x | x = input.aggregate[_].bar_rule_found]) == 1
    violation := result.fail(rego.metadata.chain(), {})
}

... the logic will work OK for both cases (empty array or undefined).

While I'm still (slightly) inclined to keep it as an empty array, I'm not strong on this one (and I'm here to help the project contributing code and following your lead). I'd be happy to change things. Let me know what changes you'd like and I'll make then soon this week.

Thanks 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of aggregates was to write business rules I can enforce without concerning how I package the code in multiple .rego files.

The point of aggregate rules is to provide a way to lint Rego at the level of a project, which is something we can't currently do. If your project consists of a single file, there's really no point in using aggregate rules, as there's nothing to aggregate. That case is covered by regular rules. Indeed, this distinction will need to be well documented, and I'd be happy to flesh that out in follow-up PRs.

See the Regal integration that went out in the Rego Playground yesterday for an example of single file linting. This will be increasingly common as we start to look into editor integrations later.

In the current version, a refactor merging all files into one would suddenly cause aggregates not being collected, and thus violations could be missed, violating the principle of least surprise.

Yes, a refactor merging all files into one would render aggregate rules moot. That's not surprising to me, although I'd be very surprised if someone did that and called it refactoring :)

I think that if we encourage writing rules that attempt to find data more specifically

Agreed! Even better than encouraging, I think, is a helper function, similar to how we provide result.fail. I suggested this previously, but much have been said since then, so that it got lost along the way is understandable :)

Rather than:

entry := {
    "bar_rule_found" : input.regal.file.name
}

We'd provide a result.aggregate helper:

entry := result.aggregate(rego.metadata.chain(), {"some_other_interesting": "value"})

Since we pass rego.metadata.chain(), this helper would be able to find the name of the rule from the annotation, and group the aggregates accordingly, i.e. input.aggregate["my-rule"]. It would also add input.regal.file.name to each aggregate, and anything else we could think might be useful. The second argument would be for the rule author to populate with whatever is useful in the context of their rule.

So yeah 1. would be me preferred choice here:

  1. When the docs are added, clearly state that aggregates are collected only if files > 1. Ideally, display a warning if we see aggregate rules unused.

And I'd be happy to help fill in some of the blanks in follow-up fixes, like docs, the above mentioned helper function, (optionally) display warnings if aggregate rules are skipped, and so on. Just let me know how you want to proceed 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just let me know how you want to proceed 😃

IIUC, we have two changes:

  1. Make aggregates undefined if not collected, or collected for still of zero-length (i.e. normalise to undefined)
  2. Add the helper function result.aggregate(...)

IMHO, we could leave 2 for later. Initially, Rego authors can put anything they want in the entry. Later, the project can release the helper function to make it easier to auto-insert the rule name, etc.

I'm happy to help with the minimum change you'd accept a merge, leaving any optionals/wish-to-have for a post-merge. If you think we cannot merge this without #2, no worries... I'll commit to add it (not soon because I have too much on my plate recently, but I still want to end what I started).

The reason I'm leaning towards speed over coding optionals is that the PR has been open for a a while and I have now non-trivial conflicts to merge. I'll work on them while waiting for your answer.

err := regal(&stdout, &stderr)("lint", "--format", "json", basedir+"/rego/policy_1.rego", "--rules", basedir+"/rules/custom_rules_using_aggregates.rego")

if exp, act := 3, ExitStatus(err); exp != act {
t.Errorf("expected exit status %d, got %d", exp, act)
}

if exp, act := "", stderr.String(); exp != act {
t.Errorf("expected stderr %q, got %q", exp, act)
}

var rep report.Report

if err = json.Unmarshal(stdout.Bytes(), &rep); err != nil {
t.Fatalf("expected JSON response, got %v", stdout.String())
}

if rep.Summary.NumViolations != 1 {
t.Errorf("expected 1 violation, got %d", rep.Summary.NumViolations)
}
})
}

func TestTestRegalBundledBundle(t *testing.T) {
t.Parallel()

Expand Down
3 changes: 3 additions & 0 deletions e2e/testdata/aggregates/rego/policy_1.rego
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
package mypolicy1.public

my_policy_1 := true
3 changes: 3 additions & 0 deletions e2e/testdata/aggregates/rego/policy_2.rego
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
package mypolicy2.public

export := []
20 changes: 20 additions & 0 deletions e2e/testdata/aggregates/rules/custom_rules_using_aggregates.rego
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# METADATA
# description: Collect data in aggregates and validate it
package custom.regal.rules.testcase["aggregates"]

import future.keywords
import data.regal.result

aggregate contains entry if {
entry := { "file" : input.regal.file.name }
}

report contains violation if {
not two_files_processed
violation := result.fail(rego.metadata.chain(), {})
}

two_files_processed {
files := [x | x = input.aggregate[_].file]
count(files) == 2
}
145 changes: 133 additions & 12 deletions pkg/linter/linter.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,10 @@ type Linter struct {
metrics metrics.Metrics
}

type QueryInputBuilder func(name string, content string, module *ast.Module) (map[string]any, error)

type ReportCollector func(report report.Report)

const regalUserConfig = "regal_user_config"

// NewLinter creates a new Regal linter.
Expand Down Expand Up @@ -180,7 +184,7 @@ var query = ast.MustParseBody("violations = data.regal.main.report") //nolint:go
func (l Linter) Lint(ctx context.Context) (report.Report, error) {
l.startTimer(regalmetrics.RegalLint)

aggregate := report.Report{}
aggregateReport := report.Report{}

if len(l.inputPaths) == 0 && l.inputModules == nil {
return report.Report{}, errors.New("nothing provided to lint")
Expand Down Expand Up @@ -240,29 +244,39 @@ func (l Linter) Lint(ctx context.Context) (report.Report, error) {
return report.Report{}, fmt.Errorf("failed to lint using Go rules: %w", err)
}

aggregate.Violations = append(aggregate.Violations, goReport.Violations...)
aggregateReport.Violations = append(aggregateReport.Violations, goReport.Violations...)

var aggregates []report.Aggregate

if len(input.Modules) > 1 {
// No need to collect aggregates if there's only one file
aggregates, err = l.collectAggregates(ctx, input)
if err != nil {
return report.Report{}, fmt.Errorf("failed to collect aggregates using Rego rules: %w", err)
}
}

regoReport, err := l.lintWithRegoRules(ctx, input)
regoReport, err := l.lintWithRegoRules(ctx, input, aggregates)
if err != nil {
return report.Report{}, fmt.Errorf("failed to lint using Rego rules: %w", err)
}

aggregate.Violations = append(aggregate.Violations, regoReport.Violations...)
aggregateReport.Violations = append(aggregateReport.Violations, regoReport.Violations...)

aggregate.Summary = report.Summary{
aggregateReport.Summary = report.Summary{
FilesScanned: len(input.FileNames),
FilesFailed: len(aggregate.ViolationsFileCount()),
FilesFailed: len(aggregateReport.ViolationsFileCount()),
FilesSkipped: 0,
NumViolations: len(aggregate.Violations),
NumViolations: len(aggregateReport.Violations),
}

if l.metrics != nil {
l.metrics.Timer(regalmetrics.RegalLint).Stop()

aggregate.Metrics = l.metrics.All()
aggregateReport.Metrics = l.metrics.All()
}

return aggregate, nil
return aggregateReport, nil
}

func (l Linter) lintWithGoRules(ctx context.Context, input rules.Input) (report.Report, error) {
Expand Down Expand Up @@ -414,7 +428,7 @@ func (l Linter) paramsToRulesConfig() map[string]any {
}
}

func (l Linter) prepareRegoArgs() []func(*rego.Rego) {
func (l Linter) prepareRegoArgs(query ast.Body) []func(*rego.Rego) {
var regoArgs []func(*rego.Rego)

roots := []string{"eval"}
Expand Down Expand Up @@ -466,14 +480,16 @@ func (l Linter) prepareRegoArgs() []func(*rego.Rego) {
return regoArgs
}

func (l Linter) lintWithRegoRules(ctx context.Context, input rules.Input) (report.Report, error) {
func (l Linter) lintWithRegoRules(
ctx context.Context, input rules.Input, aggregates []report.Aggregate,
) (report.Report, error) {
l.startTimer(regalmetrics.RegalLintRego)
defer l.stopTimer(regalmetrics.RegalLintRego)

ctx, cancel := context.WithCancel(ctx)
defer cancel()

regoArgs := l.prepareRegoArgs()
regoArgs := l.prepareRegoArgs(query)

linterQuery, err := rego.New(regoArgs...).PrepareForEval(ctx)
if err != nil {
Expand Down Expand Up @@ -502,6 +518,10 @@ func (l Linter) lintWithRegoRules(ctx context.Context, input rules.Input) (repor
return
}

if len(aggregates) > 0 {
enhancedAST["aggregate"] = aggregates
}

evalArgs := []rego.EvalOption{
rego.EvalInput(enhancedAST),
}
Expand Down Expand Up @@ -738,6 +758,107 @@ func (l Linter) getBundleByName(name string) (*bundle.Bundle, error) {
return nil, fmt.Errorf("no regal bundle found")
}

func (l Linter) collectAggregates(ctx context.Context, input rules.Input) ([]report.Aggregate, error) {
var result []report.Aggregate

regoArgs := l.prepareRegoArgs(ast.MustParseBody("aggregates = data.regal.main.aggregate"))

var linterQuery rego.PreparedEvalQuery

var err error

if linterQuery, err = rego.New(regoArgs...).PrepareForEval(ctx); err != nil {
return []report.Aggregate{}, fmt.Errorf("failed preparing query for linting: %w", err)
}

if err = l.evalAndCollect(ctx, input, linterQuery,
// query input builder
func(name string, content string, module *ast.Module) (map[string]any, error) {
result, err := parse.EnhanceAST(name, input.FileContent[name], input.Modules[name])
if err != nil {
return nil,
fmt.Errorf("could not enhance AST when buiding input during lint with Rego rules: %w", err)
}

return result, nil
},
// result collector
func(report report.Report) {
result = append(result, report.Aggregates...)
},
); err != nil {
return nil, err
}

return result, nil
}

// Process each file in input.Filenames in a goroutine, with the given Rego query and building the eval input using the
// provided function. Collects the results via the provided collector. The collector is guaranteed to
// run sequentially via a mutex.
func (l Linter) evalAndCollect(ctx context.Context, input rules.Input, query rego.PreparedEvalQuery,
queryInputBuilder QueryInputBuilder,
reportCollector ReportCollector,
) error {
ctx, cancel := context.WithCancel(ctx)
defer cancel()

var wg sync.WaitGroup

var mu sync.Mutex

errCh := make(chan error)

doneCh := make(chan bool)

for _, name := range input.FileNames {
wg.Add(1)

go func(name string) {
defer wg.Done()

queryInput, err := queryInputBuilder(name, input.FileContent[name], input.Modules[name])
if err != nil {
errCh <- fmt.Errorf("failed building query input: %w", err)

return
}

resultSet, err := query.Eval(ctx, rego.EvalInput(queryInput))
if err != nil {
errCh <- fmt.Errorf("error encountered in query evaluation %w", err)

return
}

result, err := resultSetToReport(resultSet)
if err != nil {
errCh <- fmt.Errorf("failed to convert result set to report: %w", err)

return
}

mu.Lock()
reportCollector(result)
mu.Unlock()
}(name)
}

go func() {
wg.Wait()
doneCh <- true
}()

select {
case <-ctx.Done():
return fmt.Errorf("context cancelled: %w", ctx.Err())
case err := <-errCh:
return fmt.Errorf("error encountered in rule evaluation %w", err)
case <-doneCh:
return nil
}
}

func (l Linter) startTimer(name string) {
if l.metrics != nil {
l.metrics.Timer(name).Start()
Expand Down
11 changes: 10 additions & 1 deletion pkg/report/report.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ type Violation struct {
Location Location `json:"location,omitempty"`
}

// An Aggregate is data collected by some rule while processing a file AST, to be used later by other rules needing a
// global context (i.e. broader than per-file)
// Rule authors are expected to collect the minimum needed data, to avoid performance problems
// while working with large Rego code repositories.
type Aggregate map[string]any

type Summary struct {
FilesScanned int `json:"files_scanned"`
FilesFailed int `json:"files_failed"`
Expand All @@ -38,7 +44,10 @@ type Summary struct {

// Report aggregate of Violation as returned by a linter run.
type Report struct {
Violations []Violation `json:"violations"`
Violations []Violation `json:"violations"`
// We don't have aggregates when publishing the final report (see JSONReporter), so omitempty is needed here
// to avoid surfacing a null/empty field.
Aggregates []Aggregate `json:"aggregates,omitempty"`
Summary Summary `json:"summary"`
Metrics map[string]any `json:"metrics,omitempty"`
}
Expand Down
Loading