Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

Open
cici37 opened this issue Apr 25, 2024 · 15 comments · May be fixed by #124675
Open

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

cici37 opened this issue Apr 25, 2024 · 15 comments · May be fixed by #124675
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@cici37
Copy link
Contributor

cici37 commented Apr 25, 2024

What happened?

Thanks for @benluddy who raised the issue and investigated!

We didn't register the cost estimator for extended library hence the cost calculation is inaccurate for expression including extended library.

What did you expect to happen?

The cost should reflect the extended library

/cc @jpbetz @benluddy
/sig api-machinery
/triage accepted

How can we reproduce it (as minimally and precisely as possible)?

An example would be:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: costly-policy
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE"]
      resources:   ["namespaces"]
  validations:
    - expression: "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10].all(x, authorizer.requestResource.check('create').allowed())"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "costly-binding"
spec:
  policyName: "costly-policy"
  validationActions: [Deny]

The cost of the above expression exceeds the per expression limit but didn't fail.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@cici37 cici37 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 25, 2024
@jpbetz
Copy link
Contributor

jpbetz commented Apr 25, 2024

This is unfortunate. Now that this feature is GA, adding the per-expression cost has the potential to break existing users.

I don't see per-expression cost mentioned in the docs: https://github.com/kubernetes/website/blob/main/content/en/docs/reference/access-authn-authz/validating-admission-policy.md

I do see per-expression cost mentioned in the KEP but nowhere else.

I agree we should consider keeping this as-is and not adding the per-expression cost..

@benluddy
Copy link
Contributor

benluddy commented Apr 26, 2024

I agree we should consider keeping this as-is and not adding the per-expression cost.

I agree, I think the compatibility concern is enough to leave as-is and document the difference from CR validation rules. Edit: Based on our new understanding of this issue, we allow practically unlimited authorizer checks (as the worst example) and we need to fix it.

A per-policy (per-binding?) limit protects the apiserver from expensive policies. I suppose a per-expression limit puts an upper bound on the cost of the wasted work of evaluating a policy that fails due to its runtime budget? It is unfortunate that a single expression making 29 secondary authorization checks will incur the cost of 28 checks before exhausting its budget on the final check. With a per-expression limit, a determined policy author could still spread their evaluation across multiple variable or validation expressions, but it would be much harder to do by mistake.

@benluddy
Copy link
Contributor

I opened #124569 to add integration tests for runtime cost, specifically around the intentionally-large costs we chose for secondary authorization checks, and even the per-policy limit is not reached with 29 checks (I expected the limit to be floor(10M/350k) or 28 checks).

It looks like our library.CostEstimator, which computes the runtime and estimated costs for our library functions, is not wired at all for ValidatingAdmissionPolicy. On my machine I see that

callCost := c.Estimator.CallCost(call.Function(), call.OverloadID(), args, result)
is never being executed, so we're falling back to a default cost of 1.

If I wire the cel.CostTracking(&library.CostEstimator{}) cel.ProgramOption, then I see even per-expression limits are being enforced (more than 2 authz checks in a single-expression policy hits the runtime cost limit as expected).

@cici37 cici37 changed the title ValidatingAdmissionPolicy: Have a cost limit per policy but no cost limit per expression ValidatingAdmissionPolicy: Cost estimator for extended library was not registered Apr 26, 2024
@cici37
Copy link
Contributor Author

cici37 commented Apr 26, 2024

@benluddy Thank you for the investigation!
I thought we have all the previous config(

prog, err := ruleEnv.Program(ast,
cel.CostLimit(perCallLimit),
cel.CostTracking(estimator),
cel.InterruptCheckFrequency(celconfig.CheckFrequency),
) added while building env but turned out it only applied in crd path. Thank you for catching it!

There is a gap in the existing test suite. When we test cost in cost_test.go, we wired the cost estimator specifically:

est := &CostEstimator{SizeEstimator: &testCostEstimator{}}

When we test runtime cost budget in
func TestRuntimeCELCostBudget(t *testing.T) {
, we didn't add test case for extended library.

Adding test case in

func TestRuntimeCELCostBudget(t *testing.T) {
should be able to catch the issue and apparently we have gap in our test cases...

The concern I have for adding the env option now is the backward compatibility since it is a GA feature. It might break existing expression. We might have to have a ratcheting mechanism in place. Maybe ignore the previous existing and not changed expression for exceeding cost limit err?

@benluddy
Copy link
Contributor

benluddy commented Apr 26, 2024

In addition to fixing the unit test, resolving this definitely needs to include at least some integration and possibly E2E tests. Secondary authorization is the easy way to exercise it since its cost was designed to allow a certain number of checks per expression.

@liggitt
Copy link
Member

liggitt commented Apr 30, 2024

I agree, I think the compatibility concern is enough to leave as-is and document the difference from CR validation rules.

I'm not sure I agree... I don't think the ability to make the API server do unbounded work is acceptable.

A ratcheting approach to keep existing persisted VAPs from getting stuck would make sense, but this should be ratcheted down for new VAPs

edit: well, not completely unbounded, I guess... the overall limit still applies... but I'd still push for converging this to cost limits

@benluddy
Copy link
Contributor

benluddy commented Apr 30, 2024

I'm not sure I agree... I don't think the ability to make the API server do unbounded work is acceptable.

Our understanding of the issue changed since that comment (the issue has been edited), I think everyone agrees that we need to fix this now.

the overall limit still applies

If we had in fact been in the situation where only per-policy limits are enforced, we would have then run into the problem that per-policy limits are only checked between expression evaluations (i.e. the evaluation is not interrupted if the per-policy limit is reached partway). I think on reflection that situation would have been more dangerous than the issue we do have.

@cici37
Copy link
Contributor Author

cici37 commented Apr 30, 2024

The trick is that it happens in runtime where cel is calculating the cost and reject based on the limit we set... SO simply ratcheting with checking if it is existing expression might not be easy in this case..

@liggitt
Copy link
Member

liggitt commented Apr 30, 2024

hmm... what if we fix the cost wiring in 1.31 / 1.30.x / 1.29.x with a gated opt-out escape hatch if an admin has existing use they need to audit / resolve first?

is there a reasonable way they could check if there are existing VAPs that would exceed correct cost?

@thockin
Copy link
Member

thockin commented Apr 30, 2024

Something like:

1.30.1 : Add a feature gate "AllowOverCostValidatingAdmissionPolicy", set "Deprecated", and on-by-default (maybe off by default?)
1.30.1 : Add a status indication for "needs over-cost allowance".
1.31 : default gate to off if needed
1.32 or 1.33 : remove gate

? The sooner we set that gate as deprecated, the less likely we are to break people, but no matter what there's a chance

Is it possible to always add a cost indicator to status? Or to validate the cost at CREATE/UPDATE rather than runtime?

@benluddy
Copy link
Contributor

benluddy commented Apr 30, 2024

is there a reasonable way they could check if there are existing VAPs that would exceed correct cost?

Since the cost depends on the object being admitted (you might be expanding a macro over a list field, for example), I don't think it can be done cheaply in the general case. It should be possible to decide whether a given expression is "definitely not affected" versus "possibly affected" by seeing if the expression references any library functions whose cost should differ from the default cost.

For policies that might be impacted, would it be possible to send per-object, per-operation dry run requests? It could look sort of like a storage version migration.

@cici37
Copy link
Contributor Author

cici37 commented May 1, 2024

A deprecating feature gate suggested by @thockin above sounds like a working solution moving forward. We should encourage admins to turn off the flag as early as possible. The cost error is in runtime, setting validationaction to warn/audit could help admins to have a better idea if existing expressions cost exceed the default cost limit without affecting the cluster much.

@jpbetz
Copy link
Contributor

jpbetz commented May 1, 2024

Brainstorming on the ratcheting idea suggested above.. could we somehow set a strictCostEnforcement: false status field for all already stored VAP resources, but is somehow lock the field to true for all newly created resources (or don't even include it)?

The idea is that we ratchet by allowing grandfather in ALL resources written to storage before the fix to this (1.30.1 or 1.31 or whatever we choose) to continue bypass the fixed cost code. I'd love to do better and identify what specific resources actually need the ratcheting, but if that's not possible, maybe this is?

@liggitt
Copy link
Member

liggitt commented May 1, 2024

could we somehow set a costEnforcement: false status field for all already stored VAP resources, but is somehow lock the field to true for all newly created resources (or don't even include it)?

I like the per object approach, though depending on a new API field means <=1.30 won't be protected at all. I really really want a way to protect clusters at least from >= 1.30.x

One possibility would be modeling the strictness as a condition in VAP status, which already exists in the schema. It's a little strange to populate a condition in status on creation server-side, but there's some precedent in CRD.

A possible sketch of the behavior:

  • on VAP create, in validatingAdmissionPolicyStrategy#PrepareForCreate:
    • populate status with a type=StrictCostEnforcement,status=True condition
  • on VAP status update, in validatingAdmissionPolicyStatusStrategy#PrepareForUpdate:
    • calculate existing enforcement from the existing VAP object (if there's a StrictCostEnforcement condition with True or False status, use it, otherwise default to type=StrictCostEnforcement,status=False)
    • if the incoming object is missing the condition, preserve the calculated existing condition
    • if the incoming object has the condition, disallow True → False transitions
      • Disallowing True → False transitions could be painful for admins who have existing VAP objects and need to try making them strict and revert if requests start exceeding cost... maybe disallowing this transition is a thing to allow opting back into via a deprecated gate and eventually disallow entirely once there's been an opportunity to make existing policies strict.
  • in VAP enforcement when compiling expressions, compile with extended cost registered if the read object has a type=StrictCostEnforcement,status=True condition, use existing lenient cost otherwise (if the condition is missing or status != True)

@cici37
Copy link
Contributor Author

cici37 commented May 1, 2024

Unfortunately this issue is going to affect all places in K8s using CEL through base environment except for CRD validation rules. The features which are affected the most would be the GAed features like ValidatingAdmissionPolicy and MatchConditions. The above approach works great for VAP but I am not so sure about the matchConditions added in Webhook. I don't think we have status for webhook currently...

@cici37 cici37 linked a pull request May 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants