ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

cici37 · 2024-04-25T23:23:06Z

What happened?

Thanks for @benluddy who raised the issue and investigated!

We didn't register the cost estimator for extended library hence the cost calculation is inaccurate for expression including extended library.

What did you expect to happen?

The cost should reflect the extended library

/cc @jpbetz @benluddy
/sig api-machinery
/triage accepted

How can we reproduce it (as minimally and precisely as possible)?

An example would be:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicy
metadata:
  name: costly-policy
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
    - apiGroups:   [""]
      apiVersions: ["v1"]
      operations:  ["CREATE"]
      resources:   ["namespaces"]
  validations:
    - expression: "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10].all(x, authorizer.requestResource.check('create').allowed())"
---
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingAdmissionPolicyBinding
metadata:
  name: "costly-binding"
spec:
  policyName: "costly-policy"
  validationActions: [Deny]

The cost of the above expression exceeds the per expression limit but didn't fail.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

jpbetz · 2024-04-25T23:35:18Z

This is unfortunate. Now that this feature is GA, adding the per-expression cost has the potential to break existing users.

I don't see per-expression cost mentioned in the docs: https://github.com/kubernetes/website/blob/main/content/en/docs/reference/access-authn-authz/validating-admission-policy.md

I do see per-expression cost mentioned in the KEP but nowhere else.

I agree we should consider keeping this as-is and not adding the per-expression cost..

benluddy · 2024-04-26T13:05:31Z

I agree we should consider keeping this as-is and not adding the per-expression cost.

~~I agree, I think the compatibility concern is enough to leave as-is and document the difference from CR validation rules.~~ Edit: Based on our new understanding of this issue, we allow practically unlimited authorizer checks (as the worst example) and we need to fix it.

A per-policy (per-binding?) limit protects the apiserver from expensive policies. I suppose a per-expression limit puts an upper bound on the cost of the wasted work of evaluating a policy that fails due to its runtime budget? It is unfortunate that a single expression making 29 secondary authorization checks will incur the cost of 28 checks before exhausting its budget on the final check. With a per-expression limit, a determined policy author could still spread their evaluation across multiple variable or validation expressions, but it would be much harder to do by mistake.

benluddy · 2024-04-26T16:03:20Z

I opened #124569 to add integration tests for runtime cost, specifically around the intentionally-large costs we chose for secondary authorization checks, and even the per-policy limit is not reached with 29 checks (I expected the limit to be floor(10M/350k) or 28 checks).

It looks like our library.CostEstimator, which computes the runtime and estimated costs for our library functions, is not wired at all for ValidatingAdmissionPolicy. On my machine I see that

kubernetes/vendor/github.com/google/cel-go/interpreter/runtimecost.go

Line 190 in d07a576

    
           callCost := c.Estimator.CallCost(call.Function(), call.OverloadID(), args, result)

is never being executed, so we're falling back to a default cost of 1.

If I wire the cel.CostTracking(&library.CostEstimator{}) cel.ProgramOption, then I see even per-expression limits are being enforced (more than 2 authz checks in a single-expression policy hits the runtime cost limit as expected).

cici37 · 2024-04-26T17:32:51Z

@benluddy Thank you for the investigation!
I thought we have all the previous config(

kubernetes/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/schema/cel/compilation.go

Lines 232 to 235 in bae8300

    
           prog, err := ruleEnv.Program(ast, 
        
           	cel.CostLimit(perCallLimit), 
        
           	cel.CostTracking(estimator), 
        
           	cel.InterruptCheckFrequency(celconfig.CheckFrequency),

) added while building env but turned out it only applied in crd path. Thank you for catching it!

There is a gap in the existing test suite. When we test cost in cost_test.go, we wired the cost estimator specifically:

kubernetes/staging/src/k8s.io/apiserver/pkg/cel/library/cost_test.go

Line 1056 in bae8300

est := &CostEstimator{SizeEstimator: &testCostEstimator{}}

When we test runtime cost budget in

kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/cel/filter_test.go

Line 807 in bae8300

func TestRuntimeCELCostBudget(t *testing.T) {

, we didn't add test case for extended library.

Adding test case in

kubernetes/staging/src/k8s.io/apiserver/pkg/admission/plugin/cel/filter_test.go

Line 807 in bae8300

func TestRuntimeCELCostBudget(t *testing.T) {

should be able to catch the issue and apparently we have gap in our test cases...

The concern I have for adding the env option now is the backward compatibility since it is a GA feature. It might break existing expression. We might have to have a ratcheting mechanism in place. Maybe ignore the previous existing and not changed expression for exceeding cost limit err?

benluddy · 2024-04-26T17:44:34Z

In addition to fixing the unit test, resolving this definitely needs to include at least some integration and possibly E2E tests. Secondary authorization is the easy way to exercise it since its cost was designed to allow a certain number of checks per expression.

liggitt · 2024-04-30T17:11:21Z

I agree, I think the compatibility concern is enough to leave as-is and document the difference from CR validation rules.

I'm not sure I agree... I don't think the ability to make the API server do unbounded work is acceptable.

A ratcheting approach to keep existing persisted VAPs from getting stuck would make sense, but this should be ratcheted down for new VAPs

edit: well, not completely unbounded, I guess... the overall limit still applies... but I'd still push for converging this to cost limits

benluddy · 2024-04-30T17:13:30Z

I'm not sure I agree... I don't think the ability to make the API server do unbounded work is acceptable.

Our understanding of the issue changed since that comment (the issue has been edited), I think everyone agrees that we need to fix this now.

the overall limit still applies

If we had in fact been in the situation where only per-policy limits are enforced, we would have then run into the problem that per-policy limits are only checked between expression evaluations (i.e. the evaluation is not interrupted if the per-policy limit is reached partway). I think on reflection that situation would have been more dangerous than the issue we do have.

cici37 · 2024-04-30T17:13:33Z

The trick is that it happens in runtime where cel is calculating the cost and reject based on the limit we set... SO simply ratcheting with checking if it is existing expression might not be easy in this case..

liggitt · 2024-04-30T17:26:49Z

hmm... what if we fix the cost wiring in 1.31 / 1.30.x / 1.29.x with a gated opt-out escape hatch if an admin has existing use they need to audit / resolve first?

is there a reasonable way they could check if there are existing VAPs that would exceed correct cost?

thockin · 2024-04-30T18:38:48Z

Something like:

1.30.1 : Add a feature gate "AllowOverCostValidatingAdmissionPolicy", set "Deprecated", and on-by-default (maybe off by default?)
1.30.1 : Add a status indication for "needs over-cost allowance".
1.31 : default gate to off if needed
1.32 or 1.33 : remove gate

? The sooner we set that gate as deprecated, the less likely we are to break people, but no matter what there's a chance

Is it possible to always add a cost indicator to status? Or to validate the cost at CREATE/UPDATE rather than runtime?

benluddy · 2024-04-30T19:13:54Z

is there a reasonable way they could check if there are existing VAPs that would exceed correct cost?

Since the cost depends on the object being admitted (you might be expanding a macro over a list field, for example), I don't think it can be done cheaply in the general case. It should be possible to decide whether a given expression is "definitely not affected" versus "possibly affected" by seeing if the expression references any library functions whose cost should differ from the default cost.

For policies that might be impacted, would it be possible to send per-object, per-operation dry run requests? It could look sort of like a storage version migration.

cici37 · 2024-05-01T16:29:21Z

A deprecating feature gate suggested by @thockin above sounds like a working solution moving forward. We should encourage admins to turn off the flag as early as possible. The cost error is in runtime, setting validationaction to warn/audit could help admins to have a better idea if existing expressions cost exceed the default cost limit without affecting the cluster much.

jpbetz · 2024-05-01T17:04:47Z

Brainstorming on the ratcheting idea suggested above.. could we somehow set a strictCostEnforcement: false status field for all already stored VAP resources, but is somehow lock the field to true for all newly created resources (or don't even include it)?

The idea is that we ratchet by allowing grandfather in ALL resources written to storage before the fix to this (1.30.1 or 1.31 or whatever we choose) to continue bypass the fixed cost code. I'd love to do better and identify what specific resources actually need the ratcheting, but if that's not possible, maybe this is?

liggitt · 2024-05-01T17:30:27Z

could we somehow set a costEnforcement: false status field for all already stored VAP resources, but is somehow lock the field to true for all newly created resources (or don't even include it)?

I like the per object approach, though depending on a new API field means <=1.30 won't be protected at all. I really really want a way to protect clusters at least from >= 1.30.x

One possibility would be modeling the strictness as a condition in VAP status, which already exists in the schema. It's a little strange to populate a condition in status on creation server-side, but there's some precedent in CRD.

A possible sketch of the behavior:

on VAP create, in validatingAdmissionPolicyStrategy#PrepareForCreate:
- populate status with a type=StrictCostEnforcement,status=True condition
on VAP status update, in validatingAdmissionPolicyStatusStrategy#PrepareForUpdate:
- calculate existing enforcement from the existing VAP object (if there's a StrictCostEnforcement condition with True or False status, use it, otherwise default to type=StrictCostEnforcement,status=False)
- if the incoming object is missing the condition, preserve the calculated existing condition
- if the incoming object has the condition, disallow True → False transitions
  - Disallowing True → False transitions could be painful for admins who have existing VAP objects and need to try making them strict and revert if requests start exceeding cost... maybe disallowing this transition is a thing to allow opting back into via a deprecated gate and eventually disallow entirely once there's been an opportunity to make existing policies strict.
in VAP enforcement when compiling expressions, compile with extended cost registered if the read object has a type=StrictCostEnforcement,status=True condition, use existing lenient cost otherwise (if the condition is missing or status != True)

cici37 · 2024-05-01T18:58:14Z

Unfortunately this issue is going to affect all places in K8s using CEL through base environment except for CRD validation rules. The features which are affected the most would be the GAed features like ValidatingAdmissionPolicy and MatchConditions. The above approach works great for VAP but I am not so sure about the matchConditions added in Webhook. I don't think we have status for webhook currently...

cici37 added the kind/bug Categorizes issue or PR as related to a bug. label Apr 25, 2024

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Apr 25, 2024

cici37 changed the title ~~ValidatingAdmissionPolicy: Have a cost limit per policy but no cost limit per expression~~ ValidatingAdmissionPolicy: Cost estimator for extended library was not registered Apr 26, 2024

cici37 linked a pull request May 2, 2024 that will close this issue

Adding a deprecating featurer gate to fix cost #124675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

cici37 commented Apr 25, 2024 •

edited

jpbetz commented Apr 25, 2024 •

edited

benluddy commented Apr 26, 2024 •

edited

benluddy commented Apr 26, 2024

cici37 commented Apr 26, 2024

benluddy commented Apr 26, 2024 •

edited

liggitt commented Apr 30, 2024 •

edited

benluddy commented Apr 30, 2024 •

edited

cici37 commented Apr 30, 2024 •

edited

liggitt commented Apr 30, 2024

thockin commented Apr 30, 2024

benluddy commented Apr 30, 2024 •

edited

cici37 commented May 1, 2024

jpbetz commented May 1, 2024 •

edited

liggitt commented May 1, 2024 •

edited

cici37 commented May 1, 2024

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

ValidatingAdmissionPolicy: Cost estimator for extended library was not registered #124542

Comments

cici37 commented Apr 25, 2024 • edited

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

jpbetz commented Apr 25, 2024 • edited

benluddy commented Apr 26, 2024 • edited

benluddy commented Apr 26, 2024

cici37 commented Apr 26, 2024

benluddy commented Apr 26, 2024 • edited

liggitt commented Apr 30, 2024 • edited

benluddy commented Apr 30, 2024 • edited

cici37 commented Apr 30, 2024 • edited

liggitt commented Apr 30, 2024

thockin commented Apr 30, 2024

benluddy commented Apr 30, 2024 • edited

cici37 commented May 1, 2024

jpbetz commented May 1, 2024 • edited

liggitt commented May 1, 2024 • edited

cici37 commented May 1, 2024

cici37 commented Apr 25, 2024 •

edited

jpbetz commented Apr 25, 2024 •

edited

benluddy commented Apr 26, 2024 •

edited

benluddy commented Apr 26, 2024 •

edited

liggitt commented Apr 30, 2024 •

edited

benluddy commented Apr 30, 2024 •

edited

cici37 commented Apr 30, 2024 •

edited

benluddy commented Apr 30, 2024 •

edited

jpbetz commented May 1, 2024 •

edited

liggitt commented May 1, 2024 •

edited