Skip to content

Create a proposal for Buffers API. #8151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jbtk
Copy link
Contributor

@jbtk jbtk commented May 21, 2025

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Adds proposal for Buffers API

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jbtk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from vadasambar May 21, 2025 09:36
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 21, 2025 09:36
@jbtk
Copy link
Contributor Author

jbtk commented May 21, 2025

The initial discussion about this proposal was on a doc: http://docs/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo?tab=t.0

(there are still unresolved conversations there)

@jbtk jbtk force-pushed the buffersproposal branch from d304178 to 4bebd67 Compare May 22, 2025 10:01
@jbtk
Copy link
Contributor Author

jbtk commented May 22, 2025

Summary of open topics:

  • controller vs library used by autoscalers for translating buffers to pod spec
  • k8s quotas limitations (lacking in the proposal, will research that and fill in the blanks in the proposal)

@jbtk jbtk force-pushed the buffersproposal branch from 4bebd67 to ee9915b Compare May 27, 2025 12:03
@raywainman
Copy link
Member

raywainman commented May 27, 2025

/assign @jackfrancis

@jackfrancis - adding you here because I know this is related to things you've thought about in the past with regards to pre-provisioning capacity.

@jbtk
Copy link
Contributor Author

jbtk commented Jun 2, 2025

Update:

  1. added k8s resource quota handling to proposal
  2. after some discussions I am keeping the proposal to have a separate controller. When implementing we will reconsider embedding pod spec (due to embedding and validation that needs to be copied and maintained)

Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea and direction looks good to me, but IMO we should clarify some fine details in the proposal.

@jackfrancis @gjtempleton Could you give the proposal a high-level review?

@jonathan-innis Could you review the proposal? It's translated from the doc we've already discussed.

If the user uses autoscaling the cluster size will be adjusted to the number of
currently running pods. This means that often when a new pod is created its
scheduling time will need to include the time of creating a new node and
attaching it to the cluster. Many users care deeply about pod scheduling time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a nontrivial amount of writes (events, etc) associated with evicting, deleting, creating, and scheduling pods. ETCD is pretty sensitive to writes -- I suspect there are pretty huge perf benefits to this if you were to measure it.

* User pods can use the available capacity without any configuration changes
(strictly additive, can be managed by the user who owns the pod as well as
cluster administrators)
* Open possibility for cloud providers to offer different types of buffers if
Copy link
Contributor

@ellistarn ellistarn Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with this use case. Abstraction/extensibility often comes at the cost of complexity. I'd be wary of expanding to this scope unless there was a very concrete use case that could not be met in a vendor agnostic way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the proposal I am adding a field buffer type (string). It will allow to offer different types of buffers both vendor agnostic as well as vendor specific.

Ideas that are floating around:

  • with resource hotplug proposal we could provide buffer where node can be resized into so that new capacity comes up faster (KEP: KEP-3953: Node Resource Hot Plug enhancements#3955)
  • offering a buffer of stopped nodes that can be restarted faster than creating a new node.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.

At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.

At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Actually I think that this is a good thing that there is a single API surface to manage different types of buffers and the user can test and compare how different buffers behave. Example: I have a workload, I can set up active buffer or stopped vms buffer. With a single field change it is possible to flip the buffer type and compare performance and cost. Yes - this means that this API needs to be flexible, but since it comes down to pod like chunks of capacity it should be relatively easy to translate it to any kind of buffer.
  2. About users getting confused - yes - buffers may be able to provide various buffer types in the future, but there will be a single surface to view all the possible overprovisioning mechanisms.

I think that annotations are a good idea if someone adds a new buffer type and the API is not enough for them. I would want to avoid a situation though where we create something that user did not intend to because something failed to parse annotation (because older version of the buffer controller did not support it). Therefore I think that adding a field for type ensures that we can fail fast in case the configuration tries to accomplish something that is non necessarily possible.

additional pods that are representing a buffer (for example treating them as
virtual pods) so that the capacity in the cluster becomes available.

### How do we determine the pod spec for targets using scale subresource?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a pod template spec inline? Imagine the case where you want to leave room for a CI job, but the job doesn't exist in the cluster yet, so you have nothing to reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal has both option to reference an object that implements scale subresource as well as pod template. The user will be able to choose one or the other.

Note that referencing pod template is tricky. Not every controller requires the user to define one (for example deployment has it just embedded - copy pasting is prone to errors coming from the fact that one changes and not the other) and some pods can change over time (for example if the VPA is used and scales the pods).

require more implementation work within the cluster autoscaler in order to
support virtual pods with DRA resources.
* More features that further help users to manage the buffers:
* Buffer schedule - option to define a buffer that is active only during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Buffer supported the scale subresource, you could hook it up to things like https://keda.sh/docs/2.11/scalers/cron/ to achieve this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential feature for the future. On the original doc there is discussion about this: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs

Once I implement buffers I will look into this more: how hard would it be to have it just using existing tools vs a dedicated field. Thanks for the link to the tool - I will check it out besides testing with just k8s CronJobs.

Also implementing scale subresource would mean that we could use things like HPA to adjust the size of the buffer which could be useful as well.

// Depending on the type of referenced object one of the configurations below will be used:
// - for a crd implementing scale subresource (like deployment, job, replicaset)
// the ReplicaBasedBufferSpec will be used
// - for GKE CCC or Karpenter NodePool the NodeClassBufferSpec will be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For node-centric headroom, I suspect that this will be implemented in the NodePool API in Karpenter, and not part of this buffer API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what are the plans of the Karpenter team. We talked with them about this proposal, they commented on the original doc (https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?pli=1&disco=AAABjKBEw_Q).

There is also a separate section why we do not plan to expose an API of "I just want to have 2 extra nodes" and most likely this would make sense in the context of NodePool API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify:

  • buffers are focusing on the capacity in the cluster, we want to offer the users a single API surface to define spare capacity - whether it is based on a single worklad or whether this is shared by workloads that run on a single ccc/kartpenter pool
  • in GKE CCC focuses on setting up a policy/configuration. This is why we wanted to have a separate, provider agnostic way of expressing extra capacity that should be provisioned. I am not sure about the karpenter node pool philosophy.
  • Because CCC is a policy and node class buffer (tbd the name ;P) is a way to have a shared buffer they may have different access control. Also there may be many workloads on a single pool and it may be much more convenient to specify two separate buffers of different chunk shapes rather than a single buffer which would make more sense from the nodepool API perspective

@towca as FYI

@jbtk jbtk force-pushed the buffersproposal branch from ee9915b to a8fb68a Compare June 9, 2025 11:13
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jbtk
Once this PR has been reviewed and has the lgtm label, please ask for approval from jackfrancis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jbtk jbtk force-pushed the buffersproposal branch 3 times, most recently from b2f35b9 to 0c81ab0 Compare June 13, 2025 11:49
@ellistarn
Copy link
Contributor

Thanks @jbtk, I'll review again soon when I get the chance.

name: my-deployment-buffer
namespace: my-namespace
spec:
replicaBasedBufferSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to spend some time refining these API specs. I think there are some good opportunities to simplify structure and come up with better naming. Do we think this KEP is the right forum for these discussions, or should I wait for the PR w/ the actual go structs?

For example:

We stutter spec and replicas with spec.replicaBasedBufferSpec.replicasCountPercentage

We could combine replicasCountPercentage and minCount into:

replicas: 10% # IntOrString

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an original proposal, but I got feedback about two value options not being the best and additionally when one field works with one option and not the other. This embedding was proposed during the review here: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjQp5l44

Copy link
Collaborator

@towca towca Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't personally have a strong preference here, IMO we should follow Kubernetes API best practices. Do we know of any other similar fields in the K8s API?

@jackfrancis @gjtempleton WDYT?

@ellistarn
Copy link
Contributor

@jbtk, thanks for taking my earlier feedback to heart. I like the direction this is going. I'm not quite grokking the current state of the proposal, as the go structs do not match the yaml specs. Could you make these agree and I'll do another pass? In the meantime, I have a few thoughts:

  1. I like the more specific name that conceptually scopes to "Node". I think we might've gotten a bit too specific, as "NodeBuffer" conveys the same information and is easier to read/write/say. I'd also advocate for NodeHeadroom or NodeReservation ("reserved space in a node, which can be fullfiled by pods").

  2. Have we thought about the permission model of the arbitrary targetRefs? It means that administrators may need to continuously add new permission types to their buffer controller. This gives a slight advantage

  3. I don't understand why it's necessary to include a target inside a resources buffer. Couldn't you let the autoscaler implementation decide which one to choose based off of the scheduling algorithm? For example, in Karpenter, you could use the NodeSelector field in a resources buffer to do:

nodeSelector:
  - key: "karpenter.sh/node-pool"
    value: "default"
    operator: In

It strikes me as much more powerful to use nodeSelector here than to force a target. In Karpenter, we would want the default behavior to simply let the normal provision algorithm decide which NodePool to choose based off of cost/scheduling/limits, etc -- users would only constrain things for specific situations.

  1. I'm still not understanding the point of buffer type. As far as I understand this feature only has one value (default), and the use cases are a bit speculative. I'm not confident that the other use cases theorized wouldn't require additional API surface anyways. I'd recommend dropping it and re-adding it once we have more clarity and are directly implementing the feature. It's very common for controllers to implement implementation-specific features like this using annotations, which unblocks providers from experimenting without baking concepts into the API until they're well understood.

@jbtk jbtk force-pushed the buffersproposal branch from fcade29 to dc04dad Compare June 23, 2025 07:58
@jbtk
Copy link
Contributor Author

jbtk commented Jun 23, 2025

@ellistarn responding

  1. It feels to me that NodeBuffer could incorrectly suggest that the buffer is empty nodes and this is not the intention as it is just spare capacity on the nodes (including spare capacity on the existing nodes). Therefore it feels like NodeBuffer can be misleading and does not represent well what it does. NodeAutoscalingBuffer however feels like a better fit expressing that this is a buffer that is maintained when autoscaling the nodes. I am happy to come up with a better name (I agree it is not great), but I do not think that NodeBuffer is what we need.
  2. What do you mean by "permission model for arbitrary target refs"? If someone wants to support some additional target refs they will need to deploy their own buffer controller(s). Do you mean some mechanism for earlier validation what target refs are available so that people learn earlier if their buffer is incorrect or what?
  3. I think there are a couple of reasons, none of them particularly strong so happy to discuss:
    1. node selectors are very expressive - if we want to offer users with good observability it would be more convenient for them to reference an object directly, if you decide that you want to do this for multiple karpenter pools you can use node selectors and leave the single target empty.
    2. if people want to write buffer controllers for their custom CRDs referencing an object directly will be much easier to handle - not everything related to custom CRDs would be possible to express in the form of node selectors.
  4. The use case of "extensibility for different cloud providers to offer custom types" will always be speculative until we add a field and someone will start using it. With defaulting the user will not need to specify any additional field if they use the default type. If we say that all of this has to be done through annotations we can end up with users copy pasting later the buffer between clouds and being surprised that it was misinterpreted as the annotations were ignored. It is possible that for certain custom buffer types there will be need for additional fields and the only way of adding them will be in annotations but it will be clear that these types have special handling (and the buffer controller can clearly communicate that a given buffer type is not supported).

@ellistarn
Copy link
Contributor

Thanks @jbtk,

For 1, I get your point about the potential confusion. Any thoughts to my suggested name: NodeReservation? This one especially makes sense to me if we decide to support node hibernation with this API. It might also be worth exploring a "Pod" prefix (e.g. PodBuffer, PodReservation) for these names to help clarify the subject.

For 2, apologies -- this thought was half baked and the last sentence incomplete. My thought was that arbitrary targets might require customized RBAC, but the design already assumes that each provider will have a separate implementation w/ separate RBAC, so I think this is solved.


For 3/4, I think these can be summarized as "speculative features that try to address future extensibility". This is philosophical, so I can understand if you feel differently, but I'd recommend that we delay the addition of these additional fields/features until the use case is concrete.

For example, if CAS/GKE/Karpenter need targetRef to support something that cannot be easily supported with NodeSelector, let's add a follow-on that builds the extensibility and maps it directly to the use case.

In my experience, building speculative extensibility carries risk that the feature will not fully meet the future requirements once they're known. This can lead to awkward/unnecessary fields that are confusing to customers. We can easily add to APIs with new features, but removing from APIs is much more costly if not impossible. It's valuable to think of potential extension points (e.g. our discussion on fulfilledBy), even if we don't implement them immediately.

@jbtk
Copy link
Contributor Author

jbtk commented Jun 24, 2025

@ellistarn

  • 1 reservations are considered to be added on the scheduler level (some docs: older one, newer approach that was rejected, but may come back in a similar form: KEP-5147: DRA Node Reservations via ResourceClaims enhancements#5149). Ultimately we do not reserve the space - only create it and keep it up (it has no guarantees that it will not move between nodes etc as scheduler is unaware of this space being taken). As I said - I do not want to create an impression that Buffers operate on a node level as this is not the intention of this api. I would rather go with ResourceBuffer, but "resource" is again a pretty ambiguous word. Though the resource word matches the resources fields in the pod spec which makes it potentially a match.

  • 3 For target ref in resources capacity - this is not something I have a strong feeling about - I think that for any custom object that is a single thing (GKE custom compute class or any other custom CRD that users would like to write custom buffers for) it is much more convenient to directly point to that object rather than reference it by node selectors (for CCC it happens to be possible, but for custom CRD not necessarily making it almost the same as pod based config with only a feature of splitting the resources between chunks of the buffers). Also with direct ref we could automatically use it for owner ref metadata field. Since this is not required as CCC can be referenced by a node selector (not direct and nice, but working) I can remove this if others also feel that this is not a good idea.

  • 4 As for the type - this is something that has very concrete ideas (like stopped vms or resizeability) and does not have a workaround other than an annotation that has a risk of buffer controller not supporting these and acting as if buffer was for active capacity. This one I would rather leave as is.

Since 3 and 4 is a matter of opinion it would be great for others to chime in. @towca - are you the right person or is there someone else who may have an opinion here?

@towca
Copy link
Collaborator

towca commented Jun 25, 2025

@jbtk thanks for addressing my comments! I feel like we're very close to alignment, let's finalize this so that we can start implementing something.

IIUC these are the contentious parts that are preventing us from approving the proposal:

  1. The Buffer name.

    • We need to distinguish that it's a Node-autoscaling-related (vs workload-autoscaling) object, so Buffer is too ambiguous.
    • NodeBuffer suggests that the Buffer is "Node-centric" (i.e. "Buffer consisting of Nodes"), while the semantics are actually more "workload-centric" (i.e. "Buffer for a workload"). I agree that this is confusing and would prefer to avoid this option. NodeHeadroom has the same problem.
    • Including Reservation in the name is IMO confusing because nothing will actually be "reserved" for Buffers until we implement the fulfilledBy semantics. Also there's the name collision with scheduler reservations that are being discussed separately, as @jbtk mentioned.
    • So we need a name that is accurate, distinguishes from workload-autoscaling, doesn't suggest that it's "Node-centric", and doesn't have collisions with other k8s concepts. IMO NodeAutoscalingBuffer, or ResourceBuffer both satisfy all points (the accuracy part is of course subjective).
    • IMO CapacityBuffer would be slightly better than ResourceBuffer. I'd also like to resurface my early suggestion of node-autoscaling.x-k8s.io/Buffer which also seems to satisfy all requirements.
  2. Whether to include TargetRef in ResourceBufferCapacity.

    • @jbtk Do I understand correctly that ResourceBufferCapacity is supposed to mean "buffer for total amount of resources, and the arbitrary CRD just specifies where to create the resources"? If so, I'd name the field something else than TargetRef since the "target" here seems to be the total resources+chunk config (following the ReplicaBufferCapacity pattern where "target" is a workload). Maybe NodeSelectorRef/NodeSelectorObject or something similar?
    • @ellistarn Having the reference to the object gives us better observability (we could use it in kubectl etc). It's also easier for the controller to identify the object to translate (imagine if e.g. we have different default chunking strategies depending on the object). NodeSelector can work for this, but it's imprecise and essentially hacking the semantics.
    • Maybe it would be clearer if we put NodeSelector and the Ref in a separate subobject meant to select where to create the capacity? @ellistarn could you expand on what is your concern here?
  3. The format of replica count/percentage in ReplicaBufferCapacity.

    • As mentioned in the other comment, for this one I'd defer to prior art in K8s API, or an opinion of a K8s API expert.
    • Maybe we can get Tim to do another review pass?
  4. Whether to include the ProvisioningStrategy field.

    • As @jbtk mentioned, we do have concrete use-cases in mind that we can talk about right now. The alternative is coming back to this discussion in a couple of months. I'd prefer to avoid that, as any iteration seems to have a non-trivial overhead of kicking off and maintaining the discussion.
    • We can focus on the simple use-case that @jbtk mentioned - "stand-by"/"stopped" VMs that can be quickly resumed. Having the ProvisioningStrategy field allows the user to specify that the Buffer should be created using such VMs.
    • @ellistarn I agree that we might need more discussion and design later when the use-cases are more concrete, but it seems safe to assume that we'll need to at least be able to distinguish from the default behavior. Which is all that the field does. Are you concerned that we'd want a completely separate API for this use-case?

A lot of this is subjective so it'd be good to hear more opinions. @gjtempleton @jackfrancis @x13n could you weigh in on the points listed above?

@jbtk jbtk force-pushed the buffersproposal branch from dc04dad to 6531fa2 Compare June 26, 2025 07:32
@jbtk
Copy link
Contributor Author

jbtk commented Jun 26, 2025

@towca @ellistarn

Yesterday I met with @liggitt to discuss embedding the pod spec. Following his advice I decided to get rid of embedding and the the buffer status we will use the pod template and save the generation number to ensure that the template was not modified by the user. This way we will be able to ensure that we do not over provision if we reach resource quota limits. The pod template will have the owner reference set so when buffer is deleted it will be deleted as well.

I also used the opportunity to consult on the IntOrString type. @liggitt told me that in newer APIs we try to avoid these multipurpose fields as they make the processing harder (harder to do validation, some other fields work with one value but not the other). Taking this into account I would rather leave these fields as they were.

@jbtk jbtk force-pushed the buffersproposal branch from 6531fa2 to cdec8ce Compare June 26, 2025 09:05

type NodeAutoscalingBufferStatus struct {
PodBufferCapacity PodBufferCapacity
PodTemplateGeneration int
Copy link
Contributor

@ellistarn ellistarn Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't generation inside PodBufferCapacity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generation is relevant for the status only. I do not see any good use case to give to the user option to define the generation in the config and therefore I did not want to increase the number of fields that the user may set (and therefore needs to understand what it does)

PodBufferCapacity PodBufferCapacity
PodTemplateGeneration int
Conditions []BufferCondition
LastFullyProvisioned *metav1.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this timestamp differ from the timestamps in .status.conditions? Usually time related information is available there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will not have a condition for a fully provisioned buffer. We only have a condition for the fact that the cluster is a scaling for this buffer.

Why?
We could have an additional condition for fully provisioned, but this information is very volatile - it is enough for a single pod to get scheduled in this space for it to be stale (and we do not have a good way to set this condition to false right away). Therefore I did not want to create a condition that looks like a way to check whether a buffer is fully provisioned but in fact only the timestamp there is useful and decided to go with just the timestamp field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do you want to keep LastFullyProvisioned up to date? Will it be updated every second if there are no pods scheduled into the buffer? With condition it is actually clear: condition FullyProvisioned: true with last transition timestamp T1 would mean that as of T1 the buffer was fully provisioned. It can later change to FullyProvisioned: false with timestamp T2, meanging as of T2 the buffer was not fully provisioned (and either the cluster was scaling or there was some reason on the condition explaining why scaling was not happening as of T2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were the user and could create an if condition on FullyProvisioned it would be very likely that I would totally forget to check the timestamp. It just feels very misleading to offer something like that that we know that can become stale very easily. But maybe k8s users are used to conditions like this and I am pessimistic here.

On the other hand you are correct that I did not think about how often we would update the buffers in a cluster with low churn and having the buffers set up. I am removing this field and will reconsider when will be implementing metrics and monitoring after the alpha launch.

apiVersion: apps/v1
kind: Deployment
name: my-deployment
percent:
Copy link
Contributor

@ellistarn ellistarn Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few unusual things about this structure to me.

  1. Percent nested under another percent
  2. targetRef nested under replicas -- it's logically a layer above
  3. minCount implies that there's something dynamic (i.e. could be above that value), is that right?

I'd suggest:

capacity:
  targetRef: ...
  replicas: 1 # mutually exclusive with percent
  percent: 10 # mutually exclusive with replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tricky part with such a setup is that min and max apply only if the percent is used and not when a constant number of replicas is set - splitting it into separate object clearly indicates that (though I agree that double percent is weird). The min value is mostly for cases where users say that they want to have 10% of their workload, but at least 1 replica. This is dynamic as this is a percent of the current size of the workload (for example deployment).

We cannot put replicas config directly under capacity as this is an object with three options: replicas, resources and pods. The target ref was there originally and it was moved down based on your comments - we need to have somewhere the split and "one of" choice.

If I am not getting something please let me know and show how your proposal would look like for other buffers (for example one pointing to a pod template)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the current structure is a bit weird.

Would it make sense to get rid of minReplicas if the percent was always rounded up to full replicas? Do we really need maxReplicas? If we do, can limits be moved to top level (right now there is this one here and CPU/memory totals elsewhere)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where top level? The min and max make only sense for buffer defined as a percent of the workload. Which top level do you mean and what would it mean there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can drop the min/max entirely? I assume the purpose of this is to address the following:

  • Referring to a deployment w/ 1 replica
  • Maintain a +10% buffer

Ergo we don't actuate the buffer controller to accommodate addtional capacity for a new pod replica until the deployment replica count reaches 5 (assuming basic rounding up to the nearest int). Is this actually a problem in practice? In the case of a 10% Buffer that references a deployment pod spec that uses 1 core that hasn't yet replicated out to 5, we would still be guaranteeing node capacity for 100 millicores (assuming 1 replica).

tl;dr I'm not sure we really need the extra min/max. I understand that at small ints this might cause issues, but are we really solving for that use case? I think the canonical use case is the scale use case; or, if indeed there are deployment scaling use cases that we're solving for where the deployment replica flaps between small numbers, like 1 and 3, we can simply use a buffer with a high percent value like 100%.

Thoughts?

scheduling time will need to include the time of creating a new node and
attaching it to the cluster. Many users care deeply about pod scheduling time
and therefore would prefer to keep spare capacity in the cluster in order to
speed up start time of the newly created pods.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to not just talk about the headroom use-cases but also the baseline capacity use-cases here? I know that we are expecting to address both by the API below so it might be nice to have that covered in the summary too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Baseline capacity is not covered here yet as we removed the feature of having a buffer that is filled in by something for now. I will come back to it in another PR (updating this AEP) once this is approved so that we can start implementation of this initial, simple usecase.

Comment on lines +25 to +33
* Today it is possible to create balloon pods and deployments, but they
require maintenance (sizing according to the pods user wants to run) and
with high numbers they can negatively impact the scheduling speed (as
the scheduler has to preempt the balloon first before scheduling the
actual pod). Due to the latter many customers run a balloon per machine,
but this requires adjusting the size to the machine per nodepool and
with custom compute classes (CCC) in GKE or karpenter pools the machine
size is not constant anymore.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool if, when we get a POC built-out, if we could show the scheduling performance difference between using the negative priority pods and the buffer overhead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you like to compare - scheduling time on empty capacity vs preempting pods (which can be done without any buffers - just a cluster with a min size where it has empty space or space is filled in with low priority pods) or something else that is buffer related?

but this requires adjusting the size to the machine per nodepool and
with custom compute classes (CCC) in GKE or karpenter pools the machine
size is not constant anymore.
* Introducing a new API allows us to reference other existing objects to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of this? Is your point that balloon pods are often just a representation of a deployment that has yet to be applied to the cluster and that we can simplify this by referencing the same object? It's not entirely clear to me how this API is going to help us with this significantly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today to have a balloon pod/deployment you need to copy paste the pod spec from the workload that you want to buffer for. If the original deployment changes you need to apply the same changes to the balloon. With buffer you will be able to point to the deployment and the size of the buffer will change automatically when the underlying deployment changes.

size is not constant anymore.
* Introducing a new API allows us to reference other existing objects to
simplify the configuration
* Having a separate object we can add convenience configuration options in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The counterpoint to this is that you could just orchestrate this through other mechanisms: CronJobs for scale-down of balloon pods for example. There is still definitely a value prop of less maintenance/being able to define all of this in one place, but these things aren't impossible to achieve without this API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right - this feature was discussed (whether it is worth introducing even in the buffer itself) here: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs. I think in general big part of this API is making things more convenient and then building out from that (for example new buffer types like a buffer of a stopped VMs).

the cluster based on the buffer spec that was provided. Also if for example
the target deployment changes it will be visible whether the change was
already picked up.
* Extendability: this design will allow for the users to either opt in the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we are trying to do this but aren't really sure how this would be handled concretely -- there isn't a strong use-case that I can think of for having divergent behavior on the pods that you gen from a given template -- I actually think that this makes things more confusing for a user who has to reason about what kind of output that they will get based on the components that they install

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this split will also benefit the OSS: with separate controller we will be able to offer guaranteed buffer once reservations are offered in scheduler without any autoscaler.

About divergent behavior - both with controller and with a library solution we will provide a library to use. In a controller case this library will be used within the controller. In the end this will be the same library.


### Why not?

* Encapsulate the translation logic in a single controller that can be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anytime I see the phrase "if the need arises", that means there isn't a need right now and therefore we are overshooting the mark with what we are doing for today. Since it's always possible to add this later and it's not possible to go back once it's added, I'm still strongly in favor of us starting with the library-based approach before we go with the status-based approach

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the need arises meaning: if the user wants to implement their custom logic for example of having buffer for their own custom CRD.

with a user extended controller as long as it gets the list of buffer pod
specs from the buffer objects. This allows users to define buffers for their
custom CRDs without changing the autoscaler in any way
* Debuggability - splitting translation from provisioning of the capacity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the strongest argument to me for putting the result of the pods that we are going to provision for in the status -- now, I still think it won't be the easiest thing to read in the world because we're going to be putting a full pod spec into the status of something, but this is the most convincing thing to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end we will not be putting the spec to avoid pod spec embedding - the buffer will create a pod template that it will reference.

Additionally adding a feature to buffers (like scheduled buffers or a new buffer
type) would require again touching every single object.

## No controller

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've mentioned this in a few previous discussions that we have had, but I really really think that we need some concrete use-cases/reasons that we think the controller-based solution is significantly better than the no-controller solution. I can't think of strong extensions that would be useful to have the pod-based approach in this moment so I'd love to hear some if you are thinking of any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there are two strongest reasons for me:

  • split of the responsibility including resource quotas that are today totally outside of the autoscaling scope (as they are fully handled on admission)
  • once reservations are delivered in scheduler we will be able to offer buffers that are backed by scheduler reservations without even autoscaler involved.

Without controller we are limiting people to only buffers that are pulled in as a part of their autoscaling solution and I do not see a big gain from getting rid of the controller. Can you say what concerns you the most about the separate controller (that can run as a subprocess of the autoscaler/karpenter not necessarily as a separate pod in the cluster)?

appropriate reservation underneath. Binding this logic to autoscaler does
not feel right.

Note that while some of these arguments are not very strong the cost of having a

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of arguments here around "the cost of doing this is low" -- agreed that the thing is easy to build probably, but it's not just about this -- there's a cognitive and maintenance cost to adding these things in and I'd rather take decisions that we can iterate on than decisions that are more one-way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think that the cognitive cost is higher to the users between a controller that is a subprocess in autoscaler vs a library that is called within autoscaler? I see one thing - the fact that buffer is in multiple states. But even in the library the buffer can fail in multiple ways so the user will need to understand that first the buffer is translated and only later it is provisioned.

@jbtk jbtk force-pushed the buffersproposal branch from cdec8ce to 9e3d518 Compare June 27, 2025 12:41

```
apiversion: autoscaling.x-k8s.io/v1alpha1
kind: NodeAutoscalingBuffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to eventually support DRA, it will no longer be only about pods I think. Would something like WorkloadBuffer be more in line with the workload definition in https://docs.google.com/document/d/1pZaQ4q9nhENFFIy-WjUhb89WCgJ7ZPcrKx-PIU8rOqA/edit?tab=t.0#heading=h.ve4ahfbzr475?

apiVersion: apps/v1
kind: Deployment
name: my-deployment
percent:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the current structure is a bit weird.

Would it make sense to get rid of minReplicas if the percent was always rounded up to full replicas? Do we really need maxReplicas? If we do, can limits be moved to top level (right now there is this one here and CPU/memory totals elsewhere)?

PodBufferCapacity PodBufferCapacity
PodTemplateGeneration int
Conditions []BufferCondition
LastFullyProvisioned *metav1.Time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how do you want to keep LastFullyProvisioned up to date? Will it be updated every second if there are no pods scheduled into the buffer? With condition it is actually clear: condition FullyProvisioned: true with last transition timestamp T1 would mean that as of T1 the buffer was fully provisioned. It can later change to FullyProvisioned: false with timestamp T2, meanging as of T2 the buffer was not fully provisioned (and either the cluster was scaling or there was some reason on the condition explaining why scaling was not happening as of T2).


We gathered feedback from the customers that maintenance of balloon pods and
sizing them is something that should be simplified. Additionally if we have a
buffer object we can add additional feature that are requested for example:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also doable with CronJob balloon pods, so not really an argument against them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Though using cron job makes buffers/balloons hard to observe. You need to combine information from multiple objects to understand what was happening and why.

@jbtk jbtk force-pushed the buffersproposal branch 3 times, most recently from 4cfbebb to ca9853e Compare June 27, 2025 20:52
@towca
Copy link
Collaborator

towca commented Jun 30, 2025

@jbtk thanks for addressing my comments! I feel like we're very close to alignment, let's finalize this so that we can start implementing something.

IIUC these are the contentious parts that are preventing us from approving the proposal:

  1. The Buffer name.

    • We need to distinguish that it's a Node-autoscaling-related (vs workload-autoscaling) object, so Buffer is too ambiguous.
    • NodeBuffer suggests that the Buffer is "Node-centric" (i.e. "Buffer consisting of Nodes"), while the semantics are actually more "workload-centric" (i.e. "Buffer for a workload"). I agree that this is confusing and would prefer to avoid this option. NodeHeadroom has the same problem.
    • Including Reservation in the name is IMO confusing because nothing will actually be "reserved" for Buffers until we implement the fulfilledBy semantics. Also there's the name collision with scheduler reservations that are being discussed separately, as @jbtk mentioned.
    • So we need a name that is accurate, distinguishes from workload-autoscaling, doesn't suggest that it's "Node-centric", and doesn't have collisions with other k8s concepts. IMO NodeAutoscalingBuffer, or ResourceBuffer both satisfy all points (the accuracy part is of course subjective).
    • IMO CapacityBuffer would be slightly better than ResourceBuffer. I'd also like to resurface my early suggestion of node-autoscaling.x-k8s.io/Buffer which also seems to satisfy all requirements.
  2. Whether to include TargetRef in ResourceBufferCapacity.

    • @jbtk Do I understand correctly that ResourceBufferCapacity is supposed to mean "buffer for total amount of resources, and the arbitrary CRD just specifies where to create the resources"? If so, I'd name the field something else than TargetRef since the "target" here seems to be the total resources+chunk config (following the ReplicaBufferCapacity pattern where "target" is a workload). Maybe NodeSelectorRef/NodeSelectorObject or something similar?
    • @ellistarn Having the reference to the object gives us better observability (we could use it in kubectl etc). It's also easier for the controller to identify the object to translate (imagine if e.g. we have different default chunking strategies depending on the object). NodeSelector can work for this, but it's imprecise and essentially hacking the semantics.
    • Maybe it would be clearer if we put NodeSelector and the Ref in a separate subobject meant to select where to create the capacity? @ellistarn could you expand on what is your concern here?
  3. The format of replica count/percentage in ReplicaBufferCapacity.

    • As mentioned in the other comment, for this one I'd defer to prior art in K8s API, or an opinion of a K8s API expert.
    • Maybe we can get Tim to do another review pass?
  4. Whether to include the ProvisioningStrategy field.

    • As @jbtk mentioned, we do have concrete use-cases in mind that we can talk about right now. The alternative is coming back to this discussion in a couple of months. I'd prefer to avoid that, as any iteration seems to have a non-trivial overhead of kicking off and maintaining the discussion.
    • We can focus on the simple use-case that @jbtk mentioned - "stand-by"/"stopped" VMs that can be quickly resumed. Having the ProvisioningStrategy field allows the user to specify that the Buffer should be created using such VMs.
    • @ellistarn I agree that we might need more discussion and design later when the use-cases are more concrete, but it seems safe to assume that we'll need to at least be able to distinguish from the default behavior. Which is all that the field does. Are you concerned that we'd want a completely separate API for this use-case?

A lot of this is subjective so it'd be good to hear more opinions. @gjtempleton @jackfrancis @x13n could you weigh in on the points listed above?

Based on the recent comments, there's another point of contention:

  1. Do we need the controller part for translating a Buffer Spec into the virtual Pods that autoscalers inject?
    • The alternative is to have the translation as part of Node autoscaler responsibilities, leveraging a shared library for the implementation.
    • A hard requirement for the translation is extensibility. Regardless of the solution, we need to be able to translate cloud-provider-specific objects on top of the shared part. We can do that both via a controller and a library with similar ease.
    • Using the controller approach we could move the translation logic out of Node autoscalers to reduce their overall scope. We could also have users run controllers for translating their own custom objects. I agree that both these points are pretty speculative, the first iteration wouldn't be utilizing either.
    • When we take quotas into account, the controller approach allows us to similarly not have the quota logic in Node autoscalers. This is out of scope for this first iteration, but it should be a fast follow and probably a requirement for beta - so I wouldn't call that part speculative. The quota logic is also likely to be more involved and have less affinity to Node autoscalers than the translation logic.
    • We've decided on referencing PodTemplate objects instead of inlining them on the advice of k8s API experts. So if we don't go with the controller approach but still want to have the pod spec in the status for observability, the shared library would have to create PodTemplate objects. This is pretty natural for a controller, but seems weird for such a library.
    • IMO it's pretty subjective which solution is "cleaner" or incurs less cognitive burden. I see the argument for "the proposal is less complex when we take the controller part out" (this was my first intuition as well), but on the other hand I definitely see the argument for "the simpler proposal without controller further bloats Node autoscaler scope, resulting in a more complex end-state". The ever-expanding scope is one of the biggest problems we see in CA, so delegating new not-strictly-autoscaling responsibilities to other components (even if it's just separating them under the same binary to start with) seems worthwhile.
    • IMO out of all the above, the quota part seems the most convincing. We'll have to implement quota logic soon, and I'd strongly prefer not to have to add it to CA directly.

@towca
Copy link
Collaborator

towca commented Jun 30, 2025

On a separate note - IMO both this and previous CA/Karpenter API alignment attempts (e.g. kubernetes/kubernetes#124800) turned out to be surprisingly difficult. We seem to frequently discuss things in circles, which then either makes the proposal die if it's low priority (like safe-to-evict/do-not-disrupt), or makes it hard to judge alignment progress and plan implementation if it's high priority (like this one).

I wonder what we could change to improve the process, especially given that we're likely going to have increasingly more proposals of similar scope in the future. WDYT about doing some sort of retrospective about this proposal after we're done?

@jbtk jbtk force-pushed the buffersproposal branch from ca9853e to dd1bc1d Compare July 1, 2025 08:51
@jbtk
Copy link
Contributor Author

jbtk commented Jul 1, 2025

@towca

For the format of the ReplicaBufferCapacity - based on discussion with @liggitt it seems that int or string field is not the easiest to maintain over time so I would prefer to stay with separate fields.

As for the structure. Following a suggestion from @x13n I changed the names: percent to value and ReplicaBasedCapacity from replicas to controllerBased so that we do not have percent.percent or replica.replica fields.

@jbtk jbtk force-pushed the buffersproposal branch from dd1bc1d to 7aee04e Compare July 1, 2025 09:50
@jbtk
Copy link
Contributor Author

jbtk commented Jul 9, 2025

For the controller discussion there is one more argument. For buffers for objects implementing scale subresource if we go with a library and no pod template in the status we would be able to buffer capacity only if at least one pod is running in the cluster. This is a limitation for people who want buffers for jobs that run only some time during a day.

@ellistarn @towca @jonathan-innis as FYI

In case of the cluster autoscaler we would deploy the controller as an optional
subprocess:

![Figure](./images/buffers-component-structure.jpg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be clearer to describe the buffers controller as

Cluster autoscaler
(buffers loop)

To emphasize that we aren't proposing a discrete controller (e.g., a node-autoscaling-buffer-controller-manager pod)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants