-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Create a proposal for Buffers API. #8151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Hi @jbtk. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The initial discussion about this proposal was on a doc: http://docs/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo?tab=t.0 (there are still unresolved conversations there) |
Summary of open topics:
|
/assign @jackfrancis @jackfrancis - adding you here because I know this is related to things you've thought about in the past with regards to pre-provisioning capacity. |
Update:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall idea and direction looks good to me, but IMO we should clarify some fine details in the proposal.
@jackfrancis @gjtempleton Could you give the proposal a high-level review?
@jonathan-innis Could you review the proposal? It's translated from the doc we've already discussed.
If the user uses autoscaling the cluster size will be adjusted to the number of | ||
currently running pods. This means that often when a new pod is created its | ||
scheduling time will need to include the time of creating a new node and | ||
attaching it to the cluster. Many users care deeply about pod scheduling time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a nontrivial amount of writes (events, etc) associated with evicting, deleting, creating, and scheduling pods. ETCD is pretty sensitive to writes -- I suspect there are pretty huge perf benefits to this if you were to measure it.
* User pods can use the available capacity without any configuration changes | ||
(strictly additive, can be managed by the user who owns the pod as well as | ||
cluster administrators) | ||
* Open possibility for cloud providers to offer different types of buffers if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling with this use case. Abstraction/extensibility often comes at the cost of complexity. I'd be wary of expanding to this scope unless there was a very concrete use case that could not be met in a vendor agnostic way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the proposal I am adding a field buffer type (string). It will allow to offer different types of buffers both vendor agnostic as well as vendor specific.
Ideas that are floating around:
- with resource hotplug proposal we could provide buffer where node can be resized into so that new capacity comes up faster (KEP: KEP-3953: Node Resource Hot Plug enhancements#3955)
- offering a buffer of stopped nodes that can be restarted faster than creating a new node.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.
At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.
At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Actually I think that this is a good thing that there is a single API surface to manage different types of buffers and the user can test and compare how different buffers behave. Example: I have a workload, I can set up active buffer or stopped vms buffer. With a single field change it is possible to flip the buffer type and compare performance and cost. Yes - this means that this API needs to be flexible, but since it comes down to pod like chunks of capacity it should be relatively easy to translate it to any kind of buffer.
- About users getting confused - yes - buffers may be able to provide various buffer types in the future, but there will be a single surface to view all the possible overprovisioning mechanisms.
I think that annotations are a good idea if someone adds a new buffer type and the API is not enough for them. I would want to avoid a situation though where we create something that user did not intend to because something failed to parse annotation (because older version of the buffer controller did not support it). Therefore I think that adding a field for type ensures that we can fail fast in case the configuration tries to accomplish something that is non necessarily possible.
additional pods that are representing a buffer (for example treating them as | ||
virtual pods) so that the capacity in the cluster becomes available. | ||
|
||
### How do we determine the pod spec for targets using scale subresource? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a pod template spec inline? Imagine the case where you want to leave room for a CI job, but the job doesn't exist in the cluster yet, so you have nothing to reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal has both option to reference an object that implements scale subresource as well as pod template. The user will be able to choose one or the other.
Note that referencing pod template is tricky. Not every controller requires the user to define one (for example deployment has it just embedded - copy pasting is prone to errors coming from the fact that one changes and not the other) and some pods can change over time (for example if the VPA is used and scales the pods).
require more implementation work within the cluster autoscaler in order to | ||
support virtual pods with DRA resources. | ||
* More features that further help users to manage the buffers: | ||
* Buffer schedule - option to define a buffer that is active only during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Buffer supported the scale subresource, you could hook it up to things like https://keda.sh/docs/2.11/scalers/cron/ to achieve this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential feature for the future. On the original doc there is discussion about this: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs
Once I implement buffers I will look into this more: how hard would it be to have it just using existing tools vs a dedicated field. Thanks for the link to the tool - I will check it out besides testing with just k8s CronJobs.
Also implementing scale subresource would mean that we could use things like HPA to adjust the size of the buffer which could be useful as well.
// Depending on the type of referenced object one of the configurations below will be used: | ||
// - for a crd implementing scale subresource (like deployment, job, replicaset) | ||
// the ReplicaBasedBufferSpec will be used | ||
// - for GKE CCC or Karpenter NodePool the NodeClassBufferSpec will be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For node-centric headroom, I suspect that this will be implemented in the NodePool API in Karpenter, and not part of this buffer API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what are the plans of the Karpenter team. We talked with them about this proposal, they commented on the original doc (https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?pli=1&disco=AAABjKBEw_Q).
There is also a separate section why we do not plan to expose an API of "I just want to have 2 extra nodes" and most likely this would make sense in the context of NodePool API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify:
- buffers are focusing on the capacity in the cluster, we want to offer the users a single API surface to define spare capacity - whether it is based on a single worklad or whether this is shared by workloads that run on a single ccc/kartpenter pool
- in GKE CCC focuses on setting up a policy/configuration. This is why we wanted to have a separate, provider agnostic way of expressing extra capacity that should be provisioned. I am not sure about the karpenter node pool philosophy.
- Because CCC is a policy and node class buffer (tbd the name ;P) is a way to have a shared buffer they may have different access control. Also there may be many workloads on a single pool and it may be much more convenient to specify two separate buffers of different chunk shapes rather than a single buffer which would make more sense from the nodepool API perspective
@towca as FYI
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jbtk The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b2f35b9
to
0c81ab0
Compare
Thanks @jbtk, I'll review again soon when I get the chance. |
name: my-deployment-buffer | ||
namespace: my-namespace | ||
spec: | ||
replicaBasedBufferSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to spend some time refining these API specs. I think there are some good opportunities to simplify structure and come up with better naming. Do we think this KEP is the right forum for these discussions, or should I wait for the PR w/ the actual go structs?
For example:
We stutter spec
and replicas
with spec.replicaBasedBufferSpec.replicasCountPercentage
We could combine replicasCountPercentage and minCount into:
replicas: 10% # IntOrString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an original proposal, but I got feedback about two value options not being the best and additionally when one field works with one option and not the other. This embedding was proposed during the review here: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjQp5l44
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't personally have a strong preference here, IMO we should follow Kubernetes API best practices. Do we know of any other similar fields in the K8s API?
@jackfrancis @gjtempleton WDYT?
@jbtk, thanks for taking my earlier feedback to heart. I like the direction this is going. I'm not quite grokking the current state of the proposal, as the go structs do not match the yaml specs. Could you make these agree and I'll do another pass? In the meantime, I have a few thoughts:
It strikes me as much more powerful to use
|
@ellistarn responding
|
Thanks @jbtk, For 1, I get your point about the potential confusion. Any thoughts to my suggested name: NodeReservation? This one especially makes sense to me if we decide to support node hibernation with this API. It might also be worth exploring a "Pod" prefix (e.g. PodBuffer, PodReservation) for these names to help clarify the subject. For 2, apologies -- this thought was half baked and the last sentence incomplete. My thought was that arbitrary targets might require customized RBAC, but the design already assumes that each provider will have a separate implementation w/ separate RBAC, so I think this is solved. For 3/4, I think these can be summarized as "speculative features that try to address future extensibility". This is philosophical, so I can understand if you feel differently, but I'd recommend that we delay the addition of these additional fields/features until the use case is concrete. For example, if CAS/GKE/Karpenter need targetRef to support something that cannot be easily supported with NodeSelector, let's add a follow-on that builds the extensibility and maps it directly to the use case. In my experience, building speculative extensibility carries risk that the feature will not fully meet the future requirements once they're known. This can lead to awkward/unnecessary fields that are confusing to customers. We can easily add to APIs with new features, but removing from APIs is much more costly if not impossible. It's valuable to think of potential extension points (e.g. our discussion on fulfilledBy), even if we don't implement them immediately. |
Since 3 and 4 is a matter of opinion it would be great for others to chime in. @towca - are you the right person or is there someone else who may have an opinion here? |
@jbtk thanks for addressing my comments! I feel like we're very close to alignment, let's finalize this so that we can start implementing something. IIUC these are the contentious parts that are preventing us from approving the proposal:
A lot of this is subjective so it'd be good to hear more opinions. @gjtempleton @jackfrancis @x13n could you weigh in on the points listed above? |
Yesterday I met with @liggitt to discuss embedding the pod spec. Following his advice I decided to get rid of embedding and the the buffer status we will use the pod template and save the generation number to ensure that the template was not modified by the user. This way we will be able to ensure that we do not over provision if we reach resource quota limits. The pod template will have the owner reference set so when buffer is deleted it will be deleted as well. I also used the opportunity to consult on the IntOrString type. @liggitt told me that in newer APIs we try to avoid these multipurpose fields as they make the processing harder (harder to do validation, some other fields work with one value but not the other). Taking this into account I would rather leave these fields as they were. |
|
||
type NodeAutoscalingBufferStatus struct { | ||
PodBufferCapacity PodBufferCapacity | ||
PodTemplateGeneration int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't generation inside PodBufferCapacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation is relevant for the status only. I do not see any good use case to give to the user option to define the generation in the config and therefore I did not want to increase the number of fields that the user may set (and therefore needs to understand what it does)
PodBufferCapacity PodBufferCapacity | ||
PodTemplateGeneration int | ||
Conditions []BufferCondition | ||
LastFullyProvisioned *metav1.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this timestamp differ from the timestamps in .status.conditions
? Usually time related information is available there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will not have a condition for a fully provisioned buffer. We only have a condition for the fact that the cluster is a scaling for this buffer.
Why?
We could have an additional condition for fully provisioned, but this information is very volatile - it is enough for a single pod to get scheduled in this space for it to be stale (and we do not have a good way to set this condition to false right away). Therefore I did not want to create a condition that looks like a way to check whether a buffer is fully provisioned but in fact only the timestamp there is useful and decided to go with just the timestamp field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how do you want to keep LastFullyProvisioned
up to date? Will it be updated every second if there are no pods scheduled into the buffer? With condition it is actually clear: condition FullyProvisioned: true
with last transition timestamp T1 would mean that as of T1 the buffer was fully provisioned. It can later change to FullyProvisioned: false
with timestamp T2, meanging as of T2 the buffer was not fully provisioned (and either the cluster was scaling or there was some reason
on the condition explaining why scaling was not happening as of T2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were the user and could create an if condition on FullyProvisioned it would be very likely that I would totally forget to check the timestamp. It just feels very misleading to offer something like that that we know that can become stale very easily. But maybe k8s users are used to conditions like this and I am pessimistic here.
On the other hand you are correct that I did not think about how often we would update the buffers in a cluster with low churn and having the buffers set up. I am removing this field and will reconsider when will be implementing metrics and monitoring after the alpha launch.
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: my-deployment | ||
percent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few unusual things about this structure to me.
- Percent nested under another percent
- targetRef nested under replicas -- it's logically a layer above
- minCount implies that there's something dynamic (i.e. could be above that value), is that right?
I'd suggest:
capacity:
targetRef: ...
replicas: 1 # mutually exclusive with percent
percent: 10 # mutually exclusive with replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tricky part with such a setup is that min and max apply only if the percent is used and not when a constant number of replicas is set - splitting it into separate object clearly indicates that (though I agree that double percent is weird). The min value is mostly for cases where users say that they want to have 10% of their workload, but at least 1 replica. This is dynamic as this is a percent of the current size of the workload (for example deployment).
We cannot put replicas config directly under capacity as this is an object with three options: replicas, resources and pods. The target ref was there originally and it was moved down based on your comments - we need to have somewhere the split and "one of" choice.
If I am not getting something please let me know and show how your proposal would look like for other buffers (for example one pointing to a pod template)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the current structure is a bit weird.
Would it make sense to get rid of minReplicas
if the percent was always rounded up to full replicas? Do we really need maxReplicas
? If we do, can limits be moved to top level (right now there is this one here and CPU/memory totals elsewhere)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where top level? The min and max make only sense for buffer defined as a percent of the workload. Which top level do you mean and what would it mean there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can drop the min/max entirely? I assume the purpose of this is to address the following:
- Referring to a deployment w/ 1 replica
- Maintain a +10% buffer
Ergo we don't actuate the buffer controller to accommodate addtional capacity for a new pod replica until the deployment replica count reaches 5 (assuming basic rounding up to the nearest int). Is this actually a problem in practice? In the case of a 10% Buffer that references a deployment pod spec that uses 1 core that hasn't yet replicated out to 5, we would still be guaranteeing node capacity for 100 millicores (assuming 1 replica).
tl;dr I'm not sure we really need the extra min/max. I understand that at small ints this might cause issues, but are we really solving for that use case? I think the canonical use case is the scale use case; or, if indeed there are deployment scaling use cases that we're solving for where the deployment replica flaps between small numbers, like 1 and 3, we can simply use a buffer with a high percent value like 100%.
Thoughts?
scheduling time will need to include the time of creating a new node and | ||
attaching it to the cluster. Many users care deeply about pod scheduling time | ||
and therefore would prefer to keep spare capacity in the cluster in order to | ||
speed up start time of the newly created pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to not just talk about the headroom use-cases but also the baseline capacity use-cases here? I know that we are expecting to address both by the API below so it might be nice to have that covered in the summary too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baseline capacity is not covered here yet as we removed the feature of having a buffer that is filled in by something for now. I will come back to it in another PR (updating this AEP) once this is approved so that we can start implementation of this initial, simple usecase.
* Today it is possible to create balloon pods and deployments, but they | ||
require maintenance (sizing according to the pods user wants to run) and | ||
with high numbers they can negatively impact the scheduling speed (as | ||
the scheduler has to preempt the balloon first before scheduling the | ||
actual pod). Due to the latter many customers run a balloon per machine, | ||
but this requires adjusting the size to the machine per nodepool and | ||
with custom compute classes (CCC) in GKE or karpenter pools the machine | ||
size is not constant anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool if, when we get a POC built-out, if we could show the scheduling performance difference between using the negative priority pods and the buffer overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you like to compare - scheduling time on empty capacity vs preempting pods (which can be done without any buffers - just a cluster with a min size where it has empty space or space is filled in with low priority pods) or something else that is buffer related?
but this requires adjusting the size to the machine per nodepool and | ||
with custom compute classes (CCC) in GKE or karpenter pools the machine | ||
size is not constant anymore. | ||
* Introducing a new API allows us to reference other existing objects to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of this? Is your point that balloon pods are often just a representation of a deployment that has yet to be applied to the cluster and that we can simplify this by referencing the same object? It's not entirely clear to me how this API is going to help us with this significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today to have a balloon pod/deployment you need to copy paste the pod spec from the workload that you want to buffer for. If the original deployment changes you need to apply the same changes to the balloon. With buffer you will be able to point to the deployment and the size of the buffer will change automatically when the underlying deployment changes.
size is not constant anymore. | ||
* Introducing a new API allows us to reference other existing objects to | ||
simplify the configuration | ||
* Having a separate object we can add convenience configuration options in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counterpoint to this is that you could just orchestrate this through other mechanisms: CronJobs for scale-down of balloon pods for example. There is still definitely a value prop of less maintenance/being able to define all of this in one place, but these things aren't impossible to achieve without this API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right - this feature was discussed (whether it is worth introducing even in the buffer itself) here: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs. I think in general big part of this API is making things more convenient and then building out from that (for example new buffer types like a buffer of a stopped VMs).
the cluster based on the buffer spec that was provided. Also if for example | ||
the target deployment changes it will be visible whether the change was | ||
already picked up. | ||
* Extendability: this design will allow for the users to either opt in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that we are trying to do this but aren't really sure how this would be handled concretely -- there isn't a strong use-case that I can think of for having divergent behavior on the pods that you gen from a given template -- I actually think that this makes things more confusing for a user who has to reason about what kind of output that they will get based on the components that they install
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this split will also benefit the OSS: with separate controller we will be able to offer guaranteed buffer once reservations are offered in scheduler without any autoscaler.
About divergent behavior - both with controller and with a library solution we will provide a library to use. In a controller case this library will be used within the controller. In the end this will be the same library.
|
||
### Why not? | ||
|
||
* Encapsulate the translation logic in a single controller that can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anytime I see the phrase "if the need arises", that means there isn't a need right now and therefore we are overshooting the mark with what we are doing for today. Since it's always possible to add this later and it's not possible to go back once it's added, I'm still strongly in favor of us starting with the library-based approach before we go with the status-based approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the need arises meaning: if the user wants to implement their custom logic for example of having buffer for their own custom CRD.
with a user extended controller as long as it gets the list of buffer pod | ||
specs from the buffer objects. This allows users to define buffers for their | ||
custom CRDs without changing the autoscaler in any way | ||
* Debuggability - splitting translation from provisioning of the capacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the strongest argument to me for putting the result of the pods that we are going to provision for in the status -- now, I still think it won't be the easiest thing to read in the world because we're going to be putting a full pod spec into the status of something, but this is the most convincing thing to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the end we will not be putting the spec to avoid pod spec embedding - the buffer will create a pod template that it will reference.
Additionally adding a feature to buffers (like scheduled buffers or a new buffer | ||
type) would require again touching every single object. | ||
|
||
## No controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've mentioned this in a few previous discussions that we have had, but I really really think that we need some concrete use-cases/reasons that we think the controller-based solution is significantly better than the no-controller solution. I can't think of strong extensions that would be useful to have the pod-based approach in this moment so I'd love to hear some if you are thinking of any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there are two strongest reasons for me:
- split of the responsibility including resource quotas that are today totally outside of the autoscaling scope (as they are fully handled on admission)
- once reservations are delivered in scheduler we will be able to offer buffers that are backed by scheduler reservations without even autoscaler involved.
Without controller we are limiting people to only buffers that are pulled in as a part of their autoscaling solution and I do not see a big gain from getting rid of the controller. Can you say what concerns you the most about the separate controller (that can run as a subprocess of the autoscaler/karpenter not necessarily as a separate pod in the cluster)?
appropriate reservation underneath. Binding this logic to autoscaler does | ||
not feel right. | ||
|
||
Note that while some of these arguments are not very strong the cost of having a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of arguments here around "the cost of doing this is low" -- agreed that the thing is easy to build probably, but it's not just about this -- there's a cognitive and maintenance cost to adding these things in and I'd rather take decisions that we can iterate on than decisions that are more one-way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think that the cognitive cost is higher to the users between a controller that is a subprocess in autoscaler vs a library that is called within autoscaler? I see one thing - the fact that buffer is in multiple states. But even in the library the buffer can fail in multiple ways so the user will need to understand that first the buffer is translated and only later it is provisioned.
|
||
``` | ||
apiversion: autoscaling.x-k8s.io/v1alpha1 | ||
kind: NodeAutoscalingBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to eventually support DRA, it will no longer be only about pods I think. Would something like WorkloadBuffer
be more in line with the workload definition in https://docs.google.com/document/d/1pZaQ4q9nhENFFIy-WjUhb89WCgJ7ZPcrKx-PIU8rOqA/edit?tab=t.0#heading=h.ve4ahfbzr475?
apiVersion: apps/v1 | ||
kind: Deployment | ||
name: my-deployment | ||
percent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the current structure is a bit weird.
Would it make sense to get rid of minReplicas
if the percent was always rounded up to full replicas? Do we really need maxReplicas
? If we do, can limits be moved to top level (right now there is this one here and CPU/memory totals elsewhere)?
PodBufferCapacity PodBufferCapacity | ||
PodTemplateGeneration int | ||
Conditions []BufferCondition | ||
LastFullyProvisioned *metav1.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And how do you want to keep LastFullyProvisioned
up to date? Will it be updated every second if there are no pods scheduled into the buffer? With condition it is actually clear: condition FullyProvisioned: true
with last transition timestamp T1 would mean that as of T1 the buffer was fully provisioned. It can later change to FullyProvisioned: false
with timestamp T2, meanging as of T2 the buffer was not fully provisioned (and either the cluster was scaling or there was some reason
on the condition explaining why scaling was not happening as of T2).
|
||
We gathered feedback from the customers that maintenance of balloon pods and | ||
sizing them is something that should be simplified. Additionally if we have a | ||
buffer object we can add additional feature that are requested for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also doable with CronJob
balloon pods, so not really an argument against them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Though using cron job makes buffers/balloons hard to observe. You need to combine information from multiple objects to understand what was happening and why.
4cfbebb
to
ca9853e
Compare
Based on the recent comments, there's another point of contention:
|
On a separate note - IMO both this and previous CA/Karpenter API alignment attempts (e.g. kubernetes/kubernetes#124800) turned out to be surprisingly difficult. We seem to frequently discuss things in circles, which then either makes the proposal die if it's low priority (like safe-to-evict/do-not-disrupt), or makes it hard to judge alignment progress and plan implementation if it's high priority (like this one). I wonder what we could change to improve the process, especially given that we're likely going to have increasingly more proposals of similar scope in the future. WDYT about doing some sort of retrospective about this proposal after we're done? |
For the format of the ReplicaBufferCapacity - based on discussion with @liggitt it seems that int or string field is not the easiest to maintain over time so I would prefer to stay with separate fields. As for the structure. Following a suggestion from @x13n I changed the names: percent to value and ReplicaBasedCapacity from replicas to controllerBased so that we do not have percent.percent or replica.replica fields. |
For the controller discussion there is one more argument. For buffers for objects implementing scale subresource if we go with a library and no pod template in the status we would be able to buffer capacity only if at least one pod is running in the cluster. This is a limitation for people who want buffers for jobs that run only some time during a day. @ellistarn @towca @jonathan-innis as FYI |
In case of the cluster autoscaler we would deploy the controller as an optional | ||
subprocess: | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be clearer to describe the buffers controller as
Cluster autoscaler
(buffers loop)
To emphasize that we aren't proposing a discrete controller (e.g., a node-autoscaling-buffer-controller-manager
pod)
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Adds proposal for Buffers API
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: