-
Notifications
You must be signed in to change notification settings - Fork 115
Extensible/Pluggable data layer proposal #1023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extensible/Pluggable data layer proposal #1023
Conversation
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Hi @elevran. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
status check: does the proposal make sense? Can I start implementing based on the phases suggested? |
/ok-to-test |
taking a look |
pluggability effort. For example, extending the system to support the above | ||
should be through implementing well defined Plugin interfaces and registering | ||
them in the GIE Data Layer subsystem; any configuration would be done in the | ||
same way (e.g., code and/or configuration file), etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configuration file
Just to sanity check, we would extend the same API, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the intent.
Any data layer plugin configuration would be done via the same API and configuration mechanism.
// Data Sources (interface defined below) in the system. | ||
// It is accompanied by functions (not shown) to register | ||
// and retrieve sources | ||
type DataLayerSourcesRegistry map[string]DataSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking aloud. Should we generate a new string
type here to be very concrete about valid DataLayerSources? i.e. type DataSource string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable. What assurances would we want from the type?
We expect a new DataLayer source to define its "tag" when registered in the system and the same name would be expected by Plugins (to access the extended data).
I can see using hierarchical name (e.g., with "DNS domains" prefix) to reduce chance of collisions (not sure we'll reach that point...), so maybe something like k8s NamespaceName could be useful?
// UpdateEndpoints replaces the set of pods/resources tracked by | ||
// this source. | ||
// Alternative: add/remove individual endpoints? | ||
UpdateEndpoints(epIDs []string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of scope for this proposal since we are going to iterate, but should we always assume that data collection is for endpoints? That seems reasonable for now. I cant think of something explicit that would be collected pool-wide. Again, just open thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seen some discussion of using CRDs (to define LoRA loading and availability) or receiving scheduling hints from an external optimizer process (over k8s CRs or GRPC, possibly providing hints for which inference pool is optimal in case of multiple pools on different HW).
These would not be directly from endpoints or even attached to endpoints
Having all of our data layer storage keyed off of endpoints, would be a challenge for those use cases.
Not sure how to model that yet, and we probably need to see the real uses cases and refactor accordingly.
Some grammatical catches and some non-blocking open ended questions. Overall looks good. Will stamp after minor grammatical fixes are in. Thanks! |
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
@kfswain thanks for the review and catching those typos. |
- (Future) Allow non-uniform data collection (i.e., not all endpoints share the | ||
same data). | ||
|
||
## Non-Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth clarifying as a non-goal: I assume we don't intent to introduce a new mechanism for scraping, meaning the one scraper thread per endpoint is a common framework for collecting data from the endpoint, and the extensibility we are introducing here is not meant to allow other ways to scrape, but to extend what can be scraped via this framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your statement correctly - then yes: the DataSource implementation would spawn a go-routine per endpoint.
So we would have 1 goroutine per data source, per endpoint.
Or did you mean that the same go-routine would be used for all data sources operating on an endpoint?
I think the shared go-routine per endpoint is doable but adds complexity (i.e., since it is running each source in turn, delays/errors in one could impact other sources). It's a complexity vs resource overheads tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping for the latter, one go-routine per endpoint. But we can discuss on the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will implement one go-routine for all DataSources operating on an endpoint and we can refine if needed on the implementation.
// Extract is called by data sources with (possibly) updated | ||
// data per endpoint. Extracted attributes are added to the | ||
// Endpoint. | ||
Extract(ep Endpoint, data interface{}) error // or Collect? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a generic Source
type instead of Endpoint? The collector would know what to cast it to based on the data source it registered for. This is to allow data sources other than endpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A generic source could be more flexible - I'll think about that more.
However, I realize that there's possibly use of confusing naming on my part - let me try to clarify.
Regarding the specific interface function:
Extract()
receives relevant data
from source (e.g., response to GET /metrics
), exctracts the needed attributes (e.g., the specific metrics), and then stores them for later use on ep
.
The current design uses a shared storage layer, with standard data per endpoint (e.g., scheduling/types.Pod
referencing backendmetrics.MetricsState
and backend.Pod
, IIUC).
If we want to continue using this approach (and I'm suggesting we do, at least initially), the extended data would need to be stored in the same place (e.g., a new backend.Extended
object or whatever we converge on).
Another option is for each new DataCollection to maintain its own internal storage scheme and only store "handles" (or tags) on the shared object (e.g., backend.Pod
). The caller would use the handle (a data collection name and a tag) to retrieve the data from the relevant DataCollection.
This adds flexibility but also complexity (more moving parts, more complicated mental model, more indirection and calls). It could also make non Endpoint
Sources
easier to support (Plugins would retrieve a reference to DataCollection and get whatever data they need).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok to start with endpoint focused interfaces and adjust them as a follow up as we learn how the implementation will look like. My sense is that to support datasource types other than Endpoints we will need to change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think there is an agreement on the general direction. I’m in favor of starting to implement and iterate as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I share the same hesitation with Endpoint, but I think its fine for initial implementation
/lgtm |
/approve Unhold at your leisure, thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elevran, kfswain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>
* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>
* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>
- Explicitly include change to scraping mode as non-goal (feedback on #1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>
* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>
This proposal outlines the plan for making the EPP data layer extensible and allowing custom attribute collection and storage for specific use cases.
Many extensions to scheduling will require changes to ingested metrics and attributes. As such, the data layer should be built to be extended to support bespoke use cases and experimentation.
The gist of the proposal is to enable configurable data source and data collections (from existing or new source) with no code changes to core components.
Ref #703
Follow up on https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing
@ahg-g @kfswain @nirrozenbaum @liu-cong