Skip to content

Extensible/Pluggable data layer proposal #1023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

elevran
Copy link
Contributor

@elevran elevran commented Jun 19, 2025

This proposal outlines the plan for making the EPP data layer extensible and allowing custom attribute collection and storage for specific use cases.

Many extensions to scheduling will require changes to ingested metrics and attributes. As such, the data layer should be built to be extended to support bespoke use cases and experimentation.
The gist of the proposal is to enable configurable data source and data collections (from existing or new source) with no code changes to core components.

Ref #703

Follow up on https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing

@ahg-g @kfswain @nirrozenbaum @liu-cong

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Copy link

netlify bot commented Jun 19, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 52ca32d
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/685c0be02c7e9300085971bb
😎 Deploy Preview https://deploy-preview-1023--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and Jeffwan June 19, 2025 18:02
@k8s-ci-robot
Copy link
Contributor

Hi @elevran. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 19, 2025
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Signed-off-by: Etai Lev Ran <elevran@gmail.com>
@elevran
Copy link
Contributor Author

elevran commented Jun 24, 2025

status check: does the proposal make sense? Can I start implementing based on the phases suggested?

@nirrozenbaum
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 24, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jun 24, 2025

status check: does the proposal make sense? Can I start implementing based on the phases suggested?

taking a look

pluggability effort. For example, extending the system to support the above
should be through implementing well defined Plugin interfaces and registering
them in the GIE Data Layer subsystem; any configuration would be done in the
same way (e.g., code and/or configuration file), etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration file

Just to sanity check, we would extend the same API, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the intent.
Any data layer plugin configuration would be done via the same API and configuration mechanism.

// Data Sources (interface defined below) in the system.
// It is accompanied by functions (not shown) to register
// and retrieve sources
type DataLayerSourcesRegistry map[string]DataSource
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking aloud. Should we generate a new string type here to be very concrete about valid DataLayerSources? i.e. type DataSource string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable. What assurances would we want from the type?

We expect a new DataLayer source to define its "tag" when registered in the system and the same name would be expected by Plugins (to access the extended data).
I can see using hierarchical name (e.g., with "DNS domains" prefix) to reduce chance of collisions (not sure we'll reach that point...), so maybe something like k8s NamespaceName could be useful?

// UpdateEndpoints replaces the set of pods/resources tracked by
// this source.
// Alternative: add/remove individual endpoints?
UpdateEndpoints(epIDs []string) error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope for this proposal since we are going to iterate, but should we always assume that data collection is for endpoints? That seems reasonable for now. I cant think of something explicit that would be collected pool-wide. Again, just open thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seen some discussion of using CRDs (to define LoRA loading and availability) or receiving scheduling hints from an external optimizer process (over k8s CRs or GRPC, possibly providing hints for which inference pool is optimal in case of multiple pools on different HW).
These would not be directly from endpoints or even attached to endpoints

Having all of our data layer storage keyed off of endpoints, would be a challenge for those use cases.
Not sure how to model that yet, and we probably need to see the real uses cases and refactor accordingly.

@kfswain
Copy link
Collaborator

kfswain commented Jun 24, 2025

Some grammatical catches and some non-blocking open ended questions. Overall looks good. Will stamp after minor grammatical fixes are in. Thanks!

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
@elevran
Copy link
Contributor Author

elevran commented Jun 25, 2025

@kfswain thanks for the review and catching those typos.
I've pushed a new commit fixing them and will share my thoughts regarding the open ended questions.

- (Future) Allow non-uniform data collection (i.e., not all endpoints share the
same data).

## Non-Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth clarifying as a non-goal: I assume we don't intent to introduce a new mechanism for scraping, meaning the one scraper thread per endpoint is a common framework for collecting data from the endpoint, and the extensibility we are introducing here is not meant to allow other ways to scrape, but to extend what can be scraped via this framework.

Copy link
Contributor Author

@elevran elevran Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your statement correctly - then yes: the DataSource implementation would spawn a go-routine per endpoint.
So we would have 1 goroutine per data source, per endpoint.
Or did you mean that the same go-routine would be used for all data sources operating on an endpoint?

I think the shared go-routine per endpoint is doable but adds complexity (i.e., since it is running each source in turn, delays/errors in one could impact other sources). It's a complexity vs resource overheads tradeoff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping for the latter, one go-routine per endpoint. But we can discuss on the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will implement one go-routine for all DataSources operating on an endpoint and we can refine if needed on the implementation.

// Extract is called by data sources with (possibly) updated
// data per endpoint. Extracted attributes are added to the
// Endpoint.
Extract(ep Endpoint, data interface{}) error // or Collect?
Copy link
Contributor

@ahg-g ahg-g Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a generic Source type instead of Endpoint? The collector would know what to cast it to based on the data source it registered for. This is to allow data sources other than endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A generic source could be more flexible - I'll think about that more.

However, I realize that there's possibly use of confusing naming on my part - let me try to clarify.

Regarding the specific interface function:
Extract() receives relevant data from source (e.g., response to GET /metrics), exctracts the needed attributes (e.g., the specific metrics), and then stores them for later use on ep.

The current design uses a shared storage layer, with standard data per endpoint (e.g., scheduling/types.Pod referencing backendmetrics.MetricsState and backend.Pod, IIUC).

If we want to continue using this approach (and I'm suggesting we do, at least initially), the extended data would need to be stored in the same place (e.g., a new backend.Extended object or whatever we converge on).

Another option is for each new DataCollection to maintain its own internal storage scheme and only store "handles" (or tags) on the shared object (e.g., backend.Pod). The caller would use the handle (a data collection name and a tag) to retrieve the data from the relevant DataCollection.
This adds flexibility but also complexity (more moving parts, more complicated mental model, more indirection and calls). It could also make non Endpoint Sources easier to support (Plugins would retrieve a reference to DataCollection and get whatever data they need).

Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok to start with endpoint focused interfaces and adjust them as a follow up as we learn how the implementation will look like. My sense is that to support datasource types other than Endpoints we will need to change this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think there is an agreement on the general direction. I’m in favor of starting to implement and iterate as needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I share the same hesitation with Endpoint, but I think its fine for initial implementation

@ahg-g
Copy link
Contributor

ahg-g commented Jun 25, 2025

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 25, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jun 25, 2025

/approve

Unhold at your leisure, thanks!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elevran, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 25, 2025
@nirrozenbaum
Copy link
Contributor

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 25, 2025
@k8s-ci-robot k8s-ci-robot merged commit 9111ef2 into kubernetes-sigs:main Jun 25, 2025
8 checks passed
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
* initial proposal content

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* renamed based on assigned PR

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* Plugin suffix unused - removing open

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* address grammatical errors from review

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

---------

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
shmuelk pushed a commit to shmuelk/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
* initial proposal content

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* renamed based on assigned PR

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* Plugin suffix unused - removing open

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* address grammatical errors from review

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

---------

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
* initial proposal content

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* renamed based on assigned PR

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* Plugin suffix unused - removing open

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* address grammatical errors from review

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

---------

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
elevran added a commit to elevran/gateway-api-inference-extension that referenced this pull request Jun 27, 2025
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023)
- Spelling errors

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
elevran added a commit to elevran/gateway-api-inference-extension that referenced this pull request Jun 28, 2025
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023)
- Spelling errors

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
k8s-ci-robot pushed a commit that referenced this pull request Jun 28, 2025
- Explicitly include change to scraping mode as non-goal (feedback on #1023)
- Spelling errors

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
@elevran elevran deleted the data-layer-architecture-proposal branch July 2, 2025 12:27
EyalPazz pushed a commit to EyalPazz/gateway-api-inference-extension that referenced this pull request Jul 9, 2025
* initial proposal content

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* renamed based on assigned PR

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* Plugin suffix unused - removing open

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* address grammatical errors from review

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

---------

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
EyalPazz pushed a commit to EyalPazz/gateway-api-inference-extension that referenced this pull request Jul 9, 2025
- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023)
- Spelling errors

Signed-off-by: Etai Lev Ran <elevran@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants