Extensible/Pluggable data layer proposal #1023

elevran · 2025-06-19T18:02:51Z

This proposal outlines the plan for making the EPP data layer extensible and allowing custom attribute collection and storage for specific use cases.

Many extensions to scheduling will require changes to ingested metrics and attributes. As such, the data layer should be built to be extended to support bespoke use cases and experimentation.
The gist of the proposal is to enable configurable data source and data collections (from existing or new source) with no code changes to core components.

Ref #703

Follow up on https://docs.google.com/document/d/1eCCuyB_VW08ik_jqPC1__z6FzeWO_VOlPDUpN85g9Ww/edit?usp=sharing

@ahg-g @kfswain @nirrozenbaum @liu-cong

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

netlify · 2025-06-19T18:02:56Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`52ca32d`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/685c0be02c7e9300085971bb
😎 Deploy Preview	https://deploy-preview-1023--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-06-19T18:03:00Z

Hi @elevran. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

docs/proposals/1023-data-layer-architecture/README.md

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

elevran · 2025-06-24T17:45:57Z

status check: does the proposal make sense? Can I start implementing based on the phases suggested?

nirrozenbaum · 2025-06-24T17:50:10Z

/ok-to-test

kfswain · 2025-06-24T21:37:36Z

status check: does the proposal make sense? Can I start implementing based on the phases suggested?

taking a look

kfswain · 2025-06-24T21:40:01Z

docs/proposals/1023-data-layer-architecture/README.md

+  pluggability effort. For example, extending the system to support the above
+  should be through implementing well defined Plugin interfaces and registering
+  them in the GIE Data Layer subsystem; any configuration would be done in the
+  same way (e.g., code and/or configuration file), etc.


configuration file

Just to sanity check, we would extend the same API, correct?

Yes, that's the intent.
Any data layer plugin configuration would be done via the same API and configuration mechanism.

docs/proposals/1023-data-layer-architecture/README.md

kfswain · 2025-06-24T21:51:13Z

docs/proposals/1023-data-layer-architecture/README.md

+// Data Sources (interface defined below) in the system.
+// It is accompanied by functions (not shown) to register
+// and retrieve sources
+type DataLayerSourcesRegistry map[string]DataSource 


Just thinking aloud. Should we generate a new string type here to be very concrete about valid DataLayerSources? i.e. type DataSource string

sounds reasonable. What assurances would we want from the type?

We expect a new DataLayer source to define its "tag" when registered in the system and the same name would be expected by Plugins (to access the extended data).
I can see using hierarchical name (e.g., with "DNS domains" prefix) to reduce chance of collisions (not sure we'll reach that point...), so maybe something like k8s NamespaceName could be useful?

kfswain · 2025-06-24T21:52:52Z

docs/proposals/1023-data-layer-architecture/README.md

+    // UpdateEndpoints replaces the set of pods/resources tracked by
+    // this source.
+    // Alternative: add/remove individual endpoints?
+    UpdateEndpoints(epIDs []string) error 


Out of scope for this proposal since we are going to iterate, but should we always assume that data collection is for endpoints? That seems reasonable for now. I cant think of something explicit that would be collected pool-wide. Again, just open thoughts.

Seen some discussion of using CRDs (to define LoRA loading and availability) or receiving scheduling hints from an external optimizer process (over k8s CRs or GRPC, possibly providing hints for which inference pool is optimal in case of multiple pools on different HW).
These would not be directly from endpoints or even attached to endpoints

Having all of our data layer storage keyed off of endpoints, would be a challenge for those use cases.
Not sure how to model that yet, and we probably need to see the real uses cases and refactor accordingly.

kfswain · 2025-06-24T21:54:04Z

Some grammatical catches and some non-blocking open ended questions. Overall looks good. Will stamp after minor grammatical fixes are in. Thanks!

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

elevran · 2025-06-25T14:48:52Z

@kfswain thanks for the review and catching those typos.
I've pushed a new commit fixing them and will share my thoughts regarding the open ended questions.

ahg-g · 2025-06-25T15:03:52Z

docs/proposals/1023-data-layer-architecture/README.md

+- (Future) Allow non-uniform data collection (i.e., not all endpoints share the
+  same data).
+
+## Non-Goals


Worth clarifying as a non-goal: I assume we don't intent to introduce a new mechanism for scraping, meaning the one scraper thread per endpoint is a common framework for collecting data from the endpoint, and the extensibility we are introducing here is not meant to allow other ways to scrape, but to extend what can be scraped via this framework.

If I understand your statement correctly - then yes: the DataSource implementation would spawn a go-routine per endpoint.
So we would have 1 goroutine per data source, per endpoint.
Or did you mean that the same go-routine would be used for all data sources operating on an endpoint?

I think the shared go-routine per endpoint is doable but adds complexity (i.e., since it is running each source in turn, delays/errors in one could impact other sources). It's a complexity vs resource overheads tradeoff.

I was hoping for the latter, one go-routine per endpoint. But we can discuss on the implementation.

Sounds good. Will implement one go-routine for all DataSources operating on an endpoint and we can refine if needed on the implementation.

ahg-g · 2025-06-25T15:07:52Z

docs/proposals/1023-data-layer-architecture/README.md

+    // Extract is called by data sources with (possibly) updated
+    // data per endpoint. Extracted attributes are added to the 
+    // Endpoint.
+    Extract(ep Endpoint, data interface{}) error // or Collect?


Shouldn't this be a generic Source type instead of Endpoint? The collector would know what to cast it to based on the data source it registered for. This is to allow data sources other than endpoints.

A generic source could be more flexible - I'll think about that more.

However, I realize that there's possibly use of confusing naming on my part - let me try to clarify.

Regarding the specific interface function:
Extract() receives relevant data from source (e.g., response to GET /metrics), exctracts the needed attributes (e.g., the specific metrics), and then stores them for later use on ep.

The current design uses a shared storage layer, with standard data per endpoint (e.g., scheduling/types.Pod referencing backendmetrics.MetricsState and backend.Pod, IIUC).

If we want to continue using this approach (and I'm suggesting we do, at least initially), the extended data would need to be stored in the same place (e.g., a new backend.Extended object or whatever we converge on).

Another option is for each new DataCollection to maintain its own internal storage scheme and only store "handles" (or tags) on the shared object (e.g., backend.Pod). The caller would use the handle (a data collection name and a tag) to retrieve the data from the relevant DataCollection.
This adds flexibility but also complexity (more moving parts, more complicated mental model, more indirection and calls). It could also make non Endpoint Sources easier to support (Plugins would retrieve a reference to DataCollection and get whatever data they need).

Thoughts?

I am ok to start with endpoint focused interfaces and adjust them as a follow up as we learn how the implementation will look like. My sense is that to support datasource types other than Endpoints we will need to change this.

+1. I think there is an agreement on the general direction. I’m in favor of starting to implement and iterate as needed.

Same, I share the same hesitation with Endpoint, but I think its fine for initial implementation

ahg-g · 2025-06-25T16:50:24Z

/lgtm
/hold

kfswain · 2025-06-25T17:00:25Z

/approve

Unhold at your leisure, thanks!

k8s-ci-robot · 2025-06-25T17:00:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elevran, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirrozenbaum · 2025-06-25T17:05:16Z

/unhold

* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>

- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>

- Explicitly include change to scraping mode as non-goal (feedback on #1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>

* initial proposal content Signed-off-by: Etai Lev Ran <elevran@gmail.com> * renamed based on assigned PR Signed-off-by: Etai Lev Ran <elevran@gmail.com> * Plugin suffix unused - removing open Signed-off-by: Etai Lev Ran <elevran@gmail.com> * address grammatical errors from review Signed-off-by: Etai Lev Ran <elevran@gmail.com> --------- Signed-off-by: Etai Lev Ran <elevran@gmail.com>

- Explicitly include change to scraping mode as non-goal (feedback on kubernetes-sigs#1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>

initial proposal content

26cc5dc

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

k8s-ci-robot added the cncf-cla: yes label Jun 19, 2025

k8s-ci-robot requested review from ahg-g and Jeffwan June 19, 2025 18:02

k8s-ci-robot added needs-ok-to-test size/L labels Jun 19, 2025

renamed based on assigned PR

c0e0f1c

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

nirrozenbaum reviewed Jun 19, 2025

View reviewed changes

docs/proposals/1023-data-layer-architecture/README.md Outdated Show resolved Hide resolved

Plugin suffix unused - removing open

9e4cc90

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 24, 2025

kfswain reviewed Jun 24, 2025

View reviewed changes

address grammatical errors from review

52ca32d

Signed-off-by: Etai Lev Ran <elevran@gmail.com>

ahg-g reviewed Jun 25, 2025

View reviewed changes

k8s-ci-robot added the do-not-merge/hold label Jun 25, 2025

k8s-ci-robot assigned ahg-g Jun 25, 2025

k8s-ci-robot added the lgtm label Jun 25, 2025

k8s-ci-robot added the approved label Jun 25, 2025

k8s-ci-robot removed the do-not-merge/hold label Jun 25, 2025

k8s-ci-robot merged commit 9111ef2 into kubernetes-sigs:main Jun 25, 2025
8 checks passed

elevran mentioned this pull request Jun 28, 2025

Tidy up Data Layer documentation #1087

Merged

k8s-ci-robot pushed a commit that referenced this pull request Jun 28, 2025

Tidy up Data Layer documentation (#1087)

9c9abd5

- Explicitly include change to scraping mode as non-goal (feedback on #1023) - Spelling errors Signed-off-by: Etai Lev Ran <elevran@gmail.com>

elevran deleted the data-layer-architecture-proposal branch July 2, 2025 12:27

elevran mentioned this pull request Jul 13, 2025

(feat) initial types and interfaces for pluggable data layer #1154

Open

Extensible/Pluggable data layer proposal #1023

Extensible/Pluggable data layer proposal #1023

Conversation

elevran commented Jun 19, 2025

Uh oh!

netlify bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Jun 19, 2025

Uh oh!

Uh oh!

elevran commented Jun 24, 2025

Uh oh!

nirrozenbaum commented Jun 24, 2025

Uh oh!

kfswain commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfswain commented Jun 24, 2025

Uh oh!

elevran commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elevran Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Jun 25, 2025

Uh oh!

kfswain commented Jun 25, 2025

Uh oh!

k8s-ci-robot commented Jun 25, 2025

Uh oh!

nirrozenbaum commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Jun 19, 2025 •

edited

Loading

elevran Jun 25, 2025 •

edited

Loading

ahg-g Jun 25, 2025 •

edited

Loading