Skip to content

feat: fetch partition routing and on-demand metadata refresh#126

Merged
novatechflow merged 2 commits intoKafScale:mainfrom
klaudworks:feat/fetch-partition-routing
Mar 2, 2026
Merged

feat: fetch partition routing and on-demand metadata refresh#126
novatechflow merged 2 commits intoKafScale:mainfrom
klaudworks:feat/fetch-partition-routing

Conversation

@klaudworks
Copy link
Collaborator

@klaudworks klaudworks commented Mar 1, 2026

Merge #125 first (group coordination routing). This PR is stacked on top.
Closes #115

Summary

  • Fetch requests are now routed to the broker that owns the requested partitions, similar to how produce requests already work.
  • The proxy no longer polls for metadata every 3 seconds. Instead, it fetches metadata only when it encounters a broker or topic it doesn't recognize. Multiple simultaneous cache misses share a single metadata request.
  • The readiness probe no longer makes a network call on every check. It returns healthy immediately if recent metadata is cached, and only reaches out to the metadata store if the cache has gone stale.

Outlook

Consumers will now directly fetch data from brokers that already have it in their segment cache. Slow S3 requests are only required to reprocess older data after partition reassignments or similar.

@klaudworks klaudworks changed the title feat: partition-aware fetch routing and on-demand metadata caching feat: fetch partition routing and on-demand metadata refresh Mar 1, 2026
@klaudworks klaudworks requested a review from novatechflow March 1, 2026 13:02
novatechflow
novatechflow previously approved these changes Mar 2, 2026
@novatechflow
Copy link
Collaborator

@klaudworks - can you please resolve the conflicts?

Split fetch requests by owning broker, forward concurrently, and merge
responses. Retry partitions rejected with NOT_LEADER_OR_FOLLOWER up to
3 times. For v12+ requests that use topic IDs instead of names, resolve
IDs via a metadata-refreshed cache and use a collision-safe key to
prevent unresolved topics from merging silently.

Adds EncodeFetchRequest and ParseFetchResponse codecs with round-trip
and kmsg validation tests.
Remove the 3-second polling ticker and refresh broker/topic caches on
demand when a lookup misses. Concurrent misses are coalesced via
singleflight to avoid thundering herd metadata fetches.

Readiness probe now checks cached state first (fast path), falling back
to a live metadata fetch only when the cache TTL expires. Static
backends are always ready.

Clean up comments per AGENTS.md: remove low-value comments, condense
verbose doc comments.
@klaudworks klaudworks force-pushed the feat/fetch-partition-routing branch from 4acb9c5 to fa6938f Compare March 2, 2026 11:37
@klaudworks
Copy link
Collaborator Author

@novatechflow Unfortunately github has no support for stacked PR so I have to rebase each PR after the prior one is merged. Once this is approved + merged, I'll rebase the next one.

@klaudworks klaudworks requested a review from novatechflow March 2, 2026 12:28
@novatechflow novatechflow merged commit 5162304 into KafScale:main Mar 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: reduce e2e produce + fetch latency on AWS to < 100ms

2 participants