Skip to content

cluster: use CTP as the transport for compute and storage protocol connections #32897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Jul 2, 2025

(The first 5 commits are from #32330 and can be ignored here.)

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

teskje added 4 commits July 2, 2025 10:34
This commit introduces a new transport protocol for bidirectional
message-stream communication between controllers and cluster replicas,
called CTP. The intention is to replace gRPC with a much simpler
protocol and implementation, to reduce complexity.
This commit adds server fqdn validation to CTP. Having this is important
to avoid confusion when clients connect to the wrong server endpoint
because of stale DNS caches.
This commit adds timeouts to CTP connections. Each read or write from
the underlying stream is wrapped in a timeout, and the connection fails
if a timeout is triggered. To avoid triggering timeouts on idle but
healthy connections, keepalive messages are introduced.
This commit makes the CTP server bound the number of active client
connections to one, by canceling the current connection upon accepting a
new one.
@teskje teskje mentioned this pull request Jul 2, 2025
5 tasks
teskje added 4 commits July 2, 2025 12:56
This commit adds support for connection metrics tracking to CTP. For now
the reported metrics are limited to the number of bytes sent and
received, together with the corresponding payloads.
CTP requires all messages to impl the serde traits, so this commit adds
them where they were missing.
This commit extends the replica server code with support for CTP as the
controller connection transport, in addition to the existing gRPC
transport. It introduces a new command line flag, `--use-ctp`, to select
between the two.
This commit makes the controller use CTP for controller connections by
default. Both the replica tasks and the orchestrator are configured
accordingly. For now it is still possible to revert to gRPC using a
feature flag.
@teskje teskje force-pushed the ctp-for-compute-storage branch from 118aaea to 7f1ee96 Compare July 2, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant