-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
orchestrator-process: store PID metadata in temp directory #15810
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ci looks mad but this was shockingly painless to do it appears!
203ae33
to
89de11d
Compare
Yep, this part was easy. Sadly the rest of #15725 is hard! |
#[clap( | ||
long, | ||
env = "ORCHESTRATOR_PROCESSDATA_DIRECTORY", | ||
env = "ORCHESTRATOR_PROCESS_DATA_DIRECTORY", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable was used with its old name in mzcompose/services.py . I guess, to be safe, services.py needs to set the variable with both its old and new spelling.
I pushed #15800 as a temporary work-around until this is sorted out. |
89de11d
to
8b365e0
Compare
8b365e0
to
f4377ac
Compare
db3ace4
to
1439b5f
Compare
1357799
to
4870fe1
Compare
Add `SocketAddr`, `Listener`, and `Stream` types to the `mz_ore::netio` module, which abstract over TCP sockets and Unix domain sockets. Then teach storaged and computed to accept their listen addresses using the new `SocketAddr` types, which allows them to bind to either TCP or Unix domain sockets, as desired. This is a key part step towards fixing the process orchestrator (MaterializeInc#15725), as it will allow multiple copies of Materialize to be run concurrently without competing for access to the same ports.
4870fe1
to
dbfd8a1
Compare
PID files are not valid after a reboot of a machine. In the best case, the referenced PIDs do not exist, and the process orchestrator correctly recreates the services; in the worst case, the PIDs have been reused by different processes entirely, and the process orchestrator incorrectly thinks the services are already running. The worst case scenario is almost a guarantee with containers, where there are only a few processes using the low-numbered containers. This commit fixes the problem by moving the PID metadata files into $TMPDIR/environment-$ID. $TMPDIR is cleared on restart, so the stale PID files will correctly vanish after a restart. Naming the directory after the environment ID ensures that environmentd can find its metadata after a process restart without a machine restart, but allows multiple `environmentd` processes to co-exist, as long as they use different environment IDs. Things work correctly with the `--reset` option to bin/environment, too, as this option generates a new environmentd ID. Touches MaterializeInc#15725. Would close MaterializeInc#15800.
Rewrite the process orchestrator to be more reliable: * Use Unix domain sockets rather than TCP, to avoid competing with other `environmentd` processes (e.g., in tests) for a very limited range of TCP ports. * Restructure process supervision so that existing processes that are adopted at startup are monitored for failures and restarted if necessary. * Enable process status updates. * Enable process metrics collection. Fix MaterializeInc#15725. Fix MaterializeInc#15336.
dbfd8a1
to
a233f31
Compare
@jkosh44 just a heads up that the process orchestrator looks very different as a result of this PR. So that you're not surprised when you next look at it. |
PID files are not valid after a reboot of a machine. In the best case, the referenced PIDs do not exist, and the process orchestrator correctly recreates the services; in the worst case, the PIDs have been reused by different processes entirely, and the process orchestrator incorrectly thinks the services are already running. The worst case scenario is almost a guarantee with containers, where there are only a few processes using the low-numbered containers.
This commit fixes the problem by moving the PID metadata files into $TMPDIR/environment-$ID. $TMPDIR is cleared on restart, so the stale PID files will correctly vanish after a restart. Naming the directory after the environment ID ensures that environmentd can find its metadata after a process restart without a machine restart, but allows multiple
environmentd
processes to co-exist, as long as they use different environment IDs. Things work correctly with the--reset
option to bin/environment, too, as this option generates a new environmentd ID.Touches #15725.
Would close #15800.
Motivation
Checklist
This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way) and therefore is tagged with aT-proto
label.This PR includes the following user-facing behavior changes: