-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix up the process orchestrator #15725
Comments
This bug actually prevents running any tests with SIZE '4-4' reliably in the CI, let alone the entire CI. The sheer number of processes that need to be started properly is such that sporadic failures are guaranteed to spoil the CI experience for everyone. Both the "storaged was never restarted" and the "cannot assign requested address" sub-issues are seen. @benesch could this be a potential solution:
|
Yup, I think that would do it! |
In order to work around MaterializeInc#15725, clean up the Progress Orchestrator information on PIDs and TCP ports on restart. As the tests are running in containers, the processes that may have held those PIDs and ports are now gone anyway when environmentd restarts, as they lived in the same container.
In order to work around MaterializeInc#15725, clean up the Progress Orchestrator information on PIDs and TCP ports on restart. As the tests are running in containers, the processes that may have held those PIDs and ports are now gone anyway when environmentd restarts, as they lived in the same container. Relates to #MaterializeInc#15725, MaterializeInc#15155
In order to work around MaterializeInc#15725, clean up the Progress Orchestrator information on PIDs and TCP ports on restart. As the tests are running in containers, the processes that may have held those PIDs and ports are now gone anyway when environmentd restarts, as they lived in the same container. Relates to #MaterializeInc#15725, MaterializeInc#15155
@benesch is ephemeral storage the same as |
Even more ephemeral! Anonymous volumes are still volumes that can persist across container restarts. Ephemeral storage is the stuff that vanishes on container exit. Like, writing to /home, unless you've mounted a volume at /home, for example. |
Due to MaterializeInc#15725, environmentd may fail to spawn all the required computeds unless the process orchestrator metadata is wiped in advance. We consolidate the wiping procedure in `up()` so that all testing frameworks that happen to restart Mz can benefit.
Due to MaterializeInc#15725, environmentd may fail to spawn all the required computeds unless the process orchestrator metadata is wiped in advance. We consolidate the wiping procedure in `up()` so that all testing frameworks that happen to restart Mz can benefit.
…-wipe mzcompose: Fortify mzcompose against #15725
Due to MaterializeInc#15725, environmentd may fail to spawn all the required computeds unless the process orchestrator metadata is wiped in advance. We consolidate the wiping procedure in `up()` so that all testing frameworks that happen to restart Mz can benefit.
Add `SocketAddr`, `Listener`, and `Stream` types to the `mz_ore::netio` module, which abstract over TCP sockets and Unix domain sockets. Then teach storaged and computed to accept their listen addresses using the new `SocketAddr` types, which allows them to bind to either TCP or Unix domain sockets, as desired. This is a key part step towards fixing the process orchestrator (#15725), as it will allow multiple copies of Materialize to be run concurrently without competing for access to the same ports.
PID files are not valid after a reboot of a machine. In the best case, the referenced PIDs do not exist, and the process orchestrator correctly recreates the services; in the worst case, the PIDs have been reused by different processes entirely, and the process orchestrator incorrectly thinks the services are already running. The worst case scenario is almost a guarantee with containers, where there are only a few processes using the low-numbered containers. This commit fixes the problem by moving the PID metadata files into $TMPDIR/environment-$ID. $TMPDIR is cleared on restart, so the stale PID files will correctly vanish after a restart. Naming the directory after the environment ID ensures that environmentd can find its metadata after a process restart without a machine restart, but allows multiple `environmentd` processes to co-exist, as long as they use different environment IDs. Things work correctly with the `--reset` option to bin/environment, too, as this option generates a new environmentd ID. Touches #15725. Would close #15800.
There are two known problems with the process orchestrator:
We've talked about simplifying the process orchestrator (e.g., removing all support for re-adopting processes after an
environmentd
restart), but there are at least two good reasons to want to keep the process orchestrator in parity with the Kubernetes orchestrator:envd
and have it adopt existingstoraged
andcomputed
processes.I think there are three well-scoped improvements we can make.
cargo-nextest
to run Rust tests faster (ci: pull cargo test out of the build CI job #13035).supervise
that watches for the external process to crash.Tossing this one on @guswynn and @jkosh44's radar, but the process orchestrator isn't really owned by any particular team.
cc @chaas @uce
The text was updated successfully, but these errors were encountered: