Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

orchestrator-process: store PID metadata in temp directory #15810

Merged
merged 3 commits into from
Dec 4, 2022

Conversation

benesch
Copy link
Member

@benesch benesch commented Oct 31, 2022

PID files are not valid after a reboot of a machine. In the best case, the referenced PIDs do not exist, and the process orchestrator correctly recreates the services; in the worst case, the PIDs have been reused by different processes entirely, and the process orchestrator incorrectly thinks the services are already running. The worst case scenario is almost a guarantee with containers, where there are only a few processes using the low-numbered containers.

This commit fixes the problem by moving the PID metadata files into $TMPDIR/environment-$ID. $TMPDIR is cleared on restart, so the stale PID files will correctly vanish after a restart. Naming the directory after the environment ID ensures that environmentd can find its metadata after a process restart without a machine restart, but allows multiple environmentd processes to co-exist, as long as they use different environment IDs. Things work correctly with the --reset option to bin/environment, too, as this option generates a new environmentd ID.

Touches #15725.
Would close #15800.

Motivation

  • This PR takes a step towards fixing a recognized bug.

Checklist

Copy link
Contributor

@guswynn guswynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci looks mad but this was shockingly painless to do it appears!

@benesch benesch force-pushed the pid-metadata-tmp branch 3 times, most recently from 203ae33 to 89de11d Compare November 1, 2022 03:20
@benesch benesch enabled auto-merge (rebase) November 1, 2022 03:22
@benesch
Copy link
Member Author

benesch commented Nov 1, 2022

Yep, this part was easy. Sadly the rest of #15725 is hard!

#[clap(
long,
env = "ORCHESTRATOR_PROCESSDATA_DIRECTORY",
env = "ORCHESTRATOR_PROCESS_DATA_DIRECTORY",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable was used with its old name in mzcompose/services.py . I guess, to be safe, services.py needs to set the variable with both its old and new spelling.

@philip-stoev
Copy link
Contributor

I pushed #15800 as a temporary work-around until this is sorted out.

Add `SocketAddr`, `Listener`, and `Stream` types to the `mz_ore::netio`
module, which abstract over TCP sockets and Unix domain sockets. Then
teach storaged and computed to accept their listen addresses using the
new `SocketAddr` types, which allows them to bind to either TCP or Unix
domain sockets, as desired.

This is a key part step towards fixing the process orchestrator
(MaterializeInc#15725), as it will allow multiple copies of Materialize to be run
concurrently without competing for access to the same ports.
PID files are not valid after a reboot of a machine. In the best case,
the referenced PIDs do not exist, and the process orchestrator correctly
recreates the services; in the worst case, the PIDs have been reused by
different processes entirely, and the process orchestrator incorrectly
thinks the services are already running. The worst case scenario is
almost a guarantee with containers, where there are only a few processes
using the low-numbered containers.

This commit fixes the problem by moving  the PID metadata files into
$TMPDIR/environment-$ID. $TMPDIR is cleared on restart, so the stale PID
files will correctly vanish after a restart. Naming the directory after
the environment ID ensures that environmentd can find its metadata after
a process restart without a machine restart, but allows multiple
`environmentd` processes to co-exist, as long as they use different
environment IDs. Things work correctly with the `--reset` option to
bin/environment, too, as this option generates a new environmentd ID.

Touches MaterializeInc#15725.
Would close MaterializeInc#15800.
Rewrite the process orchestrator to be more reliable:

  * Use Unix domain sockets rather than TCP, to avoid competing with
    other `environmentd` processes (e.g., in tests) for a very limited
    range of TCP ports.

  * Restructure process supervision so that existing processes that are
    adopted at startup are monitored for failures and restarted if
    necessary.

  * Enable process status updates.

  * Enable process metrics collection.

Fix MaterializeInc#15725.
Fix MaterializeInc#15336.
@benesch
Copy link
Member Author

benesch commented Dec 4, 2022

@jkosh44 just a heads up that the process orchestrator looks very different as a result of this PR. So that you're not surprised when you next look at it.

@benesch benesch merged commit ac1028d into MaterializeInc:main Dec 4, 2022
@benesch benesch deleted the pid-metadata-tmp branch December 4, 2022 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants