Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

environmentd: Have environmentd panic on child crashes during testing #16347

Merged

Conversation

philip-stoev
Copy link
Contributor

@philip-stoev philip-stoev commented Nov 29, 2022

By default, the process orchestrator will restart any computed and storageds that it manages. During testing, this will unfortunately mask panics that should have caused the test to fail instead.

  • Add a new environmentd-command line option that will cause environmentd to panic if any of its children exit with an exit code that implies a crash or a panic happened on the child.
  • Make sure the new behavior is in effect for sqllogictests, cargo unit tests and mzcompose-based tests
  • do not panic environmentd in mzcompose workflows that explicitly use mz_panic() to panic a computed
  • Remove the KillReplica Action from Zippy, that was using mz_panic() internally. The original idea of this Action was to panic a single replica, but in practice mz_panic() is evaluated by all replicas and they all panic together.

Motivation

  • This PR fixes a previously unreported bug.

The fact that the process orchestrator is restarting panicked children masks bugs that should have been exposed during testing.

@philip-stoev
Copy link
Contributor Author

With this PR #16318 is visible directly in the main CI, not even Nightly is required:

https://buildkite.com/materialize/nightlies/builds/1422#0184c2d2-25f0-4a4e-983c-9335a512c612

@philip-stoev philip-stoev marked this pull request as ready for review November 29, 2022 11:50
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, except that I think the ReusePort is hiding a deeper bug! I think it's time for the process orchestrator to learn to use Unix domain sockets instead of TCP connections.

src/compute/src/bin/computed.rs Outdated Show resolved Hide resolved
src/orchestrator-process/src/lib.rs Outdated Show resolved Hide resolved
@philip-stoev philip-stoev force-pushed the environmentd-panic-on-panic branch 8 times, most recently from ce5186a to 222de04 Compare December 1, 2022 12:50
@philip-stoev
Copy link
Contributor Author

@benesch I would like to move this forward as soon as possible, it is now or never while the builds are green even with panic-on-crash enabled.

@benesch
Copy link
Member

benesch commented Dec 4, 2022

Yes, apologies, I just can't get behind setting SO_REUSEPORT! It has me spooked. I hear you on getting this in ASAP, though. I just pushed up a big rewrite of the process orchestrator in #15810 that will hopefully avoid all these problems. 🤞🏽

@benesch
Copy link
Member

benesch commented Dec 4, 2022

#15810 just merged. I've rebased this on top of #15810 (with a few modifications to naming/comments). 🤞🏽 that it passes!

@benesch benesch enabled auto-merge (rebase) December 4, 2022 18:38
@benesch
Copy link
Member

benesch commented Dec 5, 2022

@philip-stoev I'm so sorry but there appears to be at least one real bug here. The legacy upgrade tests are crashing with:

kafka-u32: some element of the Sink as_of frontier is too far advanced for our output-gating timestamp: as_of Antichain { elements: [1670181633879] }, gate_ts: 1670181537332', src/storage/src/sink/kafka.rs:1063:13

I'm not sure what's going wrong in cargo-test. It's of course absolutely impossible to see what's going on there since the logs are disabled. Sigh.

@philip-stoev
Copy link
Contributor Author

#15810 just merged. I've rebased this on top of #15810 (with a few modifications to naming/comments). 🤞🏽 that it passes!

@benesch can you push your rebase to GitHub?

@benesch
Copy link
Member

benesch commented Dec 5, 2022

I already did?

@philip-stoev
Copy link
Contributor Author

Oh, my local branch must have messed up then . Let me reset it.

@philip-stoev philip-stoev force-pushed the environmentd-panic-on-panic branch 2 times, most recently from 4e2b0f1 to 3ed6700 Compare December 5, 2022 14:04
@benesch
Copy link
Member

benesch commented Dec 5, 2022

Aha! It's test_github_12546 that's to blame. Which makes sense! That test calls mz_panic. I'm working on a fix now.

By default, the process orchestrator will restart any computed and
storageds that it manages. During testing, this will unfortunately
mask panics that should have caused the test to fail instead.

- Add a new environmentd-command line option that will cause environmentd
  to panic if any of its children exit with an exit code that implies
  a crash or a panic happened on the child.
- Make sure the new behavior is in effect for sqllogictests, cargo
  unit tests and mzcompose-based tests
- do not panic environmentd in mzcompose workflows that
  explicitly use mz_panic() to panic a computed
- Remove the KillReplica Action from Zippy, that was using mz_panic()
  internally. The original idea of this Action was to panic a single replica,
  but in practice mz_panic() is evaludated by all replicas and they all panic
  together.

Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
@benesch
Copy link
Member

benesch commented Dec 5, 2022

Just pushed up a fix to disable propagate_crashes in test_github_12546. I'm feeling good about this one!

@benesch benesch merged commit f260e88 into MaterializeInc:main Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants