Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Fix process orchestrator port re-use #15336

Closed
jkosh44 opened this issue Oct 11, 2022 · 1 comment
Closed

test: Fix process orchestrator port re-use #15336

jkosh44 opened this issue Oct 11, 2022 · 1 comment
Labels
C-bug Category: something is broken T-testing Theme: tests or test infrastructure

Comments

@jkosh44
Copy link
Contributor

jkosh44 commented Oct 11, 2022

What version of Materialize are you using?

main

How did you install Materialize?

Built from source

What is the issue?

In https://github.com/MaterializeInc/materialize/pull/15316/files#diff-818d26f34fdb3ab6b0892e482b0a1231780a2321f7981fc96912e022bc1d6a6d we discovered an issue with the process orchestrator where we were seeing panics from attempting to re-use ports. We made a fix to just kill the existing process and restart it on another port. We should probably figure out why the port re-use was happening in the first place and fix it.

As per the PR comment:

I also hacked in a fix for the process orchestrator that just kills the existing process and finds it new ports when it falls out of sync during port allocation. This worked well enough in my local testing—and also made the basic case of killall -9 environmentd still work as expected; all clusters are readopted. We should revisit this soon too, though.

Relevant log output

No response

@jkosh44 jkosh44 added C-bug Category: something is broken T-testing Theme: tests or test infrastructure C-triage labels Oct 11, 2022
@philip-stoev
Copy link
Contributor

This is a product-side issue, so I am removing the QA team. A work-around has been implemented on the testing side as #15800

@ggnall ggnall removed the C-triage label Nov 22, 2022
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Nov 29, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Nov 30, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Nov 30, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Nov 30, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Dec 1, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Dec 1, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Dec 1, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Dec 1, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
philip-stoev added a commit to philip-stoev/materialize that referenced this issue Dec 1, 2022
In scenarios where replicas and sources are rapidly killed and restarted,
computed and storaged may fail to bind to their assigned HTTP port
if that port has been just freed by some other process.

Previously, this would cause the process to panic and be restarted
by the process orchestrator. Now that all panics are fatal, set the
SO_REUSEADDR socket option so that bind() succeeds instead.

Relates to: MaterializeInc#15336
@benesch benesch closed this as completed in ac1028d Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken T-testing Theme: tests or test infrastructure
Projects
None yet
Development

No branches or pull requests

3 participants