Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

environmentd: panicked at 'port re-use', src/orchestrator-process/src/lib.rs:185:37 when restarted #13235

Closed
philip-stoev opened this issue Jun 23, 2022 · 9 comments
Labels
C-bug Category: something is broken

Comments

@philip-stoev
Copy link
Contributor

philip-stoev commented Jun 23, 2022

What version of Materialize are you using?

v0.26.1-dev

How did you install Materialize?

Docker image

What is the issue?

If environmentd is restarted after a CREATE CLUSTER command, it will panic.

Relevant log output

   0: rust_begin_unwind
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panicking.rs:143:14
   2: <mz_orchestrator_process::NamespacedProcessOrchestrator as mz_orchestrator::NamespacedOrchestrator>::ensure_service::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0857afdeab3487c2d-1/materialize/tests/src/orchestrator-process/src/lib.rs:185:37
   3: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91:19
   4: <core::pin::Pin<P> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/future.rs:124:9
   5: <mz_orchestrator_tracing::NamespacedTracingOrchestrator as mz_orchestrator::NamespacedOrchestrator>::ensure_service::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0857afdeab3487c2d-1/materialize/tests/src/orchestrator-tracing/src/lib.rs:321:54
   6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91:19
   7: <core::pin::Pin<P> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/future.rs:124:9
   8: mz_dataflow_types::client::controller::Controller<T>::add_replica_to_instance::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0857afdeab3487c2d-1/materialize/tests/src/dataflow-types/src/client/controller.rs:317:22
   9: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91:19
  10: mz_coord::coord::Coordinator<S>::bootstrap::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0857afdeab3487c2d-1/materialize/tests/src/coord/src/coord.rs:678:86
  11: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91:19
  12: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:54
  13: tokio::coop::with_budget::{{closure}}
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:102:9
  14: std::thread::local::LocalKey<T>::try_with
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/thread/local.rs:442:16
  15: std::thread::local::LocalKey<T>::with
             at ./rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/thread/local.rs:418:9
  16: tokio::coop::with_budget
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:95:5
  17: tokio::coop::budget
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:72:5
  18: tokio::park::thread::CachedParkThread::block_on
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:31
worker 1:	error connecting to worker 0: Cannot assign requested address (os error 99); retrying
  19: tokio::runtime::enter::Enter::block_on
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/enter.rs:151:13
  20: tokio::runtime::handle::Handle::block_on
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/handle.rs:296:9
  21: mz_coord::coord::serve::{{closure}}::{{closure}}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0857afdeab3487c2d-1/materialize/tests/src/coord/src/coord.rs:5445:29
@philip-stoev philip-stoev added the C-bug Category: something is broken label Jun 23, 2022
@philip-stoev
Copy link
Contributor Author

To reproduce, run this test a couple of times:

import time

from pathlib import Path

from materialize import spawn
from materialize.mzcompose import Composition
from materialize.mzcompose.services import (
    Materialized,
)

SERVICES = [
    Materialized(),
]


def workflow_default(c: Composition) -> None:
    c.up("materialized")
    c.wait_for_materialized()
    c.sql("CREATE CLUSTER create_cluster1 REPLICAS (replica1 (SIZE '2-2'));")
    c.kill("materialized")
    c.up("materialized");
    c.wait_for_materialized()

@philip-stoev
Copy link
Contributor Author

I tried giving ample sleeps between the steps but this does not fix the issue.

@jkosh44
Copy link
Contributor

jkosh44 commented Jun 23, 2022

I have not been able to reproduce this locally. Does it work if you remove the CREATE CLUSTER command?

@philip-stoev
Copy link
Contributor Author

@jkosh44 Can you please try this test case:

cd test/platform-checks
 ./mzcompose run default --check DropCluster --scenario RestartMz

philip-stoev added a commit to philip-stoev/materialize that referenced this issue Jul 5, 2022
Disable the ClusterCreate/ClusterDrop/ReplicateCreate/ReplicaDrop
due to MaterializeInc#13235 that makes those checks incompatible with whole-Mz restart
@jkosh44
Copy link
Contributor

jkosh44 commented Jul 5, 2022

@philip-stoev Is this on main? I'm getting an error about RestartMz not being a valid scenario

mzcompose: test case workflow-default failed:
Traceback (most recent call last):
  File "/home/joe/materialize/misc/python/materialize/mzcompose/__init__.py", line 381, in test_case
    yield
  File "/home/joe/materialize/misc/python/materialize/cli/mzcompose.py", line 577, in handle_composition
    composition.workflow(args.workflow, *args.unknown_subargs[1:])
  File "/home/joe/materialize/misc/python/materialize/mzcompose/__init__.py", line 293, in workflow
    func(self, parser)
  File "/home/joe/materialize/test/platform-checks/mzcompose.py", line 79, in workflow_default
    [globals()[args.scenario]] if args.scenario else Scenario.__subclasses__()
KeyError: 'RestartMz'
mzcompose: error: at least one test case failed

@philip-stoev
Copy link
Contributor Author

It is now called RestartEntireMz . Apologies!

@jkosh44
Copy link
Contributor

jkosh44 commented Jul 5, 2022

No worries, that worked for me. That test also passes for me locally though.

(venv) joe@joe-ThinkPad-X1-Extreme-Gen-3:~/materialize/test/platform-checks (main)$ ./mzcompose down -v; ./mzcompose --dev run default --check DropCluster --scenario RestartEntireMz
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-4THNJPSPB4YHYJOUIJTR42EIS76SSFGH
materialize/materialized:mzbuild-NBHTK7VYUQKAHR5Z55MLWVVDJY3J6ISO
materialize/testdrive:mzbuild-MPMEYXIUASWESPLT7B2ZKKD4DK6SJVWI
==> Delegating to Docker Compose
$ docker compose down -v --remove-orphans
[+] Running 8/8
 ⠿ Container platform-checks-materialized-1  Removed                                                                                                                                                     0.7s
 ⠿ Container platform-checks-testdrive-1     Removed                                                                                                                                                     0.4s
 ⠿ Container platform-checks-redpanda-1      Removed                                                                                                                                                     0.7s
 ⠿ Volume platform-checks_tmp                Removed                                                                                                                                                     0.0s
 ⠿ Volume platform-checks_mzdata             Removed                                                                                                                                                     0.0s
 ⠿ Network platform-checks_default           Removed                                                                                                                                                     0.2s
 ⠿ Volume platform-checks_pgdata             Removed                                                                                                                                                     0.1s
 ⠿ Volume platform-checks_mydata             Removed                                                                                                                                                     0.1s
==> Collecting mzbuild images
materialize/ubuntu-base:mzbuild-4THNJPSPB4YHYJOUIJTR42EIS76SSFGH
materialize/materialized:mzbuild-BFDZXS7LW43U24BIIXPGUYVLM5CSA5ZJ
materialize/testdrive:mzbuild-GB4VPXQ73ION5WGRSP2LFX2GXU3UJKEI
==> Running test case workflow-default
==> Running workflow default
$ docker compose up --detach testdrive
[+] Running 6/6
 ⠿ Network platform-checks_default        Created                                                                                                                                                        0.0s
 ⠿ Volume "platform-checks_mydata"        Created                                                                                                                                                        0.0s
 ⠿ Volume "platform-checks_tmp"           Created                                                                                                                                                        0.0s
 ⠿ Volume "platform-checks_mzdata"        Created                                                                                                                                                        0.0s
 ⠿ Volume "platform-checks_pgdata"        Created                                                                                                                                                        0.0s
 ⠿ Container platform-checks-testdrive-1  Started                                                                                                                                                        0.4s
$ docker compose up --detach redpanda
[+] Running 1/1
 ⠿ Container platform-checks-redpanda-1  Started                                                                                                                                                         0.4s
C> waiting for redpanda:9092 success!
C> waiting for redpanda:8081 success!
Testing upgrade scenario <class 'materialize.checks.scenarios.RestartEntireMz'>
$ docker compose up --detach materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Started                                                                                                                                                     0.5s
$ docker compose port materialized 6875
C> waiting for dbname=materialize host=localhost port=49181 user=materialize password=postgres to handle 'SELECT 1' 59 59 59 59 59 59 59 59 59 59 58 58 58 58 58 58 58 58 58 58 57 57 57 57 57 57success!
Running initialize() from <materialize.checks.cluster.DropCluster object at 0x7f1700918fd0>
$ docker compose exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --aws-endpoint=http://localstack:4566 --validate-data-dir=/mzdata --no-reset --default-timeout=300s --seed=1
--- <stdin>
+++ testdrive completed successfully.
$ docker compose kill -sSIGKILL materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Killed                                                                                                                                                      0.3s
$ docker compose up --detach materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Started                                                                                                                                                     0.4s
$ docker compose port materialized 6875
C> waiting for dbname=materialize host=localhost port=49184 user=materialize password=postgres to handle 'SELECT 1' 59 59 59 59 59 59 59 59 59 59 58 58 58 58 58 58 58 58 58 58 57 57 57 57 57 57success!
Running manipulate() from <materialize.checks.cluster.DropCluster object at 0x7f1700936d30>
$ docker compose exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --aws-endpoint=http://localstack:4566 --validate-data-dir=/mzdata --no-reset --default-timeout=300s --seed=1
--- <stdin>
> CREATE TABLE drop_cluster1_table (f1 INTEGER);
rows match; continuing at ts 1657038594.0260918
> CREATE TABLE drop_cluster2_table (f1 INTEGER);
rows match; continuing at ts 1657038594.1014009
> INSERT INTO drop_cluster1_table VALUES (123);
rows match; continuing at ts 1657038594.146248
> INSERT INTO drop_cluster2_table VALUES (234);
rows match; continuing at ts 1657038594.1865683
> CREATE CLUSTER drop_cluster1 REPLICAS (replica1 (SIZE '2-2'));
rows match; continuing at ts 1657038594.3066893
> CREATE CLUSTER drop_cluster2 REPLICAS (replica1 (SIZE '2-2'));
rows match; continuing at ts 1657038594.456528
> SET cluster=drop_cluster1
rows match; continuing at ts 1657038594.5022218
> CREATE DEFAULT INDEX ON drop_cluster1_table;
rows match; continuing at ts 1657038594.5507228
> CREATE MATERIALIZED VIEW drop_cluster1_view AS SELECT SUM(f1) FROM drop_cluster1_table;
rows match; continuing at ts 1657038594.6595042
> SET cluster=drop_cluster2
rows match; continuing at ts 1657038594.7040997
> CREATE DEFAULT INDEX ON drop_cluster2_table;
rows match; continuing at ts 1657038594.7474813
> CREATE MATERIALIZED VIEW drop_cluster2_view AS SELECT SUM(f1) FROM drop_cluster2_table;
rows match; continuing at ts 1657038594.8669329
> DROP CLUSTER drop_cluster1 CASCADE;
rows match; continuing at ts 1657038594.9871464
+++ testdrive completed successfully.
$ docker compose kill -sSIGKILL materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Killed                                                                                                                                                      0.3s
$ docker compose up --detach materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Started                                                                                                                                                     0.3s
$ docker compose port materialized 6875
C> waiting for dbname=materialize host=localhost port=49187 user=materialize password=postgres to handle 'SELECT 1' 59 59 59 59 59 59 59 59 59 59 58 58 58 58 58 58 58 58 58 58 57 57 57 57 57 57success!
Running manipulate() from <materialize.checks.cluster.DropCluster object at 0x7f1700936c70>
$ docker compose exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --aws-endpoint=http://localstack:4566 --validate-data-dir=/mzdata --no-reset --default-timeout=300s --seed=1
--- <stdin>
> DROP CLUSTER drop_cluster2 CASCADE;
rows match; continuing at ts 1657038600.313309
+++ testdrive completed successfully.
$ docker compose kill -sSIGKILL materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Killed                                                                                                                                                      0.3s
$ docker compose up --detach materialized
[+] Running 1/1
 ⠿ Container platform-checks-materialized-1  Started                                                                                                                                                     0.4s
$ docker compose port materialized 6875
C> waiting for dbname=materialize host=localhost port=49190 user=materialize password=postgres to handle 'SELECT 1' 59 59 59 59 59 59 59 59 59 59 58 58 58 58 58 58 58 58 58 58 57 57 57 57 57 57success!
Running validate() from <materialize.checks.cluster.DropCluster object at 0x7f170092a220>
$ docker compose exec -T testdrive testdrive --kafka-addr=kafka:9092 --schema-registry-url=http://schema-registry:8081 --materialize-url=postgres://materialize@materialized:6875 --aws-endpoint=http://localstack:4566 --validate-data-dir=/mzdata --no-reset --default-timeout=300s --seed=1
--- <stdin>
> SET cluster=drop_cluster1
rows match; continuing at ts 1657038605.172902
> SET cluster=create_cluster2
rows match; continuing at ts 1657038605.17517
> SET cluster=default
rows match; continuing at ts 1657038605.1768987
> SELECT * FROM drop_cluster1_table;
rows match; continuing at ts 1657038605.2213411
> SELECT * FROM drop_cluster1_view;
rows match; continuing at ts 1657038605.2611177
> SELECT * FROM drop_cluster2_table;
rows match; continuing at ts 1657038605.296245
> SELECT * FROM drop_cluster2_view;
rows match; continuing at ts 1657038605.3391085
+++ testdrive completed successfully.
==> mzcompose: test case workflow-default succeeded

@jkosh44
Copy link
Contributor

jkosh44 commented Jul 5, 2022

I think the cause of the panic would be that there are multiple metadata files, each with the same ports in them. I'm not sure how we'd end up in that scenario though. One possibility is that we just need to call fsync after dropping the old files.

@benesch
Copy link
Member

benesch commented Jan 23, 2023

Presumably fixed by #15810.

@benesch benesch closed this as completed Jan 23, 2023
def- added a commit to antiguru/materialize that referenced this issue Jun 2, 2023
Also reenable CreateCluster, DropCluster, CreateReplica, DropReplica as
they appear fixed, see MaterializeInc#13235
def- added a commit to antiguru/materialize that referenced this issue Jun 2, 2023
Also reenable CreateCluster, DropCluster, CreateReplica, DropReplica as
they appear fixed, see MaterializeInc#13235
antiguru pushed a commit to antiguru/materialize that referenced this issue Jun 5, 2023
Also reenable CreateCluster, DropCluster, CreateReplica, DropReplica as
they appear fixed, see MaterializeInc#13235
antiguru pushed a commit to antiguru/materialize that referenced this issue Jun 5, 2023
Also reenable CreateCluster, DropCluster, CreateReplica, DropReplica as
they appear fixed, see MaterializeInc#13235
antiguru pushed a commit to antiguru/materialize that referenced this issue Jun 6, 2023
Also reenable CreateCluster, DropCluster, CreateReplica, DropReplica as
they appear fixed, see MaterializeInc#13235
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: something is broken
Projects
No open projects
SQL and Coordinator
Awaiting triage
Development

No branches or pull requests

3 participants