Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage-client: don't panic if reading Kafka secret fails #18563

Merged
merged 4 commits into from
Apr 18, 2023

Conversation

benesch
Copy link
Member

@benesch benesch commented Apr 3, 2023

The surrounding code paths have been sufficiently refactored such that it is now possible to gracefully return an error here.

Fix #18438.

Motivation

  • This PR fixes a recognized bug.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • Too small a bug to mention IMO.

@benesch benesch requested review from aljoscha, petrosagg, philip-stoev and a team April 3, 2023 15:20
@benesch
Copy link
Member Author

benesch commented Apr 3, 2023

@philip-stoev would you like to try to test this somehow?

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good! @philip-stoev We don't yet have infra for futzing with the secrets store for tests, do we?

@philip-stoev
Copy link
Contributor

Let me cook a test today, give me 1 hour.

@philip-stoev
Copy link
Contributor

A fix is needed on the postgres side too:

thread 'timely:work-0' panicked at 'Postgres connection unexpectedly missing secrets: ApiError: secrets "user-managed-u1" not found: NotFound (ErrorResponse { status: "Failure", message: "secrets \"user-managed-u1\" not found", reason: "NotFound", code: 404 })

@aljoscha
Copy link
Contributor

aljoscha commented Apr 4, 2023

Do you have a test ready that provokes this? It seems

@benesch
Copy link
Member Author

benesch commented Apr 4, 2023

Updated with a fix for PostgreSQL too. @petrosagg you may want to take a look. The future::pending pattern is pretty gross; I feel like we should be able to do better.

@philip-stoev
Copy link
Contributor

I am terribly sorry, but there is still a panic when trying a DROP CLUSTER on a cluster that has affected sources:

thread 'coordinator' panicked at 'Postgres source u6 missing secrets: ApiError: secrets "user-managed-u1" not found: NotFound (ErrorResponse { status: "Failure", message: "secrets \"user-managed-u1\" not found", reason: "NotFound", code: 404 })', /var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:121:49
stack backtrace:
   0: rust_begin_unwind
             at ./rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at ./rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/panicking.rs:64:14
   2: <mz_adapter::coord::Coordinator>::catalog_transact_with::<<mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}::{closure#0}::{closure#0}, ()>::{closure#0}::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:121:49
   3: <core::result::Result<mz_postgres_util::Config, anyhow::Error>>::unwrap_or_else::<<mz_adapter::coord::Coordinator>::catalog_transact_with<<mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}::{closure#0}::{closure#0}, ()>::{closure#0}::{closure#0}::{closure#0}>
             at ./rustc/fc594f15669680fa70d255faec3ca3fb507c3405/library/core/src/result.rs:1504:23
   4: <mz_adapter::coord::Coordinator>::catalog_transact_with::<<mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}::{closure#0}::{closure#0}, ()>::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:116:54
   5: <mz_adapter::coord::Coordinator>::catalog_transact_with::<<mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}::{closure#0}::{closure#0}, ()>::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:76:5
   6: <mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:63:61
   7: <mz_adapter::coord::Coordinator>::catalog_transact::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/ddl.rs:57:5
   8: <mz_adapter::coord::Coordinator>::sequence_drop_clusters::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/sequencer/inner.rs:1537:50
   9: <mz_adapter::coord::Coordinator>::sequence_plan::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/sequencer.rs:195:68
  10: <mz_adapter::coord::Coordinator>::sequence_plan::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/sequencer.rs:56:5
  11: <mz_adapter::coord::Coordinator>::handle_execute_inner::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/command_handler.rs:529:78
  12: <mz_adapter::coord::Coordinator>::handle_execute_inner::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/command_handler.rs:328:5
  13: <mz_adapter::coord::Coordinator>::handle_execute::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/command_handler.rs:325:61
  14: <mz_adapter::coord::Coordinator>::handle_execute::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/command_handler.rs:284:5
  15: <tracing::instrument::Instrumented<<mz_adapter::coord::Coordinator>::handle_execute::{closure#0}> as core::future::future::Future>::poll
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tracing-0.1.37/src/instrument.rs:272:9
  16: <mz_adapter::coord::Coordinator>::handle_command::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/command_handler.rs:75:21
  17: <mz_adapter::coord::Coordinator>::message_command::{closure#0}::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/message_handler.rs:248:33
  18: <mz_adapter::coord::Coordinator>::message_command::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/message_handler.rs:245:5
  19: <mz_adapter::coord::Coordinator>::handle_message::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord/message_handler.rs:44:63
  20: <mz_adapter::coord::Coordinator>::serve::{closure#0}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord.rs:1247:37
  21: <tokio::runtime::park::CachedParkThread>::block_on::<<mz_adapter::coord::Coordinator>::serve::{closure#0}>::{closure#0}
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/park.rs:283:63
  22: tokio::runtime::coop::with_budget::<core::task::poll::Poll<()>, <tokio::runtime::park::CachedParkThread>::block_on<<mz_adapter::coord::Coordinator>::serve::{closure#0}>::{closure#0}>
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/coop.rs:102:5
  23: tokio::runtime::coop::budget::<core::task::poll::Poll<()>, <tokio::runtime::park::CachedParkThread>::block_on<<mz_adapter::coord::Coordinator>::serve::{closure#0}>::{closure#0}>
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/coop.rs:68:5
  24: <tokio::runtime::park::CachedParkThread>::block_on::<<mz_adapter::coord::Coordinator>::serve::{closure#0}>
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/park.rs:283:31
  25: <tokio::runtime::context::BlockingRegionGuard>::block_on::<<mz_adapter::coord::Coordinator>::serve::{closure#0}>
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/context.rs:315:13
  26: <tokio::runtime::handle::Handle>::block_on::<<mz_adapter::coord::Coordinator>::serve::{closure#0}>
             at ./cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.24.2/src/runtime/handle.rs:264:9
  27: mz_adapter::coord::serve::{closure#0}::{closure#1}
             at ./var/lib/buildkite-agent/builds/buildkite-builders-d43b1b5-i-0c88690e6388bbf2d-1/materialize/tests/src/adapter/src/coord.rs:1456:17
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

@philip-stoev
Copy link
Contributor

I pushed a cloudtest as a separate commit -- it currently fails because of the panic reported above.

@benesch
Copy link
Member Author

benesch commented Apr 6, 2023

Thanks for the test, @philip-stoev. Aljoscha: I likely need someone else to take this over. Is someone from storage available?

@aljoscha
Copy link
Contributor

aljoscha commented Apr 6, 2023

I'll take it over

@aljoscha aljoscha requested a review from a team as a code owner April 14, 2023 10:36
@aljoscha
Copy link
Contributor

I rebased and fixed up based on the latest display_with_causes changes. I also added a commit that removes the panic, but I can't test it because I haven't yet gotten cloudtest to work on my machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surfaces changes look good!

@aljoscha aljoscha force-pushed the secret-reading-errors branch 4 times, most recently from c1005bc to d524b5f Compare April 14, 2023 18:15
@aljoscha
Copy link
Contributor

@philip-stoev I fixed up our behaviour and the cloud test:

  • We now fail when dropping the cluster. We report back the errors, but the cluster is not dropped.
  • I have a test that re-instates the secrets, and verifies that we can then drop the cluster.

@benesch At first glance, it might seem bad that you can't drop a cluster when secrets are not accessible. On the other hand, I think users should never see this because they normally can't mess with secret storage.

The other option is to go through with dropping the cluster. This will mean, however, that we leak the postgres resources that we are trying to clean up using those missing secrets.

wdyt?

@benesch
Copy link
Member Author

benesch commented Apr 17, 2023

@benesch At first glance, it might seem bad that you can't drop a cluster when secrets are not accessible. On the other hand, I think users should never see this because they normally can't mess with secret storage.

The other option is to go through with dropping the cluster. This will mean, however, that we leak the postgres resources that we are trying to clean up using those missing secrets.

wdyt?

I would much rather stick with the behavior that you've already got (not dropping a cluster with inaccessible secrets)!

@benesch
Copy link
Member Author

benesch commented Apr 17, 2023

Thanks very much for pushing this over the line.

@aljoscha
Copy link
Contributor

I would much rather stick with the behavior that you've already got (not dropping a cluster with inaccessible secrets)!

Oh yes, absolutely! Sorry if this wasn't clear from how I described it.

@benesch
Copy link
Member Author

benesch commented Apr 18, 2023

I would much rather stick with the behavior that you've already got (not dropping a cluster with inaccessible secrets)!

Oh yes, absolutely! Sorry if this wasn't clear from how I described it.

It was! I was just trying to express enthusiasm for your choice!

@aljoscha aljoscha merged commit 7f6379d into MaterializeInc:main Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"reading kafka secret unexpectedly failed" is a panic, rather than a source error
4 participants