Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Improve the usability and reliability of SSH TUNNELs #18490

Closed
9 of 23 tasks
guswynn opened this issue Mar 29, 2023 · 8 comments
Closed
9 of 23 tasks

[Epic] Improve the usability and reliability of SSH TUNNELs #18490

guswynn opened this issue Mar 29, 2023 · 8 comments
Labels
A-storage Area: storage Epic [Auto] Multi-stage issue T-stability Theme: robustness, resilience, and graceful failure
Milestone

Comments

@guswynn
Copy link
Contributor

guswynn commented Mar 29, 2023

Product Outcome

SSH tunnel connections are resilient and observable

Context

Over the past few weeks, its become clear there are usability, reliability, and debuggability issues with SSH TUNNELs, particularly when used with kafka. This issue tracks a set of improvements that can be made SSH TUNNELs:

Tasklist

Reliability

Usability

Debuggability (and documentation)

@guswynn guswynn added A-storage Area: storage T-stability Theme: robustness, resilience, and graceful failure labels Mar 29, 2023
@guswynn guswynn self-assigned this Mar 29, 2023
@benesch
Copy link
Member

benesch commented Mar 29, 2023

Harden tunnels by supporting multiple bastions

I'd propose skipping this one for now! It's a big lift to make it work reliably (how do you detect if a connection has stalled and needs to be restarted in the other SSH bastion?), and I think we have a good workaround (asking customers to ping their SSH bastions and restart them if they fail).

@guswynn
Copy link
Contributor Author

guswynn commented Mar 29, 2023

@benesch I mentioned the order of priority here: https://materializeinc.slack.com/archives/C04DJT084KA/p1680117095320179, ill transfer this context to the issue!

@benesch
Copy link
Member

benesch commented Mar 29, 2023

Awesome, thanks!

@sentry-io
Copy link

sentry-io bot commented Apr 3, 2023

Sentry issue: DATABASE-BACKEND-2P2

@dseisun-materialize dseisun-materialize added the Epic [Auto] Multi-stage issue label Apr 3, 2023
@dseisun-materialize dseisun-materialize modified the milestones: 2. Next, 1. Now Apr 3, 2023
@benesch
Copy link
Member

benesch commented Apr 4, 2023

Wanted to record some thoughts from debugging SSH tunnels over the weekend:

  • We need a more principled take on when and how to report SSH tunnel errors. Right now we'll report an initial SSH tunnel error, but not any future SSH tunnel errors.
    • Chatted with Gus about this a bit today. There's a perhaps not so bad path to improving this, starting by consistentizing Kafka errors. Everywhere we report a Kafka error, we should a) use a metadata fetch request and report only the errors from that API, and b) always prefer to report any SSH tunnel error rather than a Kafka error.
  • The reported error when a PostgreSQL connection fails is terrible (often just "failed to connect to remote host" regardless of whether the SSH tunnel failed, or the SSH credentials were wrong, or the PostgreSQL server was down, or the PostgreSQL credentials were wrong). AFAICT, the error message details are available, but we store the error in PlanError::FetchingPostgresPublicationInfoFailed, which embeds a PostgresError::Generic, which embeds an anyhow::Error, which contains "failed to connect to remote host" as the top-level error, with the true cause embedded within. Convincing this error to print its causal chain is REALLY hard. You can't convert Arc<PostgresError> to an anyhow::Error, which means you can't use anyhow's support for printing causal chains. I think we need a rule like "you can never put an anyhow::Error instead another error type; you can only use them at the top level" since otherwise it all falls apart.

@aljoscha
Copy link
Contributor

aljoscha commented Apr 4, 2023

yes! There is #17826, which we still need to flesh out

@aljoscha
Copy link
Contributor

aljoscha commented Apr 4, 2023

* The reported error when a PostgreSQL connection fails is terrible (often just "failed to connect to remote host" regardless of whether the SSH tunnel failed, or the SSH credentials were wrong, or the PostgreSQL server was down, or the PostgreSQL credentials were wrong). AFAICT, the error message details _are_ available, but we store the error in [`PlanError::FetchingPostgresPublicationInfoFailed`](https://github.com/MaterializeInc/materialize/blob/bc93581a3680371a8f51345284a6674003823d76/src/sql/src/plan/error.rs#L110), which embeds a `PostgresError::Generic`, which embeds an `anyhow::Error`, which contains "failed to connect to remote host" as the top-level error, with the true cause _embedded within_. Convincing this error to print its causal chain is REALLY hard. You can't convert `Arc<PostgresError>` to an `anyhow::Error`, which means you can't use anyhow's support for printing causal chains. I think we need a rule like "you can never put an anyhow::Error instead another error type; you can only use them at the top level" since otherwise it all falls apart.

I think we can actually print the chain of errors, see #18583. We might have to do some more work to ensure that all the errors are actually there.

For PostgresError::Generic this works because the variants are all #[error(transparent)] so we get the Display impl of the wrapped error. @benesch I think your example uses PostgresError::Ssh or PostgresError::SshIo, for which this doesn't work. We'd have to write a custom Display impl for PostgresError to get those to work nicely, I'm afraid (typing that up quickly, to see).

@dseisun-materialize dseisun-materialize changed the title Mini Epic: Improve the usability and reliability of SSH TUNNELs [Epic] Improve the usability and reliability of SSH TUNNELs Sep 26, 2023
@dseisun-materialize dseisun-materialize modified the milestones: 1. Now, 3. Later Oct 18, 2023
@benesch
Copy link
Member

benesch commented Jan 28, 2024

Closing as this is now complete enough!

@benesch benesch closed this as completed Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage Epic [Auto] Multi-stage issue T-stability Theme: robustness, resilience, and graceful failure
Projects
None yet
Development

No branches or pull requests

4 participants