Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sinks] Fail open when we can't fetch Kafka config #18853

Merged
merged 4 commits into from
Apr 21, 2023

Conversation

bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Apr 19, 2023

Motivation

This PR fixes a previously unreported bug.

Kafka allows specifying the partition number and replication factor when creating a topic. When -1 is specified, it will use a broker-wide default value. However, older versions of Kafka don't have the -1 defaulting behaviour... as a workaround, we fetch the config and apply the defaults manually.

If fetching the config fails, we fail the entire ensure-creation request, even though it is actually fairly likely to succeed: either because the topic exists, or because the Kafka version is recent enough to handle these defaulted configs. This PR "fails open", replacing the error with a warning log and continuing on.

Tips for reviewer

This is mostly code movement; you may want to hide whitespace changes with ?w=1.

I'll try and reproduce this locally before merging.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@bkirwi bkirwi requested review from a team, benesch and sploiselle April 19, 2023 21:35
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code change LGTM. Do you think it's possible to write a test for this?

@bkirwi
Copy link
Contributor Author

bkirwi commented Apr 19, 2023

Thanks for the look!

Do you think it's possible to write a test for this?

Theoretically! Though I'll be more confident once I've been able to reproduce this locally.

@bkirwi
Copy link
Contributor Author

bkirwi commented Apr 19, 2023

Okay, I have reproduced this locally.

  • I could not reproduce this by leaving the config unset in the broker configuration. Even when the explicit config is removed from server.properties, it appears to fall back to a value of 1.
  • However! I did manage to produce this error by blocking DescribeConfigs on the cluster resource via ACLs. (I think the fact that we're getting an unhelpful error may be a rdkafka bug, but this looks like it would fail either way.) This seems to be the only place where we need this ACL, at least in the sink; with the patch on this branch, the sink is able to start and make progress.

Writing an automated test for this will be annoying and I don't think I can get it out today, though I should be able to follow up on it later.

@benesch
Copy link
Member

benesch commented Apr 20, 2023

Writing an automated test for this will be annoying and I don't think I can get it out today, though I should be able to follow up on it later.

@philip-stoev or someone else from the QA team can probably help with this!

@philip-stoev philip-stoev self-requested a review April 20, 2023 06:19
@philip-stoev
Copy link
Contributor

@bkirwi can you share how you blocked DescribeConfigs when you ran things manually?

@benesch
Copy link
Member

benesch commented Apr 20, 2023

@philip-stoev here's a good guide on the subject: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-demo-acl-authorization.html

tl;dr is set authorizer.class.name=kafka.security.authorizer.AclAuthorizer in the properties, and then add an ACL that allows writing to a topic but not describing configs. I think something like this would do the trick:

kafka-acls.sh \
  --bootstrap-server :9093 \
  --add \
  --allow-principal User:CN=whoever \
  --operation Write \
  --topic '*' \
  --command-config /tmp/kafka-ssl-demo/root.properties

Probably want to adapt one of the existing kafka-auth tests, since ACLs only work in tandem with authentication.

@philip-stoev
Copy link
Contributor

I gave a Deny ACL to "ALL" against all topics for a user, and yet CREATE SINK fails all the way at:

aterialize=> CREATE SINK no_describe_config_sink FROM no_describe_config_view INTO KAFKA CONNECTION no_describe_config_kafka_conn (TOPIC 'testdrive-no-describe-config-XXXXX', replication factor = -1 , partition count = -1 ) FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY CONNECTION no_describe_config_csr_conn ENVELOPE DEBEZIUM;
ERROR:  error registering kafka topic for sink: Error creating topic testdrive-no-describe-config-XXXXX for sink: unable to fetch topic metadata after creation

which I think is too late. Please stand by.

@philip-stoev
Copy link
Contributor

philip-stoev commented Apr 20, 2023

Ok so I was able to create a Kafka user that is not allowed to run DescribeConfigs against the cluster resource:

kafka-acls \
  --bootstrap-server kafka:9092 \
  --add \
  --deny-principal User:CN=no_describe_config \
  --operation DescribeConfigs \
  --cluster \
  --command-config //tmp/foo.properties

However, this does not cause the expected warning path to trigger. In the discover_topic_configs method, I get the following:

For the configs variable:

[Ok(ConfigResource { specifier: Broker(1), entries: [] })]

For the config variable:

ConfigResource { specifier: Broker(1), entries: [] }

In other words, with a restrictive ACL, an empty list of entries is returned by client.describe_configs , rather than an Err() or something else. (With permissive ACLs, the entire configuration is returned, as expected)

Therefore, none of the error paths in the method will trigger. We get to the for entry in config.entries loop, exit it immediately, as there are no entries, so we match nothing with it, and therefore the method returns the -1 defaults for both settings as defined at the top of the method wrapped in an Ok(). So the:

warn!("Failed to discover default values for topic configs: {e}");

is never reached.

In fact, from a cursory examination of the code, rdkafka's describe_configs method is unlikely to return an Err, there is only one ? operator in the entire thing.

That said, I have what I think are all the pieces required to construct an mzcompose test, so I will push one tomorrow.

@bkirwi
Copy link
Contributor Author

bkirwi commented Apr 20, 2023

Yep, that sounds like the error condition I was seeing! My hypothesis is that the broker is passing along an error code that rdkafka is ignoring. (When I make the equivalent call from a JVM client, I get an exception.) But it seems clear that either way this array is empty.

However, this does not cause the expected warning path to trigger.

Is the surprise that the log entry is not showing up? I removed those errors since we're failing open anyways. But I can put them back if that would be more clear!

In either case, it sounds like your test would fail to create the sink on main but pass on this branch? If so that definitely sounds worth merging in.

It would be very strange if the actual cluster config was empty, so this
will log an error in that case. (Since we fail open, the overall
behaviour is unchanged.)
@bkirwi
Copy link
Contributor Author

bkirwi commented Apr 20, 2023

But I can put them back if that would be more clear!

I pushed a change to error when the resulting config list for our cluster is empty, since it seems unlikely that this will ever happen in a non-error case. Let me know if this is what you had in mind!

@dseisun-materialize dseisun-materialize added the release-blocker Critical issue that should block *any* release if not fixed label Apr 20, 2023
Copy link
Contributor

@philip-stoev philip-stoev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a test that passes with this patch but fails on main.

@aljoscha
Copy link
Contributor

tyvm, @philip-stoev! I was hoping you already had a test up your sleeve 😅

@philip-stoev
Copy link
Contributor

I did not have a test, but I have gained something -- a deep, simmering hatred for Kafka authentication.

@petrosagg petrosagg merged commit 4bc81cb into MaterializeInc:main Apr 21, 2023
@bkirwi bkirwi deleted the try-configs branch April 21, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker Critical issue that should block *any* release if not fixed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants