-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Forward kafka cluster options to clients #21275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the syntax errors, this seems fine to me.
Deferring to @mattrobenolt and @tkaemming as our resident Kafka experts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks legit to me. Still waiting to hear back from @mattrobenolt and @tkaemming before merging though.
Thanks a lot for looking into this @alexef!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alexef, happy to see you here as well :D
Thanks for doing this on Sentry as well.
There are a couple of things to figure out.
- the order of the override, please see the comment inline.
- there is a potential backward compatibility issue I would like @BYK to chime in on because it impact self hosted users. Today we are already applying the kafka connection parameters from the settings for producers (https://github.com/getsentry/sentry/blob/master/src/sentry/utils/kafka.py#L34-L35) Which means self hosted users may have already overridden their connection parameters. With this PR, at the next release, those overridden parameters will quietly start being applied to consumers as well. Most likely this is not an issue, but it is impossible to guarantee that nobody overrode a parameter that is fine for producers but breaks everything if applied to their consumers as well. @BYK what do you think? Is this something we have to watch out or communicate in the onpremise installation?
@fpacifici - good point that said we pre-set this value in on-premise and don't provide support for external Kafka instances. What's the worst-case outcome if anyone modified these settings for their setup? |
@BYK the worst case scenario is very hard to estimate because the amount of possible interactions to rule out between parameters when applied to producers and consumers is huge. It is much less likely that somebody would have overridden a parameter for the producer that actually would not work for the consumer as in the end it is the same broker, but it is not impossible. To be 100% sure we would have to separate producer from consumer config, which is kind of wrong anyway because a lot of it is supposed to be in common. I would suggest we do the followings:
@BYK do you think it would be acceptable? At least we would be making it clear and explicit when an unsupported override is happening. |
@fpacifici your plan sounds like it covers all the cases. It also seems like a longer term plan though. I think:
What do you think? |
@BYK, unfortunately I found a case where the settings for the producer may not be safe on the consumer that we use internally. What about we changed the structure of KAFKA_CLUSTERS into
We would do like on snuba, the priority of parameters being loaded wouldl be:
I would suggest we build two simple method to load the consumer and the producer config that applies overrides according to the priority above, then replacing the direct references to settings.KAFKA_CLUSTERS with calls to those methods will be pretty easy since there are only few reference in the code base. @alexef What do you think about this ? |
Updated. I couldn't find any code overrides for configs, this is why I didn't add it to the config getter. @fpacifici let me know what you think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this goes in the right direction.
A few comments inline.
Could you also write a short test for your config loader methods? This does not need to be as exhaustive as the one you built for snuba.
Up for another check :) |
/gcbrun |
I found the issue that causes the cloudbuild to fail:
Thats' because the PR applies the list of allowed parameters not only to the new configuration (correct) but also to the legacy configuration (incorrect). The legacy configuration should not be validated since we do not have control on what was added so far. This issue shows up on cloud build (fortunately it got caught there) because it runs the onpremise version of sentry and that one overrides that parameter by default: https://github.com/getsentry/onpremise/blob/master/sentry/sentry.conf.example.py#L119 If you want to reproduce this:
@BYK The celery worker logs are not shown in cloudbuild upon failure, that means every errors happening in sentry during ingestion is hidden. That covers all the steps from the ingestion consumer (after relay) and before eventstream (where we produce to kafka for snuba) |
Quite sure the failure is due to this config: I think if we cannot make this backward-compatible with these existing configurations out in the wild, we should rename this config. |
Not sure if we can do something clever here to auto-migration/compatibility: https://github.com/getsentry/sentry/blob/master/docker/sentry.conf.py |
@BYK, let me clarify. There is nothing wrong with I tagged you because of the missing logs in the cloudbuild results. I think we should expose the celery worker logs as well, because there is where a big chunk of the ingestion logic in the sentry code base happens. |
@fpacifici - sorry, I wrote my comment before seeing yours. I get the issue now. I've also submitted a PR to make the logs available. |
thanks |
/gcbuild |
/gcbrun |
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one small issue in the kafka_config, but that is fairly important, it took me a while to figure out why it was working in a certain way. Then everything else pass, tomorrow I will merge if it is fixed.
src/sentry/utils/kafka_config.py
Outdated
if with_legacy and legacy_options: | ||
# producer uses all legacy_options | ||
options.update(legacy_options) | ||
elif "bootstrap.servers" in legacy_options: | ||
# legacy override of bootstrap.servers should be preserved | ||
options["bootstrap.servers"] = legacy_options["bootstrap.servers"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be very hard to grasp. We need to change how this is structured.
Basically what you are trying to do is to pick all legacy options for producers (if there are legacy options) and, for consumers, pick only bootstrap.servers from legacy options.
Why not rewriting it as
#Rename with_legacy into only_bootstrap and invert
if legacy_options:
if only_bootstrap:
assert "bootstrap.servers" in legacy_options
options["bootstrap.servers"] = legacy_options["bootstrap.servers"]
else :
options.update(legacy_options)
else:
# same as before.
So it should be clear that picking the bootstrap only from the legacy option is not a way to override other options but it is meant to build legacy options that discard everything that is not legacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, your version is way more readable. this ended up like this due to many rewrittings.
updated
/gcbrun |
Thanks a lot @alexef for following along for all this time! |
… then we must have bootstrap.servers
/gcbrun |
@BYK I can't wait to see this merged so we can upgrade our onprem installation. The biggest challenge was keeping my dev environment usable - python2.7 + mac does not play well, but this should be addressed soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this will be very useful.
Fixes: #14123. Related to: getsentry/snuba#1406