Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kchheda3
Copy link
Contributor

Fixes https://tracker.ceph.com/issues/66036

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@kchheda3 kchheda3 requested a review from a team as a code owner May 17, 2024 16:47
@kchheda3 kchheda3 requested review from cbodley and yuvalif May 17, 2024 16:47
@github-actions github-actions bot added the rgw label May 17, 2024
@kchheda3 kchheda3 force-pushed the wip_fix_notification_caching branch from 02558db to a4307f1 Compare May 17, 2024 17:01
EntryProcessingResult process_entry(const ConfigProxy &conf,
persistency_tracker &entry_persistency_tracker,
const cls_queue_entry &entry,
RGWPubSubEndpoint *push_endpoint,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const RGWPubSubEndpoint* push_endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

int create_push_endpoint(const std::string &queue_name,
const std::vector<cls_queue_entry> &entries,
spawn::yield_context yield,
RGWPubSubEndpoint::Ptr *push_endpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why passing a pointer to a unique_ptr?
you can just pass the bare pointer from the unique_ptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even better. since the caller ignore the return value from this function, you can return the unique_ptr.
you can return a unique_ptr initialized with nullptr in case of error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller does honor the return value, we do not process the entries if we cannot create the push_endpoint.
and i am passing the unique_ptr by ref and its created inside the function.
i do not want to verify based on whether the function returned nullptr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. so please pass by ref, not a pointer. IMO, pointer to a unique_ptr is too confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the code to pass unique_ptr

RGWPubSubEndpoint::Ptr push_endpoint;
if (create_push_endpoint(queue_name, entries, yield, &push_endpoint)
< 0) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that you stop processing of the queue. i think we should continue queue processing unless we lose ownership of queue is removed.
this will alllow users to fix conf errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean how can we continue processing the queue (entries) as we were not able to create the endpoint.
since no entries are processed, we just exit from the function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you exit from the function, nothing will reschedule the processing of the queue and the rgw will release ownership.
handling should be similar to the other error cases (="continue"), and wait for the problem to be resolved.
note that we will not be scheduling the coroutine too often in this case, with the timer we have in line 426.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -53,14 +53,29 @@ static const int RGW_AMQP_NO_REPLY_CODE = 0x0;

// the amqp_connection_info struct does not hold any memory and just points to the URL string
// so, strings are copied into connection_id_t
connection_id_t::connection_id_t(const amqp_connection_info& info, const std::string& _exchange)
: host(info.host), port(info.port), vhost(info.vhost), exchange(_exchange), ssl(info.ssl) {}
connection_id_t::connection_id_t(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing the amqp code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coz amqp too faces same issue, if for some reason there is typo in the username and password, coz the connection is pooled, the new connection never uses the new userid and passwd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amqp code needs a major rewrite (there are several issues fixed for kafka that still exist in amqp).
also, we would need to test that.
would recommend splitting amqp work to another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am afraid this will be forgotten while we refactor the amqp
since we are doing this for kafka, why not just do it for amqp as well, coz amqp also faces the same issue ? and its similar code changes so its logical to be part of same commit ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree that this is needed there. but there are many other (critical) fixes we did on the kafka code that are needed there as well.
main reason that i want it in a separate PR is testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, removed amqp code.

@yuvalif
Copy link
Contributor

yuvalif commented May 21, 2024

please add tests for both fixes:

  • for the notification caching fix, you can define an invalid kafka port or host on persistent notifications, and then modify the topic to the correct values
  • fot the connection caching, you can define 2 topics pointing to the same kafka broker, one with the with-ssl and one without. ans define the with-ssl topic first.
    before your fix, all of the notifications to both topics would fail (as they share the same connection that tries to open a secure connection with the broker). after the fix, only notifications fro mthe first topics would fail, and the second topic should be successfull

Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@@ -342,7 +338,7 @@ class RGWPubSubKafkaEndpoint : public RGWPubSubEndpoint {

std::string to_str() const override {
std::string str("Kafka Endpoint");
str += "\nBroker: " + conn_name;
str += "\nBroker: " + conn_id.broker;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
str += "\nBroker: " + conn_id.broker;
str += "\nBroker: " + to_string(conn_id);

@yuvalif
Copy link
Contributor

yuvalif commented May 23, 2024

please rebase and resolve conflicts only after: #57630
is merged (as it will prevent you from testing kafka code).

@kchheda3
Copy link
Contributor Author

please rebase and resolve conflicts only after: #57630

is merged (as it will prevent you from testing kafka code).

Yes I was planning to take one final look at the #57630, and also do some local test again

…ate Pushendpoint and stop calling RGWPubSubEndpoint::create for every event.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
@kchheda3 kchheda3 force-pushed the wip_fix_notification_caching branch from a4307f1 to fbb6ad5 Compare May 28, 2024 21:30
… key along with endpoint for connection pooling.

For kafka, currently connection pooling is done based on endpoint, so all the events with same endpoint share the same connection. but there are issues when userid & password is created/changed for the endpoint, coz the old connection is cached in broker manager and when new event with updated/new userid-password is sent, broker still uses the old connection that was created with old/no userid/password as currently only the `endpoint` is the key to connection pool.
To fix this, use all the topic attributes that are part of connection as the key to connection pool and if any of the attribute changes create new kafka connection. Attibutes include userid, password, ssl, ca_laction.

Signed-off-by: kchheda3 <kchheda3@bloomberg.net>
@kchheda3 kchheda3 force-pushed the wip_fix_notification_caching branch from fbb6ad5 to abaf8b8 Compare May 28, 2024 21:33
@kchheda3
Copy link
Contributor Author

kchheda3 commented May 28, 2024

please add tests for both fixes:

  • for the notification caching fix, you can define an invalid kafka port or host on persistent notifications, and then modify the topic to the correct values
  • fot the connection caching, you can define 2 topics pointing to the same kafka broker, one with the with-ssl and one without. ans define the with-ssl topic first.
    before your fix, all of the notifications to both topics would fail (as they share the same connection that tries to open a secure connection with the broker). after the fix, only notifications fro mthe first topics would fail, and the second topic should be successfull

added both test cases.
however FYI, both the test cases need the PR #57536 to be merged to run successfully

@kchheda3
Copy link
Contributor Author

please rebase and resolve conflicts only after: #57630 is merged (as it will prevent you from testing kafka code).

rebased and updated the commits.

@kchheda3 kchheda3 requested a review from yuvalif May 28, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants