rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

kchheda3 · 2024-05-17T16:47:04Z

Fixes https://tracker.ceph.com/issues/66036

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows
jenkins test rook e2e

yuvalif · 2024-05-21T18:39:52Z

src/rgw/driver/rados/rgw_notify.cc

+  EntryProcessingResult process_entry(const ConfigProxy &conf,
+                                      persistency_tracker &entry_persistency_tracker,
+                                      const cls_queue_entry &entry,
+                                      RGWPubSubEndpoint *push_endpoint,


nit: const RGWPubSubEndpoint* push_endpoint

yuvalif · 2024-05-21T18:44:52Z

src/rgw/driver/rados/rgw_notify.cc

+  int create_push_endpoint(const std::string &queue_name,
+                           const std::vector<cls_queue_entry> &entries,
+                           spawn::yield_context yield,
+                           RGWPubSubEndpoint::Ptr *push_endpoint) {


why passing a pointer to a unique_ptr?
you can just pass the bare pointer from the unique_ptr

or even better. since the caller ignore the return value from this function, you can return the unique_ptr.
you can return a unique_ptr initialized with nullptr in case of error

the caller does honor the return value, we do not process the entries if we cannot create the push_endpoint.
and i am passing the unique_ptr by ref and its created inside the function.
i do not want to verify based on whether the function returned nullptr

ok. so please pass by ref, not a pointer. IMO, pointer to a unique_ptr is too confusing.

remove the code to pass unique_ptr

yuvalif · 2024-05-21T18:46:59Z

src/rgw/driver/rados/rgw_notify.cc

+      RGWPubSubEndpoint::Ptr push_endpoint;
+      if (create_push_endpoint(queue_name, entries, yield, &push_endpoint)
+          < 0) {
+        return;


this means that you stop processing of the queue. i think we should continue queue processing unless we lose ownership of queue is removed.
this will alllow users to fix conf errors

i mean how can we continue processing the queue (entries) as we were not able to create the endpoint.
since no entries are processed, we just exit from the function ?

if you exit from the function, nothing will reschedule the processing of the queue and the rgw will release ownership.
handling should be similar to the other error cases (="continue"), and wait for the problem to be resolved.
note that we will not be scheduling the coroutine too often in this case, with the timer we have in line 426.

src/rgw/driver/rados/rgw_notify.cc

yuvalif · 2024-05-21T18:56:42Z

src/rgw/rgw_amqp.cc

@@ -53,14 +53,29 @@ static const int RGW_AMQP_NO_REPLY_CODE =                 0x0;

 // the amqp_connection_info struct does not hold any memory and just points to the URL string
 // so, strings are copied into connection_id_t
-connection_id_t::connection_id_t(const amqp_connection_info& info, const std::string& _exchange)
-    : host(info.host), port(info.port), vhost(info.vhost), exchange(_exchange), ssl(info.ssl) {}
+connection_id_t::connection_id_t(


why changing the amqp code?

coz amqp too faces same issue, if for some reason there is typo in the username and password, coz the connection is pooled, the new connection never uses the new userid and passwd

amqp code needs a major rewrite (there are several issues fixed for kafka that still exist in amqp).
also, we would need to test that.
would recommend splitting amqp work to another PR

i am afraid this will be forgotten while we refactor the amqp
since we are doing this for kafka, why not just do it for amqp as well, coz amqp also faces the same issue ? and its similar code changes so its logical to be part of same commit ?

agree that this is needed there. but there are many other (critical) fixes we did on the kafka code that are needed there as well.
main reason that i want it in a separate PR is testing.

done, removed amqp code.

src/rgw/rgw_kafka.h

yuvalif · 2024-05-21T19:06:09Z

please add tests for both fixes:

for the notification caching fix, you can define an invalid kafka port or host on persistent notifications, and then modify the topic to the correct values
fot the connection caching, you can define 2 topics pointing to the same kafka broker, one with the with-ssl and one without. ans define the with-ssl topic first.
before your fix, all of the notifications to both topics would fail (as they share the same connection that tries to open a secure connection with the broker). after the fix, only notifications fro mthe first topics would fail, and the second topic should be successfull

github-actions · 2024-05-21T21:13:48Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

yuvalif · 2024-05-22T07:07:03Z

src/rgw/driver/rados/rgw_pubsub_push.cc

@@ -342,7 +338,7 @@ class RGWPubSubKafkaEndpoint : public RGWPubSubEndpoint {

  std::string to_str() const override {
    std::string str("Kafka Endpoint");
-    str += "\nBroker: " + conn_name;
+    str += "\nBroker: " + conn_id.broker;


Suggested change

str += "\nBroker: " + conn_id.broker;

str += "\nBroker: " + to_string(conn_id);

yuvalif · 2024-05-23T11:03:52Z

please rebase and resolve conflicts only after: #57630
is merged (as it will prevent you from testing kafka code).

kchheda3 · 2024-05-23T11:06:56Z

please rebase and resolve conflicts only after: #57630

is merged (as it will prevent you from testing kafka code).

Yes I was planning to take one final look at the #57630, and also do some local test again

…ate Pushendpoint and stop calling RGWPubSubEndpoint::create for every event. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>

… key along with endpoint for connection pooling. For kafka, currently connection pooling is done based on endpoint, so all the events with same endpoint share the same connection. but there are issues when userid & password is created/changed for the endpoint, coz the old connection is cached in broker manager and when new event with updated/new userid-password is sent, broker still uses the old connection that was created with old/no userid/password as currently only the `endpoint` is the key to connection pool. To fix this, use all the topic attributes that are part of connection as the key to connection pool and if any of the attribute changes create new kafka connection. Attibutes include userid, password, ssl, ca_laction. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>

kchheda3 · 2024-05-28T21:35:16Z

please add tests for both fixes:

for the notification caching fix, you can define an invalid kafka port or host on persistent notifications, and then modify the topic to the correct values

fot the connection caching, you can define 2 topics pointing to the same kafka broker, one with the with-ssl and one without. ans define the with-ssl topic first.
before your fix, all of the notifications to both topics would fail (as they share the same connection that tries to open a secure connection with the broker). after the fix, only notifications fro mthe first topics would fail, and the second topic should be successfull

added both test cases.
however FYI, both the test cases need the PR #57536 to be merged to run successfully

kchheda3 · 2024-05-28T21:35:37Z

please rebase and resolve conflicts only after: #57630 is merged (as it will prevent you from testing kafka code).

rebased and updated the commits.

kchheda3 requested a review from a team as a code owner May 17, 2024 16:47

kchheda3 requested review from cbodley and yuvalif May 17, 2024 16:47

github-actions bot added the rgw label May 17, 2024

kchheda3 force-pushed the wip_fix_notification_caching branch from 02558db to a4307f1 Compare May 17, 2024 17:01

yuvalif reviewed May 21, 2024

View reviewed changes

src/rgw/driver/rados/rgw_notify.cc Show resolved Hide resolved

yuvalif added the needs-release-note label May 21, 2024

yuvalif reviewed May 21, 2024

View reviewed changes

src/rgw/rgw_kafka.h Show resolved Hide resolved

yuvalif added the needs-test label May 21, 2024

github-actions bot added the needs-rebase label May 21, 2024

yuvalif reviewed May 22, 2024

View reviewed changes

rgw/notification: Use topic attributes from topic rados object to cre…

c8f7d31

…ate Pushendpoint and stop calling RGWPubSubEndpoint::create for every event. Signed-off-by: kchheda3 <kchheda3@bloomberg.net>

kchheda3 force-pushed the wip_fix_notification_caching branch from a4307f1 to fbb6ad5 Compare May 28, 2024 21:30

github-actions bot added tests and removed needs-rebase labels May 28, 2024

kchheda3 force-pushed the wip_fix_notification_caching branch from fbb6ad5 to abaf8b8 Compare May 28, 2024 21:33

kchheda3 requested a review from yuvalif May 28, 2024 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

kchheda3 commented May 17, 2024

yuvalif May 21, 2024

kchheda3 May 28, 2024

yuvalif May 21, 2024

yuvalif May 21, 2024

kchheda3 May 21, 2024

yuvalif May 22, 2024

kchheda3 May 28, 2024

yuvalif May 21, 2024

kchheda3 May 21, 2024

yuvalif May 22, 2024

kchheda3 May 28, 2024

yuvalif May 21, 2024

kchheda3 May 21, 2024

yuvalif May 22, 2024

kchheda3 May 22, 2024

yuvalif May 23, 2024

kchheda3 May 28, 2024

yuvalif commented May 21, 2024

github-actions bot commented May 21, 2024

yuvalif May 22, 2024

yuvalif commented May 23, 2024

kchheda3 commented May 23, 2024

kchheda3 commented May 28, 2024 •

edited

kchheda3 commented May 28, 2024

	str += "\nBroker: " + conn_id.broker;
	str += "\nBroker: " + to_string(conn_id);

rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

Are you sure you want to change the base?

rgw/notification: Fix the caching issues of notification brokers, where the cache was not invalidated if topic attributes were changed #57537

Conversation

kchheda3 commented May 17, 2024

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvalif commented May 21, 2024

github-actions bot commented May 21, 2024

Choose a reason for hiding this comment

yuvalif commented May 23, 2024

kchheda3 commented May 23, 2024

kchheda3 commented May 28, 2024 • edited

kchheda3 commented May 28, 2024

kchheda3 commented May 28, 2024 •

edited