RabbitMQ OCF RA M/S HA cluster agent migration #1698

bogdando · 2021-10-01T10:29:37Z

New home for rabbitmq-server-ha RA OCF m/s HA

Moving it from repo:
https://github.com/rabbitmq/rabbitmq-server
The original path:
scripts/rabbitmq-server-ha.ocf

Also preserve list of authors and changes history since its
very initial commit in the Fuel for OpenStack project,
now archieved: https://github.com/openstack-archive/fuel-library

To get the history use:
$ git log --follow heartbeat/rabbitmq-server-ha.ocf

Reasoning behind: the OCF RA script provides M/S HA
pacemaker resource for RabbitMQ cluster and better fits
this place.

Background

It's been actively maintained for years.
And now it needs a new home due to requests of RabbitMQ team,
since it is no longer possible to run CI tests for changes
proposed against it by the old location.

TripleO upstream project and its layered RH OSP product have plans
to adopt this OCF RA for its use. That guarantees the future
maintanance and support for it.

How it works

Documentation is kept by its original upstream location:
https://www.rabbitmq.com/pacemaker.html#auto-pacemaker

Future Plans

Once it is there, the package builds for RDO will catch up
changes for that OCF RA and make sure it's CI'ed, also in TripleO
and OSP.

Status Quo

Until the adoption completes, I'm planning to test changes proposed
by this new location in my fork, with github actions, like [0].
The CI runs on pre-build images [1] and vagrant scripts [1] that
I maintain for the (more or less) recent Pacemaker and RabbitMQ
builds. The test coverage includes a simple cluster assemble smoke
test and a sofisticated Jepsen testcase that verifies auto-healing
of the cluster resource in Pacemaker managed by this OCF RA.

[0] https://github.com/bogdando/rabbitmq-server/runs/3757495446
[1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf
[2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant

Signed-off-by: Bogdan Dobrelya bdobreli@redhat.com

based on Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e All the work done for fuel-library packaging Should be overriden by the change above after we switch CI to package-based implements blueprint: package-fuel-components Change-Id: I48ed37a009b42f0a9a21cc869a869edb505b39c3

1) Package fuel library into three different packages: RPM: fuel-library6.1 ALL: fuel-ha-utils, fuel-misc 2) Install packages onto slave nodes implements blueprint: package-fuel-components Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e

This commit makes post-start notify action to check hostlist of nodes that should be joined to the cluster to contain not only nodes that will be started but also ones that are already started. This fixes the case when Pacemaker sends notifies only for the latest event and thus the node which is not included into the start list will not join the cluster. Also it checks whether the node is already clustered and skips the join if it is not needed. Change-Id: Ibe8ecdcfe42c14228350b1eb3c9d08b1a64e117d Closes-bug: #1455761

There is a mistake in OCF logic which tries to start rabbitmq app without running beam after Mnesia reset getting into the loop which constantly fails until it times out Change-Id: Id096961e206a083b51978fc5034f99d04715d7ea Related-bug: #1436812

W/o this patch, the code in OCF script from deployment/ dir will never get to the fuel-library packages, which are building from files/ and /debian dirs only. The solution is: 1) sync the code diverged to the files/ and debian/ 2) either to remove the source OCF file or to update the way files being linked. This patch fixes only the step 1 as there is not yet decided how to deal with the step 2. Related-bug: #1457441 Related-bug: #184966 Change-Id: Ied86640e8e853de99bcd26f1ae726fc8272b6db7 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix, when rabbit app cannot start due to corrupted mnesia state, the mnesia would be cleaned not completely. This may prevent the rabbit app from start and take the node out of the cluster permanently. The solution is to remove all rabbit node related mnesia files. Closes-bug: #1457766 Change-Id: I680efbf573c22aa9a13d8429d985b5a57235b2bf Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

* When the rabbit node went down, its status remains 'running' in mnesia db for a while, so few retries (50 sec of total) are required in order to kick and forget this node from the cluster. This also requires +50 sec for actions stop & demote timeout. * The rabbit master score in the CIB is retained after the current master moved manually. This is wrong and the score must be reset ASAP for post-demote and post-stop as well. * The demoted node must be kicked from cluster by other nodes on post-demote processing. * Post-demote should stop the rabbit app at the node being demoted as this node should be kicked from the cluster by other nodes. Instead, it stops the app at the *other* nodes and brings full cluster downtime. * The check to join should be only done at the post-start and not at the post-promote, otherwise the node being promoted may think it is clustered with some node while the join check reports it as already clustered with another one. (the regression was caused by https://review.openstack.org/184671) * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE everywhere to ensure we are using correct pacemaker node name * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting the resource as not running. This will rerun resource and restore its state, eventually. Closes-bug: #1436812 Closes-bug: #1455761 Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix, the failover time was longer than expected as rabbit nodes was able to query corosync nodes left the cluster and also try to join them by rabbit cluster ending up being reset and rejoin alive nodes later. 1) Add functions: a) to get all alive nodes in the partition b) to get all nodes This fixes get_monitor behaviour so that it ignores attributes for dead nodes as crm_node behaviour changed with upgrade of pacemaker. So rabbit nodes will never try to join the dead ones. 2) Fix bash scopes for local variables Minor change removing unexcpeted behavior when local variable impacts global scope. Related-bug: #1436812 Change-Id: I89b716b4cd007572bb6832365d4424669921f057 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix, rabbitmqctl sometimes may hang failing many commands. This is a problem as it brings the rabbit node to unresponsive and broken state. This also may affect entire cluster operations, for example, when the failed command is the forget_cluster_node. The solution is to check for the cases when the command rabbitmqctl list_channels timed out and killed or termintated with exit codes 137 or 124 and return generic error. There is also related confusing error message "get_status() returns generic error" may be logged when the rabbit node is running out of the cluster and fixed as well. Closes-bug: #1459173 Change-Id: Ia52fc5f2ab7adb36252a7194f9209ab87ce487de Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

This commit checks whether there is a running cluster of rabbitmq and if rabbitmq app is running on the node and exits with non-zero code if current node is not running rabbitmq, but should do so Change-Id: I2098405b39ade7325b94781aeb997de0937bdf4c Closes-bug: #1458828

W/o this fix, the situation is possible when a rabbit node would stuck in a start/stop loop failing to join the cluster with an error: "no_running_cluster_nodes, You cannot leave a cluster if no online nodes are present." This is an issue because the rabbit node should always be able to join the cluster, if it was ordered to start by pacemaker RA. The solution is to force the mnesia reset, if the rabbit node cannot join the cluster on post-start notify. Note, that for the master starting, the node wouldn't be reset. So, the mnesia will be kept intact at least on the resource master. Partial-bug: #1461509 Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix the dead end situation is possible when the rabbit node have no free memory resources left and the cluster blocks all publishing, by design. But the app thinks "let's wait for the publish block have lifted" and cannot recover. The workaround is to monitor results of crucial rabbitmqctl commands and restart the rabbit node, if queues/channels/alarms cannot be listed or if there are memory alarms found. This is the similar logic as we have for the cases when rabbitmqctl list_channels hangs. But the channels check is also fixed to verify if the exit code>0 when the rabbit app is running. Additional checks added to the monitor also require extending the timeout window for the monitor action from 60 to 180 seconds. Besides that, this patch makes the monitor action to gather the rabbit status and runtime stats, like consumed memory by all queues of total Mem+Swap, total messages in all queues and average queue consumer utilization. This info should help to troubleshoot failures better. DocImpact: ops guide. If any rabbitmq node exceeded its memory threshold the publish became blocked cluster-wide, by design. For such cases, this rabbit node would be recovered from the raised memory alert and immediately stopped to be restarted later by the pacemaker. Otherwise, this blocked publishing state might never have been lifted, if the pressure persists from the OpenStack apps side. Closes-bug: #1463433 Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix, the list of file names not accessible by rabbitmq user will be treated as multiple arguments to the if command causing it to throw the "too many arguments" error and the chown command to be skipped. This is the problem as it might prevent the rabbitmq server from starting because of a bad files ownership. The solution is to pass the list of files as a single argument "${foo}". Closes-bug: #1472175 Change-Id: I1d00ec3f31cd0f023bd58a4e11e5b31659977229 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

W/o this fix the situation is possible when rabbit OCF returns OCF_NOT_RUNNING in the hope of future restart of the resource by pacemaker. But in fact, pacemaker will not trigger restart action if monitor returns "not running". This is an issue as we want resource restarted. The solution is to return OCF_ERR_GENERIC instead of OCF_NOT_RUNNING when we expect the resource to be restarted (which is action stop plus action start). Closes-bug: #1472230 Change-Id: I10c6e43d92cb23596636d86932674b36864d1595 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

This changes leverages the rabbitmq management plugin to dump exchanges, queues, bindings, users, virtual hosts, permissions and parameters from the running system. Specifically this change adds the following: * The dumping rabbitMQ definitions (users/vhosts/exchanges/etc) during the end of the deployment * The possibility to restore definitions to the rabbitmq-server ocf script during rabbitMQ startup. * Enabled rabbitmq admin plugin, but restricts it to localhost traffic. This reverts Ic01c26200f6019a8112b1c5fb04a282e64b3b3e6 but adds firewall rules to mitigate the issue. DocImpact: The dump_rabbit_definitions task can be used to backup the rabbitmq definitions and if custom definitions (users/vhosts/etc) are created it must be run or the changes may be lost during the rabbitmq failover via pacemaker. Change-Id: I715f7c2ae527f7e105b9f6b7d82c443e8accf178 Closes-bug: #1383258 Related-bug: #1450443 Co-Authored-By: Alex Schultz <aschultz@mirantis.com>

Previously we were sending the json backup data on the command line which fails when the dataset is large. This change updates the command line options for curl to pass the filename directly and let it handle the reading of the data. Change-Id: I37f298279beca06df41fb08e1745602976c6a776 Closes-Bug: 1383258

It's really hard to debug, when get_status() returns $OCF_NOT_RUNNING only and looses exit code and error output. Added more logs to avoid of this situation. Related-Bug: #1488999 Change-Id: Id0999235d7be688f55799e2952fe22e97b678ce7

W/o this patch, the race condition is possible when there is no running rabbit nodes/resource master. The rabbit nodes will start/stop in an endless loop as a result introducing full downtime for AMQP cluster and cloud control plane. The solution is: * On post-start/post-promote notify, do nothing, if either of the following is a true: - there is no rabbit resources running or no master - the list of rabbit resources being started/promoted reported empty * For such cases, do not report resource failure and delegate recovery, if needed, to the "running out of the cluster" monitor's logic. * Additionally, report about a last man standing when there is no running rabbit resources around. Closes-bug: #1491306 Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

The change makes OCF script ignore small number of timeouts of rabbitmqctl for 'heavy' operations: list_channels, get_alarms and list_queues. Number of tolerated timeouts in a row is configured through a new variable 'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl timeouts are not tolerated at all. Bug #1487517 is fixed by extracting declaration of local variables 'rc_alarms' and 'rc_queues' from assignment operations. Text for Operations Guide: If on node where RabbitMQ is deployed other processes consume significant part of CPU, RabbitMQ starts responding slow to queries by 'rabbitmqctl' utility. The utility is used by RabbitMQ's OCF script to monitor state of the RabbitMQ. When utility fails to return in pre-defined timeout, OCF script considers RabbitMQ to be down and restarts it, which might lead to a limited (several minutes) OpenStack downtime. Such restarts are undesirable as they cause downtime without benefit. To mitigate the issue, the OCF script might be told to tolerate certain amount of rabbitmqctl timeouts in a row using the following command: crm_resource --resource p_rabbitmq-server --set-parameter \ max_rabbitmqctl_timeouts --parameter-value N Here N should be replaced with the number of timeouts. For instance, if it is set to 3, the OCF script will tolerate two rabbitmqctl timeouts in a row, but fail if the third one occurs. By default the parameter is set to 1, i.e. rabbitmqctl timeout is not tolerated at all. The downside of increasing the parameter is that if a real issue occurs which causes rabbitmqctl timeout, OCF script will detect that only after N monitor runs and so the restart, which might fix the issue, will be delayed. To understand that RabbitMQ's restart was caused by rabbitmqctl timeout you should examine lrmd.log of the corresponding controller on Fuel master node in /var/log/docker-logs/remote/ directory. Here lines like "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..." indicate rabbitmqctl timeout. The next line will explain if it caused restart or not. For example: "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now." DocImpact: user-guide, operations-guide Closes-Bug: #1479815 Closes-Bug: #1487517 Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca

Change get_status to return NOT_RUNNING when beam is not_running. Otherwise, pacemaker will get stuck during rabbitmq failover and will not attempt to restart the failed resource Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30 Closes-bug: #1484280

On notify, if we detect that we are a part of a cluster we still need to start the RabbitMQ application, because it is always down after action_start finishes. Closes-Bug: #1496386 Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

When the data returned from 'rabbitmqctl list_queues' grows a lot and awk sums up all the rows especially for memory calculation it returns the sum in scientific notation (example from bug was .15997e+09), later when we want to calculate the memory in MB instead of bytes, the bash division does not like this string. We can just avoid the situation by doing the division into MB in awk itself. Since we don't need the memory in bytes anyway. Closes-Bug: #1503331 Change-Id: I38d25406b84d0f70ed62101d5fb5ba108bcab8bd

Added new OCF key stop_time (corresponding to start_time) Added wait_sync function which tries until start_time/2 for queues on stopped/demoted node to reach synced state. Added optional [-t timeout] to su_rabbit_cmd function to provide arbitrary timeout Change-Id: Iae2211b3d477a9603a58d5eacb12e0fba924861a Closes-Bug: #1464637

Sync upstream changes back to Fuel downstream Source https://github.com/rabbitmq/rabbitmq-server version stable/fedfefebaa39a0aeb41cf9328ba44c3a458e4614 Related blueprint upstream-rabbit-ocf Closes-bug: #1473015 Change-Id: Ie19c2f071c53b873a359c6c5134e9498c6391e66 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

... in the source distribution anymore

And fix local bashisms as a little bonus Upstream patch rabbitmq/rabbitmq-server#374 Related-bug: #1464637 Change-Id: I13189de9f8abce23673c031d11132e495e1972e3 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

…mq nodes We introduce the OCF_RESKEY_allowed_cluster_node parameter which can be used to specify which nodes of the cluster rabbitmq is expected to run on. When this variable is not set the resource agent assumes that all nodes of the cluster (output of crm_node -l) are eligible to run rabbitmq. The use case here is clusters that have a large numbers of node, where only a specific subset is used for rabbitmq (usually this is done with some constraints). Tested in a 9-node cluster as follows: [root@messaging-0 ~]# pcs resource config rabbitmq Resource: rabbitmq (class=ocf provider=rabbitmq type=rabbitmq-server-ha) Attributes: allowed_cluster_nodes="messaging-0 messaging-1 messaging-2" avoid_using_iptables=true Meta Attrs: container-attribute-target=host master-max=3 notify=true ordered=true Operations: demote interval=0s timeout=30 (rabbitmq-demote-interval-0s) monitor interval=5 timeout=30 (rabbitmq-monitor-interval-5) monitor interval=3 role=Master timeout=30 (rabbitmq-monitor-interval-3) notify interval=0s timeout=20 (rabbitmq-notify-interval-0s) promote interval=0s timeout=60s (rabbitmq-promote-interval-0s) start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s) [root@messaging-0 ~]# pcs status |grep -e rabbitmq -e messaging * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] ... * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]: * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2

Currently every call to unblock_client_access() is followed by a log line showing which function requested the unblocking. When we pass the parameter OCF_RESKEY_avoid_using_iptables=true it makes no sense to log unblocking of iptables since it is effectively a no-op. Let's move that logging inside the unblock_client_access() function allowing a parameter to log which function called it. Tested on a cluster with rabbitmq bundles with avoid_using_iptables=true and observed no spurious logging any longer: [root@messaging-0 ~]# journalctl |grep 'unblocked access to RMQ port' |wc -l 0

RABBITMQ_NODE_PORT is exported by default and set to 5672. Re-exporting it in that case will actually break the case where we set up rabbit with tls on the default port: 2021-02-28 07:44:10.732 [error] <0.453.0> Failed to start Ranch listener {acceptor,{172,17,1,93},5672} in ranch_ssl:listen([{cacerts,'...'},{key,'...'},{cert,'...'},{ip,{172,17,1,93}},{port,5672}, inet,{keepalive,true}, {versions,['tlsv1.1','tlsv1.2']},{certfile,"/etc/pki/tls/certs/rabbitmq.crt"},{keyfile,"/etc/pki/tls/private/rabbitmq.key"}, {depth,1},{secure_renegotiate,true},{reuse_sessions,true},{honor_cipher_order,true},{verify,verify_none},{fail_if_no_peer_cert,false}]) for reason eaddrinuse (address already in use) This is because by explicitely always exporting it, we force rabbit to listen to that port via tcp and that is a problem when we want to do SSL on that port. Since 5672 is the default port already we can just avoid exporting this port when the user does not customize the port. Tested both in a non-TLS env (A) and in a TLS-env (B) successfully: (A) Non-TLS [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq [root@messaging-0 /]# [root@messaging-0 /]# pcs status |grep rabbitmq * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2 (B) TLS [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq/ |head -n3 /etc/rabbitmq/rabbitmq.config: {ssl, [{versions, ['tlsv1.1', 'tlsv1.2']}]}, /etc/rabbitmq/rabbitmq.config: {ssl_listeners, [{"172.17.1.48", 5672}]}, /etc/rabbitmq/rabbitmq.config: {ssl_options, [ [root@messaging-0 ~]# pcs status |grep rabbitmq * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-0 * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-1 * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha): Master messaging-2 Note: I don't believe we should export RABBITMQ_NODE_PORT at all, since you can specify all ports in the rabbit configuration anyways, but prefer to play it safe here as folks might rely on being able to customize this. Signed-off-by: Michele Baldessari <michele@acksyn.org>

Upgrade Lager to 3.9 for OTP 24 compatibility

In newer Erlang, beam.smp no longer writes a pidfile, until the rabbit applicataion starts. It also no longer passes -mneisa dir and -sname, which are required in order to start the node only delaying the application start up. Handle that so the Pacemaker HA setup keeps working with newer Erlang and rabbitmq-server versions. Fix '[ x == x ]' bashisms as well to silence errors in the RA logs. Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>

Fuel for OpenStack origined the rabbitmq OCF RA. Restore history of changes for it. This commit is empty and only set a milestone for its Fuel histroy ending at Tue May 10 15:27:53 2016. Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

Every time we recompile the erlang/elixir/rebar/rabbitmq stack there is one or more fresh new warnings that will completely trip up any parsing of these commands. Most end up being bugs that get fixed later on [1]. Since stderr is rarely interesting and just holds any rebase up, let's ignore it when running these rabbitmqctl commands. [1] https://elixirforum.com/t/mix-local-hex-warning-authenticity-is-not-established-by-certificate-path-validation/39665 Authored-by: Michele Baldessari <michele@acksyn.org> Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>

knet-ci-bot · 2021-10-01T10:29:39Z

Can one of the admins verify this patch?

Moving it from repo: https://github.com/rabbitmq/rabbitmq-server The original path: scripts/rabbitmq-server-ha.ocf Also preserve list of authors and changes history since its very initial commit in the Fuel for OpenStack project, now archieved: https://github.com/openstack-archive/fuel-library To get the history use: $ git log --follow heartbeat/rabbitmq-server-ha.ocf Reasoning behind: the OCF RA script provides M/S HA pacemaker resource for RabbitMQ cluster and better fits this place. Background ========== It's been actively maintained for years. And now it needs a new home due to requests of RabbitMQ team, since it is no longer possible to run CI tests for changes proposed against it by the old location. TripleO upstream project and its layered RH OSP product have plans to adopt this OCF RA for its use. That guarantees the future maintanance and support for it. How it works ========== Documentation is kept by its original upstream location: https://www.rabbitmq.com/pacemaker.html#auto-pacemaker Future Plans ============ Once it is there, the package builds for RDO will catch up changes for that OCF RA and make sure it's CI'ed, also in TripleO and OSP. Status Quo ========== Until the adoption completes, I'm planning to test changes proposed by this new location in my fork, with github actions, like [0]. The CI runs on pre-build images [1] and vagrant scripts [1] that I maintain for the (more or less) recent Pacemaker and RabbitMQ builds. The test coverage includes a simple cluster assemble smoke test and a sofisticated Jepsen testcase that verifies auto-healing of the cluster resource in Pacemaker managed by this OCF RA. [0] https://github.com/bogdando/rabbitmq-server/runs/3757495446 [1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf [2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

bogdando · 2021-10-01T10:53:46Z

Now it works as expected, see an example commit from history

bogdando · 2021-10-01T10:58:38Z

Please note that this PR includes commits all the way back to 2015 intentionally, to preserve authored changes and link authors, whenever possible, as well

oalbrigt

Looks good in general.

I've added some comments of what I think should be improved.

You should also check what parts of the logging to INFO should stay INFO, and which can be changed to DEBUG (to avoid spamming the logs too much).

Consider switching license to the same as the other agents, unless you got a good reason for using the current license:
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#licensing

heartbeat/rabbitmq-server-ha.ocf

oalbrigt · 2021-10-01T11:25:48Z

heartbeat/rabbitmq-server-ha.ocf

+    local LH="${LL} my_host():"
+
+    hostname=$(process_fqdn $(get_hostname))
+    ocf_log info "${LH} hostlist is: $hostlist"


This should be debug, so we dont spam it to the logs unless we want it.

heartbeat/rabbitmq-server-ha.ocf

oalbrigt · 2021-10-01T12:30:34Z

heartbeat/rabbitmq-server-ha.ocf

+    # remove all temporary RMQ blocking rules, if there are more than one exist
+    for i in $(iptables -nvL --wait --line-numbers | awk '/temporary RMQ block/ {print $1}'); do
+      iptables --wait -D INPUT -p tcp -m tcp --dport ${OCF_RESKEY_node_port} -m state --state NEW,RELATED,ESTABLISHED \
+      -m comment --comment 'temporary RMQ block' -j REJECT --reject-with tcp-reset


Indent a step more to make it more visible that it's part of line above.

heartbeat/rabbitmq-server-ha.ocf

oalbrigt · 2021-10-01T13:35:44Z

heartbeat/rabbitmq-server-ha.ocf

+
+        elif [ -n "${alarms}" ]; then
+            for node in ${alarms}; do
+                name=`echo ${node} | perl -n -e "m/memory,'(?<n>\S+)+'/ && print \"$+{n}\n\""`


We should use sed or awk here.

Good point, I'd fix that in follow-up. Since that might regress the agent code and requires extra run of functional testing, I'd ran it in my forked repo with github actions once I point that to the "new home".

bogdando · 2021-10-01T13:49:14Z

@jeckersb PTAL

bogdando · 2021-10-01T13:51:09Z

@oalbrigt thank you for reviewing! Do you think it would make sence to land this first, then addressing your comments by small patches, which I could test on my own, as individual changes?

oalbrigt · 2021-10-01T14:00:54Z

@oalbrigt thank you for reviewing! Do you think it would make sence to land this first, then addressing your comments by small patches, which I could test on my own, as individual changes?

We can just fix it here, as I'm in no hurry to merge it, so we can leave it open for a week or 3.

heartbeat/rabbitmq-server-ha.ocf

Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

bogdando · 2021-11-04T14:03:56Z

It's good to go. I tested it with github workflows set up for my fork, it passed smoke and jepsen without regressions (in my test suite jepsen always reports failures because messages are expectedly lost, so that's the expected result)

Vladimir Kuklin and others added 30 commits April 16, 2015 11:23

All the work done for fuel-library packaging

ec56073

1) Package fuel library into three different packages: RPM: fuel-library6.1 ALL: fuel-ha-utils, fuel-misc 2) Install packages onto slave nodes implements blueprint: package-fuel-components Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e

Merge "Restart rabbit if can't list queues or found memory alert"

c87ebae

Merge "Fix error return codes for rabbit OCF"

f9a87be

Add more logs to rabbitmq get_status function

fb89b78

It's really hard to debug, when get_status() returns $OCF_NOT_RUNNING only and looses exit code and error output. Added more logs to avoid of this situation. Related-Bug: #1488999 Change-Id: Id0999235d7be688f55799e2952fe22e97b678ce7

Merge "Detect a last man standing for rabbit OCF agent"

96d0a34

Start RabbitMQ app on notify

64285b3

On notify, if we detect that we are a part of a cluster we still need to start the RabbitMQ application, because it is always down after action_start finishes. Closes-Bug: #1496386 Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432

Merge "Avoid division operation in shell"

2b3f58e

Packages are now "self-hosted": no need for the packaging dir

a796ec8

... in the source distribution anymore

Fix the timeout arg for the su_rabbit_cmd

27a0454

And fix local bashisms as a little bonus Upstream patch rabbitmq/rabbitmq-server#374 Related-bug: #1464637 Change-Id: I13189de9f8abce23673c031d11132e495e1972e3 Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

mbaldessari and others added 10 commits March 1, 2021 21:55

Merge pull request #2864 from rabbitmq/mk-lager-3-9-0

bff3727

Upgrade Lager to 3.9 for OTP 24 compatibility

fix hostname(1) calls on Solaris

8799071

Merge branch 'rabbit-migrate' of fuel-library

07301d2

Merge branch 'rabbit-migrate' of rabbitmq-server

ecc7231

bogdando changed the title ~~RabbitMQ OCF RA M/S HA cluster agent migration~~ [WIP] RabbitMQ OCF RA M/S HA cluster agent migration Oct 1, 2021

bogdando force-pushed the rabbit-migrate-final branch from 5ff52f7 to 7553d8f Compare October 1, 2021 10:48

bogdando changed the title ~~[WIP] RabbitMQ OCF RA M/S HA cluster agent migration~~ RabbitMQ OCF RA M/S HA cluster agent migration Oct 1, 2021

bogdando mentioned this pull request Oct 1, 2021

[RA OCF HA] Ignore stderr when calling rabbitmqctl eval() rabbitmq/rabbitmq-server#3524

Closed

12 tasks

oalbrigt requested changes Oct 1, 2021

View reviewed changes

oalbrigt reviewed Oct 4, 2021

View reviewed changes

heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved

Fix OCF params and bashisms

3555d9e

Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

bogdando force-pushed the rabbit-migrate-final branch from 7553d8f to 3555d9e Compare November 3, 2021 12:14

bogdando requested a review from oalbrigt November 3, 2021 12:15

oalbrigt approved these changes Nov 3, 2021

View reviewed changes

oalbrigt merged commit 65bed16 into ClusterLabs:master Nov 4, 2021

bogdando mentioned this pull request Nov 4, 2021

Remove pacemaker OCF RA agent for RabbitMQ rabbitmq/rabbitmq-server#3653

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ OCF RA M/S HA cluster agent migration #1698

RabbitMQ OCF RA M/S HA cluster agent migration #1698

bogdando commented Oct 1, 2021 •

edited

Loading

knet-ci-bot commented Oct 1, 2021

bogdando commented Oct 1, 2021

bogdando commented Oct 1, 2021

oalbrigt left a comment

oalbrigt Oct 1, 2021

oalbrigt Oct 1, 2021

oalbrigt Oct 1, 2021

bogdando Nov 3, 2021

bogdando commented Oct 1, 2021

bogdando commented Oct 1, 2021 •

edited

Loading

oalbrigt commented Oct 1, 2021 •

edited

Loading

bogdando commented Nov 4, 2021

RabbitMQ OCF RA M/S HA cluster agent migration #1698

RabbitMQ OCF RA M/S HA cluster agent migration #1698

Conversation

bogdando commented Oct 1, 2021 • edited Loading

Background

How it works

Future Plans

Status Quo

knet-ci-bot commented Oct 1, 2021

bogdando commented Oct 1, 2021

bogdando commented Oct 1, 2021

oalbrigt left a comment

Choose a reason for hiding this comment

oalbrigt Oct 1, 2021

Choose a reason for hiding this comment

oalbrigt Oct 1, 2021

Choose a reason for hiding this comment

oalbrigt Oct 1, 2021

Choose a reason for hiding this comment

bogdando Nov 3, 2021

Choose a reason for hiding this comment

bogdando commented Oct 1, 2021

bogdando commented Oct 1, 2021 • edited Loading

oalbrigt commented Oct 1, 2021 • edited Loading

bogdando commented Nov 4, 2021

bogdando commented Oct 1, 2021 •

edited

Loading

bogdando commented Oct 1, 2021 •

edited

Loading

oalbrigt commented Oct 1, 2021 •

edited

Loading