Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ OCF RA M/S HA cluster agent migration #1698

Merged
merged 188 commits into from
Nov 4, 2021

Conversation

bogdando
Copy link
Contributor

@bogdando bogdando commented Oct 1, 2021

New home for rabbitmq-server-ha RA OCF m/s HA

Moving it from repo:
https://github.com/rabbitmq/rabbitmq-server
The original path:
scripts/rabbitmq-server-ha.ocf

Also preserve list of authors and changes history since its
very initial commit in the Fuel for OpenStack project,
now archieved: https://github.com/openstack-archive/fuel-library

To get the history use:
$ git log --follow heartbeat/rabbitmq-server-ha.ocf

Reasoning behind: the OCF RA script provides M/S HA
pacemaker resource for RabbitMQ cluster and better fits
this place.

Background

It's been actively maintained for years.
And now it needs a new home due to requests of RabbitMQ team,
since it is no longer possible to run CI tests for changes
proposed against it by the old location.

TripleO upstream project and its layered RH OSP product have plans
to adopt this OCF RA for its use. That guarantees the future
maintanance and support for it.

How it works

Documentation is kept by its original upstream location:
https://www.rabbitmq.com/pacemaker.html#auto-pacemaker

Future Plans

Once it is there, the package builds for RDO will catch up
changes for that OCF RA and make sure it's CI'ed, also in TripleO
and OSP.

Status Quo

Until the adoption completes, I'm planning to test changes proposed
by this new location in my fork, with github actions, like [0].
The CI runs on pre-build images [1] and vagrant scripts [1] that
I maintain for the (more or less) recent Pacemaker and RabbitMQ
builds. The test coverage includes a simple cluster assemble smoke
test and a sofisticated Jepsen testcase that verifies auto-healing
of the cluster resource in Pacemaker managed by this OCF RA.

[0] https://github.com/bogdando/rabbitmq-server/runs/3757495446
[1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf
[2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant

Signed-off-by: Bogdan Dobrelya bdobreli@redhat.com

Vladimir Kuklin and others added 30 commits April 16, 2015 11:23
based on Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e
All the work done for fuel-library packaging

Should be overriden by the change above after we switch
CI to package-based

implements blueprint: package-fuel-components

Change-Id: I48ed37a009b42f0a9a21cc869a869edb505b39c3
1) Package fuel library into three different
packages:
RPM: fuel-library6.1
ALL: fuel-ha-utils, fuel-misc

2) Install packages onto slave nodes
implements blueprint: package-fuel-components

Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e
This commit makes post-start notify action to check
hostlist of nodes that should be joined to the cluster
to contain not only nodes that will be started but
also ones that are already started. This fixes
the case when Pacemaker sends notifies only for
the latest event and thus the node which is not
included into the start list will not join the
cluster. Also it checks whether the node is
already clustered and skips the join if it
is not needed.

Change-Id: Ibe8ecdcfe42c14228350b1eb3c9d08b1a64e117d
Closes-bug: #1455761
There is a mistake in OCF logic which tries
to start rabbitmq app without running beam
after Mnesia reset getting into the loop
which constantly fails until it times out

Change-Id: Id096961e206a083b51978fc5034f99d04715d7ea
Related-bug: #1436812
W/o this patch, the code in OCF script from
deployment/ dir will never get to the fuel-library
packages, which are building from files/ and /debian
dirs only.

The solution is:
1) sync the code diverged to the files/ and debian/
2) either to remove the source OCF file or to
update the way files being linked.

This patch fixes only the step 1 as there is not yet
decided how to deal with the step 2.

Related-bug: #1457441
Related-bug: #184966

Change-Id: Ied86640e8e853de99bcd26f1ae726fc8272b6db7
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix, when rabbit app cannot start due to
corrupted mnesia state, the mnesia would be cleaned
not completely. This may prevent the rabbit app from
start and take the node out of the cluster permanently.

The solution is to remove all rabbit node related mnesia
files.

Closes-bug: #1457766

Change-Id: I680efbf573c22aa9a13d8429d985b5a57235b2bf
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
* When the rabbit node went down, its status remains 'running'
  in mnesia db for a while, so few retries (50 sec of total) are
  required in order to kick and forget this node from the cluster.
  This also requires +50 sec for actions stop & demote timeout.
* The rabbit master score in the CIB is retained after the current
  master moved manually. This is wrong and the score must be reset
  ASAP for post-demote and post-stop as well.
* The demoted node must be kicked from cluster by other nodes
  on post-demote processing.
* Post-demote should stop the rabbit app at the node being demoted as
  this node should be kicked from the cluster by other nodes.
  Instead, it stops the app at the *other* nodes and brings full
  cluster downtime.
* The check to join should be only done at the post-start and not at
  the post-promote, otherwise the node being promoted may think it
  is clustered with some node while the join check reports it as
  already clustered with another one.
  (the regression was caused by https://review.openstack.org/184671)
* Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
  everywhere to ensure we are using correct pacemaker node name
* Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
  the resource as not running. This will rerun resource and restore
  its state, eventually.

Closes-bug: #1436812
Closes-bug: #1455761

Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix, the failover time was longer than expected
as rabbit nodes was able to query corosync nodes left the
cluster and also try to join them by rabbit cluster ending
up being reset and rejoin alive nodes later.
1) Add functions:
  a) to get all alive nodes in the partition
  b) to get all nodes
This fixes get_monitor behaviour so that it ignores
attributes for dead nodes as crm_node behaviour
changed with upgrade of pacemaker. So rabbit nodes will
never try to join the dead ones.

2) Fix bash scopes for local variables
Minor change removing unexcpeted behavior when local variable
impacts global scope.

Related-bug: #1436812

Change-Id: I89b716b4cd007572bb6832365d4424669921f057
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix, rabbitmqctl sometimes may hang failing
many commands. This is a problem as it brings the rabbit node
to unresponsive and broken state. This also may affect
entire cluster operations, for example, when the failed command is
the forget_cluster_node.

The solution is to check for the cases when the command rabbitmqctl
list_channels timed out and killed or termintated with exit codes
137 or 124 and return generic error.
There is also related confusing error message "get_status() returns generic
error" may be logged when the rabbit node is running out of the cluster
and fixed as well.

Closes-bug: #1459173

Change-Id: Ia52fc5f2ab7adb36252a7194f9209ab87ce487de
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
This commit checks whether there is a running
cluster of rabbitmq and if rabbitmq app is running
on the node and exits with non-zero code if
current node is not running rabbitmq, but should
do so

Change-Id: I2098405b39ade7325b94781aeb997de0937bdf4c
Closes-bug: #1458828
W/o this fix, the situation is possible when a
rabbit node would stuck in a start/stop loop failing
to join the cluster with an error:
"no_running_cluster_nodes, You cannot leave a cluster
if no online nodes are present."

This is an issue because the rabbit node should always
be able to join the cluster, if it was ordered to start
by pacemaker RA.

The solution is to force the mnesia reset, if the
rabbit node cannot join the cluster on post-start
notify. Note, that for the master starting, the node
wouldn't be reset. So, the mnesia will be kept intact
at least on the resource master.

Partial-bug: #1461509

Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix the dead end situation is possible
when the rabbit node have no free memory resources left
and the cluster blocks all publishing, by design.
But the app thinks "let's wait for the publish block have
lifted" and cannot recover.

The workaround is to monitor results
of crucial rabbitmqctl commands and restart the rabbit node,
if queues/channels/alarms cannot be listed or if there are
memory alarms found.
This is the similar logic as we have for the cases when
rabbitmqctl list_channels hangs. But the channels check is also
fixed to verify if the exit code>0 when the rabbit app is
running.

Additional checks added to the monitor also require extending
the timeout window for the monitor action from 60 to 180 seconds.

Besides that, this patch makes the monitor action to gather the
rabbit status and runtime stats, like consumed memory by all
queues of total Mem+Swap, total messages in all queues and
average queue consumer utilization. This info should help to
troubleshoot failures better.

DocImpact: ops guide. If any rabbitmq node exceeded its memory
threshold the publish became blocked cluster-wide, by design.
For such cases, this rabbit node would be recovered from the
raised memory alert and immediately stopped to be restarted
later by the pacemaker. Otherwise, this blocked publishing state
might never have been lifted, if the pressure persists from the
OpenStack apps side.

Closes-bug: #1463433

Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix, the list of file names not
accessible by rabbitmq user will be treated
as multiple arguments to the if command causing
it to throw the "too many arguments" error and
the chown command to be skipped.

This is the problem as it might prevent the rabbitmq
server from starting because of a bad files ownership.

The solution is to pass the list of files as a single
argument "${foo}".

Closes-bug: #1472175

Change-Id: I1d00ec3f31cd0f023bd58a4e11e5b31659977229
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
W/o this fix the situation is possible when
rabbit OCF returns OCF_NOT_RUNNING in the hope of
future restart of the resource by pacemaker.

But in fact, pacemaker will not trigger restart action
if monitor returns "not running". This is an issue
as we want resource restarted.

The solution is to return OCF_ERR_GENERIC instead of
OCF_NOT_RUNNING when we expect the resource to be restarted
(which is action stop plus action start).

Closes-bug: #1472230

Change-Id: I10c6e43d92cb23596636d86932674b36864d1595
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
This changes leverages the rabbitmq management plugin to dump
exchanges, queues, bindings, users, virtual hosts, permissions and
parameters from the running system. Specifically this change adds the
following:

* The dumping rabbitMQ definitions (users/vhosts/exchanges/etc) during
  the end of the deployment
* The possibility to restore definitions to the rabbitmq-server ocf
  script during rabbitMQ startup.
* Enabled rabbitmq admin plugin, but restricts it to localhost traffic.
  This reverts Ic01c26200f6019a8112b1c5fb04a282e64b3b3e6 but adds
  firewall rules to mitigate the issue.

DocImpact: The dump_rabbit_definitions task can be used to backup the
rabbitmq definitions and if custom definitions (users/vhosts/etc) are
created it must be run or the changes may be lost during the rabbitmq
failover via pacemaker.

Change-Id: I715f7c2ae527f7e105b9f6b7d82c443e8accf178
Closes-bug: #1383258
Related-bug: #1450443
Co-Authored-By: Alex Schultz <aschultz@mirantis.com>
Previously we were sending the json backup data on the command line
which fails when the dataset is large. This change updates the command
line options for curl to pass the filename directly and let it handle
the reading of the data.

Change-Id: I37f298279beca06df41fb08e1745602976c6a776
Closes-Bug: 1383258
It's really hard to debug, when get_status() returns $OCF_NOT_RUNNING
only and looses exit code and error output.

Added more logs to avoid of this situation.

Related-Bug: #1488999

Change-Id: Id0999235d7be688f55799e2952fe22e97b678ce7
W/o this patch, the race condition is possible
when there is no running rabbit nodes/resource
master. The rabbit nodes will start/stop in an
endless loop as a result introducing full downtime
for AMQP cluster and cloud control plane.

The solution is:
* On post-start/post-promote notify, do nothing, if
  either of the following is a true:
  - there is no rabbit resources running or no master
  - the list of rabbit resources being started/promoted
    reported empty
* For such cases, do not report resource failure and delegate
  recovery, if needed, to the "running out of the cluster"
  monitor's logic.
* Additionally, report about a last man standing when
  there is no running rabbit resources around.

Closes-bug: #1491306

Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
The change makes OCF script ignore small number of timeouts of rabbitmqctl
for 'heavy' operations: list_channels, get_alarms and list_queues.
Number of tolerated timeouts in a row is configured through a new variable
'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
timeouts are not tolerated at all.

Bug #1487517 is fixed by extracting declaration of local variables
'rc_alarms' and 'rc_queues' from assignment operations.


Text for Operations Guide:

If on node where RabbitMQ is deployed
other processes consume significant part of CPU, RabbitMQ starts
responding slow to queries by 'rabbitmqctl' utility. The utility is
used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
When utility fails to return in pre-defined timeout, OCF script
considers RabbitMQ to be down and restarts it, which might lead to
a limited (several minutes) OpenStack downtime. Such restarts
are undesirable as they cause downtime without benefit. To
mitigate the issue, the OCF script might be told to tolerate
certain amount of rabbitmqctl timeouts in a row using the following
command:
  crm_resource --resource p_rabbitmq-server --set-parameter \
      max_rabbitmqctl_timeouts --parameter-value N

Here N should be replaced with the number of timeouts. For instance,
if it is set to 3, the OCF script will tolerate two rabbitmqctl
timeouts in a row, but fail if the third one occurs.

By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
tolerated at all. The downside of increasing the parameter is that
if a real issue occurs which causes rabbitmqctl timeout, OCF script
will detect that only after N monitor runs and so the restart, which
might fix the issue, will be delayed.

To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
you should examine lrmd.log of the corresponding controller on Fuel
master node in /var/log/docker-logs/remote/ directory. Here lines like
"the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."

indicate rabbitmqctl timeout. The next line will explain if it
caused restart or not. For example:
"rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."

DocImpact: user-guide, operations-guide

Closes-Bug: #1479815
Closes-Bug: #1487517
Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca
Change get_status to return NOT_RUNNING when
beam is not_running. Otherwise, pacemaker
will get stuck during rabbitmq failover and
will not attempt to restart the failed resource

Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
Closes-bug: #1484280
On notify, if we detect that we are a part of a cluster we still
need to start the RabbitMQ application, because it is always
down after action_start finishes.

Closes-Bug: #1496386
Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
When the data returned from 'rabbitmqctl list_queues' grows a lot
and awk sums up all the rows especially for memory calculation it
returns the sum in scientific notation (example from bug
was .15997e+09), later when we want to calculate the memory in
MB instead of bytes, the bash division does not like this string.

We can just avoid the situation by doing the division into MB
in awk itself. Since we don't need the memory in bytes anyway.

Closes-Bug: #1503331
Change-Id: I38d25406b84d0f70ed62101d5fb5ba108bcab8bd
Added new OCF key stop_time (corresponding to start_time)
Added wait_sync function which tries until start_time/2
for queues on stopped/demoted node to reach synced state.

Added optional [-t timeout] to su_rabbit_cmd function to
provide arbitrary timeout

Change-Id: Iae2211b3d477a9603a58d5eacb12e0fba924861a
Closes-Bug: #1464637
Sync upstream changes back to Fuel downstream
Source https://github.com/rabbitmq/rabbitmq-server
version stable/fedfefebaa39a0aeb41cf9328ba44c3a458e4614

Related blueprint upstream-rabbit-ocf
Closes-bug: #1473015

Change-Id: Ie19c2f071c53b873a359c6c5134e9498c6391e66
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
And fix local bashisms as a little bonus
Upstream patch rabbitmq/rabbitmq-server#374

Related-bug: #1464637

Change-Id: I13189de9f8abce23673c031d11132e495e1972e3
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
mbaldessari and others added 10 commits March 1, 2021 21:55
…mq nodes

We introduce the OCF_RESKEY_allowed_cluster_node parameter which can be used to specify
which nodes of the cluster rabbitmq is expected to run on. When this variable is not
set the resource agent assumes that all nodes of the cluster (output of crm_node -l)
are eligible to run rabbitmq. The use case here is clusters that have a large
numbers of node, where only a specific subset is used for rabbitmq (usually this is
done with some constraints).

Tested in a 9-node cluster as follows:
[root@messaging-0 ~]# pcs resource config rabbitmq
 Resource: rabbitmq (class=ocf provider=rabbitmq type=rabbitmq-server-ha)
  Attributes: allowed_cluster_nodes="messaging-0 messaging-1 messaging-2" avoid_using_iptables=true
  Meta Attrs: container-attribute-target=host master-max=3 notify=true ordered=true
  Operations: demote interval=0s timeout=30 (rabbitmq-demote-interval-0s)
              monitor interval=5 timeout=30 (rabbitmq-monitor-interval-5)
              monitor interval=3 role=Master timeout=30 (rabbitmq-monitor-interval-3)
              notify interval=0s timeout=20 (rabbitmq-notify-interval-0s)
              promote interval=0s timeout=60s (rabbitmq-promote-interval-0s)
              start interval=0s timeout=200s (rabbitmq-start-interval-0s)
              stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)

[root@messaging-0 ~]# pcs status |grep -e rabbitmq -e messaging
  * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
...
  * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
    * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
    * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
    * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
Currently every call to unblock_client_access() is followed by a log line
showing which function requested the unblocking. When we pass the parameter
OCF_RESKEY_avoid_using_iptables=true it makes no sense to log
unblocking of iptables since it is effectively a no-op.

Let's move that logging inside the unblock_client_access() function
allowing a parameter to log which function called it.

Tested on a cluster with rabbitmq bundles with avoid_using_iptables=true
and observed no spurious logging any longer:

[root@messaging-0 ~]# journalctl |grep 'unblocked access to RMQ port' |wc -l
0
RABBITMQ_NODE_PORT is exported by default and set to 5672. Re-exporting it in that
case will actually break the case where we set up rabbit with tls on the default port:

  2021-02-28 07:44:10.732 [error] <0.453.0> Failed to start Ranch listener
  {acceptor,{172,17,1,93},5672} in ranch_ssl:listen([{cacerts,'...'},{key,'...'},{cert,'...'},{ip,{172,17,1,93}},{port,5672},
  inet,{keepalive,true}, {versions,['tlsv1.1','tlsv1.2']},{certfile,"/etc/pki/tls/certs/rabbitmq.crt"},{keyfile,"/etc/pki/tls/private/rabbitmq.key"},
  {depth,1},{secure_renegotiate,true},{reuse_sessions,true},{honor_cipher_order,true},{verify,verify_none},{fail_if_no_peer_cert,false}])
  for reason eaddrinuse (address already in use)

This is because by explicitely always exporting it, we force rabbit to listen to
that port via tcp and that is a problem when we want to do SSL on that port.
Since 5672 is the default port already we can just avoid exporting this port when
the user does not customize the port.

Tested both in a non-TLS env (A) and in a TLS-env (B) successfully:
(A) Non-TLS
[root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq
[root@messaging-0 /]#
[root@messaging-0 /]# pcs status |grep rabbitmq
    * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
    * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
    * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2

(B) TLS
[root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq/ |head -n3
/etc/rabbitmq/rabbitmq.config:  {ssl, [{versions, ['tlsv1.1', 'tlsv1.2']}]},
/etc/rabbitmq/rabbitmq.config:    {ssl_listeners, [{"172.17.1.48", 5672}]},
/etc/rabbitmq/rabbitmq.config:    {ssl_options, [

[root@messaging-0 ~]# pcs status |grep rabbitmq
    * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
    * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
    * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2

Note: I don't believe we should export RABBITMQ_NODE_PORT at all, since you can specify all ports
in the rabbit configuration anyways, but prefer to play it safe here as folks might rely on being
able to customize this.

Signed-off-by: Michele Baldessari <michele@acksyn.org>
Upgrade Lager to 3.9 for OTP 24 compatibility
In newer Erlang, beam.smp no longer writes a pidfile, until the rabbit
applicataion starts. It also no longer passes -mneisa dir and -sname,
which are required in order to start the node only delaying
the application start up.
Handle that so the Pacemaker HA setup keeps working with newer Erlang
and rabbitmq-server versions.

Fix '[ x == x ]' bashisms as well to silence errors in the RA logs.

Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
Fuel for OpenStack origined the rabbitmq OCF RA.
Restore history of changes for it.

This commit is empty and only set a milestone for its Fuel histroy
ending at Tue May 10 15:27:53 2016.

Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
Every time we recompile the erlang/elixir/rebar/rabbitmq stack there is
one or more fresh new warnings that will completely trip up any parsing
of these commands. Most end up being bugs that get fixed later on [1].

Since stderr is rarely interesting and just holds any rebase up, let's
ignore it when running these rabbitmqctl commands.

[1] https://elixirforum.com/t/mix-local-hex-warning-authenticity-is-not-established-by-certificate-path-validation/39665

Authored-by: Michele Baldessari <michele@acksyn.org>
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
@knet-ci-bot
Copy link

Can one of the admins verify this patch?

@bogdando bogdando changed the title RabbitMQ OCF RA M/S HA cluster agent migration [WIP] RabbitMQ OCF RA M/S HA cluster agent migration Oct 1, 2021
Moving it from repo:
https://github.com/rabbitmq/rabbitmq-server
The original path:
scripts/rabbitmq-server-ha.ocf

Also preserve list of authors and changes history since its
very initial commit in the Fuel for OpenStack project,
now archieved: https://github.com/openstack-archive/fuel-library

To get the history use:
$ git log --follow heartbeat/rabbitmq-server-ha.ocf

Reasoning behind: the OCF RA script provides M/S HA
pacemaker resource for RabbitMQ cluster and better fits
this place.

Background
==========
It's been actively maintained for years.
And now it needs a new home due to requests of RabbitMQ team,
since it is no longer possible to run CI tests for changes
proposed against it by the old location.

TripleO upstream project and its layered RH OSP product have plans
to adopt this OCF RA for its use. That guarantees the future
maintanance and support for it.

How it works
==========
Documentation is kept by its original upstream location:
https://www.rabbitmq.com/pacemaker.html#auto-pacemaker

Future Plans
============
Once it is there, the package builds for RDO will catch up
changes for that OCF RA and make sure it's CI'ed, also in TripleO
and OSP.

Status Quo
==========
Until the adoption completes, I'm planning to test changes proposed
by this new location in my fork, with github actions, like [0].
The CI runs on pre-build images [1] and vagrant scripts [1] that
I maintain for the (more or less) recent Pacemaker and RabbitMQ
builds. The test coverage includes a simple cluster assemble smoke
test and a sofisticated Jepsen testcase that verifies auto-healing
of the cluster resource in Pacemaker managed by this OCF RA.

[0] https://github.com/bogdando/rabbitmq-server/runs/3757495446
[1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf
[2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant

Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
@bogdando bogdando changed the title [WIP] RabbitMQ OCF RA M/S HA cluster agent migration RabbitMQ OCF RA M/S HA cluster agent migration Oct 1, 2021
@bogdando
Copy link
Contributor Author

bogdando commented Oct 1, 2021

Now it works as expected, see an example commit from history

@bogdando
Copy link
Contributor Author

bogdando commented Oct 1, 2021

Please note that this PR includes commits all the way back to 2015 intentionally, to preserve authored changes and link authors, whenever possible, as well

Copy link
Contributor

@oalbrigt oalbrigt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general.

I've added some comments of what I think should be improved.

You should also check what parts of the logging to INFO should stay INFO, and which can be changed to DEBUG (to avoid spamming the logs too much).

Consider switching license to the same as the other agents, unless you got a good reason for using the current license:
https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#licensing

heartbeat/rabbitmq-server-ha.ocf Show resolved Hide resolved
heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved
local LH="${LL} my_host():"

hostname=$(process_fqdn $(get_hostname))
ocf_log info "${LH} hostlist is: $hostlist"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be debug, so we dont spam it to the logs unless we want it.

heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved
# remove all temporary RMQ blocking rules, if there are more than one exist
for i in $(iptables -nvL --wait --line-numbers | awk '/temporary RMQ block/ {print $1}'); do
iptables --wait -D INPUT -p tcp -m tcp --dport ${OCF_RESKEY_node_port} -m state --state NEW,RELATED,ESTABLISHED \
-m comment --comment 'temporary RMQ block' -j REJECT --reject-with tcp-reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent a step more to make it more visible that it's part of line above.

heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved
heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved
heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved
heartbeat/rabbitmq-server-ha.ocf Outdated Show resolved Hide resolved

elif [ -n "${alarms}" ]; then
for node in ${alarms}; do
name=`echo ${node} | perl -n -e "m/memory,'(?<n>\S+)+'/ && print \"$+{n}\n\""`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use sed or awk here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'd fix that in follow-up. Since that might regress the agent code and requires extra run of functional testing, I'd ran it in my forked repo with github actions once I point that to the "new home".

@bogdando
Copy link
Contributor Author

bogdando commented Oct 1, 2021

@jeckersb PTAL

@bogdando
Copy link
Contributor Author

bogdando commented Oct 1, 2021

@oalbrigt thank you for reviewing! Do you think it would make sence to land this first, then addressing your comments by small patches, which I could test on my own, as individual changes?

@oalbrigt
Copy link
Contributor

oalbrigt commented Oct 1, 2021

@oalbrigt thank you for reviewing! Do you think it would make sence to land this first, then addressing your comments by small patches, which I could test on my own, as individual changes?

We can just fix it here, as I'm in no hurry to merge it, so we can leave it open for a week or 3.

Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
@bogdando
Copy link
Contributor Author

bogdando commented Nov 4, 2021

It's good to go. I tested it with github workflows set up for my fork, it passed smoke and jepsen without regressions (in my test suite jepsen always reports failures because messages are expectedly lost, so that's the expected result)

@oalbrigt oalbrigt merged commit 65bed16 into ClusterLabs:master Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet