Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ OCF RA M/S HA cluster agent migration #1698

Merged
merged 188 commits into from
Nov 4, 2021

Commits on Apr 16, 2015

  1. Backward-compatible commit for packaging of fuel-library

    based on Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e
    All the work done for fuel-library packaging
    
    Should be overriden by the change above after we switch
    CI to package-based
    
    implements blueprint: package-fuel-components
    
    Change-Id: I48ed37a009b42f0a9a21cc869a869edb505b39c3
    Vladimir Kuklin committed Apr 16, 2015
    Configuration menu
    Copy the full SHA
    f5e2cc0 View commit details
    Browse the repository at this point in the history

Commits on May 14, 2015

  1. All the work done for fuel-library packaging

    1) Package fuel library into three different
    packages:
    RPM: fuel-library6.1
    ALL: fuel-ha-utils, fuel-misc
    
    2) Install packages onto slave nodes
    implements blueprint: package-fuel-components
    
    Change-Id: Ie759857fb94db9aa94aaeaeda2c6ab5bb159cc9e
    Vladimir Kuklin committed May 14, 2015
    Configuration menu
    Copy the full SHA
    ec56073 View commit details
    Browse the repository at this point in the history

Commits on May 21, 2015

  1. Check hostlist against starting and active resources

    This commit makes post-start notify action to check
    hostlist of nodes that should be joined to the cluster
    to contain not only nodes that will be started but
    also ones that are already started. This fixes
    the case when Pacemaker sends notifies only for
    the latest event and thus the node which is not
    included into the start list will not join the
    cluster. Also it checks whether the node is
    already clustered and skips the join if it
    is not needed.
    
    Change-Id: Ibe8ecdcfe42c14228350b1eb3c9d08b1a64e117d
    Closes-bug: #1455761
    Vladimir Kuklin committed May 21, 2015
    Configuration menu
    Copy the full SHA
    bbb3793 View commit details
    Browse the repository at this point in the history
  2. Check whether beam is started before running start_app

    There is a mistake in OCF logic which tries
    to start rabbitmq app without running beam
    after Mnesia reset getting into the loop
    which constantly fails until it times out
    
    Change-Id: Id096961e206a083b51978fc5034f99d04715d7ea
    Related-bug: #1436812
    Vladimir Kuklin committed May 21, 2015
    Configuration menu
    Copy the full SHA
    f97fb5c View commit details
    Browse the repository at this point in the history

Commits on May 22, 2015

  1. Sync rabbit OCF code diverge to packages

    W/o this patch, the code in OCF script from
    deployment/ dir will never get to the fuel-library
    packages, which are building from files/ and /debian
    dirs only.
    
    The solution is:
    1) sync the code diverged to the files/ and debian/
    2) either to remove the source OCF file or to
    update the way files being linked.
    
    This patch fixes only the step 1 as there is not yet
    decided how to deal with the step 2.
    
    Related-bug: #1457441
    Related-bug: #184966
    
    Change-Id: Ied86640e8e853de99bcd26f1ae726fc8272b6db7
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed May 22, 2015
    Configuration menu
    Copy the full SHA
    8a5d91b View commit details
    Browse the repository at this point in the history
  2. Fix rabbit OCF reset_mnesia

    W/o this fix, when rabbit app cannot start due to
    corrupted mnesia state, the mnesia would be cleaned
    not completely. This may prevent the rabbit app from
    start and take the node out of the cluster permanently.
    
    The solution is to remove all rabbit node related mnesia
    files.
    
    Closes-bug: #1457766
    
    Change-Id: I680efbf573c22aa9a13d8429d985b5a57235b2bf
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed May 22, 2015
    Configuration menu
    Copy the full SHA
    bae52d6 View commit details
    Browse the repository at this point in the history

Commits on May 25, 2015

  1. Fix rabbit OCF demote/stop/promote actions

    * When the rabbit node went down, its status remains 'running'
      in mnesia db for a while, so few retries (50 sec of total) are
      required in order to kick and forget this node from the cluster.
      This also requires +50 sec for actions stop & demote timeout.
    * The rabbit master score in the CIB is retained after the current
      master moved manually. This is wrong and the score must be reset
      ASAP for post-demote and post-stop as well.
    * The demoted node must be kicked from cluster by other nodes
      on post-demote processing.
    * Post-demote should stop the rabbit app at the node being demoted as
      this node should be kicked from the cluster by other nodes.
      Instead, it stops the app at the *other* nodes and brings full
      cluster downtime.
    * The check to join should be only done at the post-start and not at
      the post-promote, otherwise the node being promoted may think it
      is clustered with some node while the join check reports it as
      already clustered with another one.
      (the regression was caused by https://review.openstack.org/184671)
    * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
      everywhere to ensure we are using correct pacemaker node name
    * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
      the resource as not running. This will rerun resource and restore
      its state, eventually.
    
    Closes-bug: #1436812
    Closes-bug: #1455761
    
    Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed May 25, 2015
    Configuration menu
    Copy the full SHA
    7af8c43 View commit details
    Browse the repository at this point in the history
  2. Add rabbit OCF functions to get pacemaker node names

    W/o this fix, the failover time was longer than expected
    as rabbit nodes was able to query corosync nodes left the
    cluster and also try to join them by rabbit cluster ending
    up being reset and rejoin alive nodes later.
    1) Add functions:
      a) to get all alive nodes in the partition
      b) to get all nodes
    This fixes get_monitor behaviour so that it ignores
    attributes for dead nodes as crm_node behaviour
    changed with upgrade of pacemaker. So rabbit nodes will
    never try to join the dead ones.
    
    2) Fix bash scopes for local variables
    Minor change removing unexcpeted behavior when local variable
    impacts global scope.
    
    Related-bug: #1436812
    
    Change-Id: I89b716b4cd007572bb6832365d4424669921f057
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed May 25, 2015
    Configuration menu
    Copy the full SHA
    85dabaa View commit details
    Browse the repository at this point in the history

Commits on May 27, 2015

  1. Check if the rabbitmqctl command is responding

    W/o this fix, rabbitmqctl sometimes may hang failing
    many commands. This is a problem as it brings the rabbit node
    to unresponsive and broken state. This also may affect
    entire cluster operations, for example, when the failed command is
    the forget_cluster_node.
    
    The solution is to check for the cases when the command rabbitmqctl
    list_channels timed out and killed or termintated with exit codes
    137 or 124 and return generic error.
    There is also related confusing error message "get_status() returns generic
    error" may be logged when the rabbit node is running out of the cluster
    and fixed as well.
    
    Closes-bug: #1459173
    
    Change-Id: Ia52fc5f2ab7adb36252a7194f9209ab87ce487de
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed May 27, 2015
    Configuration menu
    Copy the full SHA
    b2b60c5 View commit details
    Browse the repository at this point in the history
  2. Add second monitor operation to check RabbitMQ

    This commit checks whether there is a running
    cluster of rabbitmq and if rabbitmq app is running
    on the node and exits with non-zero code if
    current node is not running rabbitmq, but should
    do so
    
    Change-Id: I2098405b39ade7325b94781aeb997de0937bdf4c
    Closes-bug: #1458828
    Vladimir Kuklin committed May 27, 2015
    Configuration menu
    Copy the full SHA
    d0f4a4c View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2015

  1. Erase mnesia if a rabbit node cannot join the cluster

    W/o this fix, the situation is possible when a
    rabbit node would stuck in a start/stop loop failing
    to join the cluster with an error:
    "no_running_cluster_nodes, You cannot leave a cluster
    if no online nodes are present."
    
    This is an issue because the rabbit node should always
    be able to join the cluster, if it was ordered to start
    by pacemaker RA.
    
    The solution is to force the mnesia reset, if the
    rabbit node cannot join the cluster on post-start
    notify. Note, that for the master starting, the node
    wouldn't be reset. So, the mnesia will be kept intact
    at least on the resource master.
    
    Partial-bug: #1461509
    
    Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jun 3, 2015
    Configuration menu
    Copy the full SHA
    a57312c View commit details
    Browse the repository at this point in the history

Commits on Jun 12, 2015

  1. Restart rabbit if can't list queues or found memory alert

    W/o this fix the dead end situation is possible
    when the rabbit node have no free memory resources left
    and the cluster blocks all publishing, by design.
    But the app thinks "let's wait for the publish block have
    lifted" and cannot recover.
    
    The workaround is to monitor results
    of crucial rabbitmqctl commands and restart the rabbit node,
    if queues/channels/alarms cannot be listed or if there are
    memory alarms found.
    This is the similar logic as we have for the cases when
    rabbitmqctl list_channels hangs. But the channels check is also
    fixed to verify if the exit code>0 when the rabbit app is
    running.
    
    Additional checks added to the monitor also require extending
    the timeout window for the monitor action from 60 to 180 seconds.
    
    Besides that, this patch makes the monitor action to gather the
    rabbit status and runtime stats, like consumed memory by all
    queues of total Mem+Swap, total messages in all queues and
    average queue consumer utilization. This info should help to
    troubleshoot failures better.
    
    DocImpact: ops guide. If any rabbitmq node exceeded its memory
    threshold the publish became blocked cluster-wide, by design.
    For such cases, this rabbit node would be recovered from the
    raised memory alert and immediately stopped to be restarted
    later by the pacemaker. Otherwise, this blocked publishing state
    might never have been lifted, if the pressure persists from the
    OpenStack apps side.
    
    Closes-bug: #1463433
    
    Change-Id: I91dec2d30d77b166ff9fe88109f3acdd19ce9ff9
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jun 12, 2015
    Configuration menu
    Copy the full SHA
    5415505 View commit details
    Browse the repository at this point in the history

Commits on Jul 7, 2015

  1. Fix chowning for rabbit OCF

    W/o this fix, the list of file names not
    accessible by rabbitmq user will be treated
    as multiple arguments to the if command causing
    it to throw the "too many arguments" error and
    the chown command to be skipped.
    
    This is the problem as it might prevent the rabbitmq
    server from starting because of a bad files ownership.
    
    The solution is to pass the list of files as a single
    argument "${foo}".
    
    Closes-bug: #1472175
    
    Change-Id: I1d00ec3f31cd0f023bd58a4e11e5b31659977229
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jul 7, 2015
    Configuration menu
    Copy the full SHA
    4a7a8e0 View commit details
    Browse the repository at this point in the history
  2. Fix error return codes for rabbit OCF

    W/o this fix the situation is possible when
    rabbit OCF returns OCF_NOT_RUNNING in the hope of
    future restart of the resource by pacemaker.
    
    But in fact, pacemaker will not trigger restart action
    if monitor returns "not running". This is an issue
    as we want resource restarted.
    
    The solution is to return OCF_ERR_GENERIC instead of
    OCF_NOT_RUNNING when we expect the resource to be restarted
    (which is action stop plus action start).
    
    Closes-bug: #1472230
    
    Change-Id: I10c6e43d92cb23596636d86932674b36864d1595
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jul 7, 2015
    Configuration menu
    Copy the full SHA
    64e2098 View commit details
    Browse the repository at this point in the history

Commits on Jul 9, 2015

  1. Configuration menu
    Copy the full SHA
    c87ebae View commit details
    Browse the repository at this point in the history

Commits on Jul 13, 2015

  1. Merge "Fix error return codes for rabbit OCF"

    Jenkins authored and openstack-gerrit committed Jul 13, 2015
    Configuration menu
    Copy the full SHA
    f9a87be View commit details
    Browse the repository at this point in the history

Commits on Jul 21, 2015

  1. Implement the dumping of rabbitMQ definitions

    This changes leverages the rabbitmq management plugin to dump
    exchanges, queues, bindings, users, virtual hosts, permissions and
    parameters from the running system. Specifically this change adds the
    following:
    
    * The dumping rabbitMQ definitions (users/vhosts/exchanges/etc) during
      the end of the deployment
    * The possibility to restore definitions to the rabbitmq-server ocf
      script during rabbitMQ startup.
    * Enabled rabbitmq admin plugin, but restricts it to localhost traffic.
      This reverts Ic01c26200f6019a8112b1c5fb04a282e64b3b3e6 but adds
      firewall rules to mitigate the issue.
    
    DocImpact: The dump_rabbit_definitions task can be used to backup the
    rabbitmq definitions and if custom definitions (users/vhosts/etc) are
    created it must be run or the changes may be lost during the rabbitmq
    failover via pacemaker.
    
    Change-Id: I715f7c2ae527f7e105b9f6b7d82c443e8accf178
    Closes-bug: #1383258
    Related-bug: #1450443
    Co-Authored-By: Alex Schultz <aschultz@mirantis.com>
    stamak and Alex Schultz committed Jul 21, 2015
    Configuration menu
    Copy the full SHA
    a9d8664 View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2015

  1. Fix rabbitmq data restore for large datasets

    Previously we were sending the json backup data on the command line
    which fails when the dataset is large. This change updates the command
    line options for curl to pass the filename directly and let it handle
    the reading of the data.
    
    Change-Id: I37f298279beca06df41fb08e1745602976c6a776
    Closes-Bug: 1383258
    Alex Schultz committed Aug 13, 2015
    Configuration menu
    Copy the full SHA
    44c24cd View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2015

  1. Add more logs to rabbitmq get_status function

    It's really hard to debug, when get_status() returns $OCF_NOT_RUNNING
    only and looses exit code and error output.
    
    Added more logs to avoid of this situation.
    
    Related-Bug: #1488999
    
    Change-Id: Id0999235d7be688f55799e2952fe22e97b678ce7
    vikt0rs committed Aug 27, 2015
    Configuration menu
    Copy the full SHA
    fb89b78 View commit details
    Browse the repository at this point in the history

Commits on Sep 3, 2015

  1. Detect a last man standing for rabbit OCF agent

    W/o this patch, the race condition is possible
    when there is no running rabbit nodes/resource
    master. The rabbit nodes will start/stop in an
    endless loop as a result introducing full downtime
    for AMQP cluster and cloud control plane.
    
    The solution is:
    * On post-start/post-promote notify, do nothing, if
      either of the following is a true:
      - there is no rabbit resources running or no master
      - the list of rabbit resources being started/promoted
        reported empty
    * For such cases, do not report resource failure and delegate
      recovery, if needed, to the "running out of the cluster"
      monitor's logic.
    * Additionally, report about a last man standing when
      there is no running rabbit resources around.
    
    Closes-bug: #1491306
    
    Change-Id: If1c62fac26b63410636413c49fce55c35e53dc5f
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Sep 3, 2015
    Configuration menu
    Copy the full SHA
    f72a006 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2015

  1. Make RabbitMQ OCF script tolerate rabbitmqctl timeouts

    The change makes OCF script ignore small number of timeouts of rabbitmqctl
    for 'heavy' operations: list_channels, get_alarms and list_queues.
    Number of tolerated timeouts in a row is configured through a new variable
    'max_rabbitmqctl_timeouts'. By default it is set to 1, i.e. rabbitmqctl
    timeouts are not tolerated at all.
    
    Bug #1487517 is fixed by extracting declaration of local variables
    'rc_alarms' and 'rc_queues' from assignment operations.
    
    
    Text for Operations Guide:
    
    If on node where RabbitMQ is deployed
    other processes consume significant part of CPU, RabbitMQ starts
    responding slow to queries by 'rabbitmqctl' utility. The utility is
    used by RabbitMQ's OCF script to monitor state of the RabbitMQ.
    When utility fails to return in pre-defined timeout, OCF script
    considers RabbitMQ to be down and restarts it, which might lead to
    a limited (several minutes) OpenStack downtime. Such restarts
    are undesirable as they cause downtime without benefit. To
    mitigate the issue, the OCF script might be told to tolerate
    certain amount of rabbitmqctl timeouts in a row using the following
    command:
      crm_resource --resource p_rabbitmq-server --set-parameter \
          max_rabbitmqctl_timeouts --parameter-value N
    
    Here N should be replaced with the number of timeouts. For instance,
    if it is set to 3, the OCF script will tolerate two rabbitmqctl
    timeouts in a row, but fail if the third one occurs.
    
    By default the parameter is set to 1, i.e. rabbitmqctl timeout is not
    tolerated at all. The downside of increasing the parameter is that
    if a real issue occurs which causes rabbitmqctl timeout, OCF script
    will detect that only after N monitor runs and so the restart, which
    might fix the issue, will be delayed.
    
    To understand that RabbitMQ's restart was caused by rabbitmqctl timeout
    you should examine lrmd.log of the corresponding controller on Fuel
    master node in /var/log/docker-logs/remote/ directory. Here lines like
    "the invoked command exited 137: /usr/sbin/rabbitmqctl list_channels ..."
    
    indicate rabbitmqctl timeout. The next line will explain if it
    caused restart or not. For example:
    "rabbitmqctl timed out 2 of max. 3 time(s) in a row. Doing nothing for now."
    
    DocImpact: user-guide, operations-guide
    
    Closes-Bug: #1479815
    Closes-Bug: #1487517
    Change-Id: I9dec06fc08dbeefbc67249b9e9633c8aab5e09ca
    dmitrymex committed Sep 4, 2015
    Configuration menu
    Copy the full SHA
    8a2afcb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    96d0a34 View commit details
    Browse the repository at this point in the history

Commits on Sep 15, 2015

  1. Return NOT_RUNNING when beam is not RUNNING

    Change get_status to return NOT_RUNNING when
    beam is not_running. Otherwise, pacemaker
    will get stuck during rabbitmq failover and
    will not attempt to restart the failed resource
    
    Change-Id: I926a3eafa9968abdf07baa5f2d5c22480300fb30
    Closes-bug: #1484280
    Vladimir Kuklin committed Sep 15, 2015
    Configuration menu
    Copy the full SHA
    4f15f6b View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2015

  1. Start RabbitMQ app on notify

    On notify, if we detect that we are a part of a cluster we still
    need to start the RabbitMQ application, because it is always
    down after action_start finishes.
    
    Closes-Bug: #1496386
    Change-Id: I307452b687a6100cc4489c8decebbc3dccdbc432
    dmitrymex committed Sep 22, 2015
    Configuration menu
    Copy the full SHA
    64285b3 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2015

  1. Avoid division operation in shell

    When the data returned from 'rabbitmqctl list_queues' grows a lot
    and awk sums up all the rows especially for memory calculation it
    returns the sum in scientific notation (example from bug
    was .15997e+09), later when we want to calculate the memory in
    MB instead of bytes, the bash division does not like this string.
    
    We can just avoid the situation by doing the division into MB
    in awk itself. Since we don't need the memory in bytes anyway.
    
    Closes-Bug: #1503331
    Change-Id: I38d25406b84d0f70ed62101d5fb5ba108bcab8bd
    dims committed Oct 9, 2015
    Configuration menu
    Copy the full SHA
    3b4a81c View commit details
    Browse the repository at this point in the history
  2. Wait for rabbitmq sync before stop/demote actions

    Added new OCF key stop_time (corresponding to start_time)
    Added wait_sync function which tries until start_time/2
    for queues on stopped/demoted node to reach synced state.
    
    Added optional [-t timeout] to su_rabbit_cmd function to
    provide arbitrary timeout
    
    Change-Id: Iae2211b3d477a9603a58d5eacb12e0fba924861a
    Closes-Bug: #1464637
    mattymo committed Oct 9, 2015
    Configuration menu
    Copy the full SHA
    806cfd2 View commit details
    Browse the repository at this point in the history

Commits on Oct 12, 2015

  1. Merge "Avoid division operation in shell"

    Jenkins authored and openstack-gerrit committed Oct 12, 2015
    Configuration menu
    Copy the full SHA
    2b3f58e View commit details
    Browse the repository at this point in the history

Commits on Oct 16, 2015

  1. Sync rabbitmq OCF from upstream

    Sync upstream changes back to Fuel downstream
    Source https://github.com/rabbitmq/rabbitmq-server
    version stable/fedfefebaa39a0aeb41cf9328ba44c3a458e4614
    
    Related blueprint upstream-rabbit-ocf
    Closes-bug: #1473015
    
    Change-Id: Ie19c2f071c53b873a359c6c5134e9498c6391e66
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Oct 16, 2015
    Configuration menu
    Copy the full SHA
    da604f9 View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2015

  1. Packages are now "self-hosted": no need for the packaging dir

    ... in the source distribution anymore
    dumbbell committed Oct 20, 2015
    Configuration menu
    Copy the full SHA
    a796ec8 View commit details
    Browse the repository at this point in the history

Commits on Oct 21, 2015

  1. Fix the timeout arg for the su_rabbit_cmd

    And fix local bashisms as a little bonus
    Upstream patch rabbitmq/rabbitmq-server#374
    
    Related-bug: #1464637
    
    Change-Id: I13189de9f8abce23673c031d11132e495e1972e3
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Oct 21, 2015
    Configuration menu
    Copy the full SHA
    27a0454 View commit details
    Browse the repository at this point in the history

Commits on Oct 22, 2015

  1. Fix piped exit codes expectations and count processing

    * Fix return code of the get_all_pacemaker_nodes() and
      get_alive_pacemaker_nodes_but() to be
      not provided as ignored anyway.
    * Fix return code expectation of the fetched count attribute
      in the check_timeouts().
    Upstream patch rabbitmq/rabbitmq-server#374
    
    Closes-bug: #1506440
    
    Change-Id: I44a6cff2ccba1ba53a18da90c9d74cbb6084ca0c
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Oct 22, 2015
    Configuration menu
    Copy the full SHA
    8ca9174 View commit details
    Browse the repository at this point in the history

Commits on Oct 23, 2015

  1. Configuration menu
    Copy the full SHA
    b867ae0 View commit details
    Browse the repository at this point in the history

Commits on Oct 26, 2015

  1. Configuration menu
    Copy the full SHA
    f0ff141 View commit details
    Browse the repository at this point in the history

Commits on Nov 5, 2015

  1. Don't update .erlang.cookie on every run

    Update happens even during no-op commands like 'meta-data' or 'usage'.
    During this update there is a short window for a race condition: a shell
    redirection truncates the cookie file, and echo writes data there only
    after a brief period of time. So erlang may read data from this empty
    file and die with error "Too short cookie string".
    
    Change-Id: I4c3201617669f3872145048b77337632cb93558c
    Closes-Bug: #1512754
    binarin committed Nov 5, 2015
    Configuration menu
    Copy the full SHA
    d048c74 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2015

  1. Fix metadata in OCF HA script

    Looks like copy-paste has gone wrong.
    binarin committed Nov 9, 2015
    Configuration menu
    Copy the full SHA
    05f33de View commit details
    Browse the repository at this point in the history
  2. Don't update cookie on every run of HA OCF script

    Update happens even during no-op commands like 'meta-data' or 'usage'.
    During this update there is a short window for a race condition: a shell
    redirection truncates the cookie file, and echo writes data there only
    after a brief period of time. So erlang may read data from this empty
    file and die with the error "Too short cookie string".
    binarin committed Nov 9, 2015
    Configuration menu
    Copy the full SHA
    38afe77 View commit details
    Browse the repository at this point in the history
  3. Merge pull request ClusterLabs#411 from binarin/rabbitmq-server-ocf-i…

    …dempotent-cookie
    
    Don't update cookie on every run of HA OCF script
    michaelklishin committed Nov 9, 2015
    Configuration menu
    Copy the full SHA
    6d8c983 View commit details
    Browse the repository at this point in the history

Commits on Nov 12, 2015

  1. Bind rabbitmq, epmd, and management plugin to internal IP

    RabbitMQ itself was already listening on the correct IP
    for controllers, but epmd and management plugin listened
    everywhere (although management was covered by firewall
    rules).
    
    This covers all RabbitMQ server connection binding so that
    all connections are done on the same IP address (with the
    unfortunate side effect of blocking localhost connections).
    
    Removed unused parameter rabbitmq_host from
    nailgun::rabbitmq.
    
    Change-Id: I9bfb8bc85fcd6d4711c4ca9d79745ad2ce7e673a
    Closes-Bug: #1501731
    mattymo committed Nov 12, 2015
    Configuration menu
    Copy the full SHA
    0432228 View commit details
    Browse the repository at this point in the history

Commits on Nov 16, 2015

  1. Configuration menu
    Copy the full SHA
    937890f View commit details
    Browse the repository at this point in the history
  2. Add host_ip field

    Working with RMQ definitions via management plugin
    requires knowing the IP address where it listens.
    
    host_ip parameter will default to 127.0.0.1, but is
    configurable.
    mattymo authored and michaelklishin committed Nov 16, 2015
    Configuration menu
    Copy the full SHA
    222ffcd View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2015

  1. Merge branch 'stable'

    michaelklishin committed Dec 9, 2015
    Configuration menu
    Copy the full SHA
    9777d8e View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2015

  1. Add ability to disable HA for RabbitMQ queues

    Add two flags:
     * enable_rpc_ha which enables queue mirroring for RPC queues
     * enable_notifications_ha which enables queue mirroring for
       Ceilometer queues
    
    Since the feature is experimental, both flags are set to true by
    default to preserve current behaviour.
    
    The change is implemented in several steps:
     * the upstream script changed so that it allows to extend the
       list of parameters and uses a policy file to define RabbitMQ
       policies.
     * we add our own version of OCF script which wraps around the
       upstream one. It defines a new enable_rpc_ha and
       enable_notifications_ha parameter and passes their value to the
       upstream script.
     * we add our policy file, where we use the introduced parameters
       to decide which policies we should set.
    
    So we will have two OCF scripts for RabbitMQ in our deployment:
     * rabbitmq-server-upstream - the upstream version
     * rabbitmq-server - our extention, which will be used in the
       environment
    
    The upstream version of the script is pushed to the upstream
    along with empty policy file, so that other users can define their
    own policies or extend the script if needed. Here are the
    corresponding pull requests:
      rabbitmq/rabbitmq-server#480
      rabbitmq/rabbitmq-server#482
    (both are already merged)
    
    Text for Operations Guide
    
    It is possible to significantly reduce load which OpenStack puts on
    RabbitMQ by disabling queue mirroring. This could be done separately
    for RPC queues and Ceilometer ones. To disable mirroring for RPC
    queues, execute the following command on one of the controllers:
    
        crm_resource --resource p_rabbitmq-server --set-parameter \
            enable_rpc_ha --parameter-value false
    
    To disable mirroring for Ceilometer queues, execute the following
    command on one of the controllers:
    
        crm_resource --resource p_rabbitmq-server --set-parameter \
            enable_notifications_ha --parameter-value false
    
    In order for any of the changes to take effect, RabbitMQ service
    should be restarted. To do that, first execute
    
        pcs resource disable master_p_rabbitmq-server
    
    Then monitor RabbitMQ state using command
    
        pcs resource
    
    until it shows that all RabbitMQ nodes are stopped. Once they are,
    execute the following command to start RabbitMQ:
    
        pcs resource enable master_p_rabbitmq-server
    
    Beware: during restart all messages accumulated in RabbitMQ will be
    lost. Also, OpenStack will stop functioning until RabbitMQ is up
    again, so plan accordingly.
    
    Note that it is not yet well tested how this configuration affects
    failover when some cluster nodes go down. Hence it is experimental,
    use at your own risk!
    
    DocImpact:  ops-guide
    
    Implements: blueprint rabbitmq-disable-mirroring-for-rpc
    Change-Id: I80ae231ca64e2a903b0968d36ba0e85ca9cc9891
    dmitrymex committed Dec 11, 2015
    Configuration menu
    Copy the full SHA
    129dbce View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2015

  1. Merge branch 'stable'

    dumbbell committed Dec 14, 2015
    Configuration menu
    Copy the full SHA
    2d05408 View commit details
    Browse the repository at this point in the history
  2. Fix default value for 'use_fqdn' in meta_data

    This change fixes the copy-paste gone wrong and pulls in the rabbitmq
    upstream commit of c85fdd0f5c54f312fc2147dad2b956961aae3f12.
    
    Closes-Bug: #1526062
    Change-Id: I49e45cd893af8c65ed5ddd3efb834e38737a69a2
    binarin authored and Alex Schultz committed Dec 14, 2015
    Configuration menu
    Copy the full SHA
    2a52905 View commit details
    Browse the repository at this point in the history

Commits on Dec 30, 2015

  1. Fix stop conditions for the rabbit OCF resource

    * Fix the get_status() unexpectedly reports generic error
      instead of "not running"
    * Add proc_stop and proc_kill functions
      (TODO these shall go as external common ocf heplers, eventually)
    * Rework stop_server_process()
      - make it to return SUCCESS/ERROR as expected
      - grant the "rabbitmqctl stop" a graceful termintation window and only
        then ensure the beam process termination and pidfile removal as well
      - return the actual status with get_status()
    * Rework kill_rmq_and_remove_pid()
      - use proc_stop to try to kill by pgrp with -TERM, then -KILL, or
        by the beam process name match, if there is no PID.
      - make it to returns SUCCESS/ERROR
    * Fix action_stop()
      - fail early by the stop_server_process() results without additional
        rabbitmqctl invocations in the get_status() call
      - rework hard-coded sleep 10 to use the gracefull stop windows in the
        stop_server_process() instead
      - ensure the rabbit-start-time removal from CIB before to try to stop
        the server process
      - issue the "stop: action end" log record before the actual end
    * Add comments and make logs to be more informational
    
    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1529897
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Co-authored-by: Alex Schultz <aschultz@mirantis.com>
    Bogdan Dobrelya and Alex Schultz committed Dec 30, 2015
    Configuration menu
    Copy the full SHA
    df33e89 View commit details
    Browse the repository at this point in the history

Commits on Dec 31, 2015

  1. Ensure rabbit node uptime is reset in the CIB for OCF resource

    * Add ocf_run wrappers and info log messages for CIB attribute events
    * Move "fast" CIB attribute updates before "heavy" operations like
      start/stop/wait to ensure CIB consistent even if the timeouts
      exceeded for the ops
    * Delete master and start time attributes from CIB on action_start
      to ensure the correct rabbit nodes uptime evaluation for new
      master elections for corresponding pacemaker resources
    * For post-demote notify and action_demote() delete the master
      attribute from CIB as well.
    * For post-start notify, update the start time in the CIB even when
      the node is already clustered. Otherwise it would remain running
      in cluster w/o the start time registered, which affects the new
      master elections badly.
    * fix wrong log message when joining by a node
    
    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1530150
    https://bugs.launchpad.net/fuel/+bug/1530296
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Dec 31, 2015
    Configuration menu
    Copy the full SHA
    48d7106 View commit details
    Browse the repository at this point in the history
  2. Fix rabbit OCF log message when joining by a node

    Closes-bug: #1530296
    
    Change-Id: Id2258da4f272dc8eca92130d45ecb69a16ed7c35
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Dec 31, 2015
    Configuration menu
    Copy the full SHA
    6382b99 View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2016

  1. Remove unneeded sleep for a graceful stop by PID

    The sleep in not needed according to the
    https://www.rabbitmq.com/man/rabbitmqctl.1.man.html
    "If a pid_file is specified, also waits for the process
    specified there to terminate."
    
    Related Fuel bug https://launchpad.net/bugs/1529897
    Related PR
    rabbitmq/rabbitmq-server#523
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 7, 2016
    Configuration menu
    Copy the full SHA
    b2cea03 View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2016

  1. Syntax and local vars usage fixes to OCF HA

    Related Fuel bug:
    https://launchpad.net/bugs/1529897
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 11, 2016
    Configuration menu
    Copy the full SHA
    e4255b2 View commit details
    Browse the repository at this point in the history
  2. Fix proc_kill then there is no pid found

    W/o this fix, the rabbit OCF cannot make
    proc_stop to try to kill the pid-less beam process
    by its name matching because the proc_kill()'s
    1st parameter cannot be passed empty.
    
    The fix is to use the "none" value then the pid-less
    process must be matched by the service_name instead.
    
    Also, fix the proc_kill to deal with Multi process
    pid files as well (there are many pids, a space separated).
    
    Related Fuel bugs:
    https://launchpad.net/bugs/1529897
    https://launchpad.net/bugs/1532723
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 11, 2016
    Configuration menu
    Copy the full SHA
    051514a View commit details
    Browse the repository at this point in the history
  3. Fix get_status, action_stop, proc_stop then beam's unresponsive

    * Fix get status() to catch beam state and output errors
    * Fix action_stop() to force name-based mathcing then no
    pidfile and the beam's unresponsive
    * Fix proc_stop to use name based matching if no pidfile
    found
    * Fix proc_stop to retry sending the signal when using the name
    based match as well
    
    W/o this patch, the situation is possible when:
    - beam's running and cannot process signals, but is reported "not running"
    by the get_status(), while in fact it shall be reported as generic error
    - which_applications() returned error, while its output is still
    being parsed for the "what" match, while it shall not.
    - action stop and proc_stop gives up then there is no pidfile and the beam's
    running unresponsive.
    
    The solution is to make get_status to return generic error and action
    stop to use the rabbit process name matching for killing it.
    
    Related Fuel bug:
    https://bugs.launchpad.net/fuel/+bug/1529897
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 11, 2016
    Configuration menu
    Copy the full SHA
    0f975e5 View commit details
    Browse the repository at this point in the history
  4. Fix monitor/stop operations for the rabbit OCF resource

    W/o this fix, the situation is possible when:
    - beam's running and cannot process signals, but is reported "not running"
    by the get_status(), while in fact it shall be reported as generic error
    - which_applications() returned error, while its output is still
    being parsed for the "what" match, while it shall not.
    - action stop and proc_stop gives up then there is no pidfile and the beam's
    running unresponsive.
    
    The solution is to make get_status to return generic error and action
    stop to use the rabbit process name matching for killing it. These and
    other related fixes listed below (tl;dr)
    
    * Fix get_status, action_stop, proc_stop then beam's unresponsive
      (ie. fails to process signals or does it very slowly)
      - Fix get status() to catch beam state and output errors
      - Fix action_stop() to force name-based mathcing then no
        pidfile and the beam's unresponsive
      - Fix proc_stop to use name based matching if no pidfile
        found
      - Fix proc_stop to retry sending the signal when using the name
        based match as well
    * Fix the get_status() unexpectedly reports generic error
      instead of "not running"
    * Add reworked proc_stop and proc_kill functions from the
      ocf-fuel-funcs
    * Rework stop_server_process()
      - make it to return SUCCESS/ERROR as expected
      - grant the "rabbitmqctl stop" a graceful termintation window and only
        then ensure the beam process termination and pidfile removal as well
      - return the actual status with get_status()
    * Rework kill_rmq_and_remove_pid()
      - use proc_stop to try to kill by pgrp with -TERM, then -KILL, or
        by the beam process name match, if there is no PID.
      - make it to returns SUCCESS/ERROR
    * Fix action_stop()
      - fail early by the stop_server_process() results without additional
        rabbitmqctl invocations in the get_status() call
      - rework hard-coded sleep 10 to use the gracefull stop windows in the
        stop_server_process() instead
      - ensure the rabbit-start-time removal from CIB before to try to stop
        the server process
      - issue the "stop: action end" log record before the actual end
    * Add comments, adjust logs levels and make them to be more informational
    
    Upstream PRs
    rabbitmq/rabbitmq-server#523
    rabbitmq/rabbitmq-server#532
    rabbitmq/rabbitmq-server#538
    rabbitmq/rabbitmq-server#540
    
    Closes-bug: #1529897
    
    Change-Id: I1c382e3cf004630847b6626fabaecaa0094ee271
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 11, 2016
    Configuration menu
    Copy the full SHA
    40fc6d9 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2016

  1. Ensure rabbit node uptime is reset in the CIB for OCF resource

    * Add ocf_run wrappers and info log messages for CIB attribute events
    * Move "fast" CIB attribute updates before "heavy" operations like
      start/stop/wait to ensure CIB consistent even if the timeouts
      exceeded for the ops
    * Delete master and start time attributes from CIB on action_start
      to ensure the correct rabbit nodes uptime evaluation for new
      master elections for corresponding pacemaker resources
    * For post-demote notify and action_demote() delete the master
      attribute from CIB as well.
    * For post-start notify, update the start time in the CIB even when
      the node is already clustered. Otherwise it would remain running
      in cluster w/o the start time registered, which affects the new
      master elections badly.
    
    Upstream RR rabbitmq/rabbitmq-server#524
    Closes-bug: #1530150
    
    Change-Id: I9db3c819031cef620377b4fee08ea92e90b11c70
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 12, 2016
    Configuration menu
    Copy the full SHA
    8c4d847 View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2016

  1. Fix rabbitMQ OCF monitor detection of running master

    When monitor detected the node as OCF_RUNNING_MASTER, this may be
    lost while the monitor checks in progress.
    * Rework the prev_rc by the rc_check to fix this.
    * Also add info log if detected as running master.
    * Break the monitor check loop early, if it shall be exiting to be
      restarted by pacemaker.
    * Do not recheck the master status and do not update the master score,
      if the node was already detected by monitor as OCF_RUNNING_MASTER.
      By that point, the running and healthy master shall not be checked
      against other nodes uptime as it is pointless and only takes more
      time and resources for the action monitor to finish.
    * Fail early, if monitor detected the node as OCF_RUNNING_MASTER, but
      the rabbit beam process is not running
    * For OCF_CHECK_LEVEL>20, exclude the current node from the check
      loop as we already checked it before
    
    Closes-bug: #1531838
    
    Change-Id: I319db307c73ef24d829be44eeb63d1f52f4180fa
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 14, 2016
    Configuration menu
    Copy the full SHA
    a6bcc9a View commit details
    Browse the repository at this point in the history
  2. Fix rabbitMQ OCF monitor detection of running master

    When monitor detected the node as OCF_RUNNING_MASTER, this may be
    lost while the monitor checks in progress.
    * Rework the prev_rc by the rc_check to fix this.
    * Also add info log if detected as running master.
    * Break the monitor check loop early, if it shall be exiting to be
      restarted by pacemaker.
    * Do not recheck the master status and do not update the master score,
      if the node was already detected by monitor as OCF_RUNNING_MASTER.
      By that point, the running and healthy master shall not be checked
      against other nodes uptime as it is pointless and only takes more
      time and resources for the action monitor to finish.
    * Fail early, if monitor detected the node as OCF_RUNNING_MASTER, but
      the rabbit beam process is not running
    * For OCF_CHECK_LEVEL>20, exclude the current node from the check
      loop as we already checked it before
    
    Related Fuel bug:
    https://launchpad.net/bugs/1531838
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Jan 14, 2016
    Configuration menu
    Copy the full SHA
    b7815e4 View commit details
    Browse the repository at this point in the history
  3. Introduce node name prefix for mgmt/messaging IPs

    RabbitMQ will resolve <prefix>-<fqdn> hostnames to valid mgmt/messaging IP
    
    Change-Id: Ifc2af16b08663655d365587ea6f45c87bfc68698
    Depends-On: I9813fa8c20d47e0ef1e251fe5ac8d01d08fe7703
    Closes-bug: #1528707
    galanoff committed Jan 14, 2016
    Configuration menu
    Copy the full SHA
    c4c4ae7 View commit details
    Browse the repository at this point in the history

Commits on Jan 18, 2016

  1. Add optional prefix for RabbitMQ node FQDNs

    It would allow to instantiate multiple rabbit clusters constructed
    from prefix-based instances of rabbit nodes.
    galanoff committed Jan 18, 2016
    Configuration menu
    Copy the full SHA
    5a9d7ce View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    a88ec6c View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2016

  1. Reset master score if we decide to restart RabbitMQ on timeout

    Doing otherwise might not trigger the restart while it is clearly
    needed.
    dmitrymex committed Jan 19, 2016
    Configuration menu
    Copy the full SHA
    e47d4a9 View commit details
    Browse the repository at this point in the history
  2. Reset master score if we decide to restart RabbitMQ on timeout

    Doing otherwise might not trigger the restart while it is clearly
    needed.
    
    Upstream PR: rabbitmq/rabbitmq-server#560
    
    Change-Id: I480ebaddc98fa0784098efbf0c5ab8c512c8661d
    Closes-Bug: #1513421
    dmitrymex committed Jan 19, 2016
    Configuration menu
    Copy the full SHA
    f958037 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2016

  1. Improve rabbitmq OCF script diagnostics

    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.
    
    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.
    binarin committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    c0c6480 View commit details
    Browse the repository at this point in the history
  2. Improve 'list_channels' diagnostics in OCF

    timeout(1) manpage mentions 124 as another valid return code from, in addition to 128 + signal-number.
    binarin committed Jan 20, 2016
    Configuration menu
    Copy the full SHA
    f79c7a6 View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2016

  1. Merge pull request ClusterLabs#563 from binarin/rabbitmq-server-ocf-l…

    …ist-channels-diagnostics
    
    Improve OCF script diagnostics for timed-out 'list_channels'
    michaelklishin committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    4440c79 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    aaade82 View commit details
    Browse the repository at this point in the history
  3. Fix uninitialized variable in rabbitmq script

    Upstream: rabbitmq/rabbitmq-server#571
    
    Shell was sometimes complaining at line 1447 due to empty `rc_check`
    
    Change-Id: I9411fbc41f8ebf6ac41504ff7456ee7952485564
    Partial-Bug: #1531838
    binarin committed Jan 21, 2016
    Configuration menu
    Copy the full SHA
    505f048 View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2016

  1. Improve OCF script diagnostics for timed-out 'list_channels'

    Upstream PR: rabbitmq/rabbitmq-server#563
    
    Currently time-out when running 'rabbitmqctl list_channels' is treated
    as a sign that current node is unhealthy. But it could not be the
    case, as the hanging channel could be actually on some other
    node. Given that currently we have seen more than one bug related to
    'list_channels', it makes sense to improve diagnostics here.
    
    This patch doesn't change any behaviour, only improves logging after
    time-out happens. If time-outs continue to occur (even with latest
    rabbitmq versions or with backported fixes), we could switch to this
    improved list_channels and kill rabbitmq only if stuck channels are
    located on current node. But I hope that all related rabbitmq bugs
    were already closed.
    
    Change-Id: I4746d3a4e85dc2a51af581034ae09a1cf0eefce2
    Partial-Bug: #1515223
    Partial-Bug: #1513511
    binarin committed Jan 25, 2016
    Configuration menu
    Copy the full SHA
    ffe2ad4 View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2016

  1. Configuration menu
    Copy the full SHA
    0cc3bb6 View commit details
    Browse the repository at this point in the history

Commits on Feb 2, 2016

  1. Suppress curl progress indicator in rabbit OCF

    curl is used by OCF script for fetching definitions (queues etc.), but
    results of that invocation is shown as garbage in pacemaker logs -
    progress indicator doesn't make any sense in logs.
    
    According to curl manpage the following combination of options should be
    used "--silent --show-error" - this will suppress only progress
    indicator, errors will still be shown.
    
    Also other short curl options are replaced with their long counterparts
    - for improved readability.
    binarin committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    afc03f6 View commit details
    Browse the repository at this point in the history
  2. Fix uninitialized status_master

    Fix multiple nodes may be reported in logs as the running master
    
    Related Fuel bug https://bugs.launchpad.net/bugs/1540936
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    f1a4c73 View commit details
    Browse the repository at this point in the history
  3. Fix cluster membership check for running master

    The running master is always inside of its own cluster.
    Fix the cluster membership check when a node is the master.
    bogdando committed Feb 2, 2016
    Configuration menu
    Copy the full SHA
    2078fa9 View commit details
    Browse the repository at this point in the history

Commits on Feb 3, 2016

  1. Fix uninitialized status_master

    Fix multiple nodes may be reported in logs as the running master
    
    Closes-bug: #1540936
    
    Change-Id: Ic2dfe7b2ba657b9bf06d97f49ddb4b69f2f4e063
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya authored and dmitrymex committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    37cc8b4 View commit details
    Browse the repository at this point in the history
  2. Streamline checking for cluster partitioning

    Move check if we are current cluster master to earlier place in code.
    That way we will avoid unnecessary operations for master case.
    dmitrymex committed Feb 3, 2016
    Configuration menu
    Copy the full SHA
    6183b23 View commit details
    Browse the repository at this point in the history

Commits on Feb 4, 2016

  1. Fix action_stop for the rabbit OCF

    The action_stop may sometimes stop the rabbitmq-server gracefully
    by the PID, but leave unresponsive beam.smp processes running and
    spoiling rabbits. Those shall be stopped as well. The solution is:
    - make proc_stop() to accept a pid=none to use a name matching instead
    - make kill_rmq_and_remove_pid() to stop by the beam process matching as well
    - fix stop_server_process() to ensure there is no beam process left running
    
    Closes-bug: #1541029
    
    Change-Id: Ib9669d15bb714be8a88fd65d7f1815173da788d3
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    9b7d9c5 View commit details
    Browse the repository at this point in the history
  2. Fix action_stop for the rabbit OCF

    The action_stop may sometimes stop the rabbitmq-server gracefully
    by the PID, but leave unresponsive beam.smp processes running and
    spoiling rabbits. Those shall be stopped as well. The solution is:
    - make proc_stop() to accept a pid=none to use a name matching instead
    - make kill_rmq_and_remove_pid() to stop by the beam process matching as well
    - fix stop_server_process() to ensure there is no beam process left running
    
    Related Fuel bug: https://launchpad.net/bugs/1541029
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    19e931a View commit details
    Browse the repository at this point in the history
  3. Merge "Fix action_stop for the rabbit OCF"

    Jenkins authored and openstack-gerrit committed Feb 4, 2016
    Configuration menu
    Copy the full SHA
    77fdb12 View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2016

  1. Do not check cluster health if master is not elected

    Doing otherwise causes node to restart when get_monitor is called
    within action_promote - it does not find a master and assumes that
    it is running out of cluster.
    
    Also, code is refactored a little bit - a new function returning
    current master is created and is used in the changed code.
    
    Closes-Bug: #1543154
    Change-Id: If14fcfc915d76c9580be0a097b250d79cf953b9e
    dmitrymex committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    f5ed86e View commit details
    Browse the repository at this point in the history
  2. Exit waiting loop once node has unjoined

    Without the break we always wait for 50 seconds, even if we don't need
    to wait at all.
    
    Change-Id: Ib361fbac714d61056f4b9d71f23bb74af33abf77
    dmitrymex committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    95a2b63 View commit details
    Browse the repository at this point in the history
  3. On neighbor promotion do nothing if we are already clustered

     + extracted function checking if we are in the same cluster with
       given node
    
     + made post-promote ignore promotion of self. Previously it was
       done inside jjj_join, but now we need to do that before the
       new check.
    
     + now we write "post-promote end" log entry at the very
       end of post-promote, not somewhere in the middle.
    
    Closes-Bug: #1544036
    Change-Id: Id28d6c94abe5d96452f7ecba2b3fe022f40afa0d
    dmitrymex committed Feb 10, 2016
    Configuration menu
    Copy the full SHA
    0e0feb6 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2016

  1. Merge "Exit waiting loop once node has unjoined"

    Jenkins authored and openstack-gerrit committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    3f88cd2 View commit details
    Browse the repository at this point in the history
  2. Exit waiting loop once node has unjoined

    Without the break we always wait for 50 seconds, even if we don't need
    to wait at all.
    dmitrymex committed Feb 16, 2016
    Configuration menu
    Copy the full SHA
    e99b09a View commit details
    Browse the repository at this point in the history

Commits on Feb 19, 2016

  1. Private attributes usage in rabbitmq script

    There are three types of rabbitmq attributes for pacemaker nodes:
    	-'rabbit-master'
    	-'rabbit-start-time'
    	- timeouts:
    		-'rabbit_list_channels_timeouts'
    		-'rabbit_get_alarms_timeouts'
    		-'rabbit_list_queues_timeouts'
    
    Attributes with names 'rabbit-master' and 'rabbit-start-time' should be
    public because we monitor this attributes in cycle for all nodes in our
    script.
    
    All timeouts attributes were changed to private to avoid unnecessary
    transitions.
    
    Also, --lifetime and --node options were removed for attrd_updater as
    'lifetime' for this command is always 'reboot' and 'node' default value
    is local one.
    lefremova committed Feb 19, 2016
    Configuration menu
    Copy the full SHA
    d0e7389 View commit details
    Browse the repository at this point in the history
  2. Merge pull request ClusterLabs#639 from lefremova/stable

    Private attributes usage in rabbitmq script
    michaelklishin committed Feb 19, 2016
    Configuration menu
    Copy the full SHA
    277a1d4 View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2016

  1. Private attributes usage in rabbitmq script

    There are three types of rabbitmq attributes for pacemaker nodes:
    	-'rabbit-master'
    	-'rabbit-start-time'
    	- timeouts:
    		-'rabbit_list_channels_timeouts'
    		-'rabbit_get_alarms_timeouts'
    		-'rabbit_list_queues_timeouts'
    
    Attributes with names 'rabbit-master' and 'rabbit-start-time' should be
    public because we monitor this attributes in cycle for all nodes in our
    script.
    
    All timeouts attributes were changed to private to avoid unnecessary
    transitions.
    
    Also, --lifetime and --node options were removed for attrd_updater as
    'lifetime' for this command is always 'reboot' and 'node' default value
    is local one.
    
    Closes-bug: #1524672
    Change-Id: Ie45ae3a82b8daa35dbdd977dc894877160af457b
    lefremova committed Feb 24, 2016
    Configuration menu
    Copy the full SHA
    478fd4a View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2016

  1. [OCF HA] Increase tolerable number of rabbitmqctl timeouts

    We still see that rabbitmqctl list_channels times out from time
    to time, though the RabbitMQ cluster is absolutely healthy in any
    other aspect.
    
    Setting max_rabbitmqctl_timeouts to 3 seems to be a sane default
    to help avoid unnecessary restarts.
    dmitrymex committed Feb 25, 2016
    Configuration menu
    Copy the full SHA
    9c8e3da View commit details
    Browse the repository at this point in the history
  2. [OCF HA] Log process id in RabbitMQ OCF script

    Several OCF calls might run simultaneously. For example, it often
    happens that two monitor calls intersect. Logging current process id
    for each line helps distinguish logs of different calls.
    
    Also aligned get_status() logging with format used in all other
    parts of the script.
    dmitrymex committed Feb 25, 2016
    Configuration menu
    Copy the full SHA
    88afa77 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2016

  1. [OCF HA] Do not check cluster health if master is not elected

    Doing otherwise causes node to restart when get_monitor is called
    within action_promote - it does not find a master and assumes that
    it is running out of cluster.
    
    Also, code is refactored a little bit - a new function returning
    current master is created and is used in the changed code.
    dmitrymex committed Feb 26, 2016
    Configuration menu
    Copy the full SHA
    3ac28c4 View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2016

  1. Increase tolerable number of rabbitmqctl timeouts

    We still see that rabbitmqctl list_channels times out from time
    to time, though the RabbitMQ cluster is absolutely healthy in any
    other aspect.
    
    Setting max_rabbitmqctl_timeouts to 3 seems to be a sane default
    to help avoid unnecessary restarts.
    
    Upstream PR: rabbitmq/rabbitmq-server#650
    
    Closes-Bug: #1550293
    Change-Id: I6b0686ef66ba3966e03c8706594f473e9ab01145
    dmitrymex committed Feb 29, 2016
    Configuration menu
    Copy the full SHA
    86d375f View commit details
    Browse the repository at this point in the history
  2. [OCF HA] On neighbor promotion do nothing if we are already clustered

     + extracted function checking if we are in the same cluster with
       given node
    
     + made post-promote ignore promotion of self. Previously it was
       done inside jjj_join, but now we need to do that before the
       new check.
    
     + now we write "post-promote end" log entry at the very
       end of post-promote, not somewhere in the middle.
    dmitrymex committed Feb 29, 2016
    Configuration menu
    Copy the full SHA
    7a03700 View commit details
    Browse the repository at this point in the history
  3. Suppress curl progress indicator in rabbit OCF

    Upstream PR: rabbitmq/rabbitmq-server#597
    
    curl is used by OCF script for fetching definitions (queues etc.), but
    results of that invocation is shown as garbage in pacemaker logs -
    progress indicator doesn't make any sense in logs.
    
    According to curl manpage the following combination of options should be
    used "--silent --show-error" - this will suppress only progress
    indicator, errors will still be shown.
    
    Also other short curl options are replaced with their long counterparts
    - for improved readability.
    
    Change-Id: I5ae35b3f76dc33be68c79f5dc983f0c779529fb9
    Closes-Bug: #1540831
    binarin committed Feb 29, 2016
    Configuration menu
    Copy the full SHA
    b52f1ed View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    492c853 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b90d128 View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2016

  1. Configuration menu
    Copy the full SHA
    3b56284 View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2016

  1. Log process id in RabbitMQ OCF script

    Several OCF calls might run simultaneously. For example, it often
    happens that two monitor calls intersect. Logging current process id
    for each line helps distinguish logs of different calls.
    
    Also aligned get_status() logging with format used in all other
    parts of the script.
    
    Upstream PR: rabbitmq/rabbitmq-server#653
    
    Closes-Bug: 1553089
    Change-Id: Icbaeb560021f70ef13e062cb79fe2cba84e33dce
    dmitrymex committed Mar 4, 2016
    Configuration menu
    Copy the full SHA
    f165474 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2016

  1. Merge pull request ClusterLabs#653 from dmitrymex/log-pid

    [OCF HA] Log process id in RabbitMQ OCF script
    michaelklishin committed Mar 7, 2016
    Configuration menu
    Copy the full SHA
    7979d71 View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2016

  1. Configuration menu
    Copy the full SHA
    e6adbe9 View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2016

  1. Revert "Merge "Private attributes usage in rabbitmq script""

    This reverts commit 686bed1b4f090d7f6fd368b94a5ced12c8e28744, reversing
    changes made to d42a753d75dc419c123de257a974ca9c175789f7.
    
    Change-Id: I56ce3671558cf12ab7ce7d616e14cf27f3adb5f1
    Closes-bug: #1556123
    Bogdan Dobrelya committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    e9b3c7d View commit details
    Browse the repository at this point in the history
  2. Revert "Private attributes usage in rabbitmq script"

    This reverts commit 4aeaa79bc566c81bc7f5c20d7afbe39c32771aba.
    
    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1556123
    Bogdan Dobrelya committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    214275b View commit details
    Browse the repository at this point in the history
  3. Revert "Private attributes usage in rabbitmq script"

    This reverts commit 4aeaa79bc566c81bc7f5c20d7afbe39c32771aba.
    
    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1556123
    Bogdan Dobrelya committed Mar 11, 2016
    Configuration menu
    Copy the full SHA
    f6bdfe7 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2016

  1. Merge pull request ClusterLabs#686 from bogdando/master

    Revert "Private attributes usage in rabbitmq script"
    michaelklishin committed Mar 14, 2016
    Configuration menu
    Copy the full SHA
    e0a81d1 View commit details
    Browse the repository at this point in the history

Commits on Mar 24, 2016

  1. Put the RabbitMQ OCF RA policy to /usr/sbin

    * Fix failing pcs resource list command
      and move the policy file from the ocf to policy dir
    * Configure the custom policy file to be picked
      in the /usr/sbin/set_rabbitmq_policy as the
      fuel-libraryX package installs it.
    * As the upstream rabbitmq-server package does not
      install one, use the default policy OCF path param
      as the /usr/local/sbin/...
    * Add the policy_file param and unit tests to the
      cluster::rabbitmq_ocf
    
    Closes-bug: #1558627
    
    Change-Id: I4937bde611b06c3e39385a322053610c98584d79
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Mar 24, 2016
    Configuration menu
    Copy the full SHA
    d1c8e6b View commit details
    Browse the repository at this point in the history
  2. Put the RabbitMQ OCF RA policy to /usr/sbin

    * Fix failing pcs resource list command
    * Move policy file to examples in docs dirs
    
    Related Fuel bug: https://bugs.launchpad.net/fuel/+bug/1558627
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Mar 24, 2016
    Configuration menu
    Copy the full SHA
    2ed9efd View commit details
    Browse the repository at this point in the history
  3. Merge branch 'stable'

    dumbbell committed Mar 24, 2016
    Configuration menu
    Copy the full SHA
    b050d93 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2016

  1. Fix half-hearted attempt to erase mnesia in OCF RA

    ocf_run does `"$@"`, so "${MNESIA_FILES}/*" wasn't expanded and mnesia
    directory wasn't actually cleaned up
    
    Fuel bug: https://bugs.launchpad.net/fuel/+bug/1565868
    binarin committed Apr 4, 2016
    Configuration menu
    Copy the full SHA
    1a970ad View commit details
    Browse the repository at this point in the history
  2. Fix half-hearted attempt to erase mnesia in OCF RA

    ocf_run does $("$@"), so "${MNESIA_FILES}/*" wasn't expanded and mnesia
    directory wasn't actually cleaned up
    
    It's safe to remove that directory completely - it will be re-created
    automatically by mnesia.
    
    Upstream rabbitmq/rabbitmq-server#724
    
    Change-Id: I0aa47f61e03c99ee6ebb56b833463cdf4ccd243e
    Closes-Bug: 1565868
    binarin committed Apr 4, 2016
    Configuration menu
    Copy the full SHA
    d53418e View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2016

  1. Merge branch 'stable'

    michaelklishin committed Apr 5, 2016
    Configuration menu
    Copy the full SHA
    ed9056e View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2016

  1. Stop a rabbitmq pacemaker resource when monitor fails

    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1567355
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Apr 7, 2016
    Configuration menu
    Copy the full SHA
    3480cea View commit details
    Browse the repository at this point in the history
  2. Stop a rabbitmq pacemaker resource when monitor fails

    Upstream PR rabbitmq/rabbitmq-server#731
    Closes-bug: #1567355
    
    Change-Id: I83415e0e2a40f0e99e7baa26e35b6f7463c52928
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Apr 7, 2016
    Configuration menu
    Copy the full SHA
    6000a31 View commit details
    Browse the repository at this point in the history
  3. Merge branch 'stable'

    michaelklishin committed Apr 7, 2016
    Configuration menu
    Copy the full SHA
    985f90d View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2016

  1. Configuration menu
    Copy the full SHA
    742e8c2 View commit details
    Browse the repository at this point in the history

Commits on Apr 19, 2016

  1. Stop process when rabbit is running but is not connected to master.

    It's should goes down due to avoid split brain.
    
    Change-Id: I4c51f8608702f2284d835ba9c3c9070b2c329ed8
    Closes-Bug: #1541471
    Upstream PR: rabbitmq/rabbitmq-server#758
    Maciej Relewicz committed Apr 19, 2016
    Configuration menu
    Copy the full SHA
    4a4e013 View commit details
    Browse the repository at this point in the history

Commits on Apr 20, 2016

  1. Stop process when rabbit is running but is not connected to master.

    It's should goes down due to avoid split brain.
    
    Related Fuel bug: https://bugs.launchpad.net/fuel/+bug/1541471
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Co-authored-by: Maciej Relewicz <mrelewicz@mirantis.com>
    Bogdan Dobrelya and Maciej Relewicz committed Apr 20, 2016
    Configuration menu
    Copy the full SHA
    33b9f40 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'stable'

    michaelklishin committed Apr 20, 2016
    Configuration menu
    Copy the full SHA
    bec20d0 View commit details
    Browse the repository at this point in the history

Commits on May 10, 2016

  1. Private attributes usage in rabbitmq script

    There are three types of rabbitmq attributes for pacemaker nodes:
    	-'rabbit-master'
    	-'rabbit-start-time'
    	- timeouts:
    		-'rabbit_list_channels_timeouts'
    		-'rabbit_get_alarms_timeouts'
    		-'rabbit_list_queues_timeouts'
    
    Attributes with names 'rabbit-master' and 'rabbit-start-time' should be
    public because we monitor this attributes in cycle for all nodes in our
    script.
    
    All timeouts attributes were changed to private to avoid unnecessary
    transitions.
    
    Also, --lifetime and --node options were removed for attrd_updater as
    'lifetime' for this command is always 'reboot' and 'node' default value
    is local one.
    
    This reverts commit b2b191d2e28b96c9f9a6ea440a383cf4f691d8ad.
    (As the pacemaker version was updated).
    
    Closes-bug: #1524672
    
    Change-Id: I6f0d4a99641b847321754d75605a78fbbc96ddad
    lefremova committed May 10, 2016
    Configuration menu
    Copy the full SHA
    b8e9513 View commit details
    Browse the repository at this point in the history

Commits on May 12, 2016

  1. Private attributes usage in rabbitmq script

    Required Pacemaker >= 1.1.13.
    (The command 'attrd_updater' have '-p' option only since this version).
    
    There are three types of rabbitmq attributes for pacemaker nodes:
    	-'rabbit-master'
    	-'rabbit-start-time'
    	- timeouts:
    		-'rabbit_list_channels_timeouts'
    		-'rabbit_get_alarms_timeouts'
    		-'rabbit_list_queues_timeouts'
    
    Attributes with names 'rabbit-master' and 'rabbit-start-time' should be
    public because we monitor this attributes in cycle for all nodes in our
    script. All timeouts attributes were changed to private to avoid
    unnecessary transitions.
    
    Also, --lifetime and --node options were removed for attrd_updater as
    'lifetime' for this command is always 'reboot' and 'node' default value
    is local one.
    lefremova authored and Liubov Efremova committed May 12, 2016
    Configuration menu
    Copy the full SHA
    216e164 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2016

  1. Merge branch 'stable'

    michaelklishin committed May 13, 2016
    Configuration menu
    Copy the full SHA
    951a2d4 View commit details
    Browse the repository at this point in the history

Commits on Jun 3, 2016

  1. Check cluster_status liveness during OCF checks

    We've observed some `autoheal` bug that made `cluster_status` became
    stuck forever.
    binarin committed Jun 3, 2016
    Configuration menu
    Copy the full SHA
    8bdfa3e View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2016

  1. Merge branch 'stable'

    michaelklishin committed Jun 7, 2016
    Configuration menu
    Copy the full SHA
    385afe5 View commit details
    Browse the repository at this point in the history

Commits on Jun 8, 2016

  1. Fix bashisms in OCF HA script

    `-` is not allowed in function names by POSIX, and some
    shells (e.g. `dash`) will consider this as a syntax error.
    binarin committed Jun 8, 2016
    Configuration menu
    Copy the full SHA
    d9c434d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'stable'

    michaelklishin committed Jun 8, 2016
    Configuration menu
    Copy the full SHA
    73368d2 View commit details
    Browse the repository at this point in the history

Commits on Aug 15, 2016

  1. Update iptables calls with --wait

    If iptables is currently being called outside of the ocf script, the
    iptables call will fail because it cannot get a lock. This change
    updates the iptables call to include the -w flag which will wait until
    the lock can be established and not just exit with an error.
    Alex Schultz committed Aug 15, 2016
    Configuration menu
    Copy the full SHA
    53e9b1a View commit details
    Browse the repository at this point in the history
  2. Update iptables calls with --wait

    If iptables is currently being called outside of the ocf script, the
    iptables call will fail because it cannot get a lock. This change
    updates the iptables call to include the -w flag which will wait until
    the lock can be established and not just exit with an error.
    Alex Schultz committed Aug 15, 2016
    Configuration menu
    Copy the full SHA
    dce9ea0 View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2016

  1. Merge branch 'stable'

    michaelklishin committed Aug 16, 2016
    Configuration menu
    Copy the full SHA
    470a2a9 View commit details
    Browse the repository at this point in the history

Commits on Aug 18, 2016

  1. Fix bashisms in rabbitmq OCF RA

    Change "printf %b" to be passing the checkbashisms.
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Aug 18, 2016
    Configuration menu
    Copy the full SHA
    2f7b806 View commit details
    Browse the repository at this point in the history
  2. Merge branch 'stable'

    michaelklishin committed Aug 18, 2016
    Configuration menu
    Copy the full SHA
    63cc485 View commit details
    Browse the repository at this point in the history
  3. [OCF HA] Add ocf_get_private_attr function to RabbitMQ OCF script

    The function is extracted from check_timeouts to be re-used later
    in other parts of the script. Also, swtich check_timeouts to use
    existing ocf_update_private_attr function.
    dmitrymex committed Aug 18, 2016
    Configuration menu
    Copy the full SHA
    abda1ca View commit details
    Browse the repository at this point in the history

Commits on Aug 19, 2016

  1. Merge branch 'stable'

    dcorbacho committed Aug 19, 2016
    Configuration menu
    Copy the full SHA
    69c13d4 View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2016

  1. [OCF HA] Rank master score based on start time

    Right now we assign 1000 to the oldest nodes and 1 to others. That
    creates a problem when Master restarts and no node is promoted until
    that node starts back. In that case the returned node will have score
    of 1, like all other slaves and Pacemaker will select to promote it
    again. The node is clean empty and afterwards other slaves join to
    it, wiping their data as well. As a result, we loose all the messages.
    
    The new algorithm actually ranks nodes, not just selects the oldest
    one. It also maintains the invariant that if node A started later
    than node B, then node A score must be smaller than that of
    node B. As a result, freshly started node has no chance of being
    selected in preference to older node. If several nodes start
    simultaneously, among them an older node might temporarily receive
    lower score than a younger one, but that is neglectable.
    
    Also remove any action on demote or demote notification - all of
    these duplicate actions done in stop or stop notification. With these
    removed, changing master on a running cluster does not affect RabbitMQ
    cluster in any way - we just declare another node master and that is
    it. It is important for the current change because master score might
    change after initial cluster start up causing master migration from
    one node to another.
    
    This fix is a prerequsite for fix to Fuel bugs
    https://bugs.launchpad.net/fuel/+bug/1559136
    https://bugs.launchpad.net/mos/+bug/1561894
    dmitrymex committed Aug 22, 2016
    Configuration menu
    Copy the full SHA
    091a028 View commit details
    Browse the repository at this point in the history
  2. [OCF HA] Enhance split-brain detection logic

    Previous split brain logic worked as follows: each slave checked
    that it is connected to master. If check fails, slave restarts. The
    ultimate flaw in that logic is that there is little guarantee that
    master is alive at the moment. Moreover, if master dies, it is very
    probable that during the next monitor check slaves will detect its
    death and restart, causing complete RabbitMQ cluster downtime.
    
    With the new approach master node checks that slaves are connected to
    it and orders them to restart if they are not. The check is performed
    after master node health check, meaning that at least that node
    survives. Also, orders expire in one minute and freshly started node
    ignores orders to restart for three minutes to give cluster time to
    stabilize.
    
    Also corrected the problem, when node starts and is already clustered.
    In that case OCF script forgot to start the RabbitMQ app, causing
    subsequent restart. Now we ensure that RabbitMQ app is running.
    
    The two introduced attributes rabbit-start-phase-1-time and
    rabbit-ordered-to-restart are made private. In order to allow master
    to set node's order to restart, both ocf_update_private_attr and
    ocf_get_private_attr signatures are expanded to allow passing
    node name.
    
    Finally, a bug is fixed in ocf_get_private_attr. Unlike crm_attribute,
    attrd_updater returns empty string instead of "(null)", when an
    attribute is not defined on needed node, but is defined on some other
    node. Correspondingly changed code to expect empty string, not a
    "(null)".
    
    This fix is a fix for Fuel bugs
    https://bugs.launchpad.net/fuel/+bug/1559136
    https://bugs.launchpad.net/mos/+bug/1561894
    dmitrymex committed Aug 22, 2016
    Configuration menu
    Copy the full SHA
    ab1a510 View commit details
    Browse the repository at this point in the history
  3. Merge branch 'stable'

    michaelklishin committed Aug 22, 2016
    Configuration menu
    Copy the full SHA
    7199f04 View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2016

  1. Monitor rabbitmq from OCF with less overhead

    This will stop wasting network bandwidth for monitoring.
    
    E.g. a 200-node OpenStack installation produces aronud 10k queues and
    10k channels. Doing single list_queues/list_channels in cluster in this
    environment results in 27k TCP packets and around 12 megabytes of
    network traffic. Given that this calls happen ~10 times a minute with 3
    controllers, it results in pretty significant overhead.
    
    To enable those features you shoud have rabbitmq containing following
    patches:
    - rabbitmq/rabbitmq-server#883
    - rabbitmq/rabbitmq-server#911
    - rabbitmq/rabbitmq-server#915
    binarin committed Aug 23, 2016
    Configuration menu
    Copy the full SHA
    f75cdde View commit details
    Browse the repository at this point in the history
  2. Merge branch 'stable'

    michaelklishin committed Aug 23, 2016
    Configuration menu
    Copy the full SHA
    2d0c979 View commit details
    Browse the repository at this point in the history

Commits on Aug 26, 2016

  1. Perform partition checks from OCF HA script

    Partitioned nodes are ordered to restart by master. It may sound like
    `autoheal`, but the problem is that OCF script and `autoheal` are not
    compatible because concepts of master in pacemaker and winner in
    autoheal are completely unrelated.
    binarin committed Aug 26, 2016
    Configuration menu
    Copy the full SHA
    6d7c0f2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6e54e16 View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2016

  1. Merge branch 'stable'

    michaelklishin committed Aug 31, 2016
    Configuration menu
    Copy the full SHA
    208bb82 View commit details
    Browse the repository at this point in the history

Commits on Sep 6, 2016

  1. [OCF HA] Do not suggest to run the second monitor action

    Right now we suggest to users to run the second monitor for slaves
    with depth=30. It made sense previously, when there was an additional
    check at that depth. Right now we don't have any depth-specific
    checks and hence it does not make sense to run the second monitor.
    Moreover, removing the second monitor fixes an issue with Pacemaker
    not reacting on failing monitor if it takes more than a minute. For
    details see Fuel bug https://launchpad.net/bugs/1618843
    dmitrymex committed Sep 6, 2016
    Configuration menu
    Copy the full SHA
    ad4c5d0 View commit details
    Browse the repository at this point in the history

Commits on Sep 7, 2016

  1. merge branch stable

    michaelklishin committed Sep 7, 2016
    Configuration menu
    Copy the full SHA
    fa69a41 View commit details
    Browse the repository at this point in the history
  2. [OCF HA] Delete Mnesia schema on mnesia reset

    Not doing so leads to RabbitMQ node being half-stuck in cluster. As a
    result, it can't clearly join back and constantly fails. Details could
    be found in the following Fuel bug:
    https://bugs.launchpad.net/fuel/+bug/1620649
    dmitrymex committed Sep 7, 2016
    Configuration menu
    Copy the full SHA
    c5f0563 View commit details
    Browse the repository at this point in the history

Commits on Sep 12, 2016

  1. Merge branch 'stable'

    dcorbacho committed Sep 12, 2016
    Configuration menu
    Copy the full SHA
    1f62529 View commit details
    Browse the repository at this point in the history

Commits on Sep 16, 2016

  1. Fix stdout/stderr redirects

    Related Fuel bug https://bugs.launchpad.net/fuel/+bug/1506423
    
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
    Bogdan Dobrelya committed Sep 16, 2016
    Configuration menu
    Copy the full SHA
    9e4db7d View commit details
    Browse the repository at this point in the history
  2. Merge branch 'stable'

    michaelklishin committed Sep 16, 2016
    Configuration menu
    Copy the full SHA
    523ce6b View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2016

  1. Configuration menu
    Copy the full SHA
    7e39ca9 View commit details
    Browse the repository at this point in the history
  2. Move all release handling bits to rabbitmq-release

    [#130659985]
    dumbbell committed Sep 21, 2016
    Configuration menu
    Copy the full SHA
    38b51b3 View commit details
    Browse the repository at this point in the history

Commits on Sep 23, 2016

  1. Merge branch 'stable'

    dumbbell committed Sep 23, 2016
    Configuration menu
    Copy the full SHA
    731d972 View commit details
    Browse the repository at this point in the history

Commits on Sep 29, 2016

  1. OCF RA: Check partitions on non-master nodes

    Partitions reported by `rabbit_node_monitor:partitions/0` are not
    commutative (i.e. node1 can report itself as partitioned with node2, but
    not vice versa).
    
    Given that we now have strong notion of master in OCF script, we can
    check for those fishy situations during master health check, and order
    damaged nodes to restart.
    
    Fuel bug: https://bugs.launchpad.net/fuel/+bug/1628487
    binarin committed Sep 29, 2016
    Configuration menu
    Copy the full SHA
    63bf153 View commit details
    Browse the repository at this point in the history

Commits on Oct 17, 2016

  1. Correctly return exit code from stop

    Panicking and returning non-success on stop often leads to resource
    becoming unmanaged on that node.
    
    Before we called get_status to verify that RabbitMQ is dead. But
    sometimes it returns error even though RabbitMQ is not running. There
    is no reason to call it - we will just verify that there is no beam
    process running.
    
    Related fuel bug - https://bugs.launchpad.net/fuel/+bug/1626933
    dmitrymex committed Oct 17, 2016
    Configuration menu
    Copy the full SHA
    2f6ec13 View commit details
    Browse the repository at this point in the history

Commits on Mar 31, 2017

  1. OCF RA: Don't hardcode primitive name in rabbitmq-server-ha.ocf

    We can compute the name of the primitive automatically from environment
    variables, instead of hard-coding p_rabbitmq-server; this makes the
    resource agent more flexible.
    
    Closes rabbitmq/rabbitmq-server-release#23
    vuntz committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    fffe28a View commit details
    Browse the repository at this point in the history
  2. OCF RA: Don't hardcode primitive name in rabbitmq-server-ha.ocf

    We can compute the name of the primitive automatically from environment
    variables, instead of hard-coding p_rabbitmq-server; this makes the
    resource agent more flexible.
    
    Closes rabbitmq/rabbitmq-server-release#23
    vuntz authored and michaelklishin committed Mar 31, 2017
    Configuration menu
    Copy the full SHA
    ccfc617 View commit details
    Browse the repository at this point in the history

Commits on Apr 2, 2017

  1. Merge branch 'stable'

    michaelklishin committed Apr 2, 2017
    Configuration menu
    Copy the full SHA
    bb39f85 View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2017

  1. OCF RA: Add default_vhost parameter to rabbitmq-server-ha.ocf

    This enables the cluster to focus on a vhost that is not /, in case the
    most important vhost is something else.
    
    For reference, other vhosts may exist in the cluster, but these are not
    guaranteed to not suffer from any data loss. This patch doesn't address
    this issue.
    
    Closes rabbitmq/rabbitmq-server-release#22
    vuntz committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    c6f95aa View commit details
    Browse the repository at this point in the history
  2. OCF RA: Add new limit_nofile parameter to rabbitmq-server-ha OCF RA

    This enables to change the limit of open files, as the default on
    distributions is usually too low for rabbitmq. Default is 65535.
    vuntz committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    c3434f1 View commit details
    Browse the repository at this point in the history
  3. Merge pull request ClusterLabs#24 from vuntz/ocf-vhost

    OCF RA: Add vhost parameter to rabbitmq-server-ha.ocf
    michaelklishin committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    10cf912 View commit details
    Browse the repository at this point in the history
  4. OCF RA: Only set limit for open files when higher than current value

    This allows to set the limit via some other way.
    vuntz committed Apr 4, 2017
    Configuration menu
    Copy the full SHA
    17eb6d8 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2017

  1. Merge pull request ClusterLabs#21 from vuntz/ocf-limit_nofile

    OCF RA: Add new limit_nofile parameter to both OCF resource agents
    michaelklishin committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    9acc77b View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b29f0e5 View commit details
    Browse the repository at this point in the history
  3. Merge branch 'stable'

    michaelklishin committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    4743057 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    51fe230 View commit details
    Browse the repository at this point in the history
  5. Merge branch 'stable'

    Conflicts:
    	scripts/rabbitmq-server.ocf
    michaelklishin committed Apr 5, 2017
    Configuration menu
    Copy the full SHA
    695dd24 View commit details
    Browse the repository at this point in the history

Commits on May 9, 2017

  1. Fix HA OCF script

    Some parts of ClusterLabs#21 have not been added to the stable branch. This change
    fixes the issue by adding missing changes to rabbitmq-server-ha.ocf and
    also fixing rabbitmq-server.ocf
    matelakat committed May 9, 2017
    Configuration menu
    Copy the full SHA
    e154327 View commit details
    Browse the repository at this point in the history

Commits on May 16, 2017

  1. Merge branch 'stable'

    dumbbell committed May 16, 2017
    Configuration menu
    Copy the full SHA
    ba1479f View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2017

  1. OCF RA: Avoid promoting nodes with same start time as master

    It may happen that two nodes have the same start time, and one of these
    is the master. When this happens, the node actually gets the same score
    as the master and can get promoted. There's no reason to avoid being
    stable here, so let's keep the same master in that scenario.
    vuntz committed Dec 8, 2017
    Configuration menu
    Copy the full SHA
    cb09e8f View commit details
    Browse the repository at this point in the history
  2. OCF RA: Fix test for no node in start notification handler

    If there's nothing starting and nothing active, then we do a -z " ",
    which doesn't have the same result as -z "". Instead, just test for
    emptiness for each set of nodes.
    vuntz committed Dec 8, 2017
    Configuration menu
    Copy the full SHA
    431644a View commit details
    Browse the repository at this point in the history
  3. OCF RA: Do not start rabbitmq if notification of start is not about us

    Right now, every time we get a start notification, all nodes will ensure
    the rabbitmq app is started. This makes little sense, as nodes that are
    already active don't need to do that.
    
    On top of that, this had the sideeffect of updating the start time for
    each of these nodes, which could result in the master moving to another
    node.
    vuntz committed Dec 8, 2017
    Configuration menu
    Copy the full SHA
    263047c View commit details
    Browse the repository at this point in the history
  4. OCF RA: Fix logging in start notification handler

    The "post-start end" log message was written too early (some things were
    still done afterwards), and not in all cases (it was inside a if
    statement).
    vuntz committed Dec 8, 2017
    Configuration menu
    Copy the full SHA
    044250f View commit details
    Browse the repository at this point in the history

Commits on Dec 12, 2017

  1. Merge pull request ClusterLabs#64 from vuntz/ocf-fix-notify-start

    OCF RA: Fix various issues with start notification handler
    michaelklishin committed Dec 12, 2017
    Configuration menu
    Copy the full SHA
    3183a2c View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2017

  1. OCF RA: Avoid promoting nodes with same start time as master

    It may happen that two nodes have the same start time, and one of these
    is the master. When this happens, the node actually gets the same score
    as the master and can get promoted. There's no reason to avoid being
    stable here, so let's keep the same master in that scenario.
    
    (cherry picked from commit 62a4f7561171328cd1d62cab394d0bba269ea7ad)
    (cherry picked from commit 861f2a57f916a9829e9a11092ada2bb52bdaf028)
    vuntz authored and michaelklishin committed Dec 14, 2017
    Configuration menu
    Copy the full SHA
    9a95a2c View commit details
    Browse the repository at this point in the history
  2. OCF RA: Fix syntax error

    (cherry picked from commit a9b4a4ff97a96e798de51933fc44f61aa6bc88a3)
    vuntz authored and michaelklishin committed Dec 14, 2017
    Configuration menu
    Copy the full SHA
    c0688a9 View commit details
    Browse the repository at this point in the history
  3. OCF RA: Fix syntax error

    (cherry picked from commit a9b4a4ff97a96e798de51933fc44f61aa6bc88a3)
    vuntz authored and michaelklishin committed Dec 14, 2017
    Configuration menu
    Copy the full SHA
    46c3fd2 View commit details
    Browse the repository at this point in the history

Commits on Dec 18, 2017

  1. Merge branch 'rabbitmq-server-release-153734997' into rabbitmq-server…

    …-release-153734997-master
    lukebakken committed Dec 18, 2017
    Configuration menu
    Copy the full SHA
    a0d992f View commit details
    Browse the repository at this point in the history

Commits on Dec 20, 2017

  1. OCF RA: Do not consider local failures as remote node problems

    In is_clustered_with(), commands that we run to check if the node is
    clustered with us, or partitioned with us may fail. When they fail, it
    actually doesn't tell us anything about the remote node.
    
    Until now, we were considering such failures as hints that the remote
    node is not in a sane state with us. But doing so has pretty negative
    impact, as it can cause rabbitmq to get restarted on the remote node,
    causing quite some disruption.
    
    So instead of doing this, ignore the error (it's still logged).
    
    There was a comment in the code wondering what is the best behavior;
    based on experience, I think preferring stability is the slightly more
    acceptable poison between the two options.
    vuntz committed Dec 20, 2017
    Configuration menu
    Copy the full SHA
    fac5c26 View commit details
    Browse the repository at this point in the history

Commits on Nov 19, 2018

  1. Use ocf_attribute_target instead of crm_node

    Instead of calling crm_node directly it is preferrable to use the
    ocf_attribute_target function. This function will return crm_node -n
    as usual, except when run inside a bundle (aka container in pcmk
    language). Inside a bundle it will return the bundle name or, if the
    meta attribute meta_container_attribute_target is set to 'host', it
    will return the physical node name where the bundle is running.
    
    Typically when running a rabbitmq cluster inside containers it is
    desired to set 'meta_container_attribute_target=host' on the rabbit
    cluster resource so that the RA is aware on which host it is running.
    
    Tested both on baremetal (without containers):
     Master/Slave Set: rabbitmq-master [rabbitmq]
         Masters: [ controller-0 controller-1 controller-2 ]
    
    And with bundles as well.
    
    Co-Authored-By: Damien Ciabrini <dciabrin@redhat.com>
    mbaldessari and dciabrin committed Nov 19, 2018
    Configuration menu
    Copy the full SHA
    478442b View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2019

  1. URL Cleanup

    This commit updates URLs to prefer the https protocol. Redirects are not followed to avoid accidentally expanding intentionally shortened URLs (i.e. if using a URL shortener).
    
    # Fixed URLs
    
    ## Fixed Success
    These URLs were switched to an https URL with a 2xx status. While the status was successful, your review is still recommended.
    
    * [ ] http://www.apache.org/licenses/LICENSE-2.0 with 1 occurrences migrated to:
      https://www.apache.org/licenses/LICENSE-2.0 ([https](https://www.apache.org/licenses/LICENSE-2.0) result 200).
    spring-operator committed Mar 21, 2019
    Configuration menu
    Copy the full SHA
    b2788dd View commit details
    Browse the repository at this point in the history

Commits on Jan 31, 2020

  1. Allow operator to disable iptables client blocking

    Currently the resource agent hard-codes iptables calls to block off
    client access before the resource becomes master. This was done
    historically because many libraries were fairly buggy detecting a
    not-yet functional rabbitmq, so they were being helped by getting
    a tcp RST packet and they would go on trying their next configured
    server.
    
    It makes sense to be able to disable this behaviour because
    most libraries by now have gotten better at detecting timeouts when
    talking to rabbit and because when you run rabbitmq inside a bundle
    (pacemaker term for a container with an OCF resource inside) you
    normally do not have access to iptables.
    
    Tested by creating a three-node bundle cluster inside a container:
     Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]
       Replica[0]
          rabbitmq-bundle-podman-0  (ocf::heartbeat:podman):        Started controller-0
          rabbitmq-bundle-0 (ocf::pacemaker:remote):        Started controller-0
          rabbitmq  (ocf::rabbitmq:rabbitmq-server-ha):     Master rabbitmq-bundle-0
       Replica[1]
          rabbitmq-bundle-podman-1  (ocf::heartbeat:podman):        Started controller-1
          rabbitmq-bundle-1 (ocf::pacemaker:remote):        Started controller-1
          rabbitmq  (ocf::rabbitmq:rabbitmq-server-ha):     Master rabbitmq-bundle-1
       Replica[2]
          rabbitmq-bundle-podman-2  (ocf::heartbeat:podman):        Started controller-2
          rabbitmq-bundle-2 (ocf::pacemaker:remote):        Started controller-2
          rabbitmq  (ocf::rabbitmq:rabbitmq-server-ha):     Master rabbitmq-bundle-2
    
    The ocf resource was created inside a bundle with:
    pcs resource create rabbitmq ocf:rabbitmq:rabbitmq-server-ha avoid_using_iptables="true" \
      meta notify=true container-attribute-target=host master-max=3 ordered=true \
      op start timeout=200s stop timeout=200s promote timeout=60s bundle rabbitmq-bundle
    
    Signed-off-by: Michele Baldessari <michele@acksyn.org>
    mbaldessari committed Jan 31, 2020
    Configuration menu
    Copy the full SHA
    f489110 View commit details
    Browse the repository at this point in the history

Commits on Nov 13, 2020

  1. Merge remote-tracking branch 'rabbitmq_server_release/master'

    Corresponding to master at 7b25a1cdb1bf9e5920f4394efc3096fbcf09de1f
    pjk25 committed Nov 13, 2020
    Configuration menu
    Copy the full SHA
    9214437 View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2021

  1. Allow rabbitmq to run in a larger cluster composed of also non-rabbit…

    …mq nodes
    
    We introduce the OCF_RESKEY_allowed_cluster_node parameter which can be used to specify
    which nodes of the cluster rabbitmq is expected to run on. When this variable is not
    set the resource agent assumes that all nodes of the cluster (output of crm_node -l)
    are eligible to run rabbitmq. The use case here is clusters that have a large
    numbers of node, where only a specific subset is used for rabbitmq (usually this is
    done with some constraints).
    
    Tested in a 9-node cluster as follows:
    [root@messaging-0 ~]# pcs resource config rabbitmq
     Resource: rabbitmq (class=ocf provider=rabbitmq type=rabbitmq-server-ha)
      Attributes: allowed_cluster_nodes="messaging-0 messaging-1 messaging-2" avoid_using_iptables=true
      Meta Attrs: container-attribute-target=host master-max=3 notify=true ordered=true
      Operations: demote interval=0s timeout=30 (rabbitmq-demote-interval-0s)
                  monitor interval=5 timeout=30 (rabbitmq-monitor-interval-5)
                  monitor interval=3 role=Master timeout=30 (rabbitmq-monitor-interval-3)
                  notify interval=0s timeout=20 (rabbitmq-notify-interval-0s)
                  promote interval=0s timeout=60s (rabbitmq-promote-interval-0s)
                  start interval=0s timeout=200s (rabbitmq-start-interval-0s)
                  stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)
    
    [root@messaging-0 ~]# pcs status |grep -e rabbitmq -e messaging
      * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
    ...
      * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    mbaldessari committed Feb 28, 2021
    Configuration menu
    Copy the full SHA
    3a9253a View commit details
    Browse the repository at this point in the history
  2. Stop logging unblock client access unconditionally

    Currently every call to unblock_client_access() is followed by a log line
    showing which function requested the unblocking. When we pass the parameter
    OCF_RESKEY_avoid_using_iptables=true it makes no sense to log
    unblocking of iptables since it is effectively a no-op.
    
    Let's move that logging inside the unblock_client_access() function
    allowing a parameter to log which function called it.
    
    Tested on a cluster with rabbitmq bundles with avoid_using_iptables=true
    and observed no spurious logging any longer:
    
    [root@messaging-0 ~]# journalctl |grep 'unblocked access to RMQ port' |wc -l
    0
    mbaldessari committed Feb 28, 2021
    Configuration menu
    Copy the full SHA
    4d68998 View commit details
    Browse the repository at this point in the history
  3. Only export RABBITMQ_NODE_PORT when it is not the default

    RABBITMQ_NODE_PORT is exported by default and set to 5672. Re-exporting it in that
    case will actually break the case where we set up rabbit with tls on the default port:
    
      2021-02-28 07:44:10.732 [error] <0.453.0> Failed to start Ranch listener
      {acceptor,{172,17,1,93},5672} in ranch_ssl:listen([{cacerts,'...'},{key,'...'},{cert,'...'},{ip,{172,17,1,93}},{port,5672},
      inet,{keepalive,true}, {versions,['tlsv1.1','tlsv1.2']},{certfile,"/etc/pki/tls/certs/rabbitmq.crt"},{keyfile,"/etc/pki/tls/private/rabbitmq.key"},
      {depth,1},{secure_renegotiate,true},{reuse_sessions,true},{honor_cipher_order,true},{verify,verify_none},{fail_if_no_peer_cert,false}])
      for reason eaddrinuse (address already in use)
    
    This is because by explicitely always exporting it, we force rabbit to listen to
    that port via tcp and that is a problem when we want to do SSL on that port.
    Since 5672 is the default port already we can just avoid exporting this port when
    the user does not customize the port.
    
    Tested both in a non-TLS env (A) and in a TLS-env (B) successfully:
    (A) Non-TLS
    [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq
    [root@messaging-0 /]#
    [root@messaging-0 /]# pcs status |grep rabbitmq
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    
    (B) TLS
    [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq/ |head -n3
    /etc/rabbitmq/rabbitmq.config:  {ssl, [{versions, ['tlsv1.1', 'tlsv1.2']}]},
    /etc/rabbitmq/rabbitmq.config:    {ssl_listeners, [{"172.17.1.48", 5672}]},
    /etc/rabbitmq/rabbitmq.config:    {ssl_options, [
    
    [root@messaging-0 ~]# pcs status |grep rabbitmq
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    
    Note: I don't believe we should export RABBITMQ_NODE_PORT at all, since you can specify all ports
    in the rabbit configuration anyways, but prefer to play it safe here as folks might rely on being
    able to customize this.
    
    Signed-off-by: Michele Baldessari <michele@acksyn.org>
    mbaldessari committed Feb 28, 2021
    Configuration menu
    Copy the full SHA
    c61b5df View commit details
    Browse the repository at this point in the history

Commits on Mar 1, 2021

  1. Allow rabbitmq to run in a larger cluster composed of also non-rabbit…

    …mq nodes
    
    We introduce the OCF_RESKEY_allowed_cluster_node parameter which can be used to specify
    which nodes of the cluster rabbitmq is expected to run on. When this variable is not
    set the resource agent assumes that all nodes of the cluster (output of crm_node -l)
    are eligible to run rabbitmq. The use case here is clusters that have a large
    numbers of node, where only a specific subset is used for rabbitmq (usually this is
    done with some constraints).
    
    Tested in a 9-node cluster as follows:
    [root@messaging-0 ~]# pcs resource config rabbitmq
     Resource: rabbitmq (class=ocf provider=rabbitmq type=rabbitmq-server-ha)
      Attributes: allowed_cluster_nodes="messaging-0 messaging-1 messaging-2" avoid_using_iptables=true
      Meta Attrs: container-attribute-target=host master-max=3 notify=true ordered=true
      Operations: demote interval=0s timeout=30 (rabbitmq-demote-interval-0s)
                  monitor interval=5 timeout=30 (rabbitmq-monitor-interval-5)
                  monitor interval=3 role=Master timeout=30 (rabbitmq-monitor-interval-3)
                  notify interval=0s timeout=20 (rabbitmq-notify-interval-0s)
                  promote interval=0s timeout=60s (rabbitmq-promote-interval-0s)
                  start interval=0s timeout=200s (rabbitmq-start-interval-0s)
                  stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)
    
    [root@messaging-0 ~]# pcs status |grep -e rabbitmq -e messaging
      * Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
    ...
      * Container bundle set: rabbitmq-bundle [cluster.common.tag/rhosp16-openstack-rabbitmq:pcmklatest]:
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    mbaldessari authored and michaelklishin committed Mar 1, 2021
    Configuration menu
    Copy the full SHA
    54d190d View commit details
    Browse the repository at this point in the history
  2. Stop logging unblock client access unconditionally

    Currently every call to unblock_client_access() is followed by a log line
    showing which function requested the unblocking. When we pass the parameter
    OCF_RESKEY_avoid_using_iptables=true it makes no sense to log
    unblocking of iptables since it is effectively a no-op.
    
    Let's move that logging inside the unblock_client_access() function
    allowing a parameter to log which function called it.
    
    Tested on a cluster with rabbitmq bundles with avoid_using_iptables=true
    and observed no spurious logging any longer:
    
    [root@messaging-0 ~]# journalctl |grep 'unblocked access to RMQ port' |wc -l
    0
    mbaldessari authored and michaelklishin committed Mar 1, 2021
    Configuration menu
    Copy the full SHA
    8c4055c View commit details
    Browse the repository at this point in the history
  3. Only export RABBITMQ_NODE_PORT when it is not the default

    RABBITMQ_NODE_PORT is exported by default and set to 5672. Re-exporting it in that
    case will actually break the case where we set up rabbit with tls on the default port:
    
      2021-02-28 07:44:10.732 [error] <0.453.0> Failed to start Ranch listener
      {acceptor,{172,17,1,93},5672} in ranch_ssl:listen([{cacerts,'...'},{key,'...'},{cert,'...'},{ip,{172,17,1,93}},{port,5672},
      inet,{keepalive,true}, {versions,['tlsv1.1','tlsv1.2']},{certfile,"/etc/pki/tls/certs/rabbitmq.crt"},{keyfile,"/etc/pki/tls/private/rabbitmq.key"},
      {depth,1},{secure_renegotiate,true},{reuse_sessions,true},{honor_cipher_order,true},{verify,verify_none},{fail_if_no_peer_cert,false}])
      for reason eaddrinuse (address already in use)
    
    This is because by explicitely always exporting it, we force rabbit to listen to
    that port via tcp and that is a problem when we want to do SSL on that port.
    Since 5672 is the default port already we can just avoid exporting this port when
    the user does not customize the port.
    
    Tested both in a non-TLS env (A) and in a TLS-env (B) successfully:
    (A) Non-TLS
    [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq
    [root@messaging-0 /]#
    [root@messaging-0 /]# pcs status |grep rabbitmq
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    
    (B) TLS
    [root@messaging-0 /]# grep -ir -e tls -e ssl /etc/rabbitmq/ |head -n3
    /etc/rabbitmq/rabbitmq.config:  {ssl, [{versions, ['tlsv1.1', 'tlsv1.2']}]},
    /etc/rabbitmq/rabbitmq.config:    {ssl_listeners, [{"172.17.1.48", 5672}]},
    /etc/rabbitmq/rabbitmq.config:    {ssl_options, [
    
    [root@messaging-0 ~]# pcs status |grep rabbitmq
        * rabbitmq-bundle-0 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-0
        * rabbitmq-bundle-1 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-1
        * rabbitmq-bundle-2 (ocf::rabbitmq:rabbitmq-server-ha):      Master messaging-2
    
    Note: I don't believe we should export RABBITMQ_NODE_PORT at all, since you can specify all ports
    in the rabbit configuration anyways, but prefer to play it safe here as folks might rely on being
    able to customize this.
    
    Signed-off-by: Michele Baldessari <michele@acksyn.org>
    mbaldessari authored and michaelklishin committed Mar 1, 2021
    Configuration menu
    Copy the full SHA
    7410979 View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2021

  1. Merge pull request #2864 from rabbitmq/mk-lager-3-9-0

    Upgrade Lager to 3.9 for OTP 24 compatibility
    michaelklishin committed Mar 4, 2021
    Configuration menu
    Copy the full SHA
    bff3727 View commit details
    Browse the repository at this point in the history

Commits on Jun 17, 2021

  1. Configuration menu
    Copy the full SHA
    8799071 View commit details
    Browse the repository at this point in the history

Commits on Jun 30, 2021

  1. OCF RA: fix start/stop handling

    In newer Erlang, beam.smp no longer writes a pidfile, until the rabbit
    applicataion starts. It also no longer passes -mneisa dir and -sname,
    which are required in order to start the node only delaying
    the application start up.
    Handle that so the Pacemaker HA setup keeps working with newer Erlang
    and rabbitmq-server versions.
    
    Fix '[ x == x ]' bashisms as well to silence errors in the RA logs.
    
    Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
    bogdando committed Jun 30, 2021
    Configuration menu
    Copy the full SHA
    06aa9a8 View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2021

  1. Configuration menu
    Copy the full SHA
    07301d2 View commit details
    Browse the repository at this point in the history
  2. Milestone histroy for rabbitmq OCF RA from Fuel

    Fuel for OpenStack origined the rabbitmq OCF RA.
    Restore history of changes for it.
    
    This commit is empty and only set a milestone for its Fuel histroy
    ending at Tue May 10 15:27:53 2016.
    
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    bogdando committed Oct 1, 2021
    Configuration menu
    Copy the full SHA
    006f625 View commit details
    Browse the repository at this point in the history
  3. Ignore stderr when calling rabbitmqctl eval()

    Every time we recompile the erlang/elixir/rebar/rabbitmq stack there is
    one or more fresh new warnings that will completely trip up any parsing
    of these commands. Most end up being bugs that get fixed later on [1].
    
    Since stderr is rarely interesting and just holds any rebase up, let's
    ignore it when running these rabbitmqctl commands.
    
    [1] https://elixirforum.com/t/mix-local-hex-warning-authenticity-is-not-established-by-certificate-path-validation/39665
    
    Authored-by: Michele Baldessari <michele@acksyn.org>
    Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
    bogdando committed Oct 1, 2021
    Configuration menu
    Copy the full SHA
    a935311 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ecc7231 View commit details
    Browse the repository at this point in the history
  5. New home for rabbitmq-server-ha RA OCF m/s HA

    Moving it from repo:
    https://github.com/rabbitmq/rabbitmq-server
    The original path:
    scripts/rabbitmq-server-ha.ocf
    
    Also preserve list of authors and changes history since its
    very initial commit in the Fuel for OpenStack project,
    now archieved: https://github.com/openstack-archive/fuel-library
    
    To get the history use:
    $ git log --follow heartbeat/rabbitmq-server-ha.ocf
    
    Reasoning behind: the OCF RA script provides M/S HA
    pacemaker resource for RabbitMQ cluster and better fits
    this place.
    
    Background
    ==========
    It's been actively maintained for years.
    And now it needs a new home due to requests of RabbitMQ team,
    since it is no longer possible to run CI tests for changes
    proposed against it by the old location.
    
    TripleO upstream project and its layered RH OSP product have plans
    to adopt this OCF RA for its use. That guarantees the future
    maintanance and support for it.
    
    How it works
    ==========
    Documentation is kept by its original upstream location:
    https://www.rabbitmq.com/pacemaker.html#auto-pacemaker
    
    Future Plans
    ============
    Once it is there, the package builds for RDO will catch up
    changes for that OCF RA and make sure it's CI'ed, also in TripleO
    and OSP.
    
    Status Quo
    ==========
    Until the adoption completes, I'm planning to test changes proposed
    by this new location in my fork, with github actions, like [0].
    The CI runs on pre-build images [1] and vagrant scripts [1] that
    I maintain for the (more or less) recent Pacemaker and RabbitMQ
    builds. The test coverage includes a simple cluster assemble smoke
    test and a sofisticated Jepsen testcase that verifies auto-healing
    of the cluster resource in Pacemaker managed by this OCF RA.
    
    [0] https://github.com/bogdando/rabbitmq-server/runs/3757495446
    [1] https://hub.docker.com/r/bogdando/rabbitmq-cluster-ocf
    [2] https://github.com/bogdando/rabbitmq-cluster-ocf-vagrant
    
    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    bogdando committed Oct 1, 2021
    Configuration menu
    Copy the full SHA
    8d154d8 View commit details
    Browse the repository at this point in the history

Commits on Nov 3, 2021

  1. Fix OCF params and bashisms

    Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>
    bogdando committed Nov 3, 2021
    Configuration menu
    Copy the full SHA
    3555d9e View commit details
    Browse the repository at this point in the history