Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High: cumulative patchset to fix CVE-2019-3885, CVE-2018-16877, CVE-2018-16878 + additional unmasked null pointer deref #1749

Merged
merged 8 commits into from Apr 17, 2019

Conversation

jnpkrn
Copy link
Contributor

@jnpkrn jnpkrn commented Apr 17, 2019

See also:
https://www.openwall.com/lists/oss-security/2019/04/17/1

People are strongly suggested to get these patches in.

This could possibly lead to unsolicited information disclosure by the
means of standard output of the immediately preceding agent/resource
execution leaking into the log stream under some circumstances.
It was hence assigned CVE-2019-3885.

The provoked pathological state of pacemaker-execd daemon progresses
towards crashing it for hitting segmentation fault.
[0/4: make crm_pid_active more precise as to when detections fail]

It would be bad if the function claimed the process is not active
when the only obstacle in the detection process was that none of the
detection methods worked for a plain lack of permissions to apply
them.  Also, do some other minor cleanup of the function and add its
documentation.  As an additional measure, log spamming is kept at
minimum for repeated queries about the same PID.
[1/4: new helpers to allow IPC client side to authenticate the server]

The title problem here could possibly lead to local privilege escalation
up to the root's level (and implicitly unguarded by some additional
protection layers like SELinux unless the defaults constrained further).

Main problem is that the authenticity assumptions were built on two,
seemingly mutually supporting legs leading to two CVEs assigned:

* procfs (mere process existence and right path to binary check)
  used to verify (this part was assigned CVE-2018-16878), and

* one-way only client-server authentication, putting the client
  here at the mercy of the server not necessarily cross-verified
  per the above point if at all (this part was assigned
  CVE-2018-16877)

whereas these two were in fact orthogonal, tearing security
assumptions about the "passive influencers" in the pacemaker's daemon
resilience-friendly constellation (orchestrated by the main of them,
pacemakerd) apart.  Moreover, procfs-based approach is discouraged
for other reasons.

The layout of the basic fix is as follows:
* 1/4: new helpers to allow IPC client side to authenticate the server
       (this commit, along with unifying subsequent solution for
       both CVEs)
* 2/4: pacemakerd to trust pre-existing processes via new checks instead
       (along with unifying solution for both CVEs)
* 3/4: other daemons to authenticate IPC servers of fellow processes
       (along with addressing CVE-2018-16877 alone, for parts of
       pacemaker not covered earlier)
* 4/4: CPG users to be careful about now-more-probable rival processes
       (this is merely to mitigate corner case fallout from the new
       approaches taken to face CVE-2018-16878 in particular;
       courtesy of Yan Gao of SUSE for the reporting this)

With "basic", it is meant that it constitutes a self-contained best
effort solution with some compromises that can only be overcome with the
assistance of IPC library, libqb, as is also elaborated in messages of
remaining "fix" commits.  Beside that, also conventional encapsulation
of server-by-client authentication would be useful, but lack thereof
is not an obstacle (more so should there by any security related
neglectations on the IPC client side and its connection initiating
arrangement within libqb that has a potential to strike as early as
when the authenticity of the server side is yet to be examined).

One extra kludge that's introduced for FreeBSD lacking Unix socket to
remote peer PID mapping is masquerading such an unspecified PID with
value of 1, since that shall always be around as "init" task and,
deferring to proof by contradiction, cannot be pacemakerd-spawned
child either even if PID 1 was pacemakerd (and running such a child
alone is rather nonsensical).  The code making decisions based on that
value must acknowledge this craze and refrain from killing/signalling
the underlying process on this platform (but shall in general follow
the same elsewhere, keep in mind systemd socket-based activation for
instance, which would end up in such a situation easily!).
[2/4: pacemakerd to trust pre-existing processes via new checks instead]

In pacemakerd in the context of entrusting pre-existing processes,
we now resort to procfs-based solution only in boundary, fouled cases,
and primarily examine the credentials of the processes already
occupying known IPC end-points before adopting them.

The commit applies the new helpers from 1/1 so as to close the both
related sensitive problems, CVE-2018-16877 and CVE-2018-16878, in
a unified manner, this time limited to the main daemon of pacemaker
(pacemakerd).

To be noted that it is clearly not 100% for this purpose for still
allowing for TOCTTOU, but that's what commit (3/3) is meant to solve
for the most part, plus there may be optimizations solving this concern
as a side effect, but that requires an active assistance on the libqb
side (ClusterLabs/libqb#325) since any
improvement on pacemaker side in isolation would be very
cumbersome if generally possible at all, but either way
means a new, soft compatibility encumberment.

As a follow-up to what was put in preceding 1/3 commit, PID of 1 tracked
as child's identification on FreeBSD (or when socket-based activation is
used with systemd) is treated specially, incl. special precaution with
child's PID discovered as 1 elsewhere.

v2: courtesy of Yan Gao of SUSE for early discovery and report for
    what's primarily solved with 4/4 commit, in extension, child
    daemons in the initialization phase coinciding with IPC-feasibility
    based process scan in pacemakerd in a way that those are missed
    (although they are to come up fully just moments later only
    to interfere with naturally spawned ones) are now considered so
    that if any native children later fail for said clash, the
    pre-existing counterpart may get adopted instead of ending up
    with repeated spawn-bury loop ad nauseam without real progress
    (note that PCMK_fail_fast=true could possibly help, but that's
    rather a big hammer not suitable for all the use cases, not
    the ones we try to deal with gracefully here)
[3/4: other daemons to authenticate IPC servers of fellow processes]

Now that CVE-2018-16877 issue alone is still only partially covered
based on the preceding commits in the set, put the server-by-client
authentication (enabled and 1/3 and partially sported in 2/3) into
practice widely amongst the communicating pacemaker child daemons and
towards CPG API provided by 3rd party but principally using the same
underlying IPC mechanism facilitated by libqb, and consequently close
the remaining "big gap".

As a small justification to introducing yet another "return
value" int variable, type-correctness is restored for those
that shall be cs_error_t to begin with.
[4/4: CPG users to be careful about now-more-probable rival processes]

In essence, this comes down to pacemaker confusing at-node CPG members
with effectively the only plausible to co-exist at particular node,
which doesn't hold and asks for a wider reconciliation of this
reality-check.

However, in practical terms, since there are two factors lowering the
priority of doing so:

1/ possibly the only non-self-inflicted scenario is either that
   some of the cluster stack processes fail -- this the problem
   that shall rather be deferred to arranged node disarming/fencing
   to stay on the safe side with 100% certainty, at the cost of
   possibly long-lasting failover process at other nodes
   (for other possibility, someone running some of these by accident
   so they effectively become rival processes, it's like getting
   hands cut when playing with a lawnmower in an unintended way)

2/ for state tracking of the peer nodes, it may possibly cause troubles
   in case the process observed as left wasn't the last for the
   particular node, even if presumably just temporary, since the
   situation may eventually resolve with imposed serialization of
   the rival processes via API end-point singleton restriction (this
   is also the most likely cause of why such non-final leave gets
   observed in the first place), except in one case -- the legitimate
   API end-point carrier won't possibly acknowledged as returned
   by its peers, at least not immediately, unless it tries to join
   anew, which verges on undefined behaviour (at least per corosync
   documentation)

we make do just with a light code change so as to

* limit 1/ some more with in-daemon self-check for pre-existing
  end-point existence (this is to complement the checks already made in
  the parent daemon prior to spawning new instances, only some moments
  later; note that we don't have any lock file etc. mechanisms to
  prevent parallel runs of the same daemons, and people could run these
  on their own deliberation), and to

* guard against the interferences from the rivals at the same node
  per 2/ with ignoring their non-final leave messages altogether.

Note that CPG at this point is already expected to be authenticity-safe.

Regarding now-more-probable part, we actually traded the inherently racy
procfs scanning for something (exactly that singleton mentioned above)
rather firm (and unfakeable), but we admittedly got lost track of
processes that are after CPG membership (that is, another form of
a shared state) prior to (or in non-deterministic order allowing for
the same) carring about publishing the end-point.

Big thanks is owed to Yan Gao of SUSE, for early discovery and reporting
this discrepancy arising from the earlier commits in the set.
This is now more likely triggerable once the problems related to
CVE-2018-16878 are avoided.
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Apr 17, 2019

Oops, Travis is failing in part because now we require resolution of
hacluster/haclient, but arguably, the environment not having those
is rather hostile for the cluster stack (basically an error in the
deployment), so I suggest to deal with this afterwards.

For illustration:

(lrmd_ipc_connect) 	info: Connecting to executor
(crm_user_lookup) 	info: User hacluster lookup: Invalid argument
(crm_client_new) 	warning: Could not find user and group IDs for user hacluster

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Apr 17, 2019

Last patch is a tentative attempt to calm Travis down.

@jnpkrn jnpkrn force-pushed the high-sec-master branch 2 times, most recently from b3edc9c to 4a904d4 Compare April 17, 2019 15:40
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Apr 17, 2019

No longer tentative, it apparently helped.

@gao-yan
Copy link
Member

gao-yan commented Apr 17, 2019

Nice work! Thanks.

@kgaillot kgaillot merged commit bab7de5 into ClusterLabs:master Apr 17, 2019
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Apr 18, 2019

Just to rectify Med: controld: fix possible NULL pointer dereference
commit a bit after-the-fact (there was no time for detailed analysis
yesterday given the disclosure rush), it was related to the situation
that for some reason neither of these daemons runs:

  • corosync
  • pacemakerd
  • pacemaker-based
    (or this one runs but is not authentic [or name service for user name
    to UID translation et al. has just fallen apart], which amounts to the
    same now, hence there's still a bit of truth about "more likely
    triggerable", though CVE-2018-1687 should have been mentioned instead,
    mea culpa, again, not enough time for fact checking)

and pacemaker-controld just comes to life

(very contrieved scenario:

  • corosync up, pacemakerd has just spawned pacemaker-controld
  • at this point, corosync crashes
  • the above somehow (possibly with the involvement of the supervisor
    above pacemakerd since the prerequisite, corosync, ceased to run,,
    or it could be CPG API both of these use that could trigger something
    unexpected on corosync's death) causes both pacemakerd and
    pacemaker-based disappear, while pacemaker-controld continues to
    run (wasn't connected to CPG yet)
  • pacemaker-controld would attempt to connect to corosync, and after
    a while falls on its nose for the lack of guard the commit just added
    [though it now seems -- also? -- as shenanigans in the FSM transitions
    once 30 attempts to connect to CIB fall short -- it's all very messy
    and undocumented in actual words what the controlled failure
    modes are with intended behaviours and recovery, sadly sadly sadly])

More realistic scenario is that the admin decided for some reason to
run /usr/lib/exec/pacemaker-controld in isolation without corosync
(and pacemaker-based) running (perhaps to test pacemakerd's ability
to adopt pre-existing processes but simply getting the timing wrong).

All in all, that commit is also a very desirable, even if it rather
fixes a long-standing flaw, good that my colleague Martin Juříček
conducting some tests discovered that (thanks).

@code-share1
Copy link

Does these CVE issues CVE-2019-3885, CVE-2018-16877, CVE-2018-16878 exist in pacemaker 1.1.19? Thanks,

@jnpkrn
Copy link
Contributor Author

jnpkrn commented May 28, 2019

@code-share1
As discussed at the 1.1 branch counterpart of these patches [*], you are
best served with a peek at https://wiki.clusterlabs.org/wiki/Security

(You would then know the answer to your question is: yes)

[*] #1750 (comment)

@kgaillot
Copy link
Contributor

For a complete list of commits related to the security issues, see:
https://lists.clusterlabs.org/pipermail/users/2019-May/025822.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants