Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.1] High: cumulative patchset to fix CVE-2019-3885, CVE-2018-16877, CVE-2018-16878 + additional unmasked null pointer deref #1750

Merged
merged 8 commits into from Apr 17, 2019

Conversation

Projects
None yet
4 participants
@jnpkrn
Copy link
Contributor

commented Apr 17, 2019

See also:
https://www.openwall.com/lists/oss-security/2019/04/17/1

People are strongly suggested to get these patches in.

jnpkrn added some commits Apr 2, 2019

High: libservices: fix use-after-free wrt. alert handling
This could possibly lead to unsolicited information disclosure by the
means of standard output of the immediately preceding agent/resource
execution leaking into the log stream under some circumstances.
It was hence assigned CVE-2019-3885.

The provoked pathological state of pacemaker-execd daemon progresses
towards crashing it for hitting segmentation fault.
High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (0/4)
[0/4: make crm_pid_active more precise as to when detections fail]

It would be bad if the function claimed the process is not active
when the only obstacle in the detection process was that none of the
detection methods worked for a plain lack of permissions to apply
them.  Also, do some other minor cleanup of the function and add its
documentation.  As an additional measure, log spamming is kept at
minimum for repeated queries about the same PID.
High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (1/4)
[1/4: new helpers to allow IPC client side to authenticate the server]

The title problem here could possibly lead to local privilege escalation
up to the root's level (and implicitly unguarded by some additional
protection layers like SELinux unless the defaults constrained further).

Main problem is that the authenticity assumptions were built on two,
seemingly mutually supporting legs leading to two CVEs assigned:

* procfs (mere process existence and right path to binary check)
  used to verify (this part was assigned CVE-2018-16878), and

* one-way only client-server authentication, putting the client
  here at the mercy of the server not necessarily cross-verified
  per the above point if at all (this part was assigned
  CVE-2018-16877)

whereas these two were in fact orthogonal, tearing security
assumptions about the "passive influencers" in the pacemaker's daemon
resilience-friendly constellation (orchestrated by the main of them,
pacemakerd) apart.  Moreover, procfs-based approach is discouraged
for other reasons.

The layout of the basic fix is as follows:
* 1/4: new helpers to allow IPC client side to authenticate the server
       (this commit, along with unifying subsequent solution for
       both CVEs)
* 2/4: pacemakerd to trust pre-existing processes via new checks instead
       (along with unifying solution for both CVEs)
* 3/4: other daemons to authenticate IPC servers of fellow processes
       (along with addressing CVE-2018-16877 alone, for parts of
       pacemaker not covered earlier)
* 4/4: CPG users to be careful about now-more-probable rival processes
       (this is merely to mitigate corner case fallout from the new
       approaches taken to face CVE-2018-16878 in particular;
       courtesy of Yan Gao of SUSE for reporting this)

With "basic", it is meant that it constitutes a self-contained best
effort solution with some compromises that can only be overcome with the
assistance of IPC library, libqb, as is also elaborated in messages of
remaining "fix" commits.  Beside that, also conventional encapsulation
of server-by-client authentication would be useful, but lack thereof
is not an obstacle (more so should there by any security related
neglectations on the IPC client side and its connection initiating
arrangement within libqb that has a potential to strike as early as
when the authenticity of the server side is yet to be examined).

One extra kludge that's introduced for FreeBSD lacking Unix socket to
remote peer PID mapping is masquerading such an unspecified PID with
value of 1, since that shall always be around as "init" task and,
deferring to proof by contradiction, cannot be pacemakerd-spawned
child either even if PID 1 was pacemakerd (and running such a child
alone is rather nonsensical).  The code making decisions based on that
value must acknowledge this craze and refrain from killing/signalling
the underlying process on this platform (but shall in general follow
the same elsewhere, keep in mind systemd socket-based activation for
instance, which would end up in such a situation easily!).
High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (2/4)
[2/4: pacemakerd to trust pre-existing processes via new checks instead]

In pacemakerd in the context of entrusting pre-existing processes,
we now resort to procfs-based solution only in boundary, fouled cases,
and primarily examine the credentials of the processes already
occupying known IPC end-points before adopting them.

The commit applies the new helpers from 1/1 so as to close the both
related sensitive problems, CVE-2018-16877 and CVE-2018-16878, in
a unified manner, this time limited to the main daemon of pacemaker
(pacemakerd).

To be noted that it is clearly not 100% for this purpose for still
allowing for TOCTTOU, but that's what commit (3/3) is meant to solve
for the most part, plus there may be optimizations solving this concern
as a side effect, but that requires an active assistance on the libqb
side (ClusterLabs/libqb#325) since any
improvement on pacemaker side in isolation would be very
cumbersome if generally possible at all, but either way
means a new, soft compatibility encumberment.

As a follow-up to what was put in preceding 1/3 commit, PID of 1 tracked
as child's identification on FreeBSD (or when socket-based activation is
used with systemd) is treated specially, incl. special precaution with
child's PID discovered as 1 elsewhere.

v2: courtesy of Yan Gao of SUSE for early discovery and report for
    what's primarily solved with 4/4 commit, in extension, child
    daemons in the initialization phase coinciding with IPC-feasibility
    based process scan in pacemakerd in a way that those are missed
    (although they are to come up fully just moments later only
    to interfere with naturally spawned ones) are now considered so
    that if any native children later fail for said clash, the
    pre-existing counterpart may get adopted instead of ending up
    with repeated spawn-bury loop ad nauseam without real progress
    (note that PCMK_fail_fast=true could possibly help, but that's
    rather a big hammer not suitable for all the use cases, not
    the ones we try to deal with gracefully here)
High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (3/4)
[3/4: other daemons to authenticate IPC servers of fellow processes]

Now that CVE-2018-16877 issue alone is still only partially covered
based on the preceding commits in the set, put the server-by-client
authentication (enabled and 1/3 and partially sported in 2/3) into
practice widely amongst the communicating pacemaker child daemons and
towards CPG API provided by 3rd party but principally using the same
underlying IPC mechanism facilitated by libqb, and consequently close
the remaining "big gap".

As a small justification to introducing yet another "return
value" int variable, type-correctness is restored for those
that shall be cs_error_t to begin with.
High: pacemakerd vs. IPC/procfs confused deputy authenticity issue (4/4)
[4/4: CPG users to be careful about now-more-probable rival processes]

In essence, this comes down to pacemaker confusing at-node CPG members
with effectively the only plausible to co-exist at particular node,
which doesn't hold and asks for a wider reconciliation of this
reality-check.

However, in practical terms, since there are two factors lowering the
priority of doing so:

1/ possibly the only non-self-inflicted scenario is either that
   some of the cluster stack processes fail -- this the problem
   that shall rather be deferred to arranged node disarming/fencing
   to stay on the safe side with 100% certainty, at the cost of
   possibly long-lasting failover process at other nodes
   (for other possibility, someone running some of these by accident
   so they effectively become rival processes, it's like getting
   hands cut when playing with a lawnmower in an unintended way)

2/ for state tracking of the peer nodes, it may possibly cause troubles
   in case the process observed as left wasn't the last for the
   particular node, even if presumably just temporary, since the
   situation may eventually resolve with imposed serialization of
   the rival processes via API end-point singleton restriction (this
   is also the most likely cause of why such non-final leave gets
   observed in the first place), except in one case -- the legitimate
   API end-point carrier won't possibly acknowledged as returned
   by its peers, at least not immediately, unless it tries to join
   anew, which verges on undefined behaviour (at least per corosync
   documentation)

we make do just with a light code change so as to

* limit 1/ some more with in-daemon self-check for pre-existing
  end-point existence (this is to complement the checks already made in
  the parent daemon prior to spawning new instances, only some moments
  later; note that we don't have any lock file etc. mechanisms to
  prevent parallel runs of the same daemons, and people could run these
  on their own deliberation), and to

* guard against the interferences from the rivals at the same node
  per 2/ with ignoring their non-final leave messages altogether.

Note that CPG at this point is already expected to be authenticity-safe.

Regarding now-more-probable part, we actually traded the inherently racy
procfs scanning for something (exactly that singleton mentioned above)
rather firm (and unfakeable), but we admittedly got lost track of
processes that are after CPG membership (that is, another form of
a shared state) prior to (or in non-deterministic order allowing for
the same) carring about publishing the end-point.

Big thanks is owed to Yan Gao of SUSE, for early discovery and reporting
this discrepancy arising from the earlier commits in the set.
Med: controld: fix possible NULL pointer dereference
This is now more likely triggerable once the problems related to
CVE-2018-16878 are avoided.

@kgaillot kgaillot merged commit 158c3e3 into ClusterLabs:1.1 Apr 17, 2019

1 of 2 checks passed

default Build finished.
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@wferi

This comment has been minimized.

Copy link
Contributor

commented May 20, 2019

Hi, I'm trying to backport these fixes to Pacemaker 1.1.16 to provide a security update for Debian stable, but it's proving difficult due to all the code changes in the affected areas. To start off my questions: are all these problems present in 1.1.16? What is the precise use-after-free scenario in the alert handling? Or has anybody already done such a backport by chance?

@gao-yan

This comment has been minimized.

Copy link
Member

commented May 20, 2019

This is not purely based on 1.1.16 but probably more or less helpful:
https://download.opensuse.org/update/leap/42.3/oss/src/pacemaker-1.1.16-4.12.1.src.rpm

@jnpkrn

This comment has been minimized.

@kgaillot

This comment has been minimized.

Copy link
Collaborator

commented May 20, 2019

FYI there were several subsequent related commits that are important. I'm hoping to release 1.1/2.0 rc's in the next couple of days, and then I plan to email the user list with a list of all the commits related to the security fixes, for anyone doing such backporting.

@wferi

This comment has been minimized.

Copy link
Contributor

commented May 21, 2019

Thanks everybody for their input, I successfully applied the confused deputy and the NULL deref fixes using your previous backports. The use-after-free in the alert handling is still unclear, though: the restructuring done there between 1.1.17 and 1.1.18 obstructs a straightforward backport of the simplest patch in this bunch. How does this use-after-free happen? Understanding the scenario would probably help me thread the change back to 1.1.16.

@jnpkrn

This comment has been minimized.

Copy link
Contributor Author

commented May 21, 2019

@wferi, sorry, missed the 1.1.16 precondition; to the best of my knowledge
that version doesn't suffer from that particular instance of use-after-free
(unless you got it there through backport), see also my tracking attempt:
https://wiki.clusterlabs.org/wiki/Security

@wferi

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

Thanks for the info, @jnpkrn, that part is set then.
@gao-yan, what was the conclusion with respect to the behavior change you reported in https://bugzilla.suse.com/show_bug.cgi?id=1131356#c7?

@gao-yan

This comment has been minimized.

Copy link
Member

commented May 22, 2019

It's already addressed by @jnpkrn especially with this following-up patch: d324e40

@wferi

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

Ah, great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.