Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph-disk: partprobe should block udev induced BLKRRPART #9330

Merged
merged 1 commit into from Jul 4, 2016
Merged

ceph-disk: partprobe should block udev induced BLKRRPART #9330

merged 1 commit into from Jul 4, 2016

Conversation

ghost
Copy link

@ghost ghost commented May 25, 2016

@ghost ghost added bug-fix core labels May 25, 2016
@idryomov
Copy link
Contributor

I would suggest a shorter sleep and more iterations, at least 50/5 or 100/2-3. Given that this race is so easy to hit that you thought the device had stayed busy, 25 might not cut it. I also think it's worth printing an info message on each 10th or 20th iteration.

@ghost
Copy link
Author

ghost commented May 25, 2016

@idryomov thanks for your quick review :-) Is this better ?

@idryomov
Copy link
Contributor

idryomov commented May 25, 2016

Looks good to me. Did you try it on that box?

@ghost
Copy link
Author

ghost commented May 25, 2016

@idryomov not yet, I will

@ghost
Copy link
Author

ghost commented May 26, 2016

while : ; do
    for i in $(seq 50) ; do
        if partprobe /dev/vdc > /dev/null 2>&1 ; then
            echo " $i"
            break
        else
            echo -n .
            sleep 5
        fi
    done
done

to verify on a machine where the problem was observed originally that running partprobe 50 times is enough. The highest number of retry so far is 14 (which is really high...).

@ghost
Copy link
Author

ghost commented May 26, 2016

A better option could be to block udev BLKRRPART with flock -s /dev/vdd partprobe /dev/vdd.

@ghost
Copy link
Author

ghost commented May 26, 2016

@idryomov instead of retrying, partprobe is wrapped around flock (b7fff6a). Do you think this is sane ? I wrote a long explanation with pointers in the b7fff6a commit message. I've been running

[root@dhcp-126-126 ~]# while : ; do flock -s /dev/vdc partprobe /dev/vdc || break ; done

in a loop and saw no failure so far. The idea was suggested by Marius Vollmer and Brian C. Lane. It's simple and elegant and I think it fits well with how ceph-disk is currently using partprobe and sgdisk.

I also have verified that the flock in udev was there from the start (i.e. udev 214 which is a tag that includes 02ba8fb3357daf57f6120ac512fb464a4c623419 which adds the flock).

@ghost ghost changed the title ceph-disk: try partprobe more and more often ceph-disk: partprobe should block udev induced BLKRRPART May 26, 2016
@idryomov
Copy link
Contributor

I had flock on my mind too - that's what I meant by "udev backs off" in my reply on the tracker, but the problem is that udev started doing flock not too long ago. If we care about distros with an older udev (< 214), flock won't work. One such distro is Ubuntu 14.04, with parted 2.3 and a pre-systemd udev. Have you ever observed failures there, or is this specific to parted 3.1?

@ghost
Copy link
Author

ghost commented May 26, 2016

I had flock on my mind too - that's what I meant by "udev backs off" in my reply on the tracke

I was too uneducated at the time to understand what you meant, sorry about that :-(

If we care about distros with an older udev (< 214), flock won't work.

Prior to udev 214, no BLKRRPART was issued when the device open in write mode is closed. Therefore I don't think we need to worry about that. Or am I missing something ?

@ghost
Copy link
Author

ghost commented May 26, 2016

test this please

@idryomov
Copy link
Contributor

In that case, I'd call this resolved ;)

(I lost track of all the udev-related problems, see http://tracker.ceph.com/issues/14737 for an example.)

@ghost
Copy link
Author

ghost commented May 26, 2016

ceph-workbench ceph-qa-suite --verbose --simultaneous-jobs 9 --ceph-qa-suite-git-url http://github.com/dachary/ceph-qa-suite --suite-branch wip-ceph-disk --ceph-git-url http://github.com/dachary/ceph --ceph wip-15176-partprobe --suite ceph-disk --upload --filter ubuntu_14

http://167.114.237.66:8081/ubuntu-2016-05-26_15:45:06-ceph-disk-wip-15176-partprobe---basic-openstack/

@ghost
Copy link
Author

ghost commented May 27, 2016

The bluestore part of the ceph-disk suite fails on master, therefore also fails with this patch. The rest of the suite passes on Ubuntu 14.04. Now trying with CentOS 7.2.

@ghost
Copy link
Author

ghost commented May 27, 2016

It also passes the ceph-disk suite on CentOS 7.2 (except for the bluestore and the multipath test that also fails on master).

@idealguo
Copy link

idealguo commented Jul 2, 2016

@dachary, I test the patch on jewel 10.2.2 for several times, it resolves the bug #15176. Do you has some plans to merge it to jewel branch?

@ghost
Copy link
Author

ghost commented Jul 2, 2016

I'm waiting for someone to review this pull request. I should ask explicitly now. @tchaikov would you have time for that ?

@tchaikov tchaikov self-assigned this Jul 2, 2016
@tchaikov
Copy link
Contributor

tchaikov commented Jul 2, 2016

sure, lemme take a look at it.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 4, 2016

note for myself, when trying to understand the racing here.

  • ceph-disk: write(/dev/sdc), close(/dev/sdc) => inotify =>
    • systemd/udev: with flock(dev/sdc, exclusive=True): ioctl(BLKRRPART)
    • ceph-disk: partprobe: BLKPG commands

guard partprobe with flock() fixes this.

@tchaikov
Copy link
Contributor

tchaikov commented Jul 4, 2016

@dachary could you please add "Fixes: http://tracker.ceph.com/issues/15176" in your commit message. then it is good to merge.

@tchaikov tchaikov assigned ghost and unassigned tchaikov Jul 4, 2016
Wrap partprobe with flock to stop udev from issuing BLKRRPART because
this is racy and frequently fails with a message like:

    Error: Error informing the kernel about modifications to partition
    /dev/vdc1 -- Device or resource busy.  This means Linux won't know about
    any changes you made to /dev/vdc1 until you reboot -- so you shouldn't
    mount it or use it in any way before rebooting.

Opening a device (/dev/vdc for instance) in write mode indirectly
triggers a BLKRRPART ioctl from udev (starting version 214 and up)
when the device is closed (see below for the udev release note).

However, if udev fails to acquire an exclusive lock (with
flock(fd, LOCK_EX|LOCK_NB); ) the BLKRRPART ioctl is not issued.

https://github.com/systemd/systemd/blob/045e00cf16c47bc516c0823d059b7548f3ce9c7c/src/udev/udevd.c#L1042

Acquiring an exclusive lock before running the process that opens the
device in write mode is therefore an effective way to control this
behavior.

git clone git://anonscm.debian.org/pkg-systemd/systemd.git
systemd/NEWS:
CHANGES WITH 214:

  * As an experimental feature, udev now tries to lock the
       disk device node (flock(LOCK_SH|LOCK_NB)) while it
       executes events for the disk or any of its partitions.
       Applications like partitioning programs can lock the
       disk device node (flock(LOCK_EX)) and claim temporary
       device ownership that way; udev will entirely skip all event
       handling for this disk and its partitions. If the disk
       was opened for writing, the close will trigger a partition
       table rescan in udev's "watch" facility, and if needed
       synthesize "change" events for the disk and all its partitions.
       This is now unconditionally enabled, and if it turns out to
       cause major problems, we might turn it on only for specific
       devices, or might need to disable it entirely. Device Mapper
       devices are excluded from this logic.

Fixes: http://tracker.ceph.com/issues/15176

Signed-off-by: Marius Vollmer <marius.vollmer@redhat com>
Signed-off-by: Loic Dachary <loic@dachary.org>
@ghost
Copy link
Author

ghost commented Jul 4, 2016

@tchaikov thanks for your review. I amended the commit message and will merge as soon as make check returns.

@tchaikov tchaikov merged commit 042c0aa into ceph:master Jul 4, 2016
@tchaikov tchaikov assigned tchaikov and unassigned ghost Jul 4, 2016
@pcahyna pcahyna mentioned this pull request Oct 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants