New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph-disk: partprobe should block udev induced BLKRRPART #9330
Conversation
I would suggest a shorter sleep and more iterations, at least 50/5 or 100/2-3. Given that this race is so easy to hit that you thought the device had stayed busy, 25 might not cut it. I also think it's worth printing an info message on each 10th or 20th iteration. |
@idryomov thanks for your quick review :-) Is this better ? |
Looks good to me. Did you try it on that box? |
@idryomov not yet, I will |
while : ; do for i in $(seq 50) ; do if partprobe /dev/vdc > /dev/null 2>&1 ; then echo " $i" break else echo -n . sleep 5 fi done done to verify on a machine where the problem was observed originally that running partprobe 50 times is enough. The highest number of retry so far is 14 (which is really high...). |
A better option could be to block udev BLKRRPART with flock -s /dev/vdd partprobe /dev/vdd. |
@idryomov instead of retrying, partprobe is wrapped around flock (b7fff6a). Do you think this is sane ? I wrote a long explanation with pointers in the b7fff6a commit message. I've been running [root@dhcp-126-126 ~]# while : ; do flock -s /dev/vdc partprobe /dev/vdc || break ; done in a loop and saw no failure so far. The idea was suggested by Marius Vollmer and Brian C. Lane. It's simple and elegant and I think it fits well with how ceph-disk is currently using partprobe and sgdisk. I also have verified that the flock in udev was there from the start (i.e. udev 214 which is a tag that includes 02ba8fb3357daf57f6120ac512fb464a4c623419 which adds the flock). |
I had flock on my mind too - that's what I meant by "udev backs off" in my reply on the tracker, but the problem is that udev started doing flock not too long ago. If we care about distros with an older udev (< 214), flock won't work. One such distro is Ubuntu 14.04, with parted 2.3 and a pre-systemd udev. Have you ever observed failures there, or is this specific to parted 3.1? |
I was too uneducated at the time to understand what you meant, sorry about that :-(
Prior to udev 214, no BLKRRPART was issued when the device open in write mode is closed. Therefore I don't think we need to worry about that. Or am I missing something ? |
test this please |
In that case, I'd call this resolved ;) (I lost track of all the udev-related problems, see http://tracker.ceph.com/issues/14737 for an example.) |
|
The bluestore part of the ceph-disk suite fails on master, therefore also fails with this patch. The rest of the suite passes on Ubuntu 14.04. Now trying with CentOS 7.2. |
It also passes the ceph-disk suite on CentOS 7.2 (except for the bluestore and the multipath test that also fails on master). |
I'm waiting for someone to review this pull request. I should ask explicitly now. @tchaikov would you have time for that ? |
sure, lemme take a look at it. |
note for myself, when trying to understand the racing here.
guard partprobe with |
@dachary could you please add "Fixes: http://tracker.ceph.com/issues/15176" in your commit message. then it is good to merge. |
Wrap partprobe with flock to stop udev from issuing BLKRRPART because this is racy and frequently fails with a message like: Error: Error informing the kernel about modifications to partition /dev/vdc1 -- Device or resource busy. This means Linux won't know about any changes you made to /dev/vdc1 until you reboot -- so you shouldn't mount it or use it in any way before rebooting. Opening a device (/dev/vdc for instance) in write mode indirectly triggers a BLKRRPART ioctl from udev (starting version 214 and up) when the device is closed (see below for the udev release note). However, if udev fails to acquire an exclusive lock (with flock(fd, LOCK_EX|LOCK_NB); ) the BLKRRPART ioctl is not issued. https://github.com/systemd/systemd/blob/045e00cf16c47bc516c0823d059b7548f3ce9c7c/src/udev/udevd.c#L1042 Acquiring an exclusive lock before running the process that opens the device in write mode is therefore an effective way to control this behavior. git clone git://anonscm.debian.org/pkg-systemd/systemd.git systemd/NEWS: CHANGES WITH 214: * As an experimental feature, udev now tries to lock the disk device node (flock(LOCK_SH|LOCK_NB)) while it executes events for the disk or any of its partitions. Applications like partitioning programs can lock the disk device node (flock(LOCK_EX)) and claim temporary device ownership that way; udev will entirely skip all event handling for this disk and its partitions. If the disk was opened for writing, the close will trigger a partition table rescan in udev's "watch" facility, and if needed synthesize "change" events for the disk and all its partitions. This is now unconditionally enabled, and if it turns out to cause major problems, we might turn it on only for specific devices, or might need to disable it entirely. Device Mapper devices are excluded from this logic. Fixes: http://tracker.ceph.com/issues/15176 Signed-off-by: Marius Vollmer <marius.vollmer@redhat com> Signed-off-by: Loic Dachary <loic@dachary.org>
@tchaikov thanks for your review. I amended the commit message and will merge as soon as make check returns. |
Fixes http://tracker.ceph.com/issues/15176