PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

dpavlin · 2018-10-04T10:48:24Z

debootstrap on nodes with 4k storage blocks will create invalid partition table which won't boot inside kvm since it expects partition table with 512b blocks.

PARTITION_ALIGNMENT is currently broken and I opted to remove it all together because with 4k blocks it has VERY different behavior if you are using 2048 or 1M as argument. Since this is triggered only when PARTITION_STYLE="msdos" is used, and 1Mb os enough space i decided to hard-code that.

I'm open to suggestions how to fix this better or in cleaner way.

First commit is change which comes from Debian and is required for recent sfdisk versions, I don't know why this repository doesn't have it. It comes from https://sources.debian.org/patches/ganeti-instance-debootstrap/0.16-2.1/

googlebot · 2018-10-04T10:48:27Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

dpavlin · 2018-10-04T10:52:50Z

I signed it!

googlebot · 2018-10-04T10:52:52Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

dpavlin · 2018-10-04T10:55:05Z

Oh, it seems that @apoikos first has to submit his part of change because I can't sign licence agreement for him.

iustin · 2018-10-04T12:54:50Z

@apoikos - good question, do you have any other pending patches in the Debian packages?

apoikos · 2018-10-04T13:25:10Z

@iustin: no, that's the only patch left. I'm fine with including it in this PR.

apoikos · 2018-10-04T13:25:52Z

Of course I mean the only patch left for instance-debootstrap. There are many more for ganeti itself which I'll be upstreaming slowly.

iustin · 2018-10-04T15:04:23Z

Hmm, I'm not in best position to approve a patch with cla not able to sign it. @apoikos, would you mind much to send your patch separately?

apoikos · 2018-10-05T07:53:50Z

@iustin, submitted #3.

iustin · 2018-10-05T20:56:56Z

@apoikos, thanks, #3 is merged now. @dpavlin, can you resubmit with just your changes?

googlebot · 2018-10-06T06:16:17Z

CLAs look good, thanks!

Since sfdisk doesn't supprot forced chs geometry any more, in situations where underlaying storage has 4k block, this script will create correct partition table for host, but not for guest which expects 512b blocks. This will make machine unbootable, since once kvm starts partition will be at wrong offset and it won't boot. Solution is to re-create partition table at end using losetup which has 512b blocks. Since PARTITION_ALIGNMENT is created using 4k block as base, it will create filesystem at wrong place if you are using default value of 2048. This hard-codes host partition table to 1M (which will work on hosts with both 512b and 4k blocks) and then use 2048 when using 512b blocks.

iustin · 2018-10-06T06:31:00Z

common.sh.in

  sfdisk $ARGS --quiet "$1" <<EOF
-${PARTITION_ALIGNMENT},,L,*
+1M,,L,*


The sfdisk man page says:

The default start offset for the first partition is 1MiB.

What do you think about leaving the value entirely out? And when/if sfdisk defaults change again, this code won't need change too.

If sfdisk changes default, we would need to change at last re-paritition offset (2048), so I would prefer to have exact values here to be on safe side.

I don't understand. I'm talking about removing the value, i.e. changing this into:

sfdisk $ARGS --quiet "$1" <<EOF ,,,L,*

Which makes sfdisk enter its value. We wouldn't have any PARTITION_OFFSET value anymore.

iustin · 2018-10-06T13:08:43Z

common.sh.in

@@ -127,6 +127,18 @@ map_disk0() {

 unmap_disk0() {
  kpartx -d -p- $1
+


Uh, why do you reformat the disk in the umap function? This should just unmap any partitions, but not touch the disk.

It seems to me that it's right place. I don't want to touch disk until kernel has view of partitions via kpartx, so I opted to put it in unmap. I considered pushing it into CLEANUP, but losetup puts losetup -d in CLEANUP in create and using CLEANUP would involve modifying create also, so it seemed to me like cleaner and smaller patch.

Uh, sorry to ask, but do you understand how the sequence of actions happen here? In unmap the OS is already installed. Basically this happens:

format disk

map partitions

install OS

run cleanup, which includes unmap.

Why do you want to repartition in cleanup/unmap?

This re-partition after host has finished with filesystem is the fix for 4k problems.

4k drives need 256 as aligment to create first partition at 1M which turns info offset 2048 with 512b blocks which makes them work in kvm.

I wrote a bit of documentation about it at https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt

Does this explains it better then my attempts so far?

iustin · 2018-10-06T13:09:33Z

common.sh.in

@@ -161,7 +173,6 @@ fi
 : ${VARIANTS_DIR:="@sysconfdir@/ganeti/instance-debootstrap/variants"}
 : ${GENERATE_CACHE:="yes"}
 : ${CLEAN_CACHE:="14"} # number of days to keep a cache file
-: ${PARTITION_ALIGNMENT:=2048} # sectors to align partition (1Mib default)


Or you could just change here PARTITION_ALIGNMENT to 1M, and not need any changes - plus it'd remain configurable.

I though about this, but in my setup, I had PARTITION_ALIGMENT set to 2048, so we would also need to somehow update existing configuration or convert number (2048) to 1M (or any value which user specified). This seems to me more fragile, and since nobody noticed that it doesn't work, my assumption is that it's not heavily used, and having only one option to configure instead of two seems like simpler solution. Other than rbd which has 4Mb logical IO, I can't think of other storage which would create optimal layout with hard-coded 1M value, and to be honest, I prefer to have partition at 1M offset even on rbd.

Having said all this, I'm willing to write conversion from number to xM notation and keep PARTITION_ALIGMENT in code if you prefer that.

My proposal is to make PARTITION_ALIGNMENT default to (string value) "1m", instead of hardcoding it as 1m in the sfdisk invocation. Why would it need conversion at all?

iustin · 2018-10-08T19:04:16Z

Replying on the general thread since I guess the individual comments will go away once resolved.

Thanks for the doc you pointed to (https://github.com/ffzg/gnt-info/blob/master/doc/deboostrap-4k-msdos.txt), I understand now what and how you're trying to fix.

If I understand correctly, the root of the problem is that a 4K-sector drive (under Linux) is still exposed as a 512b-sector drive when booted from via KVM. And you plan to do this by repartitioning the first partition to the same absolute byte offset (1M) after it being mounted by losetup since that exposes a 512b sector similar to how kvm expected. In other words, the partition as first created by the initial sfdisk would have 256 sector offset (256*4K=1MB), after redoing over losetup/512b the offset will be 2048 (2048*512b=same 1MB). Correct?

I see a few problems with this approach.

The first one is that this relies on KVM forever using 512b; if at one point it changes and actually picks up the backing device sector size, you'll break existing instances.

Second, you're also relying on losetup matching the same 512b. If it changes behaviour, you'll again break things because it won't necessarily match KVM. You could pass the -b argument to it, but this seems to only be supported in recent kernels… In any case, you should still pass 1MB to the sfdisk-over-losetup invocation, hardcoding 2048 here doesn't make sense.

And last, is that this is forcing everybody to use 512b sectors, even if not using KVM. The fix should at least add a configurable key to say "force 512b-sector partitions".

IMO the proper fix here would be to make KVM pick up the correct block size (see https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg04458.html for example), but if you want to make this purely on the instance-debootstrap side, something along the following lines would be much cleaner:

replace 2048 with 1M, or even better, make the value default to empty string, so that sfdisk applies its default offset
add a setting to always use losetup, with the appropriate documentation, so instead of partitioning with 4K, formatting, installing, then repartitioning, it simpy wraps everything under losetup so the partitioning is only done once.

What do you think?

apoikos · 2018-10-10T08:04:39Z

I'd like to add a couple of notes:

First of all, this issue affects only setups that use a 4K logical block size. Setups with 4K physical block size and 512-byte logical block size work fine (at least ours does).

The root cause for this issue is that the partition table does not store any information about the sector size; the latter is provided by the disk itself and has to be looked up in order to convert LBA addresses to byte offsets. In our case, the issue arises because we partition the disk on the host (which sees the hardware's 4K logical block size), but go on to use it in the guest (which only sees a 512-byte sector); if we were running debootstrap/partitioning from inside the guest, then no problem would arise — other than the guest trying to do suboptimal 512-byte I/O.

The solution proposed by @dpavlin is to re-partition the device using 512-byte sectors through losetup. This works, but has the following drawbacks:

The guest will continue to do suboptimal 512-byte I/O
The guest's disk requires special handling to be accessed from inside the host, as kpartx will not work, possibly breaking instance-debootstrap's workflow — for example, instance renames or disk exports will not work as far as I can tell.

Instead of trying to settle this disagreement between guest and host, we could try to make sure that there is no disagreement at all! qemu has supported variable block sizes since 0.13, via the physical_block_size and logical_block_size blockdev options. In theory, all we would need to do is check the backing store's block size — if it's a raw device — using the respective sysfs attributes and add the relevant options to the device in KVM's command line. However, this also has its downsides:

It might break existing instances that somehow got a working partition table (e.g. if the OS provider does partitioning from within the instance), so we'd need some way of turning it off
I have no idea how this will interact with instance mobility. Imagine for instance a DRBD setup where one side has a 4K backing store and the other one a 512-byte. I'm not sure if DRBD would work in this case, but if it did we might end up using different logical block sizes pre- and post-failover/migration. To really protect against this, we would need to persist the logical block size that was used during instance creation in the instance's config and use it henceforth.

Overall, settling for 512-byte sectors universally is not a bad idea: it's the least common denominator and offers the best compatibility; I guess that's why Qemu does not pass the sector size through by default. To do this however, we must take care to always access the disk using 512-byte sectors, especially from within the host.

dpavlin force-pushed the 4k-msdos-partitions branch from 13ab06f to de37352 Compare October 6, 2018 06:16

dpavlin force-pushed the 4k-msdos-partitions branch from de37352 to d228d45 Compare October 6, 2018 06:19

dpavlin force-pushed the 4k-msdos-partitions branch from d228d45 to ef85e0b Compare October 6, 2018 06:24

iustin reviewed Oct 6, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

dpavlin commented Oct 4, 2018

googlebot commented Oct 4, 2018

dpavlin commented Oct 4, 2018

googlebot commented Oct 4, 2018

dpavlin commented Oct 4, 2018

iustin commented Oct 4, 2018

apoikos commented Oct 4, 2018

apoikos commented Oct 4, 2018

iustin commented Oct 4, 2018

apoikos commented Oct 5, 2018

iustin commented Oct 5, 2018

googlebot commented Oct 6, 2018

iustin Oct 6, 2018

dpavlin Oct 6, 2018

iustin Oct 7, 2018

iustin Oct 6, 2018

dpavlin Oct 6, 2018

iustin Oct 7, 2018

dpavlin Oct 7, 2018

dpavlin Oct 7, 2018

iustin Oct 6, 2018

dpavlin Oct 6, 2018

iustin Oct 7, 2018

iustin commented Oct 8, 2018 •

edited

apoikos commented Oct 10, 2018

		@@ -127,6 +127,18 @@ map_disk0() {

		unmap_disk0() {
		kpartx -d -p- $1

PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

Are you sure you want to change the base?

PARTITION_STYLE="msdos" broken on storage with 4k blocks #2

Conversation

dpavlin commented Oct 4, 2018

googlebot commented Oct 4, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

dpavlin commented Oct 4, 2018

googlebot commented Oct 4, 2018

dpavlin commented Oct 4, 2018

iustin commented Oct 4, 2018

apoikos commented Oct 4, 2018

apoikos commented Oct 4, 2018

iustin commented Oct 4, 2018

apoikos commented Oct 5, 2018

iustin commented Oct 5, 2018

googlebot commented Oct 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iustin commented Oct 8, 2018 • edited

apoikos commented Oct 10, 2018

iustin commented Oct 8, 2018 •

edited