-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 4k storage #4974
Comments
Sector size is advertised by the block backend in xenstore ( This issue is really unfortunate, because a lot of places in Qubes assume you can freely transfer disk image and it will work just fine. This include cloning VMs (including cloning to a different storage pool), backup/restore etc. So, the solution here should be either:
The second one may come with a performance penalty. The first one would not have this problem, but not sure if it's possible. I'm fine with making partition table 4K aligned, as long as it will also work with 512 sector size. But it isn't clear to me it would be enough. Partition table and filesystem are built here: https://github.com/QubesOS/qubes-linux-template-builder/blob/master/prepare_image#L63-L83 Another idea would be to revert to a filesystem directly on /dev/xvda (without any partition table). This may not be as simple as it sounds, because we need to fit grub somewhere (with HVM with in-VM kernel case). But this all may not work for other cases, including other OS. Imagine installing some OS (Linux, Windows, whatever) in a standalone HVM and then moving it to another storage pool (or restoring a backup on another machine). Those cases may require emulating constant sector size. Sadly, I don't have any hardware with 4k physical sector size to test on. I'll try to find a way to emulate one. BTW, another issue from 4k sector size is 8GB of swap, instead of 1GB. But this should be easy to fix in this script |
A lot of useful info: https://superuser.com/questions/679725/how-to-correct-512-byte-sector-mbr-on-a-4096-byte-sector-disk |
There's not much to worry about 4k alignment, it is already there in the template: what I gathered, the partition table tools nowadays will enforce at least 4k alignment and they will warn if that would be violated (some might do even larger alignment). This is why I managed to rewrite the template's partition table in the first place so easily (except the truncate issue). I don't think forcing 512 sector size itself would come with a large penalty as in practice the filesystems inside will use something larger than 512 (depending on how all relevant block stuff handles the larger continuous units of course but I'd guess that would not cause performance problems). So it would be mostly relevant for booting up correctly. What I'd rather avoid though, is forcing my drive's firmware to use 512 sector size as it would explore less tested corners of the firmware and possibly have significant performance impact too (I know my NVMe drive could do 512 but I don't know if all 4k drives are able so this needs to be handled anyway).
Like I said, libvirt supposedly has a way to configure I'll probably try to use the file backend (that's what was used in R3.2 right?) for the main system for now (the NVMe drive should clone fast anyway :-) so the biggest downside I know of is a non-issue). Can the installer do that automatically if I simply reinstall (that is, how it chooses which type of storage pool to use by default) or do I have to manually setup everything afterwards skipping the firstboot stuff to avoid it failing? I can then look into the 4k stuff while other stuff keeps working fine with 512. That would also allow me to easily test cross 512 and 4k copying but that looks rather scary to begin with, so far about nothing seems portable from one sector size to another from what I've read. If the partition table would be removed from xvda, the grub might have a similar 4k vs 512 issue anyway so that might not solve anything (sector was mentioned somewhere when I tried to look into what kind of information format it uses which sound bad) but this needs a deeper investigation. |
Losetup seems able to fake logical block size: |
Yes, I think it's KVM only.
If you choose btrfs during installation, Qubes will use that instead of LVM. |
Could not the faking be done other way around? I'd feel by intuition that in block code log4096 -> phy512 is far simpler than log512 -> phy4096. Or is there some particular reason why 512 is still needed for the VM disk format that is almost internal to Qubes. VMs will obviously see the end result but they should have little reason to change how the sector size is defined by the "internal" format. Or is there some other OS that only works with 512? That would leave just a few things to address:
|
I'm not sure about disks emulated by QEMU. And then windows PV drivers. Recently I've seen some patches flying around fixing 512 sector size assumption somewhere there, so there still may be more issues like this. |
Can I throw in another alignment data point to consider: the LVM chunk_size, which can range from 64KB to 1MB. Policy-wise, Qubes may want to consider ensuring that any physical partitions (or partitions inside lvm LVs), that are created by qubes tools and/or installer, are 1MB aligned, primarily for performance reasons. Probably not as critical as the baseline fixes to ensure 4K logical sector drives work, but since that requires changes, consider enforcing a much more strict alignment going forward (see the volatile volume issue #5151). Brendan |
If anyone needs 4Kn templates right now, can use my patch from https://gist.github.com/arno01/ae31e1e9098591dadde3d1fc8c707000 I have also found that And there is some interesting discussion about the 4Kn sector disks. |
This will become a bigger problem with R4.2, where cryptsetup >= 2.4.0 (Fedora >= 35) will automatically create dm-crypt with 4K logical sectors even on 512e drives. Ideas:
|
I wonder if Qubes pools should specify the sector size of their underlying storage technology, and whether importing volumes should involve a conversion step? B |
Conversion during import would mean parsing VM data in dom0 😬 Or a DisposableVM I guess. |
Ok someone should definitely write a DisposableVM-powered converter for common volume contents. But automatic conversion won't be possible in all cases (like standalone HVMs where a volume could contain anything, e.g. bs dependent filesystems like XFS that might not be straightforward to upgrade), so even with a very good converter there's still a need for
|
Interesting. Do you know how they implement this? Because I thought this direction is the tricky one, because a block device should guarantee atomic writes per sector (in other words you should always see either the version before the write or a fully updated sector, but not a mix). So a proper implementation likely needs a journal.
At lest on dm level support seems to exists: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-ebs.html |
I'm kinda curious too about how writes really work for kernel -> 512e drive communication. Pure speculation: Since both the kernel and the drive know that the drive's physical block size is 4K, maybe the kernel just always writes batches of 8 * 512B logical blocks - and when the drive sees logical blocks coming in fast enough, one immediately following another, it figures out that read-modify-write can be avoided? Or there could be some explicit way for the kernel to signal to the drive that it's aware of 512e and that it guarantees to send 4K blocks merely encoded as batches of 512B blocks.
Huh. Thanks! Wonder if that's better than a loop device. |
I can think of at least two solutions:
|
@DemiMarie I don't get (1). Would there be some script in the VM's initrd to rewrite the partition table ("activating" the stashed away 512B or 4K version) depending on xvda's current logical block size? Dynamically switching back and forth between 512B and 4K partitioning in general seems like it could make resizing the volume ( |
Generally, I'd try to avoid any kind of conversion at startup and go for emulation when necessary. That means:
And then, either build templates with two flavors, or convert at the install time (as part of Can we get away without emulating 4k bs on 512 bs devices? |
Once there's a way to attach a volume as 4K, why even bother building (or converting to) 512B templates.
Forcing I'd guess almost all of those drives (that make sense to install a Qubes storage pool on) actually have 4K physical sectors anyway, but it's misreported by shoddy firmware or an adapter. |
Thanks! Latest updates to the proposal:
|
Here is openQA run with 4kn emulated disk: https://openqa.qubes-os.org/tests/52712
And every VM fails to start, as expected. Note to self: set |
Given the failure mode on 4.2 is worse than on 4.1, I think we should have it in 4.2. The plan outlined by @rustybird in #4974 (comment) looks good. @rustybird are you up for implementing this? |
@marmarek: what about write tearing? 4K sector writes on a 512e driver are not guaranteed to be atomic, and IIRC are not atomic on some low-end SSDs in the event of power failure. XFS takes precautions against this. |
Maybe we should get the R4.2 regression affecting 512e disks out of the way first, by hacking the installer to use But yes I'm also still interested in starting on the proposal's phase 1 and 2 at least, to fix the existing lvm_thin (and zfs driver too?) incompatibility with 4Kn drives. Not sure how long it will take though. Phase 3 and 4 is where things would get spicy with the whole atomicity question. If a Qubes storage volume is exposed to the VM as a 4K block device even if the disk hardware might not provide atomic 4K writes, will this cause the filesystem on the volume to falsely rely on 4K writes being atomic when it otherwise wouldn't have - either for its own purposes in attempting to preserve its data structures' integrity, or as some sort of ineffective guarantee to an application writing data into a file? Tbd... There's an interesting writeup: https://stackoverflow.com/a/61832882 |
See patch description for details QubesOS/qubes-issues#4974
I tried in QubesOS/qubes-anaconda#28
I think I did set it correctly:
Any ideas? |
The test had Hardcoding |
@marmarek: I think overlaying two different GPTs is the only reasonable approach here. At some point everyone will be using bcachefs or another filesystem that does not care about sector size but we are not there yet. |
Is it only partition table issue? What about the filesystem? |
Yup, it’s just about partition table, and we can avoid the dynamic resize problem by changing the partition table with our own tools that understand the different layout. |
@rustybird what do you think about adjusting partition table in initramfs? |
I like the simplicity of it, especially compared to my 4-phase slog of a proposal. The adjustment script should probably bail out early if the root volume looks too nonstandard? E.g. if xvda3 is not an ext4 filesystem (or another filesystem type that's whitelisted as known to be logical-block-size agnostic). |
Which filesystem types should be on the allowlist? |
IIRC ext3 and btrfs are fine too, xfs definitely isn't |
Setting aside the question of what a good default would be for Qubes OS, using btrfs instead of LVM + ext4 actually does work around this issue on my 1 TB NVMe SSD (the model is a Sandisk Corp WD Blue SN570). It reports 4096 for both logical and physical sector size due to me following the instructions in the ArchWiki entry for Advanced Format a while back.1 Anyway, I ran into the same issue that OP mentioned while installing R4.2 on my desktop using the default disk configuration scheme, but everything works OK so far on a reinstall using btrfs. I wouldn't have even thought to do this if not for this issue and this comment on a duplicate issue. It makes me wonder whether using other filesystems that have similar properties to btrfs would work, such as ZFS (I'm not suggesting Qubes OS needs to adopt ZFS; I'm just thinking out loud here). Seems like adjusting the drive to emulate a sector size of 512 again may not be a viable workaround based on others' comments here, but I haven't tested it. Footnotes
|
That’s really interesting, but it actually makes sense: since BTRFS is copy-on-write, it can (at the expense of performance) make arbitrarily small writes atomic. |
Qubes OS version
R4.0
Affected component(s) or functionality
VMs not working/starting right from a fresh install.
Brief summary
Right after a fresh install, all VMs fail to mount root and therefore fails to start beyond the point where they expect /dev/xvda3 available. This happens on a device that has 4kB logical and physical block sizes (NVMe drive). This was not problem in R3.2 (as it used files by default for VM storage).
To Reproduce
Steps to reproduce the behavior:
Expected behavior
VMs would start. Firstboot stuff would work. Drives with 4kB sector size would work.
Additional context
I've tracked this down to the handling of the partition table. With 512B sectors the location of the GPT differs from that of with 4kB sectors and therefore VMs fail to find the correct partition table from xvda. Obviously also the partition start/end values will be off by the factor of 8 because the templates are built(?) with an assumption of 512B sector size.
I'm not sure if there are other assumptions based on 512B sectors with the other /dev/xvd* drives.
Solutions you've tried
I cloned a template and I tried to manually fix the partition table of the clone (in dom0 through /dev/qubes_dom0/...). There's was plenty of space before the first partition, however, at the end the drive is so tight on space that the GPT secondary table won't fit so the xvda3 partition's tail was truncated slightly and I didn't try to resize its filesystem first (this probably causes some problems, potentially corruption?). With such a fixed partition table, I could start VMs (but there are then some other problems/oddities that might be due to incomplete firstboot or non-fixed fedora template, I only fixed the debian one which I mainly use normally). I could possibly enlarge the relevant LV slightly to avoid the truncate problem at the tail of xvda3 but I've not tried that yet.
I tried to look if I could somehow force pv/vg/lv chain to fake the logical sector size but couldn't find anything from the manpages.
Libvirt might be able to fake the
logical_block_size
but I've not yet tried that.Relevant documentation you've consulted
During install, I used the custom install steps to create manual partitioning (but I think it is irrelevant).
Related, non-duplicate issues
None I could find, some other issues included failure to mount root successfully but the causes are different.
The text was updated successfully, but these errors were encountered: