Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverlayFS broken on NixOS Kernel 4.19 and 4.20 #54509

Closed
bachp opened this issue Jan 23, 2019 · 26 comments
Closed

OverlayFS broken on NixOS Kernel 4.19 and 4.20 #54509

bachp opened this issue Jan 23, 2019 · 26 comments

Comments

@bachp
Copy link
Member

bachp commented Jan 23, 2019

Issue description

When using overlayfs with NixOS and Kernel 4.19 it is not possible to overwrite a file that already exists in the lower directory with new content. Instead if attempted the file appears empty. (See how to reproduce for details).

It seems like the revert of an upstream commit in 4.19 #52942 breaks something fundamentally in overlayfs.

EDIT: Reverting #52942 fixes the behavior. But I'm unable to run the tests as the VM kernel pancis :(

This also affects docker if used with one of the overlay drivers.

Steps to reproduce

Via the test in #54508

or

Manually:

mkdir -p /tmp/mnt/upper /tmp/mnt/lower /tmp/mnt/work /tmp/mnt/merged

# Create an existing file in the lower directory
echo 'Existing' > /tmp/mnt/lower/existing.txt

# Mount overlayfs
mount -t overlay overlay -o lowerdir=/tmp/mnt/lower,upperdir=/tmp/mnt/upper,workdir=/tmp/mnt/work /tmp/mnt/merged
 
# Prints "Existing" - OK
cat /tmp/mnt/merged/existing.txt

# Write some new content
"echo 'New' > /tmp/mnt/merged/existing.txt",

# Prints "" (empty) but it should be New - FAIL
cat /tmp/mnt/merged/existing.txt

Technical details

Please run nix-shell -p nix-info --run "nix-info -m" and paste the
results.

 - system: `"x86_64-linux"`
 - host os: `Linux 4.19.16, NixOS, 19.03.git.5d42db2 (Koi)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.2`
 - channels(root): `"nixos-17.03pre95306.a24728f"`
 - channels(pascal): `"nixos-19.03pre167251.20343f0ab47"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
@bachp
Copy link
Member Author

bachp commented Jan 23, 2019

/cc @aszlig and @samueldr as you probably know the most about this.

@samueldr samueldr added this to the 19.03 milestone Jan 23, 2019
@samueldr
Copy link
Member

Hmm, this is a blocker for 19.03, can't have either side of the current situation while keeping the latest LTS kernel.

Thanks for the (wip) test, I bet it'll be useful to check for the same kind of regression.

cc @lheckemann to keep in mind.

@aszlig was the issue we were working around reported upstream? I guess it hasn't been fixed since the patch (a partial revert) still applies.

@bachp
Copy link
Member Author

bachp commented Jan 26, 2019

I'm trying to debug the issue but I'm confused right now.

I tried to run the tests without the patches in #52942 but I then get the error:

switch_root: can't execute '/nix/store/i9rxcghqk9glylabsdl5wmnwv7w853wj-nixos-system-machine-19.03.git.4fb8bc8/init': Operation not permitted

I'm not sure this is actually the issue #52942 is working around?

I was first suspecting it has something to do with the interaction between 9p and overlayfs but then I made a curious observation that when I run a VM produced via:

nix-build nixos/ -A vm -I nixpkgs=/root/nixpkgs

It runs without issue using the same overlayfs over 9p setup.

Any ideas what is different between running a test VM and my second directly built VM?

@samueldr
Copy link
Member

samueldr commented Jan 26, 2019

IIRC, yes, Operation not permitted on switching root is the issue the patch is used to work around.

@bachp
Copy link
Member Author

bachp commented Jan 26, 2019

Ok but I still don't understand why it works in the non test case VM.

@samueldr
Copy link
Member

samueldr commented Jan 27, 2019

Ok but I still don't understand why it works in the non test case VM.

It does for me, from your commit, reverted the patch. (Also added a workaround for the gdk_pixbuf mismatch between 18.09 and unstable #54278.)

  • QEMU_KERNEL_PARAMS='console=ttyS0' result/bin/run-DUFFMAN-vm
  • View → serial0
<<< NixOS Stage 1 >>>

loading module virtio_balloon...
loading module virtio_console...
loading module virtio_rng...
loading module dm_mod...
running udev...
kbd_mode: KDSKBMODE: Inappropriate ioctl for device
starting device mapper and LVM...
checking /dev/vda...
fsck (busybox 1.29.3)
[fsck.ext4 (1) -- /mnt-root/] fsck.ext4 -a /dev/vda
/dev/vda: recovering journal
/dev/vda: clean, 11/32768 files, 6353/131072 blocks
mounting /dev/vda on /...
mounting store on /nix/.ro-store...
mounting tmpfs on /nix/.rw-store...
mounting shared on /tmp/shared...
mounting xchg on /tmp/xchg...
mounting overlay filesystem on /nix/store...
switch_root: can't execute '/nix/store/9h3bsjav4cxlb6xn7a4zjwvb1r71sc3m-nixos-system-DUFFMAN-19.03.git.b2669a2/init': Operation not permitted
[    1.170907] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[    1.170907] 
[    1.172524] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.19.17 #1-NixOS
[    1.173684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
[    1.175602] Call Trace:
[    1.176171]  dump_stack+0x5c/0x7b
[    1.176901]  panic+0xe4/0x242
[    1.177541]  do_exit+0xaed/0xaf0
[    1.178209]  ? handle_mm_fault+0xdc/0x210
[    1.179004]  do_group_exit+0x3a/0xa0
[    1.180014]  __x64_sys_exit_group+0x14/0x20
[    1.180832]  do_syscall_64+0x4e/0x100
[    1.181589]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    1.182502] RIP: 0033:0x7f1357beb646
[    1.183222] Code: Bad RIP value.
[    1.183887] RSP: 002b:00007ffdda70f9d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[    1.185284] RAX: ffffffffffffffda RBX: 00007f1357ed45a0 RCX: 00007f1357beb646
[    1.186477] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
[    1.187727] RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffff80
[    1.188917] R10: 0000000000000073 R11: 0000000000000246 R12: 00007f1357ed45a0
[    1.190161] R13: 0000000000000001 R14: 00007f1357edd488 R15: 0000000000000000
[    1.191539] Kernel Offset: 0x4200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.193362] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[    1.193362]  ]---

boot.kernelPackages = pkgs.linuxPackages_4_19; in my configuration.nix, maybe you're using an older LTS pinned by its version number?

@peti
Copy link
Member

peti commented Jan 29, 2019

For what it's worth, I can confirm the issue. Pretty much all my docker containers are broken after the recent channel update to 19.03pre166987.bc41317e243 a.k.a. nixos-unstable.

@samueldr
Copy link
Member

Hmm, looks like we're between a rock and a hard place, something changed in the kernel breaking our tests (not sure if it's userland that broke) and the revert to make the tests pass breaks overlayfs.

The simple solution, and what probably will happen for 19.03 unless something is figured out, is to revert back to the previous LTS as default, but this is not an ideal solution.

@ElvishJerricco
Copy link
Contributor

Should we revert #52942? Maybe there's a better way to fix the test failures.

@aszlig
Copy link
Member

aszlig commented Jan 29, 2019

The report has been posted to unionfs-linux and should show up in mailinglist archives soon, I'll post a link as soon as that's the case.

@peti
Copy link
Member

peti commented Jan 31, 2019

Should we revert #52942?

Yes, please. As much as it sucks, a situation where docker is completely broken on NixOS is not really tenable.

@bachp
Copy link
Member Author

bachp commented Jan 31, 2019

I think reverting #52942 and got back to 4.14 as the default would be the most reasonable thing to do until the issue is resolved.

@peti
Copy link
Member

peti commented Feb 2, 2019

Whatever we do, we should do it soon, because the current situation is really bad and IMHO we shouldn't have kept master in that broken state as long as we did already. Waiting another couple of days before we do something doesn't feel like a good idea.

@ElvishJerricco
Copy link
Contributor

I agree with the suggestion to revert #52942 and go back to 4.14. Then we can see about alternative solutions to the test failures with 4.19.

@aszlig
Copy link
Member

aszlig commented Feb 2, 2019

Hm, it seems my mail (Message-ID: <20190129194114.GA7785@dnyarri>) never reached the list for some reason, even though the MTA has accepted it:

DD42F980DD95A: to=<linux-unionfs@vger.kernel.org>, relay=vger.kernel.org[209.132.180.67]:25, delay=687, delays=621/0.01/0.63/65, dsn=2.7.0, status=sent (250 2.7.0 nothing apparently wrong in the message. BF:<H 0>; S1728330AbfA2Tww)

(This was on Jan 29 20:52:53 CET)

@szmi: Is the linux-unionfs mailing list moderated?

edit: Just sent it again without attachments, let's hope this was the culprit.

samueldr added a commit to samueldr/nixpkgs that referenced this issue Feb 2, 2019
This reverts commit de86af4.

(Manual revert due to conflicts.)

See NixOS#54509

The patch is causing overlayfs to misbehave.
samueldr added a commit to samueldr/nixpkgs that referenced this issue Feb 2, 2019
This reverts commit b861ebb.

The current issues (See NixOS#54509 and NixOS#48828) are causing headaches to
users of the unstable branches.
@aszlig
Copy link
Member

aszlig commented Feb 2, 2019

Okay, worked without attachments: https://www.spinics.net/lists/linux-unionfs/msg06627.html

@NeQuissimus
Copy link
Member

There are a bunch of overlay fixes in 4.20.7, can you try the latest kernels?

@samueldr
Copy link
Member

samueldr commented Feb 6, 2019

Same fixes with 4.19 (which is desired), looking this evening.

@samueldr
Copy link
Member

samueldr commented Feb 7, 2019

machine# Probing EDD (edd=off to disable)... ok
machine# c[    0.000000] Linux version 4.19.20 (nixbld@localhost) (gcc version 7.4.0 (GCC)) #1-NixOS SMP Wed Feb 6 16:30:16 UTC 2019
machine# [    0.000000] Command line: loglevel=7 console=ttyS0 panic=1 boot.panic_on_fail init=/nix/store/xsimkh58w0k6jmn6jfxf5arsk9xkdhk0-nixos-system-machine
-19.03.git.47aad6e/init regInfo=/nix/store/a6vn3r6600p5z216iwbjh81vxkhbz3cw-closure-info/registration console=ttyS0
...
machine# mounting overlay filesystem on /nix/store...
machine# switch_root: can't execute '/nix/store/xsimkh58w0k6jmn6jfxf5arsk9xkdhk0-nixos-system-machine-19.03.git.47aad6e/init': Operation not permitted
machine# [    1.508713] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
machine# [    1.508713]
machine# [    1.510084] CPU: 0 PID: 1 Comm: switch_root Not tainted 4.19.20 #1-NixOS
machine# [    1.511062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
machine# [    1.512691] Call Trace:
machine# [    1.513078]  dump_stack+0x5c/0x7b
machine# [    1.513568]  panic+0xe4/0x242
machine# [    1.514020]  do_exit+0xb52/0xb60
machine# [    1.514499]  ? handle_mm_fault+0xdc/0x210
machine# [    1.515097]  do_group_exit+0x3a/0xa0
machine# [    1.515622]  __x64_sys_exit_group+0x14/0x20
machine# [    1.516248]  do_syscall_64+0x4e/0x100
machine# [    1.516782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
machine# [    1.517526] RIP: 0033:0x7f5827cd9646
machine# [    1.518048] Code: Bad RIP value.
machine# [    1.518549] RSP: 002b:00007ffc3718aa98 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
machine# [    1.519685] RAX: ffffffffffffffda RBX: 00007f5827fc25a0 RCX: 00007f5827cd9646
machine# [    1.520726] RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001
machine# [    1.521771] RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffff80
machine# [    1.522807] R10: 0000000000000073 R11: 0000000000000246 R12: 00007f5827fc25a0
machine# [    1.523845] R13: 0000000000000001 R14: 00007f5827fcb488 R15: 0000000000000000
machine# [    1.525021] Kernel Offset: 0xbe00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
machine# [    1.526585] Rebooting in 1 seconds..

:(

@NeQuissimus
Copy link
Member

WTF

@samueldr
Copy link
Member

samueldr commented Feb 7, 2019

In case you didn't realise, this is the same initial issue which caused us to add a patch reverting part of the changes to make the tests pass. This is the subject of the thread. So it looks like there were other changes in passing, unrelated to our problem.

@bachp
Copy link
Member Author

bachp commented Feb 7, 2019

Is somebody able to create the setup as it is in during the startup of the nixos test VM in a simplified version. One that would allow to have access to the shell right befor the switch_root get's executed?

@tbenst
Copy link
Contributor

tbenst commented Feb 24, 2019

For a new nixos-install of 18.09 I’m seeing something similar, with the main difference that except after switch_root it says “No such file or directory”...is this the same issue?

@samueldr
Copy link
Member

samueldr commented Feb 24, 2019

@tbenst this particular issue shouldn't affect 18.09, and what you are experiencing shouldn't be this issue considering (1) the faulty revert patch was reverted (2) the current issue affects the testing infra. (To be fair, your comment could be about the testing infra, it's unspecified if it's your system boot or a VM.)

Though, I'm not downplaying whatever issue you're facing! In fact, let's try and figure out what's going on. Ideally, open an issue with the following information (and mention me and this issue in the body).

  • Do previous generations boot fine?
  • This was after a fresh update of 18.09, right?
  • Which kernel is your system configured to use? (nix eval --raw '(import <nixos/nixos> {}).options.boot.kernelPackages.value.kernel')
  • What's the output of nix-info?

@tbenst
Copy link
Contributor

tbenst commented Feb 26, 2019

Thanks, @samueldr for the very kind comment! I figured out the issue thankfully--as soon as I enabled systemd-boot everything worked.

@aszlig
Copy link
Member

aszlig commented Mar 14, 2019

Just to keep everyone outside of #nixos-dev up to date, we do have a fix for this. Here is the upstream post and patch: https://www.spinics.net/lists/linux-unionfs/msg06733.html

@aszlig aszlig closed this as completed in 4c1ddb3 Mar 18, 2019
aszlig added a commit that referenced this issue Mar 18, 2019
In Linux 4.19 there has been a major rework of the overlayfs
implementation and it now opens files in lowerdir with O_NOATIME, which
in turn caused issues in our VM tests because the process owner of QEMU
doesn't match the file owner of the lowerdir.

The crux here is that 9p propagates the O_NOATIME flag to the host and
the guest kernel has no way of verifying whether that flag will lead to
any problems beforehand.

There is ongoing work to possibly fix this in the kernel, but it will
take a while until there is a working patch and consensus.

So in order to bring our default kernel back to 4.19 and of course make
it possible to run newer kernels in VM tests, I'm merging a small QEMU
patch as an interim solution, which we can drop once we have a working
fix in the next round of stable kernels.

Now we already had Linux 4.19 set as the default kernel, but that was
subsequently reverted in 048c36c
because the patch we have used was the revert of the commit I bisected a
while ago.

This patch broke overlayfs in other ways, so I'm also merging in a VM
test by @bachp, which only tests whether overlayfs is working, just to
be on the safe side that something like this won't happen in the future.

Even though this change could be considered a moderate mass-rebuild at
least for GNU/Linux, I'm merging this to master, mainly to give us some
time to get it into the current 19.03 release branch (and subsequent
testing window) once we got no new breaking builds from Hydra.

Cc: @samueldr, @lheckemann

Fixes: #54509
Fixes: #48828
Merges: #57641
Merges: #54508
aszlig added a commit that referenced this issue Mar 21, 2019
In Linux 4.19 there has been a major rework of the overlayfs
implementation and it now opens files in lowerdir with O_NOATIME, which
in turn caused issues in our VM tests because the process owner of QEMU
doesn't match the file owner of the lowerdir.

The crux here is that 9p propagates the O_NOATIME flag to the host and
the guest kernel has no way of verifying whether that flag will lead to
any problems beforehand.

There is ongoing work to possibly fix this in the kernel, but it will
take a while until there is a working patch and consensus.

So in order to bring our default kernel back to 4.19 and of course make
it possible to run newer kernels in VM tests, I'm merging a small QEMU
patch as an interim solution, which we can drop once we have a working
fix in the next round of stable kernels.

Now we already had Linux 4.19 set as the default kernel, but that was
subsequently reverted in 048c36c
because the patch we have used was the revert of the commit I bisected a
while ago.

This patch broke overlayfs in other ways, so I'm also merging in a VM
test by @bachp, which only tests whether overlayfs is working, just to
be on the safe side that something like this won't happen in the future.

Even though this change could be considered a moderate mass-rebuild at
least for GNU/Linux, I'm merging this to master, mainly to give us some
time to get it into the current 19.03 release branch (and subsequent
testing window) once we got no new breaking builds from Hydra.

Cc: @samueldr, @lheckemann

Fixes: #54509
Fixes: #48828
Merges: #57641
Merges: #54508
(cherry picked from commit 12efcc2)
qemu-deploy pushed a commit to qemu/qemu that referenced this issue May 14, 2020
QEMU's local 9pfs server passes through O_NOATIME from the client. If
the QEMU process doesn't have permissions to use O_NOATIME (namely, it
does not own the file nor have the CAP_FOWNER capability), the open will
fail. This causes issues when from the client's point of view, it
believes it has permissions to use O_NOATIME (e.g., a process running as
root in the virtual machine). Additionally, overlayfs on Linux opens
files on the lower layer using O_NOATIME, so in this case a 9pfs mount
can't be used as a lower layer for overlayfs (cf.
https://github.com/osandov/drgn/blob/dabfe1971951701da13863dbe6d8a1d172ad9650/vmtest/onoatimehack.c
and NixOS/nixpkgs#54509).

Luckily, O_NOATIME is effectively a hint, and is often ignored by, e.g.,
network filesystems. open(2) notes that O_NOATIME "may not be effective
on all filesystems. One example is NFS, where the server maintains the
access time." This means that we can honor it when possible but fall
back to ignoring it.

Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Message-Id: <e9bee604e8df528584693a4ec474ded6295ce8ad.1587149256.git.osandov@fb.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
mdroth pushed a commit to mdroth/qemu that referenced this issue Jun 16, 2020
QEMU's local 9pfs server passes through O_NOATIME from the client. If
the QEMU process doesn't have permissions to use O_NOATIME (namely, it
does not own the file nor have the CAP_FOWNER capability), the open will
fail. This causes issues when from the client's point of view, it
believes it has permissions to use O_NOATIME (e.g., a process running as
root in the virtual machine). Additionally, overlayfs on Linux opens
files on the lower layer using O_NOATIME, so in this case a 9pfs mount
can't be used as a lower layer for overlayfs (cf.
https://github.com/osandov/drgn/blob/dabfe1971951701da13863dbe6d8a1d172ad9650/vmtest/onoatimehack.c
and NixOS/nixpkgs#54509).

Luckily, O_NOATIME is effectively a hint, and is often ignored by, e.g.,
network filesystems. open(2) notes that O_NOATIME "may not be effective
on all filesystems. One example is NFS, where the server maintains the
access time." This means that we can honor it when possible but fall
back to ignoring it.

Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Message-Id: <e9bee604e8df528584693a4ec474ded6295ce8ad.1587149256.git.osandov@fb.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit a5804fc)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
mdroth pushed a commit to mdroth/qemu that referenced this issue Sep 2, 2020
QEMU's local 9pfs server passes through O_NOATIME from the client. If
the QEMU process doesn't have permissions to use O_NOATIME (namely, it
does not own the file nor have the CAP_FOWNER capability), the open will
fail. This causes issues when from the client's point of view, it
believes it has permissions to use O_NOATIME (e.g., a process running as
root in the virtual machine). Additionally, overlayfs on Linux opens
files on the lower layer using O_NOATIME, so in this case a 9pfs mount
can't be used as a lower layer for overlayfs (cf.
https://github.com/osandov/drgn/blob/dabfe1971951701da13863dbe6d8a1d172ad9650/vmtest/onoatimehack.c
and NixOS/nixpkgs#54509).

Luckily, O_NOATIME is effectively a hint, and is often ignored by, e.g.,
network filesystems. open(2) notes that O_NOATIME "may not be effective
on all filesystems. One example is NFS, where the server maintains the
access time." This means that we can honor it when possible but fall
back to ignoring it.

Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Message-Id: <e9bee604e8df528584693a4ec474ded6295ce8ad.1587149256.git.osandov@fb.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit a5804fc)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants