Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/systemd|filesystems: mount and evacuate /sys/fs/pstore using systemd-pstore #85073

Merged
merged 2 commits into from May 16, 2021

Conversation

hyperfekt
Copy link
Contributor

@hyperfekt hyperfekt commented Apr 12, 2020

Motivation for this change

Currently limited size pstores eventually fill up and logs of any panics stop being saved, making it harder to diagnose them. This enables the systemd-pstore service, which frees them automatically and adds them to the journal.

Design

We can't mount /sys/fs/pstore with the other special file systems, because at that time the kernel hasn't loaded the pstore module yet.
systemd cannot mount virtual file systems that expose APIs, so we can't use a mount unit.
Instead we just create a very simple systemd service that does the job.

On systems without pstore I expect the modprobe service to fail, which doesn't look nice in the journal, but I'm not aware of a good alternative.

Left for another time is making usage of the EFI variable pstore optional, which it currently is not even without this PR (this one just adds the vacuuming instead of only dumping into it).

Things done
  • Tested using sandboxing (nix.useSandbox on NixOS, or option sandbox in nix.conf on non-NixOS linux)
  • Built on platform(s)
    • NixOS
    • macOS
    • other Linux distributions
  • Tested via one or more NixOS test(s) if existing and applicable for the change (look inside nixos/tests)
  • Tested compilation of all pkgs that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review wip"
  • Tested execution of all binary files (usually in ./result/bin/)
  • Determined the impact on package closure size (by running nix path-info -S before and after)
  • Ensured that relevant documentation is up to date
  • Fits CONTRIBUTING.md.

@hyperfekt

This comment has been minimized.

@hyperfekt hyperfekt changed the title nixos/systemd: enable systemd-pstore service nixos/filesystems: mount and evacuate /sys/fs/pstore using systemd-pstore Apr 12, 2020
@hyperfekt hyperfekt marked this pull request as ready for review April 12, 2020 20:55
@hyperfekt
Copy link
Contributor Author

hyperfekt commented May 27, 2020

For some reason the dmsg's cleaned up by this are not showing up in the journal but only in /var/lib/systemd/pstore; which is the one configuration that's supposedly impossible according to the man page.
My current assumption is that upstream is broken somehow, and be it only in not reporting an error when it fails to write to the journal.

@stale
Copy link

stale bot commented Nov 24, 2020

I marked this as stale due to inactivity. → More info

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Nov 24, 2020
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Dec 1, 2020
@hyperfekt hyperfekt force-pushed the systemd-pstore branch 2 times, most recently from 76c692d to 86ef73b Compare December 1, 2020 23:03
@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

After bisecting my system configuration it does look like this PR broke something. The error messages I get from switching are:

warning: the following units failed: mount-pstore.service

● mount-pstore.service
     Loaded: loaded (/nix/store/...-unit-mount-pstore.service/mount-pstore.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2021-05-18 08:48:56 EDT; 1s ago
    Process: 40869 ExecStart=/nix/store/...-util-linux-2.36.2-bin/bin/mount -t pstore -o nosuid,noexec,nodev pstore /sys/fs/pstore (code=exited, status=32)
   Main PID: 40869 (code=exited, status=32)
         IP: 0B in, 0B out
        CPU: 1ms

May 18 08:48:56 albatross systemd[1]: Starting mount-pstore.service...
May 18 08:48:56 albatross mount[40869]: mount: /sys/fs/pstore: pstore already mounted on /sys/fs/pstore.
May 18 08:48:56 albatross systemd[1]: mount-pstore.service: Main process exited, code=exited, status=32/n/a
May 18 08:48:56 albatross systemd[1]: mount-pstore.service: Failed with result 'exit-code'.
May 18 08:48:56 albatross systemd[1]: Failed to start mount-pstore.service.

@hyperfekt
Copy link
Contributor Author

hyperfekt commented May 18, 2021

You have a service mounting the persistent store already? Is it to the same place or somewhere else? You can simply set systemd.services.mount-pstore.enable = false to prevent that failure (which doesn't affect anything else - the only thing 'broken' or 'failing' is that very service).

Alternatively you can disable whatever is mounting the persistent store (if feasible, curious what that is) and leave it to this service.

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Let me see if I can figure out what is mounting pstore. I wasn't even aware of pstore until this morning!

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Interesting, it doesn't look like anything depends on it, at least from the systemd perspective:

❯ sudo systemctl list-dependencies --reverse mount-pstore.service
mount-pstore.service

@hyperfekt
Copy link
Contributor Author

The only other thing that could be going wrong is that the ExecStartPost stanza is failing (it really shouldn't be able to!) and what you're seeing is the effect of systemd trying to restart the service. Can you make sure you get us logs from all attempts to start the service?

@hyperfekt
Copy link
Contributor Author

it doesn't look like anything depends on it, at least from the systemd perspective

That really shouldn't be. If nothing depends on it, how did it get started??

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

@hyperfekt Is there a way to use nix why-depends here to discover dependents of mount-pstore.service?

@cpcloud
Copy link
Contributor

cpcloud commented May 18, 2021

Ok, after setting

systemd.services.mount-pstore.enable = false

and deploying, and then setting it back to true (and deploying) I cannot reproduce this anymore.

A couple of things I noticed:

  1. The permissions of the existing /sys/fs/pstore mount were different from the permissions of the same mount after deploying from this PR's commit. The original permissions were: drwxr-xr-x, while the new permissions are drwxr-x---.
  2. The result of the command I ran above to show dependents of mount-pstore.service is different upon deploying from this commit:
❯ sudo systemctl list-dependencies --reverse mount-pstore.service
mount-pstore.service
● └─systemd-pstore.service

@wizardwatch
Copy link
Contributor

I also experience the same bug where mount-pstore fails to start. I have not attempted to fix it yet.

@zhaofengli
Copy link
Member

I encountered the same problem as well on my machines. If I'm understanding correctly, it appears that systemd mounts pstore automatically.

@ymatsiuk
Copy link
Contributor

Seems like systemd added support for pstore few months after this PR has been submitted. @hyperfekt do you still think this PR is required?

iagocq added a commit to iagocq/nixos-config that referenced this pull request May 21, 2021
@superherointj
Copy link
Contributor

superherointj commented May 21, 2021

It affected me on unstable and master. On nixos-rebuild switch:

● mount-pstore.service
     Loaded: loaded (/nix/store/rszfkvs6jjz6xi2016n48aay23a5n75m-unit-mount-pstore.service/mount-pstore.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2021-05-21 07:28:55 -03; 155ms ago
    Process: 77852 ExecStart=/nix/store/5kw5x24rbwv8gp2mgr75vd0i4nkp2zvz-util-linux-2.36.2-bin/bin/mount -t pstore -o nosuid,noexec,nodev pstore /sys/fs/pstore (code=exited, status=32)
   Main PID: 77852 (code=exited, status=32)
         IP: 0B in, 0B out
        CPU: 10ms

mai 21 07:28:55 k1 systemd[1]: Starting mount-pstore.service...
mai 21 07:28:55 k1 mount[77852]: mount: /sys/fs/pstore: pstore already mounted on /sys/fs/pstore.
mai 21 07:28:55 k1 systemd[1]: mount-pstore.service: Main process exited, code=exited, status=32/n/a
mai 21 07:28:55 k1 systemd[1]: mount-pstore.service: Failed with result 'exit-code'.
mai 21 07:28:55 k1 systemd[1]: Failed to start mount-pstore.service.
warning: error(s) occurred while switching to the new configuration

Tried workaround: systemd.services.mount-pstore.enable = false;
It silenced error messages but on reactivation it reappeared.

Shouldn't a new issue be opened to track this? This PR is merged and we have an issue.

ymatsiuk added a commit to ymatsiuk/nixpkgs that referenced this pull request May 21, 2021
@ymatsiuk ymatsiuk mentioned this pull request May 21, 2021
9 tasks
@ymatsiuk
Copy link
Contributor

In case decision is made to revert this, I've created this #123889

@hyperfekt
Copy link
Contributor Author

hyperfekt commented May 21, 2021

I'm currently investigating this.
It definitely isn't because systemd is already mounting pstore, because that line has not just been part of systemd for 8 years, it also doesn't have an effect on NixOS - which is why we have a separate stanza in one of those files for reimplementing that part of systemd in NixOS, from which /sys/fs/pstore is missing.
On my system, /sys/fs/pstore is not mounted without this PR, and I am now upgrading it to unstable to see if something goes wrong specifically during activation (attempting to start it twice maybe?).
If that's not the case, something else must be mounting /sys/fs/pstore and we need to find out what it is, because it's only present for a subset of users. One way to do so would be to start running auditd early in the boot.

@superherointj
Copy link
Contributor

On my system, /sys/fs/pstore is not mounted without this PR, and I am now upgrading it to unstable to see if something goes wrong specifically during activation (attempting to start it twice maybe?).

As your PR targets master, you should be testing it/your system on master.

@hyperfekt
Copy link
Contributor Author

I agree, I probably should have tested the activation itself in a VM instead of relying on the ISO (where no switch occurs) to show any problem.

@ymatsiuk
Copy link
Contributor

I can't reproduce this in the VM (pstore is not present there). And I fixed it just by umount and running rebuild again. Now everything works. So I would expect we could've avoided this in the future with simple check in ExecStartPre that does umount in case it is mounted (for the specific use case when someone/something did mount pstore)

@hyperfekt
Copy link
Contributor Author

hyperfekt commented May 21, 2021

It's been common in the past for some units to fail upon activation after a nixos-rebuild switch, which is why I wasn't concerned about avoiding every instance of it. Of course it shouldn't be happening to every user of NixOS, which it appears this does.
I've tried this out and while I can reproduce it happening when switching, it also isn't mounted without the service, and I can't see any hint as to why it might already be. On a subsequent switch that does nothing but enable the unit, no such failure is observed. Unfortunately I can't seem to get auditd to pick up the mount syscalls, so I'm currently a bit stumped as to why it happens. I will however open a PR to introduce a condition that checks if the persistent store is already mounted at that location, so noone else who upgrades in the future sees the associated noise (and has to wonder whether their system is now broken).

@zhaofengli
Copy link
Member

For me, the failure also occurs on boot, so it's not just happening on a switch for "people with existing /sys/fs/pstore".

@superherointj
Copy link
Contributor

Wouldn't be better to reverse it? So you can calmly investigate it?

@hyperfekt
Copy link
Contributor Author

hyperfekt commented May 21, 2021

I don't see any reason to revert it if the issue is solved by the new PR. In general no breakage is involved, just log noise. And if there is a solution necessary as a result of the investigation, it would be additive, not alternative.

@superherointj
Copy link
Contributor

On the workaround that umount in case it is mounted, it seems to be dealing with symptoms rather than root cause.

@hyperfekt
Copy link
Contributor Author

If you check #123902 you will see that's not what I did.

@hyperfekt
Copy link
Contributor Author

For me, the failure also occurs on boot

@zhaofengli If you find out what is responsible for having /sys/fs/pstore mounted the first time, I'd be glad. I am still curious about any instance where this is the case and it's from a package instead of an individual modification.

@zhaofengli
Copy link
Member

Finally have some time to poke at this. It appears that /sys/fs/pstore will automatically be mounted if pstore is built into the kernel (CONFIG_PSTORE=y) or it's loaded prior to /sys being mounted the first time (in stage 1, boot.kernelModules are loaded after the special filesystems are mounted).

The machine I initially encountered the on-boot failure is an aarch64 host running a custom kernel with CONFIG_PSTORE=y. With mount-pstore disabled:

[root@serotina:~]# uname -a
Linux serotina 5.12.5 #1-NixOS SMP Tue Jan 1 00:00:00 UTC 1980 aarch64 GNU/Linux

[root@serotina:~]# systemctl status mount-pstore
● mount-pstore.service
     Loaded: masked (Reason: Unit mount-pstore.service is masked.)
     Active: inactive (dead)

[root@serotina:~]# mount | grep pstore
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)

[root@serotina:~]# zcat /proc/config.gz | grep CONFIG_PSTORE=
CONFIG_PSTORE=y

On the regular kernel, pstore is built as a module:

[root@malus:~]# uname -a
Linux malus 5.10.37 #1-NixOS SMP Fri May 14 07:50:46 UTC 2021 x86_64 GNU/Linux

[root@malus:~]# systemctl status mount-pstore
● mount-pstore.service
     Loaded: masked (Reason: Unit mount-pstore.service is masked.)
     Active: inactive (dead)

[root@malus:~]# mount | grep pstore

[root@malus:~]# zcat /proc/config.gz | grep CONFIG_PSTORE=
CONFIG_PSTORE=m

[root@malus:~]# lsmod | grep pstore
efi_pstore             16384  0
pstore                 28672  1 efi_pstore

On a machine with CONFIG_PSTORE=y, I used an audit rule to trace all mount syscalls:

Audit configurations
{
  boot.kernelParams = [ "audit=1" "audit_backlog_limit=500" ];
  security.audit = {
    enable = true;
    rules = [
      "-a exit,always -S mount"
    ];
  };
  security.auditd.enable = true;
  # Add rule before systemd starts
  boot.initrd.postDeviceCommands = ''
    ${pkgs.audit}/bin/auditctl -a exit,always -S mount
  '';
}

Looking at the audit logs I couldn't find a call that specifically mounted /sys/fs/pstore so I don't think it's coming from userspace. If I modify stage 1 and nixos/modules/tasks/filesystems.nix to mount /sys after boot.kernelModules are loaded, /sys/fs/pstore is automatically mounted.

However, this doesn't explain why the initial activation will cause the failure. Remounting /sys after pstore is loaded doesn't appear to automatically mount /sys/fs/pstore, so we're missing something here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants