Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS /nix unmount when reboot. /nix ownership change to root #4640

Open
OliverKoo opened this issue Mar 15, 2021 · 24 comments
Open

MacOS /nix unmount when reboot. /nix ownership change to root #4640

OliverKoo opened this issue Mar 15, 2021 · 24 comments
Labels

Comments

@OliverKoo
Copy link

OliverKoo commented Mar 15, 2021

Describe the bug

On aws Mac ec2 instance running Catalina 10.15.7 installed nix with recommended approach

sh <(curl -L https://nixos.org/nix/install) --darwin-use-unencrypted-nix-store-volume

works great. you can see /nix is mounted

ec2-user@ip-10-249-8-250 ~ % nix --version
nix (Nix) 2.3.10
ec2-user@ip-10-249-8-250 ~ % diskutil apfs list 
    +-> Volume disk2s6 7420B953-17CE-4369-B12E-7910CB17CE7A
        ---------------------------------------------------
        APFS Volume Disk (Role):   disk2s6 (No specific role)
        Name:                      Nix Store (Case-insensitive)
        Mount Point:               /nix
        Capacity Consumed:         329789440 B (329.8 MB)
        FileVault:                 No

and /nix is own by ec2-user

ec2-user@ip-10-249-8-250 ~ % ls -la /nix
total 0
drwxrwxr-x   5 ec2-user  staff   160 Mar 15 17:53 .
drwxr-xr-x  22 root      wheel   704 Feb 10 01:37 ..
drwx------  34 ec2-user  staff  1088 Mar 15 17:53 .fseventsd
drwxr-xr-x  59 ec2-user  staff  1888 Mar 15 17:53 store
drwxr-xr-x   4 ec2-user  staff   128 Mar 15 17:53 var
ec2-user@ip-10-249-8-250 ~ % nix --version
nix (Nix) 2.3.10
---

Problem
however when I reboot the nix vol didn't auto mount (maybe /etc/fstab is no longer used by Catalina?)
and /nix is now own by root

ec2-user@ip-10-249-8-250 ~ % diskutil apfs list
APFS Container (1 found)
|
+-- Container disk2 7867D1D1-A318-4F69-BE7A-2C9DEF37A5BC
    ====================================================
    APFS Container Reference:     disk2
    Size (Capacity Ceiling):      274668150784 B (274.7 GB)
    Capacity In Use By Volumes:   38961782784 B (39.0 GB) (14.2% used)
    Capacity Not Allocated:       235706368000 B (235.7 GB) (85.8% free)
    |
    +-< Physical Store disk1s2 7E102836-D259-4625-A9AB-A33559D758B9
    |   -----------------------------------------------------------
    |   APFS Physical Store Disk:   disk1s2
    |   Size:                       274668150784 B (274.7 GB)
    |
    +-> Volume disk2s1 047551A9-1611-4846-90E4-DF0B2D32BDFA
    |   ---------------------------------------------------
    |   APFS Volume Disk (Role):   disk2s1 (Data)
    |   Name:                      Macintosh HD - Data (Case-insensitive)
    |   Mount Point:               /System/Volumes/Data
    |   Capacity Consumed:         24701288448 B (24.7 GB)
    |   FileVault:                 No
    |
    +-> Volume disk2s2 60805369-595C-484A-AA04-A6FD1B1C133E
    |   ---------------------------------------------------
    |   APFS Volume Disk (Role):   disk2s2 (Preboot)
    |   Name:                      Preboot (Case-insensitive)
    |   Mount Point:               Not Mounted
    |   Capacity Consumed:         79278080 B (79.3 MB)
    |   FileVault:                 No
    |
    +-> Volume disk2s3 CA71C970-9205-4BD0-8580-57EC6277A512
    |   ---------------------------------------------------
    |   APFS Volume Disk (Role):   disk2s3 (Recovery)
    |   Name:                      Recovery (Case-insensitive)
    |   Mount Point:               Not Mounted
    |   Capacity Consumed:         528957440 B (529.0 MB)
    |   FileVault:                 No
    |
    +-> Volume disk2s4 DDC89C2A-7772-45DE-B74E-CD6570BCEB30
    |   ---------------------------------------------------
    |   APFS Volume Disk (Role):   disk2s4 (VM)
    |   Name:                      VM (Case-insensitive)
    |   Mount Point:               /private/var/vm
    |   Capacity Consumed:         2147504128 B (2.1 GB)
    |   FileVault:                 No
    |
    +-> Volume disk2s5 72E57EA3-53F3-4AA0-8C1F-375C722C86B4
    |   ---------------------------------------------------
    |   APFS Volume Disk (Role):   disk2s5 (System)
    |   Name:                      Macintosh HD (Case-insensitive)
    |   Mount Point:               /
    |   Capacity Consumed:         11034324992 B (11.0 GB)
    |   FileVault:                 No
    |
    +-> Volume disk2s6 7420B953-17CE-4369-B12E-7910CB17CE7A
        ---------------------------------------------------
        APFS Volume Disk (Role):   disk2s6 (No specific role)
        Name:                      Nix Store (Case-insensitive)
        Mount Point:               Not Mounted
        Capacity Consumed:         329789440 B (329.8 MB)
        FileVault:                 No
ec2-user@ip-10-249-8-250 ~ % ls -la /nix
total 0
drwxr-xr-x   2 root  wheel   64 Mar 15 18:08 .
drwxr-xr-x  22 root  wheel  704 Feb 10 01:37 ..
ec2-user@ip-10-249-8-250 ~ % mount -a
mount_apfs: volume could not be mounted: Operation not permitted
mount: / failed with 77
mount_apfs: volume could not be mounted: Operation not permitted
mount: /nix failed with 77

I can get around it by sudo mount_apfs disk2s6 /nix but
I am using these mac ec2 instance for CI purpose and the process would fail

due to /Users/ec2-user/.nix-profile/etc/profile.d/nix.sh: Operation not permitted

Steps To Reproduce

described above

Expected behavior

nix vol mounted when boot and /nix owned by user who executed the install scripted

nix-env --version output

nix (Nix) 2.3.10

Additional context

I am running these on aws ec2 Mac1.metal instances

@OliverKoo OliverKoo added the bug label Mar 15, 2021
@abathur
Copy link
Member

abathur commented Mar 15, 2021

Can you see what sysadminctl -secureTokenStatus ec2-user says?

(I'll go into a little more detail later...)

@abathur
Copy link
Member

abathur commented Mar 16, 2021

Edits:

  • Mar 17 2021 - added an IRC log I missed previously.
  • Mar 24 2021 - added IRC logs for troubleshooting sessions w/ Oliver

I don't recall seeing this specific problem before, but we have had at least a few people turn up with nix+macOS+VM issues over the past few months. Nothing solid has come out of these reports yet, in part because everyone works around their immediate problem and forgets or loses interest in debugging it.

My general hunch for a while has been that some cloud providers are doing something ~weird (i.e., that desktop users don't do) when they set up their VMs, and that this is what causes the trouble.

I tried Nix out in a VM several weeks back while I was suffering through trying to debug an unrelated issue that required repeatedly updating macOS. While it was a complete pain, I didn't run into any of the issues described so far. But, in the process of researching that, I did stumble on this: https://mrmacintosh.com/securetoken-documentation/

From that description, it's at least plausible that they're setting up accounts without SecureToken, and that this causes the trouble. But, we still need to validate that thesis, find a fix, and figure out if it's practical for us to fix it automagically on install or if it's the sort of thing we'll just have to test for and complain about.

Since there's nothing solid to point to, I'll collect some links to existing discussions about this:

For completeness, here are IRC logs covering this troubleshooting attempt:

however when I reboot the nix vol didn't auto mount (maybe /etc/fstab is no longer used by Catalina?)
and /nix is now own by root

fstab still works fine in Catalina (and in Big Sur). Your /nix is probably owned by root because nothing is mounted there. AFAIK, any mount point described by /etc/synthetic.conf will be owned by root until/unless some other user successfully mounts something over it.

@OliverKoo
Copy link
Author

OliverKoo commented Mar 16, 2021

@abathur thanks for getting back to me so quickly.
this is what I get

ec2-user@ip-10-249-9-18 ~ % sysadminctl -secureTokenStatus ec2-user
]2021-03-16 04:19:36.035 sysadminctl[2351:21074] Secure token is DISABLED for user ec2-user

I am not super familiar with nix, nor mac in general. I am using it now for my new job
But I am happy to test drive or debug whatever you need and then document the process.

Correction on my initial report - when I say reboot I actually meant start a brand new mac1.metal instance using the AMI (amazon machine image) created from the first machine. Basically I am using packer to create AMI and provision nix onto the machine that will later be baked into an AMI.

@abathur
Copy link
Member

abathur commented Mar 16, 2021

Do you have a "password" for that account? With the caveat that I don't really know what we're doing here, I think you can enable this with something like sysadminctl -secureTokenOn ec2-user -password interactive, at which point it should prompt you for it.

(I'm basing this on sysadminctl --help, but the usage isn't terribly clear.)

If sysadminctl -secureTokenStatus confirms that it is enabled afterwards, I'm curious if the store will mount on reboot without any further changes.

@OliverKoo
Copy link
Author

OliverKoo commented Mar 16, 2021

so after enabling the security token

(run as root)

sh-3.2# sudo sysadminctl -secureTokenOn ec2-user -password - -adminUser ec2-user -adminPassword -
Enter password for ec2-user :
Enter password for ec2-user :
2021-03-16 16:00:02.676 sysadminctl[42542:216568] - Done!
sh-3.2# sysadminctl interactive -secureTokenStatus ec2-user
2021-03-16 16:00:18.307 sysadminctl[42590:216908] Secure token is ENABLED for user ec2-user

reboot, then the nix volume is attached. (not sure if this will work with the AMI process tho. Create a image that has security token already generated)

Now I am seeing a different issue on the build machine.

dyld: Library not loaded: /nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib
  Referenced from: /Users/ec2-user/.nix-profile/bin/nix
  Reason: no suitable image found.  Did find:
	file system sandbox blocked open() of '/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib'
	/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib: stat() failed with errno=1
	file system sandbox blocked open() of '/nix/store/i1cg0wfns9j4lmfzvx5dz6rc436vs6ms-libsodium-1.0.18/lib/libsodium.23.dylib'
/bin/bash: line 1:  1188 Abort trap: 6           nix --version

I am running build agent as a daemon by having a plist in /Library/LaunchDaemon. The build agent gets launched when my mac1.metal ec2 machine get booted.

If I login tho and run the build agent locally then nix --version works as expected

(tried this solution setting sandbox and extra path in my nix.conf)

@abathur
Copy link
Member

abathur commented Mar 16, 2021

Progress! This latter issue is at least something others have reported.

I'm curious about the build agent and what you're using it for here? There's also a --daemon install of Nix (which will likely become the only supported install after #4289) that'll run as root and use nixbld users for builds, but maybe you've already ruled it out in your situation?

@abathur
Copy link
Member

abathur commented Mar 16, 2021

If you know you need a distinct build agent, I'm curious how your launchdaemon differs from the one a daemon install would use, and whether those differences matter here:

https://github.com/NixOS/nix/blob/master/misc/launchd/org.nixos.nix-daemon.plist.in

If you want something less async, we can also talk in #nix-darwin on IRC

@OliverKoo
Copy link
Author

OliverKoo commented Mar 16, 2021

I am using buildkite. So the buildkite-agent is the one invoking nix. I basically set all the nix env in the plist so the agent shell would have nix context (equivalent of doing . /Users/buildkite-agent/.nix-profile/etc/profile.d/nix.sh).

(I been using ec2-user in our conversation thus far in attempt to simplify the discussion since ec2-user is the default user, but it seems like the nix issue I am seeing is not system wide. Something is funny with nix when I run buildkite-agent daemon) You can replace all instance of ec2-user above with buildkite-agent. I ran and installed nix as buildkite-agent.

this is the plist of buildkite-agent

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
  A launchd config for loading buildkite-agent on system boot on OS X
  systems, and runs in GUI mode (which allows Xcode UI testing but requires
  the user to login)
-->
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.buildkite.buildkite-agent</string>

    <key>ProgramArguments</key>
    <array>
      <string>/usr/local/bin/buildkite-agent</string>
      <string>start</string>
    </array>

    <key>KeepAlive</key>
    <dict>
      <key>SuccessfulExit</key>
      <false/>
    </dict>

    <key>RunAtLoad</key>
    <true/>

    <key>ProcessType</key>
    <string>Interactive</string>

    <key>UserName</key>
    <string>buildkite-agent</string>

    <key>ThrottleInterval</key>
    <integer>30</integer>

    <key>StandardOutPath</key>
    <string>/usr/local/var/log/buildkite-agent.log</string>

    <key>StandardErrorPath</key>
    <string>/usr/local/var/log/buildkite-agent.error.log</string>

    <key>EnvironmentVariables</key>
    <dict>
      <key>PATH</key>
      <string>/Users/buildkite-agent/.nix-profile/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin</string>

      <key>HOME</key>
      <string>/Users/buildkite-agent</string>

      <key>BUILDKITE_AGENT_CONFIG</key>
      <string>/usr/local/etc/buildkite-agent/buildkite-agent.cfg</string>

      <key>USER</key>
      <string>buildkite-agent</string>

      <key>AWS_REGION</key>
      <string>us-east-1</string>

      <key>NIX_PATH</key>
      <string>/Users/buildkite-agent/.nix-defexpr/channels</string>

      <key>NIX_PROFILES</key>
      <string>/nix/var/nix/profiles/default /Users/buildkite-agent/.nix-profile</string>

      <key>NIX_SSL_CERT_FILE</key>
      <string>/Users/buildkite-agent/.nix-profile/etc/ssl/certs/ca-bundle.crt</string>
    </dict>
  </dict>
</plist>

if I launch the daemon sudo launchctl load -w /Library/LaunchDaemons/com.buildkite.buildkite-agent.plist then I see the dyld: Library not loaded post above when build

but I can successfully build with nix if I invoke the agent directly after ssh into the machine
sudo su - buildkite-agent then /usr/local/bin/buildkite-agent start


I will also give the multi user install a shot and report back


If you want something less async, we can also talk in #nix-darwin on IRC

that would be great, can you give me direction or link to the chat?

@abathur
Copy link
Member

abathur commented Mar 16, 2021

If you already have an IRC client, you can find us on freenode. If you don't, I gather you can use the webchat via https://webchat.freenode.net/#nix-darwin

@OliverKoo
Copy link
Author

after setting BUILDKITE_SHELL to use bin/sh the dyld error went away

now seeing


$ trap 'kill -- $$' INT TERM QUIT; nix --version
  | /bin/sh: trap 'kill -- $$' INT TERM QUIT; nix --version: No such file or directory
  | 🚨 Error: The command exited with status 127


@OliverKoo
Copy link
Author

OliverKoo commented Mar 16, 2021

after setting the /bin/sh as BUILDKITE_SHELL env the nix env vars somehow are not set. (still in plist)

@OliverKoo
Copy link
Author

for my personal note - summary from yesterday's discussion

  1. Nix needs /bin/sh to have full disk access. People seem to skirt around this by using a GUI session to add a security exemption (like literally VNC in to the desktop, open the system preferences > security & privacy > privacy > Full Disk Access, unlock, and then add /bin/sh) example. BK agent seem to also use /bin/sh internally

  2. Nix vol not auto mounting after boot seems to be fix by generating a security token for buildkite-agent user then reboot then. However I also found that by running

sudo mount_apfs disk2s6 /nix
sudo diskutil enableOwnership /nix
sudo chown -R buildkite-agent /nix

then reboot also fix the issue without security token.

  1. buildkite's bootstrap shell behave's differently when run by launchctl then locally by user. Nix context seems to not fully inherited in daemon shell.

@abathur
Copy link
Member

abathur commented Mar 17, 2021

  1. Nix vol not auto mounting after boot seems to be fix by generating a security token for buildkite-agent user then reboot then. However I also found that by running
sudo mount_apfs disk2s6 /nix
sudo diskutil enableOwnership /nix
sudo chown -R buildkite-agent /nix

then reboot also fix the issue without security token.

Interesting. I do have some comments in my installer PR about enableOwnership.

Have you re-tried with my hosted installer, or is this still with the official one? I'm curious what the nix volume line in /etc/fstab says.

Will also ping you on IRC.

@OliverKoo
Copy link
Author

Summarize IRC chat from yesterday for documentation purpose

  1. Installed --daemon install darwin: encrypt nix volume if filevault is enabled #4289 onto AMI, enabled buildkite-agent security token. Instances boot with this AMI behave as follow:
    nix volume still unmount. buildkite-agent security token is disabled again. After ssh into the machine, enable the token and reboot, seeing nix vol mount correctly.

  2. nix seems to need FDA (@abathur is that right?). buildkite-agent runs directly from local login session via ssh seems to have some permission that buildkite-agent from launchd daemon bootstrap shell.

@abathur
Copy link
Member

abathur commented Mar 18, 2021

  1. nix seems to need FDA (@abathur is that right?). buildkite-agent runs directly from local login session via ssh seems to have some permission that buildkite-agent from launchd daemon bootstrap shell.

I'm not sure whether Nix does or doesn't in this context. My understanding is that a few people have worked around issues like this by adding an FDA exemption for /bin/sh (because the launchdaemon for nix-daemon uses /bin/sh).

The last comment in the other thread asked about whether you'd added the FDA exemption for buildkite-agent. I'm not sure if that is a documented expectation on their end or not. The exemption should propagate to some degree, so I'd try that one first.

I had some thoughts late yesterday about removing/replacing the remaining homedir references from your launchd plist. I'm curious if you did try that (don't feel obliged, mainly wondering if we should follow up on that possibility).

@OliverKoo
Copy link
Author

I had some thoughts late yesterday about removing/replacing the remaining homedir references from your launchd plist. I'm curious if you did try that (don't feel obliged, mainly wondering if we should follow up on that possibility).

do you mean this or something else? I am willing to test drive

@abathur
Copy link
Member

abathur commented Mar 18, 2021

@OliverKoo
Copy link
Author

OliverKoo commented Mar 19, 2021

current state of the plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<!--
  A launchd config for loading buildkite-agent on system boot on OS X
  systems, and runs in GUI mode (which allows Xcode UI testing but requires
  the user to login)
-->
<plist version="1.0">
  <dict>
    <key>Label</key>
    <string>com.buildkite.buildkite-agent</string>

    <key>ProgramArguments</key>
    <array>
      <string>/usr/local/bin/buildkite-agent</string>
      <string>start</string>
    </array>

    <key>KeepAlive</key>
    <dict>
      <key>SuccessfulExit</key>
      <false/>
    </dict>

    <key>RunAtLoad</key>
    <true/>

    <key>ProcessType</key>
    <string>Interactive</string>

    <key>UserName</key>
    <string>buildkite-agent</string>

    <key>ThrottleInterval</key>
    <integer>30</integer>

    <key>StandardOutPath</key>
    <string>/usr/local/var/log/buildkite-agent.log</string>

    <key>StandardErrorPath</key>
    <string>/usr/local/var/log/buildkite-agent.error.log</string>

    <key>EnvironmentVariables</key>
    <dict>
      <key>PATH</key>
      <string>/nix/var/nix/profiles/default/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin</string>

      <key>BUILDKITE_AGENT_CONFIG</key>
      <string>/usr/local/etc/buildkite-agent/buildkite-agent.cfg</string>

      <key>USER</key>
      <string>buildkite-agent</string>

      <key>AWS_REGION</key>
      <string>us-east-1</string>

      <key>NIX_PROFILES</key>
      <string>/nix/var/nix/profiles/default</string>

      <key>NIX_SSL_CERT_FILE</key>
      <string>/nix/var/nix/profiles/default/etc/ssl/certs/ca-bundle.crt</string>
    </dict>
  </dict>
</plist>

@abathur
Copy link
Member

abathur commented Mar 19, 2021

@klardotsh I think we've exhausted out current ideas for getting this to work without the full-disk-access security exemption--do you remember what hoops you needed to jump to get a VNC session?

@OliverKoo
Copy link
Author

👋 @klardotsh , for the above refrence

@OliverKoo
Copy link
Author

Here is the instruction on how to get a VNC session to mac1.metal instances https://gist.github.com/sebsto/6af5bf3acaf25c00dd938c3bbe722cc1

@klardotsh
Copy link

Those instructions look roughly like what I followed (which was https://www.lets-talk-about.tech/2020/12/aws-create-macos-desktop.html), so they should work. I've sadly been juggling a lot of things so haven't dived too far into the Nix-on-Mac rabbit hole lately (a coworker got Nix working on his Mac with far fewer issues than on my EC2 instance, so it fell down the priority list a bit)

@stale
Copy link

stale bot commented Sep 21, 2021

I marked this as stale due to inactivity. → More info

@leowini
Copy link

leowini commented May 22, 2022

I also ran into build issues due to my /nix being owned by root. I'm not sure if it was initially owned by my user, because I didn't check when I installed it.

I am on an M1 MacBook Pro (not a VM).

I will try to reinstall and see if the owner is still root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants