Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment.autoLuks deprecation #62211

Open
flokli opened this issue May 29, 2019 · 2 comments

Comments

@flokli
Copy link
Contributor

commented May 29, 2019

Issue description

This issue is a placeholder for all those users raising their voices in response to the deprecation of NixOps deployment.autoLuks introduced in #61321 (and backported to 19.03).

Please let us know if you are using the feature!

NixOps deployment.autoLuks is a feature to automatically handle block devices and luks encryption without storing secrets on the target devices.

Even in its current state it seems to be halfway broken (e.g. removing a LUKS device panics systemd), and people expressed doubts on whether it's being used at all.

Looking at the NixOps repository and searching for public infrastructure repositories didn't yield a large (or any) userbase of the feature. Thus we are asking for feedback if you are using it.

The changes previously done to our systemd fork included changes to the startup unit ordering. The local filesystems were no longer part of the very basic system init, allowing sshd and similar processes to start without finishing all mount units.

Due to those relaxed boot requirements a bunch of errors with state and runtime directories appeared. There were some fixes but they are still incomplete (e.g. nixos-rebuild switch regenerates all the state directories but reboots do not have the same guarantee).
Backing out of these changes and restoring a sane boot order for the price of requiring a few more lines of configuration in NixOps setups seems like a reasonable tradeoff.

Why did this become necessary?

In the past our systemd fork carried a patch (NixOS/systemd@ce79214) that removed the local-fs.target from the sysinit.target. This allowed services such as sshd to start while not all of the local filesystems were mounted, thus making it possible to send over keys using sshd.service. While probably a plausible workaround at the time this caused a bit of weird behavior down the road.

Systemd didn't support _netdev and subsequently struggled with all kinds of network block devices until roughly 2014.

Since systemd supports managing StateDirectory, RuntimeDirectory , etc (https://www.freedesktop.org/software/systemd/man/systemd.exec.html) and systemd-tmpfiles (https://www.freedesktop.org/software/systemd/man/systemd-tmpfiles.html) and their usage increased (even inside systemd itself), the amount of unexpected side effects did increase.

While probably not noticeable for most people there is a race condition between the folders in /run/ and /var/lib/ being generated and the remaining system coming up. In many cases we might just be lucky that all the directories exist. In general it lead to many PreStart scripts that created those directories, if they are missing. Those in turn required to be priviledged since most daemons are not being run with root privileges. The option we used to turn those scripts into privileged scripts is now deprecated. We have an ongoing effort to replace them where possible (#56265, #62050, …).

Besides those, we are trying to reduce the amount of custom patches that are being applied to systemd. In the long run it should become easier to maintain our systemd package. Eventually we would like to upstream some of our changes in a portable way. Things that aren't strictly required for systemd to work on NixOS should therefore go away.

What can I do to make it work again?

Make sure you add _netdev to all the filesystems you are mounting via the autoLuks module. Adding that option moves them from the local-fs.target to remote-fs.target which will allow your system to start the sshd even without the luks volumes. Afterwards you can use nixops send-keys again.

Do not forget to read the error message you got and set the option that was mentioned there.

@andir andir referenced this issue May 29, 2019
4 of 10 tasks complete
@AmineChikhaoui

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

We're using this feature at Infor for hundreds of deployments, not all deployments are using 19.03 yet though but this will be a problem soon.

@JohnAZoidberg

This comment has been minimized.

Copy link
Member

commented Jul 4, 2019

Adding _netdev to the autoLuks devices seems like a good option. We just need to document it together with the autoLuks option.

That's what the mount(8) says about it:

_netdev
The filesystem resides on a device that requires network access (used to prevent the system from attempting to mount these filesystems until the network has been enabled on the system).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.